org.apache.any23.extractor
Class SingleDocumentExtraction

java.lang.Object
  extended by org.apache.any23.extractor.SingleDocumentExtraction

public class SingleDocumentExtraction
extends Object

This class acts as facade where all the extractors were called on a single document.


Field Summary
static String EXTRACTION_CONTEXT_URI_PROPERTY
           
static String METADATA_DOMAIN_PER_ENTITY_FLAG
           
static String METADATA_NESTING_FLAG
           
static String METADATA_TIMESIZE_FLAG
           
 
Constructor Summary
SingleDocumentExtraction(Configuration configuration, DocumentSource in, ExtractorFactory<?> factory, TripleHandler output)
          Builds an extractor by the specification of document source, extractors factory and output triple handler.
SingleDocumentExtraction(Configuration configuration, DocumentSource in, ExtractorGroup extractors, TripleHandler output)
          Builds an extractor by the specification of document source, list of extractors and output triple handler.
SingleDocumentExtraction(DocumentSource in, ExtractorFactory<?> factory, TripleHandler output)
          Builds an extractor by the specification of document source, extractors factory and output triple handler, using the DefaultConfiguration.
 
Method Summary
 String getDetectedMIMEType()
          Returns the detected mimetype for the given DocumentSource.
 List<Extractor> getMatchingExtractors()
           
 String getParserEncoding()
           
 boolean hasMatchingExtractors()
          Check whether the given DocumentSource content activates of not at least an extractor.
 SingleDocumentExtractionReport run()
          Triggers the execution of all the Extractor registered to this class using the default extraction parameters.
 SingleDocumentExtractionReport run(ExtractionParameters extractionParameters)
          Triggers the execution of all the Extractor registered to this class using the specified extraction parameters.
 void setLocalCopyFactory(LocalCopyFactory copyFactory)
          Sets the internal factory for generating the document local copy, if null the MemCopyFactory will be used.
 void setMIMETypeDetector(MIMETypeDetector detector)
          Sets the internal mime type detector, if null mimetype detection will be skipped and all extractors will be activated.
 void setParserEncoding(String encoding)
          Sets the document parser encoding.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EXTRACTION_CONTEXT_URI_PROPERTY

public static final String EXTRACTION_CONTEXT_URI_PROPERTY
See Also:
Constant Field Values

METADATA_TIMESIZE_FLAG

public static final String METADATA_TIMESIZE_FLAG
See Also:
Constant Field Values

METADATA_NESTING_FLAG

public static final String METADATA_NESTING_FLAG
See Also:
Constant Field Values

METADATA_DOMAIN_PER_ENTITY_FLAG

public static final String METADATA_DOMAIN_PER_ENTITY_FLAG
See Also:
Constant Field Values
Constructor Detail

SingleDocumentExtraction

public SingleDocumentExtraction(Configuration configuration,
                                DocumentSource in,
                                ExtractorGroup extractors,
                                TripleHandler output)
Builds an extractor by the specification of document source, list of extractors and output triple handler.

Parameters:
configuration - configuration applied during extraction.
in - input document source.
extractors - list of extractors to be applied.
output - output triple handler.

SingleDocumentExtraction

public SingleDocumentExtraction(Configuration configuration,
                                DocumentSource in,
                                ExtractorFactory<?> factory,
                                TripleHandler output)
Builds an extractor by the specification of document source, extractors factory and output triple handler.

Parameters:
configuration - configuration applied during extraction.
in - input document source.
factory - the extractors factory.
output - output triple handler.

SingleDocumentExtraction

public SingleDocumentExtraction(DocumentSource in,
                                ExtractorFactory<?> factory,
                                TripleHandler output)
Builds an extractor by the specification of document source, extractors factory and output triple handler, using the DefaultConfiguration.

Parameters:
in - input document source.
factory - the extractors factory.
output - output triple handler.
Method Detail

setLocalCopyFactory

public void setLocalCopyFactory(LocalCopyFactory copyFactory)
Sets the internal factory for generating the document local copy, if null the MemCopyFactory will be used.

Parameters:
copyFactory - local copy factory.
See Also:
DocumentSource

setMIMETypeDetector

public void setMIMETypeDetector(MIMETypeDetector detector)
Sets the internal mime type detector, if null mimetype detection will be skipped and all extractors will be activated.

Parameters:
detector - detector instance.

run

public SingleDocumentExtractionReport run(ExtractionParameters extractionParameters)
                                   throws ExtractionException,
                                          IOException
Triggers the execution of all the Extractor registered to this class using the specified extraction parameters.

Parameters:
extractionParameters - the parameters applied to the run execution.
Returns:
the report generated by the extraction.
Throws:
ExtractionException - if an error occurred during the data extraction.
IOException - if an error occurred during the data access.

run

public SingleDocumentExtractionReport run()
                                   throws IOException,
                                          ExtractionException
Triggers the execution of all the Extractor registered to this class using the default extraction parameters.

Returns:
the extraction report.
Throws:
IOException
ExtractionException

getDetectedMIMEType

public String getDetectedMIMEType()
                           throws IOException
Returns the detected mimetype for the given DocumentSource.

Returns:
string containing the detected mimetype.
Throws:
IOException - if an error occurred while accessing the data.

hasMatchingExtractors

public boolean hasMatchingExtractors()
                              throws IOException
Check whether the given DocumentSource content activates of not at least an extractor.

Returns:
true if at least an extractor is activated, false otherwise.
Throws:
IOException

getMatchingExtractors

public List<Extractor> getMatchingExtractors()
Returns:
the list of all the activated extractors for the given DocumentSource.

getParserEncoding

public String getParserEncoding()
Returns:
the configured parsing encoding.

setParserEncoding

public void setParserEncoding(String encoding)
Sets the document parser encoding.

Parameters:
encoding - parser encoding.


Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.