org.apache.any23.extractor.html
Class EntityBasedMicroformatExtractor

java.lang.Object
  extended by org.apache.any23.extractor.html.MicroformatExtractor
      extended by org.apache.any23.extractor.html.EntityBasedMicroformatExtractor
All Implemented Interfaces:
Extractor<Document>, Extractor.TagSoupDOMExtractor
Direct Known Subclasses:
AdrExtractor, GeoExtractor, HCardExtractor, HListingExtractor, HRecipeExtractor, HResumeExtractor, HReviewExtractor, SpeciesExtractor

public abstract class EntityBasedMicroformatExtractor
extends MicroformatExtractor

Base class for microformat extractors based on entities.

Author:
Gabriele Renzi

Nested Class Summary
 
Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor
Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor
 
Field Summary
 
Fields inherited from class org.apache.any23.extractor.html.MicroformatExtractor
BEGIN_SCRIPT, END_SCRIPT, valueFactory
 
Constructor Summary
EntityBasedMicroformatExtractor()
           
 
Method Summary
 boolean extract()
          Performs the extraction of the data and writes them to the model.
protected abstract  boolean extractEntity(Node node, ExtractionResult out)
          Extracts an entity from a DOM node.
protected abstract  String getBaseClassName()
          Returns the base class name for the extractor.
protected  org.openrdf.model.BNode getBlankNodeFor(Node node)
           
protected abstract  void resetExtractor()
          Resets the internal status of the extractor to prepare it to a new extraction section.
 
Methods inherited from class org.apache.any23.extractor.html.MicroformatExtractor
addBNodeProperty, addBNodeProperty, addURIProperty, conditionallyAddLiteralProperty, conditionallyAddResourceProperty, conditionallyAddStringProperty, fixLink, fixLink, getCurrentExtractionResult, getDescription, getDocumentURI, getExtractionContext, getHTMLDocument, includes, openSubResult, run
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

EntityBasedMicroformatExtractor

public EntityBasedMicroformatExtractor()
Method Detail

getBaseClassName

protected abstract String getBaseClassName()
Returns the base class name for the extractor.

Returns:
a string containing the base of the extractor.

resetExtractor

protected abstract void resetExtractor()
Resets the internal status of the extractor to prepare it to a new extraction section.


extractEntity

protected abstract boolean extractEntity(Node node,
                                         ExtractionResult out)
                                  throws ExtractionException
Extracts an entity from a DOM node.

Parameters:
node - the DOM node.
out - the extraction result collector.
Returns:
true if the extraction has produces something, false otherwise.
Throws:
ExtractionException

extract

public boolean extract()
                throws ExtractionException
Description copied from class: MicroformatExtractor
Performs the extraction of the data and writes them to the model. The nodes generated in the model can have any name or implicit label but if possible they SHOULD have names (either URIs or AnonId) that are uniquely derivable from their position in the DOM tree, so that multiple extractors can merge information.

Specified by:
extract in class MicroformatExtractor
Throws:
ExtractionException

getBlankNodeFor

protected org.openrdf.model.BNode getBlankNodeFor(Node node)
Parameters:
node - a DOM node representing a blank node
Returns:
an RDF blank node corresponding to that DOM node, by using a blank node ID like "MD5 of http://doc-uri/#xpath/to/node"


Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.