org.apache.any23.plugin.htmlscraper
Class HTMLScraperExtractor

java.lang.Object
  extended by org.apache.any23.plugin.htmlscraper.HTMLScraperExtractor
All Implemented Interfaces:
Extractor<InputStream>, Extractor.ContentExtractor

public class HTMLScraperExtractor
extends Object
implements Extractor.ContentExtractor

Implementation of content extractor for performing HTML scraping.

Author:
Michele Mostarda (mostarda@fbk.eu)
See Also:
HTMLScraperPlugin

Nested Class Summary
 
Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor
Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor
 
Field Summary
protected static ExtractorFactory<HTMLScraperExtractor> factory
           
static String NAME
           
static org.openrdf.model.URI PAGE_CONTENT_AE_PROPERTY
           
static org.openrdf.model.URI PAGE_CONTENT_CE_PROPERTY
           
static org.openrdf.model.URI PAGE_CONTENT_DE_PROPERTY
           
static org.openrdf.model.URI PAGE_CONTENT_LCE_PROPERTY
           
 
Constructor Summary
HTMLScraperExtractor()
           
 
Method Summary
 void addTextExtractor(String name, org.openrdf.model.URI property, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
           
 ExtractorFactory getDescription()
           
 String[] getTextExtractors()
           
 void run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, InputStream inputStream, ExtractionResult extractionResult)
           
 void setStopAtFirstError(boolean b)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NAME

public static final String NAME
See Also:
Constant Field Values

PAGE_CONTENT_DE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_DE_PROPERTY

PAGE_CONTENT_AE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_AE_PROPERTY

PAGE_CONTENT_LCE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_LCE_PROPERTY

PAGE_CONTENT_CE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_CE_PROPERTY

factory

protected static final ExtractorFactory<HTMLScraperExtractor> factory
Constructor Detail

HTMLScraperExtractor

public HTMLScraperExtractor()
Method Detail

addTextExtractor

public void addTextExtractor(String name,
                             org.openrdf.model.URI property,
                             de.l3s.boilerpipe.BoilerpipeExtractor extractor)

getTextExtractors

public String[] getTextExtractors()

run

public void run(ExtractionParameters extractionParameters,
                ExtractionContext extractionContext,
                InputStream inputStream,
                ExtractionResult extractionResult)
         throws IOException,
                ExtractionException
Specified by:
run in interface Extractor<InputStream>
Throws:
IOException
ExtractionException

getDescription

public ExtractorFactory getDescription()
Specified by:
getDescription in interface Extractor<InputStream>

setStopAtFirstError

public void setStopAtFirstError(boolean b)
Specified by:
setStopAtFirstError in interface Extractor.ContentExtractor


Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.