org.apache.any23.plugin.htmlscraper
Class HTMLScraperExtractor
java.lang.Object
org.apache.any23.plugin.htmlscraper.HTMLScraperExtractor
- All Implemented Interfaces:
- Extractor<InputStream>, Extractor.ContentExtractor
public class HTMLScraperExtractor
- extends Object
- implements Extractor.ContentExtractor
Implementation of content extractor for performing HTML scraping.
- Author:
- Michele Mostarda (mostarda@fbk.eu)
- See Also:
HTMLScraperPlugin
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NAME
public static final String NAME
- See Also:
- Constant Field Values
PAGE_CONTENT_DE_PROPERTY
public static final org.openrdf.model.URI PAGE_CONTENT_DE_PROPERTY
PAGE_CONTENT_AE_PROPERTY
public static final org.openrdf.model.URI PAGE_CONTENT_AE_PROPERTY
PAGE_CONTENT_LCE_PROPERTY
public static final org.openrdf.model.URI PAGE_CONTENT_LCE_PROPERTY
PAGE_CONTENT_CE_PROPERTY
public static final org.openrdf.model.URI PAGE_CONTENT_CE_PROPERTY
factory
protected static final ExtractorFactory<HTMLScraperExtractor> factory
HTMLScraperExtractor
public HTMLScraperExtractor()
addTextExtractor
public void addTextExtractor(String name,
org.openrdf.model.URI property,
de.l3s.boilerpipe.BoilerpipeExtractor extractor)
getTextExtractors
public String[] getTextExtractors()
run
public void run(ExtractionParameters extractionParameters,
ExtractionContext extractionContext,
InputStream inputStream,
ExtractionResult extractionResult)
throws IOException,
ExtractionException
- Specified by:
run
in interface Extractor<InputStream>
- Throws:
IOException
ExtractionException
getDescription
public ExtractorFactory getDescription()
- Specified by:
getDescription
in interface Extractor<InputStream>
setStopAtFirstError
public void setStopAtFirstError(boolean b)
- Specified by:
setStopAtFirstError
in interface Extractor.ContentExtractor
Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.