HTMLScraperExtractor (Apache Any23 :: Plugins :: HTML Scraper 1.0.1-incubating-SNAPSHOT API)

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.any23.plugin.htmlscraper
Class HTMLScraperExtractor

java.lang.Object
  org.apache.any23.plugin.htmlscraper.HTMLScraperExtractor

All Implemented Interfaces:: Extractor<InputStream>, Extractor.ContentExtractor

public class HTMLScraperExtractor
extends Object
implements Extractor.ContentExtractor
extends Object
implements Extractor.ContentExtractor

Implementation of content extractor for performing HTML scraping.

Author:: Michele Mostarda (mostarda@fbk.eu)
See Also:: HTMLScraperPlugin

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor

Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor

Field Summary

protected static ExtractorFactory<HTMLScraperExtractor> factory


static String NAME


static org.openrdf.model.URI PAGE_CONTENT_AE_PROPERTY


static org.openrdf.model.URI PAGE_CONTENT_CE_PROPERTY


static org.openrdf.model.URI PAGE_CONTENT_DE_PROPERTY


static org.openrdf.model.URI PAGE_CONTENT_LCE_PROPERTY


Constructor Summary

HTMLScraperExtractor()


Method Summary

void addTextExtractor(String name, org.openrdf.model.URI property, de.l3s.boilerpipe.BoilerpipeExtractor extractor)


ExtractorFactory getDescription()


String[] getTextExtractors()


void run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, InputStream inputStream, ExtractionResult extractionResult)


void setStopAtFirstError(boolean b)


Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

NAME

public static final String NAME

See Also:
Constant Field Values

PAGE_CONTENT_DE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_DE_PROPERTY

PAGE_CONTENT_AE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_AE_PROPERTY

PAGE_CONTENT_LCE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_LCE_PROPERTY

PAGE_CONTENT_CE_PROPERTY

public static final org.openrdf.model.URI PAGE_CONTENT_CE_PROPERTY

factory

protected static final ExtractorFactory<HTMLScraperExtractor> factory

Constructor Detail

HTMLScraperExtractor

public HTMLScraperExtractor()

Method Detail

addTextExtractor

public void addTextExtractor(String name, org.openrdf.model.URI property, de.l3s.boilerpipe.BoilerpipeExtractor extractor)

getTextExtractors

public String[] getTextExtractors()

run

public void run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, InputStream inputStream, ExtractionResult extractionResult) throws IOException, ExtractionException

Specified by:
run in interface Extractor<InputStream>

Throws:
IOException
ExtractionException

getDescription

public ExtractorFactory getDescription()

Specified by:
getDescription in interface Extractor<InputStream>

setStopAtFirstError

public void setStopAtFirstError(boolean b)

Specified by:
setStopAtFirstError in interface Extractor.ContentExtractor

Package Class Use Tree Deprecated Index Help

PREV CLASS NEXT CLASS FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD DETAIL: FIELD | CONSTR | METHOD

Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.

Field Summary
`protected static ExtractorFactory<HTMLScraperExtractor>`	`factory`
`static String`	`NAME`
`static org.openrdf.model.URI`	`PAGE_CONTENT_AE_PROPERTY`
`static org.openrdf.model.URI`	`PAGE_CONTENT_CE_PROPERTY`
`static org.openrdf.model.URI`	`PAGE_CONTENT_DE_PROPERTY`
`static org.openrdf.model.URI`	`PAGE_CONTENT_LCE_PROPERTY`

Method Summary
`void`	`addTextExtractor(String name, org.openrdf.model.URI property, de.l3s.boilerpipe.BoilerpipeExtractor extractor)`
`ExtractorFactory`	`getDescription()`
`String[]`	`getTextExtractors()`
`void`	`run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, InputStream inputStream, ExtractionResult extractionResult)`
`void`	`setStopAtFirstError(boolean b)`

org.apache.any23.plugin.htmlscraper Class HTMLScraperExtractor

NAME

PAGE_CONTENT_DE_PROPERTY

PAGE_CONTENT_AE_PROPERTY

PAGE_CONTENT_LCE_PROPERTY

PAGE_CONTENT_CE_PROPERTY

factory

HTMLScraperExtractor

addTextExtractor

getTextExtractors

run

getDescription

setStopAtFirstError

org.apache.any23.plugin.htmlscraper
Class HTMLScraperExtractor