org.apache.nutch.indexer.anchor
Class AnchorIndexingFilter
java.lang.Object
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
- All Implemented Interfaces:
- Configurable, IndexingFilter, Pluggable
public class AnchorIndexingFilter
- extends Object
- implements IndexingFilter
Indexing filter that offers an option to either index all inbound anchor text for
a document or deduplicate anchors. Deduplication does have it's con's,
- See Also:
anchorIndexingFilter.deduplicate} in nutch-default.xml.
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
AnchorIndexingFilter
public AnchorIndexingFilter()
setConf
public void setConf(Configuration conf)
- Set the
Configuration
object
- Specified by:
setConf
in interface Configurable
getConf
public Configuration getConf()
- Get the
Configuration
object
- Specified by:
getConf
in interface Configurable
filter
public NutchDocument filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
throws IndexingException
- The
AnchorIndexingFilter
filter object which supports boolean
configuration settings for the deduplication of anchors.
See anchorIndexingFilter.deduplicate
in nutch-default.xml.
- Specified by:
filter
in interface IndexingFilter
- Parameters:
doc
- The NutchDocument
objectparse
- The relevant Parse
object passing through the filterurl
- URL to be filtered for anchor textdatum
- The CrawlDatum
entryinlinks
- The Inlinks
containing anchor text
- Returns:
- filtered NutchDocument
- Throws:
IndexingException
Copyright © 2012 The Apache Software Foundation