org.apache.nutch.indexer.anchor
Class AnchorIndexingFilter

java.lang.Object
  extended by org.apache.nutch.indexer.anchor.AnchorIndexingFilter
All Implemented Interfaces:
Configurable, IndexingFilter, Pluggable

public class AnchorIndexingFilter
extends Object
implements IndexingFilter

Indexing filter that offers an option to either index all inbound anchor text for a document or deduplicate anchors. Deduplication does have it's con's,

See Also:
anchorIndexingFilter.deduplicate} in nutch-default.xml.

Field Summary
static org.slf4j.Logger LOG
           
 
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
 
Constructor Summary
AnchorIndexingFilter()
           
 
Method Summary
 NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
          The AnchorIndexingFilter filter object which supports boolean configuration settings for the deduplication of anchors.
 Configuration getConf()
          Get the Configuration object
 void setConf(Configuration conf)
          Set the Configuration object
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

AnchorIndexingFilter

public AnchorIndexingFilter()
Method Detail

setConf

public void setConf(Configuration conf)
Set the Configuration object

Specified by:
setConf in interface Configurable

getConf

public Configuration getConf()
Get the Configuration object

Specified by:
getConf in interface Configurable

filter

public NutchDocument filter(NutchDocument doc,
                            Parse parse,
                            Text url,
                            CrawlDatum datum,
                            Inlinks inlinks)
                     throws IndexingException
The AnchorIndexingFilter filter object which supports boolean configuration settings for the deduplication of anchors. See anchorIndexingFilter.deduplicate in nutch-default.xml.

Specified by:
filter in interface IndexingFilter
Parameters:
doc - The NutchDocument object
parse - The relevant Parse object passing through the filter
url - URL to be filtered for anchor text
datum - The CrawlDatum entry
inlinks - The Inlinks containing anchor text
Returns:
filtered NutchDocument
Throws:
IndexingException


Copyright © 2012 The Apache Software Foundation