org.apache.nutch.indexer.anchor
Class AnchorIndexingFilter

java.lang.Object
  extended by org.apache.nutch.indexer.anchor.AnchorIndexingFilter
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, IndexingFilter, FieldPluggable, Pluggable

public class AnchorIndexingFilter
extends Object
implements IndexingFilter

Indexing filter that offers an option to either index all inbound anchor text for a document or deduplicate anchors. Deduplication does have it's con's,

See Also:
anchorIndexingFilter.deduplicate} in nutch-default.xml.

Field Summary
static org.slf4j.Logger LOG
           
 
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
 
Constructor Summary
AnchorIndexingFilter()
           
 
Method Summary
 void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)
           
 NutchDocument filter(NutchDocument doc, String url, WebPage page)
          The AnchorIndexingFilter filter object which supports boolean configuration settings for the deduplication of anchors.
 org.apache.hadoop.conf.Configuration getConf()
          Get the Configuration object
 Collection<WebPage.Field> getFields()
          Gets all the fields for a given WebPage Many datastores need to setup the mapreduce job by specifying the fields needed.
 void setConf(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration object
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

AnchorIndexingFilter

public AnchorIndexingFilter()
Method Detail

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object

Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getConf

public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

addIndexBackendOptions

public void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)

filter

public NutchDocument filter(NutchDocument doc,
                            String url,
                            WebPage page)
                     throws IndexingException
The AnchorIndexingFilter filter object which supports boolean configuration settings for the deduplication of anchors. See anchorIndexingFilter.deduplicate in nutch-default.xml.

Specified by:
filter in interface IndexingFilter
Parameters:
doc - The NutchDocument object
url - URL to be filtered for anchor text
page - WebPage object relative to the URL
Returns:
filtered NutchDocument
Throws:
IndexingException

getFields

public Collection<WebPage.Field> getFields()
Gets all the fields for a given WebPage Many datastores need to setup the mapreduce job by specifying the fields needed. All extensions that work on WebPage are able to specify what fields they need.

Specified by:
getFields in interface FieldPluggable


Copyright © 2013 The Apache Software Foundation