org.apache.nutch.indexer.basic
Class BasicIndexingFilter

java.lang.Object
  extended by org.apache.nutch.indexer.basic.BasicIndexingFilter
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, IndexingFilter, FieldPluggable, Pluggable

public class BasicIndexingFilter
extends Object
implements IndexingFilter

Adds basic searchable fields to a document. The fields are: host - add host as un-stored, indexed and tokenized url - url is both stored and indexed, so it's both searchable and returned. This is also a required field. orig - also store original url as both stored and indexed content - content is indexed, so that it's searchable, but not stored in index title - title is stored and indexed cache - add cached content/summary display policy, if available tstamp - add timestamp when fetched, for deduplication


Field Summary
static org.slf4j.Logger LOG
           
 
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
 
Constructor Summary
BasicIndexingFilter()
           
 
Method Summary
 void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)
           
 NutchDocument filter(NutchDocument doc, String url, WebPage page)
          The BasicIndexingFilter filter object which supports boolean configurable value for length of characters permitted within the title @see indexer.max.title.length in nutch-default.xml
 org.apache.hadoop.conf.Configuration getConf()
          Get the Configuration object
 Collection<WebPage.Field> getFields()
          Gets all the fields for a given WebPage Many datastores need to setup the mapreduce job by specifying the fields needed.
 void setConf(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration object
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

BasicIndexingFilter

public BasicIndexingFilter()
Method Detail

filter

public NutchDocument filter(NutchDocument doc,
                            String url,
                            WebPage page)
                     throws IndexingException
The BasicIndexingFilter filter object which supports boolean configurable value for length of characters permitted within the title @see indexer.max.title.length in nutch-default.xml

Specified by:
filter in interface IndexingFilter
Parameters:
doc - The NutchDocument object
url - URL to be filtered for anchor text
page - WebPage object relative to the URL
Returns:
filtered NutchDocument
Throws:
IndexingException

addIndexBackendOptions

public void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object

Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getConf

public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

getFields

public Collection<WebPage.Field> getFields()
Gets all the fields for a given WebPage Many datastores need to setup the mapreduce job by specifying the fields needed. All extensions that work on WebPage are able to specify what fields they need.

Specified by:
getFields in interface FieldPluggable


Copyright © 2013 The Apache Software Foundation