org.apache.nutch.indexer.basic
Class BasicIndexingFilter
java.lang.Object
org.apache.nutch.indexer.basic.BasicIndexingFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, IndexingFilter, FieldPluggable, Pluggable
public class BasicIndexingFilter
- extends Object
- implements IndexingFilter
Adds basic searchable fields to a document. The fields are:
host - add host as un-stored, indexed and tokenized
url - url is both stored and indexed, so it's both searchable and returned.
This is also a required field.
orig - also store original url as both stored and indexed
content - content is indexed, so that it's searchable, but not stored in index
title - title is stored and indexed
cache - add cached content/summary display policy, if available
tstamp - add timestamp when fetched, for deduplication
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
BasicIndexingFilter
public BasicIndexingFilter()
filter
public NutchDocument filter(NutchDocument doc,
String url,
WebPage page)
throws IndexingException
- The
BasicIndexingFilter
filter object which supports boolean
configurable value for length of characters permitted within the
title @see indexer.max.title.length
in nutch-default.xml
- Specified by:
filter
in interface IndexingFilter
- Parameters:
doc
- The NutchDocument
objecturl
- URL to be filtered for anchor textpage
- WebPage
object relative to the URL
- Returns:
- filtered NutchDocument
- Throws:
IndexingException
addIndexBackendOptions
public void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Set the
Configuration
object
- Specified by:
setConf
in interface org.apache.hadoop.conf.Configurable
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Get the
Configuration
object
- Specified by:
getConf
in interface org.apache.hadoop.conf.Configurable
getFields
public Collection<WebPage.Field> getFields()
- Gets all the fields for a given
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed. All extensions that work on WebPage are able to specify what fields
they need.
- Specified by:
getFields
in interface FieldPluggable
Copyright © 2013 The Apache Software Foundation