org.apache.nutch.indexer.more
Class MoreIndexingFilter
java.lang.Object
org.apache.nutch.indexer.more.MoreIndexingFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, IndexingFilter, FieldPluggable, Pluggable
public class MoreIndexingFilter
- extends Object
- implements IndexingFilter
Add (or reset) a few metaData properties as respective fields (if they are
available), so that they can be accurately used within the search index.
'lastModifed' is indexed to support query by date, 'contentLength' obtains content length from the HTTP
header, 'type' field is indexed to support query by type and finally the 'title' field is an attempt
to reset the title if a content-disposition hint exists. The logic is that such a presence is indicative
that the content provider wants the filename therein to be used as the title.
Still need to make content-length searchable!
- Author:
- John Xing
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
MoreIndexingFilter
public MoreIndexingFilter()
filter
public NutchDocument filter(NutchDocument doc,
String url,
WebPage page)
throws IndexingException
- Description copied from interface:
IndexingFilter
- Adds fields or otherwise modifies the document that will be indexed for a
parse. Unwanted documents can be removed from indexing by returning a null value.
- Specified by:
filter
in interface IndexingFilter
- Parameters:
doc
- document instance for collecting fieldsurl
- page url
- Returns:
- modified (or a new) document instance, or null (meaning the document
should be discarded)
- Throws:
IndexingException
addIndexBackendOptions
public void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
setConf
in interface org.apache.hadoop.conf.Configurable
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
getConf
in interface org.apache.hadoop.conf.Configurable
getFields
public Collection<WebPage.Field> getFields()
- Specified by:
getFields
in interface FieldPluggable
Copyright © 2013 The Apache Software Foundation