org.apache.nutch.analysis.lang
Class LanguageIndexingFilter

java.lang.Object
  extended by org.apache.nutch.analysis.lang.LanguageIndexingFilter
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, IndexingFilter, FieldPluggable, Pluggable

public class LanguageIndexingFilter
extends Object
implements IndexingFilter

An IndexingFilter that adds a lang (language) field to the document. It tries to find the language of the document by checking if HTMLLanguageParser has added some language information

Author:
Sami Siren, Jerome Charron

Field Summary
 
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
 
Constructor Summary
LanguageIndexingFilter()
          Constructs a new Language Indexing Filter.
 
Method Summary
 void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)
           
 NutchDocument filter(NutchDocument doc, String url, WebPage page)
          Adds fields or otherwise modifies the document that will be indexed for a parse.
 org.apache.hadoop.conf.Configuration getConf()
           
 Collection<WebPage.Field> getFields()
           
 void setConf(org.apache.hadoop.conf.Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LanguageIndexingFilter

public LanguageIndexingFilter()
Constructs a new Language Indexing Filter.

Method Detail

filter

public NutchDocument filter(NutchDocument doc,
                            String url,
                            WebPage page)
                     throws IndexingException
Description copied from interface: IndexingFilter
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.

Specified by:
filter in interface IndexingFilter
Parameters:
doc - document instance for collecting fields
url - page url
Returns:
modified (or a new) document instance, or null (meaning the document should be discarded)
Throws:
IndexingException

getFields

public Collection<WebPage.Field> getFields()
Specified by:
getFields in interface FieldPluggable

addIndexBackendOptions

public void addIndexBackendOptions(org.apache.hadoop.conf.Configuration conf)

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getConf

public org.apache.hadoop.conf.Configuration getConf()
Specified by:
getConf in interface org.apache.hadoop.conf.Configurable


Copyright © 2013 The Apache Software Foundation