public class BasicIndexingFilter extends Object implements IndexingFilter
indexer.add.domain
in nutch-default.xml.
title is truncated as per indexer.max.title.length
in nutch-default.xml.
(As per NUTCH-1004, a zero-length title is not added)
content is truncated as per indexer.max.content.length
in nutch-default.xml.Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
X_POINT_ID
Constructor and Description |
---|
BasicIndexingFilter() |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
BasicIndexingFilter filter object which supports few
configuration settings for adding basic searchable fields. |
org.apache.hadoop.conf.Configuration |
getConf()
Get the
Configuration object |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration object |
public NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
BasicIndexingFilter
filter object which supports few
configuration settings for adding basic searchable fields.
See indexer.add.domain
, indexer.max.title.length
,
indexer.max.content.length
in nutch-default.xml.filter
in interface IndexingFilter
doc
- The NutchDocument
objectparse
- The relevant Parse
object passing through the filterurl
- URL to be filtered for anchor textdatum
- The CrawlDatum
entryinlinks
- The Inlinks
containing anchor textIndexingException
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration
objectsetConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
Configuration
objectgetConf
in interface org.apache.hadoop.conf.Configurable
Copyright © 2014 The Apache Software Foundation