public class AnchorIndexingFilter extends Object implements IndexingFilter
anchorIndexingFilter.deduplicate} in nutch-default.xml.
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
X_POINT_ID
Constructor and Description |
---|
AnchorIndexingFilter() |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
org.apache.hadoop.conf.Configuration |
getConf()
Get the
Configuration object |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration object |
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration
objectsetConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
Configuration
objectgetConf
in interface org.apache.hadoop.conf.Configurable
public NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
AnchorIndexingFilter
filter object which supports boolean
configuration settings for the deduplication of anchors.
See anchorIndexingFilter.deduplicate
in nutch-default.xml.filter
in interface IndexingFilter
doc
- The NutchDocument
objectparse
- The relevant Parse
object passing through the filterurl
- URL to be filtered for anchor textdatum
- The CrawlDatum
entryinlinks
- The Inlinks
containing anchor textIndexingException
Copyright © 2014 The Apache Software Foundation