org.apache.nutch.indexer.urlmeta
Class URLMetaIndexingFilter
java.lang.Object
org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
- All Implemented Interfaces:
- Configurable, IndexingFilter, Pluggable
public class URLMetaIndexingFilter
- extends Object
- implements IndexingFilter
This is part of the URL Meta plugin. It is designed to enhance the NUTCH-655
patch, by doing two things: 1. Meta Tags that are supplied with your Crawl
URLs, during injection, will be propagated throughout the outlinks of those
Crawl URLs. 2. When you index your URLs, the meta tags that you specified
with your URLs will be indexed alongside those URLs--and can be directly
queried, assuming you have done everything else correctly.
The flat-file of URLs you are injecting should, per NUTCH-655, be
tab-delimited in the form of:
[www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
Be aware that if you collide with keywords that are already in use (such as
nutch.score/nutch.fetchInterval) then you are in for some unpredictable
behavior.
Furthermore, in your nutch-site.xml config, you must specify that this plugin
is to be used (1), as well as what (2) Meta Tags it should actively look for.
This does not mean that you must use these tags for every URL, but it does
mean that you must list _all_ of meta tags that you have specified. If you
want them to be propagated and indexed, that is.
1. As of Nutch 1.2, the property "plugin.includes" looks as follows:
protocol-http|urlfilter-regex|parse-(text|html|js|tika|rss)|index
-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic
|scoring-opic|urlnormalizer-(pass|regex|basic) You must change
"index-(basic|anchor)" to "index-(basic|anchor|urlmeta)", in order to call
this plugin.
2. You must also specify the property "urlmeta.tags", who's values are
comma-delimited key1, key2, key3
TODO: It may be ideal to offer two separate properties, to specify what gets
indexed versus merely propagated.
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
URLMetaIndexingFilter
public URLMetaIndexingFilter()
filter
public NutchDocument filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
throws IndexingException
- This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object. If they exist,
this will add it as an attribute inside the NutchDocument.
- Specified by:
filter
in interface IndexingFilter
- Parameters:
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the pageinlinks
- page inlinks
- Returns:
- modified (or a new) document instance, or null (meaning the document
should be discarded)
- Throws:
IndexingException
- See Also:
IndexingFilter.filter(org.apache.nutch.indexer.NutchDocument, org.apache.nutch.parse.Parse, org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.crawl.Inlinks)
getConf
public Configuration getConf()
- Boilerplate
- Specified by:
getConf
in interface Configurable
setConf
public void setConf(Configuration conf)
- handles conf assignment and pulls the value assignment from the
"urlmeta.tags" property
- Specified by:
setConf
in interface Configurable
Copyright © 2012 The Apache Software Foundation