public class GeoIPIndexingFilter extends Object implements IndexingFilter
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
The third party library distribution provides an API for the GeoIP2 Precision web services and databases. The API also works with the free GeoLite2 databases.
Depending on the service level agreement, you have with the GeoIP service provider, the plugin can add a number of the following fields to the index data model:
Some of the services are documented at the GeoIP2 Precision Services webpage where more information can be obtained.
You should also consult the following three properties in
nutch-site.xml
<!-- index-geoip plugin properties -->
<property>
<name>index.geoip.usage</name>
<value>insightsService</value>
<description>
A string representing the information source to be used for GeoIP information
association. Either enter 'cityDatabase', 'connectionTypeDatabase',
'domainDatabase', 'ispDatabase' or 'insightsService'. If you wish to use any one of the
Database options, you should make one of GeoIP2-City.mmdb, GeoIP2-Connection-Type.mmdb,
GeoIP2-Domain.mmdb or GeoIP2-ISP.mmdb files respectively available on the Hadoop classpath
and available at runtime. This can be achieved by adding it to $NUTCH_HOME/conf
</description>
</property>
<property>
<name>index.geoip.userid</name>
<value></value>
<description>
The userId associated with the GeoIP2 Precision Services account.
</description>
</property>
<property>
<name>index.geoip.licensekey</name>
<value></value>
<description>
The license key associated with the GeoIP2 Precision Services account.
</description>
</property>
X_POINT_ID
Constructor and Description |
---|
GeoIPIndexingFilter()
Default constructor for this plugin
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Configuration |
getConf() |
void |
setConf(Configuration conf) |
public GeoIPIndexingFilter()
public Configuration getConf()
getConf
in interface Configurable
Configurable.getConf()
public void setConf(Configuration conf)
setConf
in interface Configurable
Configurable.setConf(org.apache.hadoop.conf.Configuration)
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
IndexingFilter
filter
in interface IndexingFilter
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing
fetch status and fetch time)inlinks
- page inlinksIndexingException
IndexingFilter.filter(org.apache.nutch.indexer.NutchDocument,
org.apache.nutch.parse.Parse, org.apache.hadoop.io.Text,
org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.crawl.Inlinks)
Copyright © 2015 The Apache Software Foundation