Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.feed |
Indexing filter to index meta data from RSS feeds.
|
org.apache.nutch.indexer.filter | |
org.apache.nutch.indexer.geoip |
This plugin implements an indexing filter which takes
advantage of the
GeoIP2-java API.
|
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.staticfield |
A simple plugin called at indexing that adds fields with static data.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.indexer.urlmeta |
URL Meta Tag Indexing Plugin
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters.
|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
BasicIndexingFilter filter object which supports few
configuration settings for adding basic searchable fields. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
FeedIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Extracts out the relevant fields:
FEED_AUTHOR
FEED_TAGS
FEED_PUBLISHED
FEED_UPDATED
FEED
And sends them to the
Indexer for indexing within the Nutch index. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MimeTypeIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
static void |
MimeTypeIndexingFilter.main(String[] args)
Main method for invoking this tool
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
GeoIPIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
StaticFieldIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
StaticFieldIndexer filter object which adds fields as per
configuration setting. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text urlText,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
URLMetaIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Copyright © 2015 The Apache Software Foundation