Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
abstract byte[] |
Signature.calculate(Content content,
Parse parse) |
byte[] |
MD5Signature.calculate(Content content,
Parse parse) |
byte[] |
TextProfileSignature.calculate(Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters.
|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
BasicIndexingFilter filter object which supports few
configuration settings for adding basic searchable fields. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
FeedIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Extracts out the relevant fields:
FEED_AUTHOR
FEED_TAGS
FEED_PUBLISHED
FEED_UPDATED
FEED
And sends them to the
Indexer for indexing within the Nutch
index. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
StaticFieldIndexer.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
StaticFieldIndexer filter object which adds fields as per
configuration setting. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text urlText,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
URLMetaIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Class and Description |
---|---|
class |
ParseImpl
The result of parsing a page's raw content.
|
Modifier and Type | Method and Description |
---|---|
Parse |
ParseResult.get(String key)
Retrieve a single parse output.
|
Parse |
ParseResult.get(org.apache.hadoop.io.Text key)
Retrieve a single parse output.
|
Parse |
ParseStatus.getEmptyParse(org.apache.hadoop.conf.Configuration conf)
A convenience method.
|
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.mapred.RecordWriter<org.apache.hadoop.io.Text,Parse> |
ParseOutputFormat.getRecordWriter(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.mapred.JobConf job,
String name,
org.apache.hadoop.util.Progressable progress) |
Iterator<Map.Entry<org.apache.hadoop.io.Text,Parse>> |
ParseResult.iterator()
Iterate over all entries in the <url, Parse> map.
|
Modifier and Type | Method and Description |
---|---|
static ParseResult |
ParseResult.createParseResult(String url,
Parse parse)
Convenience method for obtaining
ParseResult from a single
Parse output. |
Constructor and Description |
---|
ParseImpl(Parse parse) |
Modifier and Type | Method and Description |
---|---|
float |
ScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost.
|
float |
AbstractScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
float |
ScoringFilters.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
void |
ScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming
from the parsing process.
|
void |
AbstractScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
void |
ScoringFilters.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
float |
LinkAnalysisScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
void |
LinkAnalysisScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
float |
OPICScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Dampen the boost value by scorePower.
|
void |
OPICScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
|
Modifier and Type | Method and Description |
---|---|
float |
TLDScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
void |
TLDScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
float |
URLMetaScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Boilerplate
|
void |
URLMetaScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it
within your parse data.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Copyright © 2014 The Apache Software Foundation