Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.crawl |
Crawl control code.
|
org.apache.nutch.fetcher |
The Nutch robot.
|
org.apache.nutch.indexer |
Maintain Lucene full-text indexes.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin.
|
org.apache.nutch.indexer.feed | |
org.apache.nutch.indexer.metadata | |
org.apache.nutch.indexer.more |
A more indexing plugin.
|
org.apache.nutch.indexer.staticfield |
A simple plugin called at indexing that adds fields with static data.
|
org.apache.nutch.indexer.subcollection | |
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.indexer.urlmeta |
URL Meta Tag Indexing Plugin
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.protocol | |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.http |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.protocol.httpclient |
Protocol plugin which supports retrieving documents via the HTTP and
HTTPS protocols, optionally with Basic, Digest and NTLM authentication
schemes for web server as well as proxy server.
|
org.apache.nutch.scoring | |
org.apache.nutch.scoring.link | |
org.apache.nutch.scoring.opic | |
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.scoring.webgraph | |
org.apache.nutch.segment | |
org.apache.nutch.tools | |
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Field and Description |
---|---|
CrawlDatum |
Generator.SelectorEntry.datum |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
FetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and
page signature, so that it forces refetching.
|
CrawlDatum |
AbstractFetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime,
retriesSinceFetch and page signature, so that it forces refetching.
|
CrawlDatum |
CrawlDbReader.get(String crawlDb,
String url,
org.apache.hadoop.conf.Configuration config) |
CrawlDatum |
FetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data.
|
CrawlDatum |
AbstractFetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data.
|
static CrawlDatum |
CrawlDatum.read(DataInput in) |
CrawlDatum |
FetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AbstractFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AdaptiveFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
CrawlDatum |
DefaultFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
CrawlDatum |
MimeAdaptiveFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
CrawlDatum |
FetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages
marked as GONE.
|
CrawlDatum |
AbstractFetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages
marked as GONE.
|
CrawlDatum |
FetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
|
CrawlDatum |
AbstractFetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
|
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.mapred.RecordWriter<org.apache.hadoop.io.Text,CrawlDatum> |
CrawlDbReader.CrawlDatumCsvOutputFormat.getRecordWriter(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.mapred.JobConf job,
String name,
org.apache.hadoop.util.Progressable progress) |
Modifier and Type | Method and Description |
---|---|
long |
FetchSchedule.calculateLastFetchTime(CrawlDatum datum)
Calculates last fetch time of the given CrawlDatum.
|
long |
AbstractFetchSchedule.calculateLastFetchTime(CrawlDatum datum)
This method return the last fetch time of the CrawlDatum
|
int |
CrawlDatum.compareTo(CrawlDatum that)
Sort by decreasing score.
|
CrawlDatum |
FetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and
page signature, so that it forces refetching.
|
CrawlDatum |
AbstractFetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime,
retriesSinceFetch and page signature, so that it forces refetching.
|
static boolean |
CrawlDatum.hasDbStatus(CrawlDatum datum) |
static boolean |
CrawlDatum.hasFetchStatus(CrawlDatum datum) |
CrawlDatum |
FetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data.
|
CrawlDatum |
AbstractFetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data.
|
void |
DeduplicationJob.DBFilter.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Generator.Selector.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.FloatWritable,Generator.SelectorEntry> output,
org.apache.hadoop.mapred.Reporter reporter)
Select & invert subset due for fetch.
|
void |
CrawlDbReader.CrawlDbTopNMapper.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.FloatWritable,org.apache.hadoop.io.Text> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbReader.CrawlDbDumpMapper.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Generator.CrawlDbUpdater.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbFilter.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbReader.CrawlDbStatMapper.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDatum.putAllMetaData(CrawlDatum other)
Add all metadata from other CrawlDatum to this CrawlDatum.
|
void |
CrawlDatum.set(CrawlDatum that)
Copy the contents of another instance into this instance.
|
CrawlDatum |
FetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AbstractFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AdaptiveFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
CrawlDatum |
DefaultFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
CrawlDatum |
MimeAdaptiveFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
CrawlDatum |
FetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages
marked as GONE.
|
CrawlDatum |
AbstractFetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages
marked as GONE.
|
CrawlDatum |
FetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
|
CrawlDatum |
AbstractFetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
|
boolean |
FetchSchedule.shouldFetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for
selection in the current fetchlist.
|
boolean |
AbstractFetchSchedule.shouldFetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for
selection in the current fetchlist.
|
void |
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter.write(org.apache.hadoop.io.Text key,
CrawlDatum value) |
Modifier and Type | Method and Description |
---|---|
void |
DeduplicationJob.DBFilter.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbReader.CrawlDbDumpMapper.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Generator.CrawlDbUpdater.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbFilter.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Injector.InjectMapper.map(org.apache.hadoop.io.WritableComparable<?> key,
org.apache.hadoop.io.Text value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
DeduplicationJob.DedupReducer.reduce(org.apache.hadoop.io.BytesWritable key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
DeduplicationJob.DedupReducer.reduce(org.apache.hadoop.io.BytesWritable key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbMerger.Merger.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
CrawlDbMerger.Merger.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Injector.InjectReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Injector.InjectReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
DeduplicationJob.StatusUpdateReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
DeduplicationJob.StatusUpdateReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Generator.CrawlDbUpdater.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Generator.CrawlDbUpdater.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
Generator.PartitionReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<Generator.SelectorEntry> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
Modifier and Type | Method and Description |
---|---|
void |
Fetcher.run(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,CrawlDatum> input,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output,
org.apache.hadoop.mapred.Reporter reporter) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters.
|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
void |
CleaningJob.DBFilter.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.ByteWritable,org.apache.hadoop.io.Text> output,
org.apache.hadoop.mapred.Reporter reporter) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
BasicIndexingFilter filter object which supports few
configuration settings for adding basic searchable fields. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
FeedIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Extracts out the relevant fields:
FEED_AUTHOR
FEED_TAGS
FEED_PUBLISHED
FEED_UPDATED
FEED
And sends them to the
Indexer for indexing within the Nutch
index. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
StaticFieldIndexer.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The
StaticFieldIndexer filter object which adds fields as per
configuration setting. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text urlText,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
URLMetaIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
Protocol.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Returns the
Content for a fetchlist entry. |
crawlercommons.robots.BaseRobotRules |
Protocol.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Retrieve robot rules applicable for this url.
|
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
File.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Creates a
FileResponse object corresponding to the url and
return a ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
File.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
No robots parsing is done for file protocol.
|
Constructor and Description |
---|
FileResponse(URL url,
CrawlDatum datum,
File file,
org.apache.hadoop.conf.Configuration conf)
Default public constructor
|
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
Ftp.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Creates a
FtpResponse object corresponding to the url and
returns a ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
Ftp.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Get the robots rules for a given url
|
Constructor and Description |
---|
FtpResponse(URL url,
CrawlDatum datum,
Ftp ftp,
org.apache.hadoop.conf.Configuration conf) |
Modifier and Type | Method and Description |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect) |
Constructor and Description |
---|
HttpResponse(HttpBase http,
URL url,
CrawlDatum datum)
Default public constructor.
|
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
HttpBase.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
protected abstract Response |
HttpBase.getResponse(URL url,
CrawlDatum datum,
boolean followRedirects) |
crawlercommons.robots.BaseRobotRules |
HttpBase.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
Modifier and Type | Method and Description |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect)
Fetches the
url with a configured HTTP client and
gets the response. |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages.
|
CrawlDatum |
AbstractScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages.
|
CrawlDatum |
AbstractScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
float |
ScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
This method prepares a sort value for the purpose of sorting and
selecting top N scoring pages during fetchlist generation.
|
float |
AbstractScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort) |
float |
ScoringFilters.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
Calculate a sort value for Generate.
|
float |
ScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost.
|
float |
AbstractScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
float |
ScoringFilters.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
void |
ScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set an initial score for newly discovered pages.
|
void |
AbstractScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
void |
ScoringFilters.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Calculate a new initial score, used when adding newly discovered pages.
|
void |
ScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set an initial score for newly injected pages.
|
void |
AbstractScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
void |
ScoringFilters.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Calculate a new initial score, used when injecting new pages.
|
void |
ScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content metadata. |
void |
AbstractScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
void |
ScoringFilters.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
void |
ScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the
initial value of the original CrawlDatum, and also score values contributed by
inlinked pages.
|
void |
AbstractScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked) |
void |
ScoringFilters.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update().
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages.
|
CrawlDatum |
AbstractScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
void |
ScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the
initial value of the original CrawlDatum, and also score values contributed by
inlinked pages.
|
void |
AbstractScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked) |
void |
ScoringFilters.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update().
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
float |
LinkAnalysisScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort) |
float |
LinkAnalysisScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
void |
LinkAnalysisScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
void |
LinkAnalysisScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
void |
LinkAnalysisScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
void |
LinkAnalysisScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
void |
LinkAnalysisScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
|
float |
OPICScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
Use
getScore() . |
float |
OPICScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Dampen the boost value by scorePower.
|
void |
OPICScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set to 0.0f (unknown value) - inlink contributions will bring it to
a correct level.
|
void |
OPICScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
void |
OPICScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
|
void |
OPICScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Increase the score by a sum of inlinked scores.
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
|
void |
OPICScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Increase the score by a sum of inlinked scores.
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlink(org.apache.hadoop.io.Text fromUrl,
org.apache.hadoop.io.Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount) |
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlink(org.apache.hadoop.io.Text fromUrl,
org.apache.hadoop.io.Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount) |
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
float |
TLDScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort) |
float |
TLDScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
void |
TLDScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
void |
TLDScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum) |
void |
TLDScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content) |
void |
TLDScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
void |
TLDScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
|
float |
URLMetaScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
Boilerplate
|
float |
URLMetaScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Boilerplate
|
void |
URLMetaScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Boilerplate
|
void |
URLMetaScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Boilerplate
|
void |
URLMetaScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the
datum object and injects it into the content.
|
void |
URLMetaScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Boilerplate
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
|
void |
URLMetaScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Boilerplate
|
Modifier and Type | Method and Description |
---|---|
void |
ScoreUpdater.reduce(org.apache.hadoop.io.Text key,
Iterator<org.apache.hadoop.io.ObjectWritable> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
Creates new CrawlDatum objects with the updated score from the NodeDb or
with a cleared score.
|
Modifier and Type | Method and Description |
---|---|
boolean |
SegmentMergeFilter.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given
key (URL).
|
boolean |
SegmentMergeFilters.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all
SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
Modifier and Type | Method and Description |
---|---|
boolean |
SegmentMergeFilter.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given
key (URL).
|
boolean |
SegmentMergeFilters.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all
SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
Modifier and Type | Method and Description |
---|---|
void |
FreeGenerator.FG.reduce(org.apache.hadoop.io.Text key,
Iterator<Generator.SelectorEntry> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks) |
Copyright © 2014 The Apache Software Foundation