Interface | Description |
---|---|
FetchSchedule |
This interface defines the contract for implementations that manipulate
fetch times and re-fetch intervals.
|
Class | Description |
---|---|
AbstractFetchSchedule |
This class provides common methods for implementations of
FetchSchedule . |
AdaptiveFetchSchedule |
This class implements an adaptive re-fetch algorithm.
|
CrawlDatum | |
CrawlDatum.Comparator |
A Comparator optimized for CrawlDatum.
|
CrawlDb |
This class takes the output of the fetcher and updates the
crawldb accordingly.
|
CrawlDbFilter |
This class provides a way to separate the URL normalization
and filtering steps from the rest of CrawlDb manipulation code.
|
CrawlDbMerger |
This tool merges several CrawlDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited
pages.
|
CrawlDbMerger.Merger | |
CrawlDbReader |
Read utility for the CrawlDB.
|
CrawlDbReader.CrawlDatumCsvOutputFormat | |
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter | |
CrawlDbReader.CrawlDbDumpMapper | |
CrawlDbReader.CrawlDbStatCombiner | |
CrawlDbReader.CrawlDbStatMapper | |
CrawlDbReader.CrawlDbStatReducer | |
CrawlDbReader.CrawlDbTopNMapper | |
CrawlDbReader.CrawlDbTopNReducer | |
CrawlDbReducer |
Merge new page entries with existing entries.
|
DeduplicationJob |
Generic deduplicator which groups fetched URLs with the same digest and marks
all of them as duplicate except the one with the highest score (based on the
score in the crawldb, which is not necessarily the same as the score
indexed).
|
DeduplicationJob.DBFilter | |
DeduplicationJob.DedupReducer | |
DeduplicationJob.StatusUpdateReducer |
Combine multiple new entries for a url.
|
DefaultFetchSchedule |
This class implements the default re-fetch schedule.
|
FetchScheduleFactory |
Creates and caches a
FetchSchedule implementation. |
Generator |
Generates a subset of a crawl db to fetch.
|
Generator.CrawlDbUpdater |
Update the CrawlDB so that the next generate won't include the same URLs.
|
Generator.DecreasingFloatComparator | |
Generator.GeneratorOutputFormat | |
Generator.HashComparator |
Sort fetch lists by hash of URL.
|
Generator.PartitionReducer | |
Generator.Selector |
Selects entries due for fetch.
|
Generator.SelectorEntry | |
Generator.SelectorInverseMapper | |
Injector |
This class takes a flat file of URLs and adds them to the of pages to be
crawled.
|
Injector.InjectMapper |
Normalize and filter injected urls.
|
Injector.InjectReducer |
Combine multiple new entries for a url.
|
Inlink | |
Inlinks |
A list of
Inlink s. |
LinkDb |
Maintains an inverted link map, listing incoming links for each url.
|
LinkDbFilter |
This class provides a way to separate the URL normalization
and filtering steps from the rest of LinkDb manipulation code.
|
LinkDbMerger |
This tool merges several LinkDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited URLs and
links.
|
LinkDbReader |
.
|
MapWritable | Deprecated
Use org.apache.hadoop.io.MapWritable instead.
|
MD5Signature |
Default implementation of a page signature.
|
MimeAdaptiveFetchSchedule |
Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration
of DEC and INC factors for various MIME-types.
|
NutchWritable | |
Signature | |
SignatureComparator | |
SignatureFactory |
Factory class, which instantiates a Signature implementation according to the
current Configuration configuration.
|
TextProfileSignature |
An implementation of a page signature.
|
URLPartitioner |
Partition urls by host, domain name or IP depending on the value of the
parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'
|
Copyright © 2014 The Apache Software Foundation