org.apache.nutch.crawl (apache-nutch 1.8 API)

Interface Summary
Interface Description

FetchSchedule
This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.

Interface Summary
Interface	Description
FetchSchedule	This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.

Class Summary
Class	Description
AbstractFetchSchedule	This class provides common methods for implementations of `FetchSchedule`.
AdaptiveFetchSchedule	This class implements an adaptive re-fetch algorithm.
CrawlDatum
CrawlDatum.Comparator	A Comparator optimized for CrawlDatum.
CrawlDb	This class takes the output of the fetcher and updates the crawldb accordingly.
CrawlDbFilter	This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code.
CrawlDbMerger	This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
CrawlDbMerger.Merger
CrawlDbReader	Read utility for the CrawlDB.
CrawlDbReader.CrawlDatumCsvOutputFormat
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
CrawlDbReader.CrawlDbDumpMapper
CrawlDbReader.CrawlDbStatCombiner
CrawlDbReader.CrawlDbStatMapper
CrawlDbReader.CrawlDbStatReducer
CrawlDbReader.CrawlDbTopNMapper
CrawlDbReader.CrawlDbTopNReducer
CrawlDbReducer	Merge new page entries with existing entries.
DeduplicationJob	Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed).
DeduplicationJob.DBFilter
DeduplicationJob.DedupReducer
DeduplicationJob.StatusUpdateReducer	Combine multiple new entries for a url.
DefaultFetchSchedule	This class implements the default re-fetch schedule.
FetchScheduleFactory	Creates and caches a `FetchSchedule` implementation.
Generator	Generates a subset of a crawl db to fetch.
Generator.CrawlDbUpdater	Update the CrawlDB so that the next generate won't include the same URLs.
Generator.DecreasingFloatComparator
Generator.GeneratorOutputFormat
Generator.HashComparator	Sort fetch lists by hash of URL.
Generator.PartitionReducer
Generator.Selector	Selects entries due for fetch.
Generator.SelectorEntry
Generator.SelectorInverseMapper
Injector	This class takes a flat file of URLs and adds them to the of pages to be crawled.
Injector.InjectMapper	Normalize and filter injected urls.
Injector.InjectReducer	Combine multiple new entries for a url.
Inlink
Inlinks	A list of `Inlink`s.
LinkDb	Maintains an inverted link map, listing incoming links for each url.
LinkDbFilter	This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code.
LinkDbMerger	This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.
LinkDbReader	.
MapWritable	Deprecated Use org.apache.hadoop.io.MapWritable instead.
MD5Signature	Default implementation of a page signature.
MimeAdaptiveFetchSchedule	Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types.
NutchWritable
Signature
SignatureComparator
SignatureFactory	Factory class, which instantiates a Signature implementation according to the current Configuration configuration.
TextProfileSignature	An implementation of a page signature.
URLPartitioner	Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'

Package org.apache.nutch.crawl Description

Crawl control code.