Package org.apache.nutch.crawl

Crawl control code.

See:
          Description

Interface Summary
FetchSchedule This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.
 

Class Summary
AbstractFetchSchedule This class provides common methods for implementations of FetchSchedule.
AdaptiveFetchSchedule This class implements an adaptive re-fetch algorithm.
Crawl  
CrawlDatum  
CrawlDatum.Comparator A Comparator optimized for CrawlDatum.
CrawlDb This class takes the output of the fetcher and updates the crawldb accordingly.
CrawlDbFilter This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code.
CrawlDbMerger This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
CrawlDbMerger.Merger  
CrawlDbReader Read utility for the CrawlDB.
CrawlDbReader.CrawlDatumCsvOutputFormat  
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter  
CrawlDbReader.CrawlDbStatCombiner  
CrawlDbReader.CrawlDbStatMapper  
CrawlDbReader.CrawlDbStatReducer  
CrawlDbReader.CrawlDbTopNMapper  
CrawlDbReader.CrawlDbTopNReducer  
CrawlDbReducer Merge new page entries with existing entries.
DefaultFetchSchedule This class implements the default re-fetch schedule.
FetchScheduleFactory Creates and caches a FetchSchedule implementation.
Generator Generates a subset of a crawl db to fetch.
Generator.CrawlDbUpdater Update the CrawlDB so that the next generate won't include the same URLs.
Generator.DecreasingFloatComparator  
Generator.GeneratorOutputFormat  
Generator.HashComparator Sort fetch lists by hash of URL.
Generator.PartitionReducer  
Generator.Selector Selects entries due for fetch.
Generator.SelectorEntry  
Generator.SelectorInverseMapper  
Injector This class takes a flat file of URLs and adds them to the of pages to be crawled.
Injector.InjectMapper Normalize and filter injected urls.
Injector.InjectReducer Combine multiple new entries for a url.
Inlink  
Inlinks A list of Inlinks.
LinkDb Maintains an inverted link map, listing incoming links for each url.
LinkDbFilter This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code.
LinkDbMerger This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.
LinkDbReader .
MapWritable Deprecated. Use org.apache.hadoop.io.MapWritable instead.
MD5Signature Default implementation of a page signature.
NutchWritable  
Signature  
SignatureComparator  
SignatureFactory Factory class, which instantiates a Signature implementation according to the current Configuration configuration.
TextProfileSignature An implementation of a page signature.
URLPartitioner Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'
 

Package org.apache.nutch.crawl Description

Crawl control code.



Copyright © 2011 The Apache Software Foundation