|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
FetchSchedule | This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals. |
Class Summary | |
---|---|
AbstractFetchSchedule | This class provides common methods for implementations of
FetchSchedule . |
AdaptiveFetchSchedule | This class implements an adaptive re-fetch algorithm. |
Crawl | |
CrawlDatum | |
CrawlDatum.Comparator | A Comparator optimized for CrawlDatum. |
CrawlDb | This class takes the output of the fetcher and updates the crawldb accordingly. |
CrawlDbFilter | This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code. |
CrawlDbMerger | This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages. |
CrawlDbMerger.Merger | |
CrawlDbReader | Read utility for the CrawlDB. |
CrawlDbReader.CrawlDatumCsvOutputFormat | |
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter | |
CrawlDbReader.CrawlDbStatCombiner | |
CrawlDbReader.CrawlDbStatMapper | |
CrawlDbReader.CrawlDbStatReducer | |
CrawlDbReader.CrawlDbTopNMapper | |
CrawlDbReader.CrawlDbTopNReducer | |
CrawlDbReducer | Merge new page entries with existing entries. |
DefaultFetchSchedule | This class implements the default re-fetch schedule. |
FetchScheduleFactory | Creates and caches a FetchSchedule implementation. |
Generator | Generates a subset of a crawl db to fetch. |
Generator.CrawlDbUpdater | Update the CrawlDB so that the next generate won't include the same URLs. |
Generator.DecreasingFloatComparator | |
Generator.GeneratorOutputFormat | |
Generator.HashComparator | Sort fetch lists by hash of URL. |
Generator.PartitionReducer | |
Generator.Selector | Selects entries due for fetch. |
Generator.SelectorEntry | |
Generator.SelectorInverseMapper | |
Injector | This class takes a flat file of URLs and adds them to the of pages to be crawled. |
Injector.InjectMapper | Normalize and filter injected urls. |
Injector.InjectReducer | Combine multiple new entries for a url. |
Inlink | |
Inlinks | A list of Inlink s. |
LinkDb | Maintains an inverted link map, listing incoming links for each url. |
LinkDbFilter | This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code. |
LinkDbMerger | This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links. |
LinkDbReader | . |
MapWritable | Deprecated. Use org.apache.hadoop.io.MapWritable instead. |
MD5Signature | Default implementation of a page signature. |
NutchWritable | |
Signature | |
SignatureComparator | |
SignatureFactory | Factory class, which instantiates a Signature implementation according to the current Configuration configuration. |
TextProfileSignature | An implementation of a page signature. |
URLPartitioner | Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP' |
Crawl control code.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |