/[Apache-SVN]
ViewVC logotype

Revision 359822


Jump to revision: Previous Next
Author: ab
Date: Thu Dec 29 15:28:30 2005 UTC (18 years, 4 months ago)
Changed paths: 17
Log Message:
A framework for using different page signature implementations. Ordinary
MD5 hash of a raw page content is very often unsuitable, when many
near-duplicate pages are crawled.

Now users can select their own page signature implementation, possibly
with better properties than the old one.

Two implementations are provided:

* MD5Signature: backward-compatible with the old schema.

* TextProfileSignature: an example implementation of a signature, which
  gives the same values for near-duplicate pages. Please see Javadoc for
  more information.

This commit changes the CrawlDatum to store page signatures in CrawlDb.
Last modified time field was added, too. Both changes are in preparation
for patches implementing self-adjustable fetch interval.

NutchConf was extended to store and retrieve also plain Object values.
This is useful when caching per-job instances.

StringUtil: added methods to display / parse byte[] values.

Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor).

Fixed Fetcher to actually use the command-line parameters.


Changed paths

Path Details
Directorylucene/nutch/trunk/bin/nutch modified , text changed
Directorylucene/nutch/trunk/conf/nutch-default.xml modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/crawl/MD5Signature.java added
Directorylucene/nutch/trunk/src/java/org/apache/nutch/crawl/Signature.java added
Directorylucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureComparator.java added
Directorylucene/nutch/trunk/src/java/org/apache/nutch/crawl/SignatureFactory.java added
Directorylucene/nutch/trunk/src/java/org/apache/nutch/crawl/TextProfileSignature.java added
Directorylucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java added
Directorylucene/nutch/trunk/src/java/org/apache/nutch/util/NutchConf.java modified , text changed
Directorylucene/nutch/trunk/src/java/org/apache/nutch/util/StringUtil.java modified , text changed
Directorylucene/nutch/trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java modified , text changed

infrastructure at apache.org
ViewVC Help
Powered by ViewVC 1.1.26