Log Message: |
A framework for using different page signature implementations. Ordinary
MD5 hash of a raw page content is very often unsuitable, when many
near-duplicate pages are crawled.
Now users can select their own page signature implementation, possibly
with better properties than the old one.
Two implementations are provided:
* MD5Signature: backward-compatible with the old schema.
* TextProfileSignature: an example implementation of a signature, which
gives the same values for near-duplicate pages. Please see Javadoc for
more information.
This commit changes the CrawlDatum to store page signatures in CrawlDb.
Last modified time field was added, too. Both changes are in preparation
for patches implementing self-adjustable fetch interval.
NutchConf was extended to store and retrieve also plain Object values.
This is useful when caching per-job instances.
StringUtil: added methods to display / parse byte[] values.
Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor).
Fixed Fetcher to actually use the command-line parameters.
|