- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class DeduplicationJob
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool
Generic deduplicator which groups fetched URLs with the same digest and marks
all of them as duplicate except the one with the highest score (based on the
score in the crawldb, which is not necessarily the same as the score
indexed). If two (or more) documents have the same score, then the document
with the latest timestamp is kept. If the documents have the same timestamp
then the one with the shortest URL is kept. The documents marked as duplicate can then
be deleted with the command CleaningJob.