org.apache.nutch.indexer
Class DeleteDuplicates
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.indexer.DeleteDuplicates
- All Implemented Interfaces:
- Closeable, Configurable, JobConfigurable, Mapper<WritableComparable,Writable,Text,IntWritable>, OutputFormat<WritableComparable,Writable>, Reducer<Text,IntWritable,WritableComparable,Writable>, Tool
public class DeleteDuplicates
- extends Configured
- implements Tool, Mapper<WritableComparable,Writable,Text,IntWritable>, Reducer<Text,IntWritable,WritableComparable,Writable>, OutputFormat<WritableComparable,Writable>
Delete duplicate documents in a set of Lucene indexes.
Duplicates have either the same contents (via MD5 hash) or the same URL.
This tool uses the following algorithm:
- Phase 1 - remove URL duplicates:
In this phase documents with the same URL
are compared, and only the most recent document is retained -
all other URL duplicates are scheduled for deletion.
- Phase 2 - remove content duplicates:
In this phase documents with the same content hash are compared. If
property "dedup.keep.highest.score" is set to true (default) then only
the document with the highest score is retained. If this property is set
to false, only the document with the shortest URL is retained - all other
content duplicates are scheduled for deletion.
- Phase 3 - delete documents:
In this phase documents scheduled for deletion are marked as deleted in
Lucene index(es).
- Author:
- Andrzej Bialecki
Method Summary |
void |
checkOutputSpecs(FileSystem fs,
JobConf job)
|
void |
close()
|
void |
configure(JobConf job)
|
void |
dedup(Path[] indexDirs)
|
RecordWriter<WritableComparable,Writable> |
getRecordWriter(FileSystem fs,
JobConf job,
String name,
Progressable progress)
Write nothing. |
static void |
main(String[] args)
|
void |
map(WritableComparable key,
Writable value,
OutputCollector<Text,IntWritable> output,
Reporter reporter)
Map [*,IndexDoc] pairs to [index,doc] pairs. |
void |
reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<WritableComparable,Writable> output,
Reporter reporter)
Delete docs named in values from index named in key. |
int |
run(String[] args)
|
void |
setConf(Configuration conf)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DeleteDuplicates
public DeleteDuplicates()
DeleteDuplicates
public DeleteDuplicates(Configuration conf)
configure
public void configure(JobConf job)
- Specified by:
configure
in interface JobConfigurable
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
- Overrides:
setConf
in class Configured
close
public void close()
- Specified by:
close
in interface Closeable
map
public void map(WritableComparable key,
Writable value,
OutputCollector<Text,IntWritable> output,
Reporter reporter)
throws IOException
- Map [*,IndexDoc] pairs to [index,doc] pairs.
- Specified by:
map
in interface Mapper<WritableComparable,Writable,Text,IntWritable>
- Throws:
IOException
reduce
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<WritableComparable,Writable> output,
Reporter reporter)
throws IOException
- Delete docs named in values from index named in key.
- Specified by:
reduce
in interface Reducer<Text,IntWritable,WritableComparable,Writable>
- Throws:
IOException
getRecordWriter
public RecordWriter<WritableComparable,Writable> getRecordWriter(FileSystem fs,
JobConf job,
String name,
Progressable progress)
throws IOException
- Write nothing.
- Specified by:
getRecordWriter
in interface OutputFormat<WritableComparable,Writable>
- Throws:
IOException
checkOutputSpecs
public void checkOutputSpecs(FileSystem fs,
JobConf job)
- Specified by:
checkOutputSpecs
in interface OutputFormat<WritableComparable,Writable>
dedup
public void dedup(Path[] indexDirs)
throws IOException
- Throws:
IOException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface Tool
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation