org.apache.nutch.indexer
Class DeleteDuplicates

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.indexer.DeleteDuplicates
All Implemented Interfaces:
Closeable, Configurable, JobConfigurable, Mapper<WritableComparable,Writable,Text,IntWritable>, OutputFormat<WritableComparable,Writable>, Reducer<Text,IntWritable,WritableComparable,Writable>, Tool

public class DeleteDuplicates
extends Configured
implements Tool, Mapper<WritableComparable,Writable,Text,IntWritable>, Reducer<Text,IntWritable,WritableComparable,Writable>, OutputFormat<WritableComparable,Writable>

Delete duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL. This tool uses the following algorithm:

Author:
Andrzej Bialecki

Nested Class Summary
static class DeleteDuplicates.HashPartitioner
           
static class DeleteDuplicates.HashReducer
           
static class DeleteDuplicates.IndexDoc
           
static class DeleteDuplicates.InputFormat
           
static class DeleteDuplicates.UrlsReducer
           
 
Constructor Summary
DeleteDuplicates()
           
DeleteDuplicates(Configuration conf)
           
 
Method Summary
 void checkOutputSpecs(FileSystem fs, JobConf job)
           
 void close()
           
 void configure(JobConf job)
           
 void dedup(Path[] indexDirs)
           
 RecordWriter<WritableComparable,Writable> getRecordWriter(FileSystem fs, JobConf job, String name, Progressable progress)
          Write nothing.
static void main(String[] args)
           
 void map(WritableComparable key, Writable value, OutputCollector<Text,IntWritable> output, Reporter reporter)
          Map [*,IndexDoc] pairs to [index,doc] pairs.
 void reduce(Text key, Iterator<IntWritable> values, OutputCollector<WritableComparable,Writable> output, Reporter reporter)
          Delete docs named in values from index named in key.
 int run(String[] args)
           
 void setConf(Configuration conf)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
 

Constructor Detail

DeleteDuplicates

public DeleteDuplicates()

DeleteDuplicates

public DeleteDuplicates(Configuration conf)
Method Detail

configure

public void configure(JobConf job)
Specified by:
configure in interface JobConfigurable

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable
Overrides:
setConf in class Configured

close

public void close()
Specified by:
close in interface Closeable

map

public void map(WritableComparable key,
                Writable value,
                OutputCollector<Text,IntWritable> output,
                Reporter reporter)
         throws IOException
Map [*,IndexDoc] pairs to [index,doc] pairs.

Specified by:
map in interface Mapper<WritableComparable,Writable,Text,IntWritable>
Throws:
IOException

reduce

public void reduce(Text key,
                   Iterator<IntWritable> values,
                   OutputCollector<WritableComparable,Writable> output,
                   Reporter reporter)
            throws IOException
Delete docs named in values from index named in key.

Specified by:
reduce in interface Reducer<Text,IntWritable,WritableComparable,Writable>
Throws:
IOException

getRecordWriter

public RecordWriter<WritableComparable,Writable> getRecordWriter(FileSystem fs,
                                                                 JobConf job,
                                                                 String name,
                                                                 Progressable progress)
                                                          throws IOException
Write nothing.

Specified by:
getRecordWriter in interface OutputFormat<WritableComparable,Writable>
Throws:
IOException

checkOutputSpecs

public void checkOutputSpecs(FileSystem fs,
                             JobConf job)
Specified by:
checkOutputSpecs in interface OutputFormat<WritableComparable,Writable>

dedup

public void dedup(Path[] indexDirs)
           throws IOException
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2006 The Apache Software Foundation