org.apache.nutch.indexer.solr
Class SolrDeleteDuplicates

java.lang.Object
  extended by org.apache.nutch.indexer.solr.SolrDeleteDuplicates
All Implemented Interfaces:
Closeable, Configurable, JobConfigurable, Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>, Tool

public class SolrDeleteDuplicates
extends Object
implements Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>, Tool

Utility class for deleting duplicate documents from a solr index. The algorithm goes like follows: Preparation:

  1. Query the solr server for the number of documents (say, N)
  2. Partition N among M map tasks. For example, if we have two map tasks the first map task will deal with solr documents from 0 - (N / 2 - 1) and the second will deal with documents from (N / 2) to (N - 1).
MapReduce: Note that unlike DeleteDuplicates we assume that two documents in a solr index will never have the same URL. So this class only deals with documents with different URLs but the same digest.


Nested Class Summary
static class SolrDeleteDuplicates.SolrInputFormat
           
static class SolrDeleteDuplicates.SolrInputSplit
           
static class SolrDeleteDuplicates.SolrRecord
           
 
Field Summary
static org.apache.commons.logging.Log LOG
           
 
Constructor Summary
SolrDeleteDuplicates()
           
 
Method Summary
 void close()
           
 void configure(JobConf job)
           
 void dedup(String solrUrl)
           
 Configuration getConf()
           
static void main(String[] args)
           
 void reduce(Text key, Iterator<SolrDeleteDuplicates.SolrRecord> values, OutputCollector<Text,SolrDeleteDuplicates.SolrRecord> output, Reporter reporter)
           
 int run(String[] args)
           
 void setConf(Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

SolrDeleteDuplicates

public SolrDeleteDuplicates()
Method Detail

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable

configure

public void configure(JobConf job)
Specified by:
configure in interface JobConfigurable

close

public void close()
           throws IOException
Specified by:
close in interface Closeable
Throws:
IOException

reduce

public void reduce(Text key,
                   Iterator<SolrDeleteDuplicates.SolrRecord> values,
                   OutputCollector<Text,SolrDeleteDuplicates.SolrRecord> output,
                   Reporter reporter)
            throws IOException
Specified by:
reduce in interface Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
Throws:
IOException

dedup

public void dedup(String solrUrl)
           throws IOException
Throws:
IOException

run

public int run(String[] args)
        throws IOException
Specified by:
run in interface Tool
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2006 The Apache Software Foundation