org.apache.nutch.indexer.solr
Class SolrDeleteDuplicates

java.lang.Object
  extended by org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
      extended by org.apache.nutch.indexer.solr.SolrDeleteDuplicates
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class SolrDeleteDuplicates
extends org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
implements org.apache.hadoop.util.Tool

Utility class for deleting duplicate documents from a solr index. The algorithm goes like follows: Preparation:

  1. Query the solr server for the number of documents (say, N)
  2. Partition N among M map tasks. For example, if we have two map tasks the first map task will deal with solr documents from 0 - (N / 2 - 1) and the second will deal with documents from (N / 2) to (N - 1).
MapReduce: Note that we assume that two documents in a solr index will never have the same URL. So this class only deals with documents with different URLs but the same digest.


Nested Class Summary
static class SolrDeleteDuplicates.SolrInputFormat
           
static class SolrDeleteDuplicates.SolrInputSplit
           
static class SolrDeleteDuplicates.SolrRecord
           
static class SolrDeleteDuplicates.SolrRecordReader
           
 
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Reducer.Context
 
Field Summary
static org.slf4j.Logger LOG
           
 
Constructor Summary
SolrDeleteDuplicates()
           
 
Method Summary
 void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
           
 boolean dedup(String solrUrl)
           
 org.apache.hadoop.conf.Configuration getConf()
           
static void main(String[] args)
           
 void reduce(org.apache.hadoop.io.Text key, Iterable<SolrDeleteDuplicates.SolrRecord> values, org.apache.hadoop.mapreduce.Reducer.Context context)
           
 int run(String[] args)
           
 void setConf(org.apache.hadoop.conf.Configuration conf)
           
 void setup(org.apache.hadoop.mapreduce.Reducer.Context job)
           
 
Methods inherited from class org.apache.hadoop.mapreduce.Reducer
run
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

SolrDeleteDuplicates

public SolrDeleteDuplicates()
Method Detail

getConf

public org.apache.hadoop.conf.Configuration getConf()
Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

setup

public void setup(org.apache.hadoop.mapreduce.Reducer.Context job)
           throws IOException
Overrides:
setup in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
Throws:
IOException

cleanup

public void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
             throws IOException
Overrides:
cleanup in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
Throws:
IOException

reduce

public void reduce(org.apache.hadoop.io.Text key,
                   Iterable<SolrDeleteDuplicates.SolrRecord> values,
                   org.apache.hadoop.mapreduce.Reducer.Context context)
            throws IOException
Overrides:
reduce in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
Throws:
IOException

dedup

public boolean dedup(String solrUrl)
              throws IOException,
                     InterruptedException,
                     ClassNotFoundException
Throws:
IOException
InterruptedException
ClassNotFoundException

run

public int run(String[] args)
        throws IOException,
               InterruptedException,
               ClassNotFoundException
Specified by:
run in interface org.apache.hadoop.util.Tool
Throws:
IOException
InterruptedException
ClassNotFoundException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2013 The Apache Software Foundation