org.apache.nutch.indexer.solr
Class SolrDeleteDuplicates
java.lang.Object
org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class SolrDeleteDuplicates
- extends org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
- implements org.apache.hadoop.util.Tool
Utility class for deleting duplicate documents from a solr index.
The algorithm goes like follows:
Preparation:
- Query the solr server for the number of documents (say, N)
- Partition N among M map tasks. For example, if we have two map tasks
the first map task will deal with solr documents from 0 - (N / 2 - 1) and
the second will deal with documents from (N / 2) to (N - 1).
MapReduce:
- Map: Identity map where keys are digests and values are
SolrDeleteDuplicates.SolrRecord
instances(which contain id, boost and timestamp)
- Reduce: After map,
SolrDeleteDuplicates.SolrRecord
s with the same digest will be
grouped together. Now, of these documents with the same digests, delete
all of them except the one with the highest score (boost field). If two
(or more) documents have the same score, then the document with the latest
timestamp is kept. Again, every other is deleted from solr index.
Note that we assume that two documents in
a solr index will never have the same URL. So this class only deals with
documents with different URLs but the same digest.
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Reducer |
org.apache.hadoop.mapreduce.Reducer.Context |
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class org.apache.hadoop.mapreduce.Reducer |
run |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
SolrDeleteDuplicates
public SolrDeleteDuplicates()
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
getConf
in interface org.apache.hadoop.conf.Configurable
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
setConf
in interface org.apache.hadoop.conf.Configurable
setup
public void setup(org.apache.hadoop.mapreduce.Reducer.Context job)
throws IOException
- Overrides:
setup
in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
- Throws:
IOException
cleanup
public void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException
- Overrides:
cleanup
in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
- Throws:
IOException
reduce
public void reduce(org.apache.hadoop.io.Text key,
Iterable<SolrDeleteDuplicates.SolrRecord> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException
- Overrides:
reduce
in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
- Throws:
IOException
dedup
public boolean dedup(String solrUrl)
throws IOException,
InterruptedException,
ClassNotFoundException
- Throws:
IOException
InterruptedException
ClassNotFoundException
run
public int run(String[] args)
throws IOException,
InterruptedException,
ClassNotFoundException
- Specified by:
run
in interface org.apache.hadoop.util.Tool
- Throws:
IOException
InterruptedException
ClassNotFoundException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2013 The Apache Software Foundation