org.apache.nutch.indexer.solr
Class SolrDeleteDuplicates
java.lang.Object
org.apache.nutch.indexer.solr.SolrDeleteDuplicates
- All Implemented Interfaces:
- Closeable, Configurable, JobConfigurable, Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>, Tool
public class SolrDeleteDuplicates
- extends Object
- implements Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>, Tool
Utility class for deleting duplicate documents from a solr index.
The algorithm goes like follows:
Preparation:
- Query the solr server for the number of documents (say, N)
- Partition N among M map tasks. For example, if we have two map tasks
the first map task will deal with solr documents from 0 - (N / 2 - 1) and
the second will deal with documents from (N / 2) to (N - 1).
MapReduce:
- Map: Identity map where keys are digests and values are
SolrDeleteDuplicates.SolrRecord
instances(which contain id, boost and timestamp)
- Reduce: After map,
SolrDeleteDuplicates.SolrRecord
s with the same digest will be
grouped together. Now, of these documents with the same digests, delete
all of them except the one with the highest score (boost field). If two
(or more) documents have the same score, then the document with the latest
timestamp is kept. Again, every other is deleted from solr index.
Note that unlike DeleteDuplicates
we assume that two documents in
a solr index will never have the same URL. So this class only deals with
documents with different URLs but the same digest.
Field Summary |
static org.apache.commons.logging.Log |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.apache.commons.logging.Log LOG
SolrDeleteDuplicates
public SolrDeleteDuplicates()
getConf
public Configuration getConf()
- Specified by:
getConf
in interface Configurable
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
configure
public void configure(JobConf job)
- Specified by:
configure
in interface JobConfigurable
close
public void close()
throws IOException
- Specified by:
close
in interface Closeable
- Throws:
IOException
reduce
public void reduce(Text key,
Iterator<SolrDeleteDuplicates.SolrRecord> values,
OutputCollector<Text,SolrDeleteDuplicates.SolrRecord> output,
Reporter reporter)
throws IOException
- Specified by:
reduce
in interface Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
- Throws:
IOException
dedup
public void dedup(String solrUrl)
throws IOException
- Throws:
IOException
run
public int run(String[] args)
throws IOException
- Specified by:
run
in interface Tool
- Throws:
IOException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation