SolrDeleteDuplicates (apache-nutch 2.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.indexer.solr
Class SolrDeleteDuplicates

java.lang.Object
  org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
      org.apache.nutch.indexer.solr.SolrDeleteDuplicates

All Implemented Interfaces:: org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class SolrDeleteDuplicates
extends org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
implements org.apache.hadoop.util.Tool
extends org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>
implements org.apache.hadoop.util.Tool

Utility class for deleting duplicate documents from a solr index. The algorithm goes like follows: Preparation:

Query the solr server for the number of documents (say, N)
Partition N among M map tasks. For example, if we have two map tasks the first map task will deal with solr documents from 0 - (N / 2 - 1) and the second will deal with documents from (N / 2) to (N - 1).

MapReduce:

Map: Identity map where keys are digests and values are SolrDeleteDuplicates.SolrRecord instances(which contain id, boost and timestamp)
Reduce: After map, SolrDeleteDuplicates.SolrRecords with the same digest will be grouped together. Now, of these documents with the same digests, delete all of them except the one with the highest score (boost field). If two (or more) documents have the same score, then the document with the latest timestamp is kept. Again, every other is deleted from solr index.

Note that we assume that two documents in a solr index will never have the same URL. So this class only deals with documents with different URLs but the same digest.

Nested Class Summary
`static class`	`SolrDeleteDuplicates.SolrInputFormat`
`static class`	`SolrDeleteDuplicates.SolrInputSplit`
`static class`	`SolrDeleteDuplicates.SolrRecord`
`static class`	`SolrDeleteDuplicates.SolrRecordReader`

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Reducer
`org.apache.hadoop.mapreduce.Reducer.Context`

Field Summary
`static org.slf4j.Logger`	`LOG`

Constructor Summary
`SolrDeleteDuplicates()`

Method Summary
`void`	`cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)`
`boolean`	`dedup(String solrUrl)`
`org.apache.hadoop.conf.Configuration`	`getConf()`
`static void`	`main(String[] args)`
`void`	`reduce(org.apache.hadoop.io.Text key, Iterable<SolrDeleteDuplicates.SolrRecord> values, org.apache.hadoop.mapreduce.Reducer.Context context)`
`int`	`run(String[] args)`
`void`	`setConf(org.apache.hadoop.conf.Configuration conf)`
`void`	`setup(org.apache.hadoop.mapreduce.Reducer.Context job)`

Methods inherited from class org.apache.hadoop.mapreduce.Reducer
`run`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

SolrDeleteDuplicates

public SolrDeleteDuplicates()

Method Detail

getConf

public org.apache.hadoop.conf.Configuration getConf()

Specified by:: getConf in interface org.apache.hadoop.conf.Configurable

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

Specified by:: setConf in interface org.apache.hadoop.conf.Configurable

setup

public void setup(org.apache.hadoop.mapreduce.Reducer.Context job)
           throws IOException

Overrides:: setup in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>

Throws:: IOException

cleanup

public void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
             throws IOException

Overrides:: cleanup in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>

Throws:: IOException

reduce

public void reduce(org.apache.hadoop.io.Text key,
                   Iterable<SolrDeleteDuplicates.SolrRecord> values,
                   org.apache.hadoop.mapreduce.Reducer.Context context)
            throws IOException

Overrides:: reduce in class org.apache.hadoop.mapreduce.Reducer<org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord,org.apache.hadoop.io.Text,SolrDeleteDuplicates.SolrRecord>

Throws:: IOException

dedup

public boolean dedup(String solrUrl)
              throws IOException,
                     InterruptedException,
                     ClassNotFoundException

Throws:: IOException; InterruptedException; ClassNotFoundException

run

public int run(String[] args)
        throws IOException,
               InterruptedException,
               ClassNotFoundException

Specified by:: run in interface org.apache.hadoop.util.Tool

Throws:: IOException; InterruptedException; ClassNotFoundException

main

public static void main(String[] args)
                 throws Exception

Throws:: Exception