DeduplicationJob (apache-nutch 1.8 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.nutch.crawl.DeduplicationJob

All Implemented Interfaces:

org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
```
public class DeduplicationJob
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool
```
Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept. The documents marked as duplicate can then be deleted with the command CleaningJob.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`DeduplicationJob.DBFilter`
`static class`	`DeduplicationJob.DedupReducer`
`static class`	`DeduplicationJob.StatusUpdateReducer` Combine multiple new entries for a url.

Field Summary

Fields
Modifier and Type Field and Description

static org.slf4j.Logger LOG

Constructor Summary

Constructors
Constructor and Description

DeduplicationJob()

Method Summary

Methods
Modifier and Type Method and Description

static void main(String[] args)

int run(String[] args)
- Methods inherited from class org.apache.hadoop.conf.Configured
  getConf, setConf
- Methods inherited from class java.lang.Object
  clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
- Methods inherited from interface org.apache.hadoop.conf.Configurable
  getConf, setConf

Field Detail

LOG

public static final org.slf4j.Logger LOG

Constructor Detail
- DeduplicationJob
```
public DeduplicationJob()
```

Method Detail

run
```
public int run(String[] args)
        throws IOException
```
Specified by:

run in interface org.apache.hadoop.util.Tool

Throws:

IOException

main

public static void main(String[] args)
                 throws Exception

Throws:: Exception

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Copyright © 2014 The Apache Software Foundation