CrawlDbMerger (apache-nutch 1.8 API)

java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.nutch.crawl.CrawlDbMerger

All Implemented Interfaces:

org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
```
public class CrawlDbMerger
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool
```
This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Author:

Andrzej Bialecki

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class CrawlDbMerger.Merger

Nested Classes
Modifier and Type	Class and Description
`static class`	`CrawlDbMerger.Merger`

Constructor Summary

Constructors
Constructor and Description

CrawlDbMerger()

CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)

Constructors
Constructor and Description
`CrawlDbMerger()`
`CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)`

Method Summary

Methods
Modifier and Type	Method and Description
`static org.apache.hadoop.mapred.JobConf`	`createMergeJob(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path output, boolean normalize, boolean filter)`
`static void`	`main(String[] args)`
`void`	`merge(org.apache.hadoop.fs.Path output, org.apache.hadoop.fs.Path[] dbs, boolean normalize, boolean filter)`
`int`	`run(String[] args)`

Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf

Constructor Detail

CrawlDbMerger
```
public CrawlDbMerger()
```

CrawlDbMerger

public CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)

Method Detail

merge

public void merge(org.apache.hadoop.fs.Path output,
         org.apache.hadoop.fs.Path[] dbs,
         boolean normalize,
         boolean filter)
           throws Exception

Throws:: Exception

createMergeJob

public static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration conf,
                                              org.apache.hadoop.fs.Path output,
                                              boolean normalize,
                                              boolean filter)

main

public static void main(String[] args)
                 throws Exception

Parameters:: args -
Throws:: Exception

run
```
public int run(String[] args)
        throws Exception
```
Specified by:

run in interface org.apache.hadoop.util.Tool

Throws:

Exception

Class CrawlDbMerger

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.conf.Configured

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.hadoop.conf.Configurable

Constructor Detail

CrawlDbMerger

CrawlDbMerger

Method Detail

merge

createMergeJob

main

run