CrawlDbMerger (${Name} 1.4 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.crawl
Class CrawlDbMerger

java.lang.Object
  org.apache.hadoop.conf.Configured
      org.apache.nutch.crawl.CrawlDbMerger

All Implemented Interfaces:: Configurable, Tool

public class CrawlDbMerger
extends Configured
implements Tool
extends Configured
implements Tool

This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.

It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Author:: Andrzej Bialecki

Nested Class Summary
`static class`	`CrawlDbMerger.Merger`

Constructor Summary
`CrawlDbMerger()`
`CrawlDbMerger(Configuration conf)`

Method Summary
`static JobConf`	`createMergeJob(Configuration conf, Path output, boolean normalize, boolean filter)`
`static void`	`main(String[] args)`
`void`	`merge(Path output, Path[] dbs, boolean normalize, boolean filter)`
`int`	`run(String[] args)`

Methods inherited from class org.apache.hadoop.conf.Configured
`getConf, setConf`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface org.apache.hadoop.conf.Configurable
`getConf, setConf`

Constructor Detail

CrawlDbMerger

public CrawlDbMerger()

CrawlDbMerger

public CrawlDbMerger(Configuration conf)

Method Detail

merge

public void merge(Path output,
                  Path[] dbs,
                  boolean normalize,
                  boolean filter)
           throws Exception

Throws:: Exception

createMergeJob

public static JobConf createMergeJob(Configuration conf,
                                     Path output,
                                     boolean normalize,
                                     boolean filter)

main

public static void main(String[] args)
                 throws Exception

Parameters:: args -
Throws:: Exception

run

public int run(String[] args)
        throws Exception

Specified by:: run in interface Tool

Throws:: Exception

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.crawl Class CrawlDbMerger

CrawlDbMerger

CrawlDbMerger

merge

createMergeJob

main

run

org.apache.nutch.crawl
Class CrawlDbMerger