org.apache.nutch.crawl
Class CrawlDbMerger

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.crawl.CrawlDbMerger
All Implemented Interfaces:
Configurable, Tool

public class CrawlDbMerger
extends Configured
implements Tool

This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.

It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Author:
Andrzej Bialecki

Nested Class Summary
static class CrawlDbMerger.Merger
           
 
Constructor Summary
CrawlDbMerger()
           
CrawlDbMerger(Configuration conf)
           
 
Method Summary
static JobConf createMergeJob(Configuration conf, Path output, boolean normalize, boolean filter)
           
static void main(String[] args)
           
 void merge(Path output, Path[] dbs, boolean normalize, boolean filter)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Constructor Detail

CrawlDbMerger

public CrawlDbMerger()

CrawlDbMerger

public CrawlDbMerger(Configuration conf)
Method Detail

merge

public void merge(Path output,
                  Path[] dbs,
                  boolean normalize,
                  boolean filter)
           throws Exception
Throws:
Exception

createMergeJob

public static JobConf createMergeJob(Configuration conf,
                                     Path output,
                                     boolean normalize,
                                     boolean filter)

main

public static void main(String[] args)
                 throws Exception
Parameters:
args -
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2011 The Apache Software Foundation