public class CrawlDbMerger extends Configured implements Tool
It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.
If more than one CrawlDb contains information about the same URL, only the
most recent version is retained, as determined by the value of
CrawlDatum.getFetchTime()
. However, all
metadata information from all versions is accumulated, with newer values
taking precedence over older values.
Modifier and Type | Class and Description |
---|---|
static class |
CrawlDbMerger.Merger |
Constructor and Description |
---|
CrawlDbMerger() |
CrawlDbMerger(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
static JobConf |
createMergeJob(Configuration conf,
Path output,
boolean normalize,
boolean filter) |
static void |
main(String[] args) |
void |
merge(Path output,
Path[] dbs,
boolean normalize,
boolean filter) |
int |
run(String[] args) |
getConf, setConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf, setConf
public CrawlDbMerger()
public CrawlDbMerger(Configuration conf)
Copyright © 2015 The Apache Software Foundation