public class CrawlDbMerger
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool
It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.
If more than one CrawlDb contains information about the same URL,
only the most recent version is retained, as determined by the
value of CrawlDatum.getFetchTime()
.
However, all metadata information from all versions is accumulated,
with newer values taking precedence over older values.
Modifier and Type | Class and Description |
---|---|
static class |
CrawlDbMerger.Merger |
Constructor and Description |
---|
CrawlDbMerger() |
CrawlDbMerger(org.apache.hadoop.conf.Configuration conf) |
Modifier and Type | Method and Description |
---|---|
static org.apache.hadoop.mapred.JobConf |
createMergeJob(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path output,
boolean normalize,
boolean filter) |
static void |
main(String[] args) |
void |
merge(org.apache.hadoop.fs.Path output,
org.apache.hadoop.fs.Path[] dbs,
boolean normalize,
boolean filter) |
int |
run(String[] args) |
public CrawlDbMerger()
public CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)
public void merge(org.apache.hadoop.fs.Path output, org.apache.hadoop.fs.Path[] dbs, boolean normalize, boolean filter) throws Exception
Exception
public static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path output, boolean normalize, boolean filter)
Copyright © 2014 The Apache Software Foundation