org.apache.nutch.crawl
Class LinkDbMerger
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.crawl.LinkDbMerger
- All Implemented Interfaces:
- Closeable, Configurable, JobConfigurable, Reducer<Text,Inlinks,Text,Inlinks>, Tool
public class LinkDbMerger
- extends Configured
- implements Tool, Reducer<Text,Inlinks,Text,Inlinks>
This tool merges several LinkDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited URLs and
links.
It's possible to use this tool just for filtering - in that case
only one LinkDb should be specified in arguments.
If more than one LinkDb contains information about the same URL,
all inlinks are accumulated, but only at most db.max.inlinks
inlinks will ever be added.
If activated, URLFilters will be applied to both the target URLs and
to any incoming link URL. If a target URL is prohibited, all
inlinks to that target will be removed, including the target URL. If
some of incoming links are prohibited, only they will be removed, and they
won't count when checking the above-mentioned maximum limit.
- Author:
- Andrzej Bialecki
Method Summary |
void |
close()
|
void |
configure(JobConf job)
|
static JobConf |
createMergeJob(Configuration config,
Path linkDb,
boolean normalize,
boolean filter)
|
static void |
main(String[] args)
|
void |
merge(Path output,
Path[] dbs,
boolean normalize,
boolean filter)
|
void |
reduce(Text key,
Iterator<Inlinks> values,
OutputCollector<Text,Inlinks> output,
Reporter reporter)
|
int |
run(String[] args)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LinkDbMerger
public LinkDbMerger()
LinkDbMerger
public LinkDbMerger(Configuration conf)
reduce
public void reduce(Text key,
Iterator<Inlinks> values,
OutputCollector<Text,Inlinks> output,
Reporter reporter)
throws IOException
- Specified by:
reduce
in interface Reducer<Text,Inlinks,Text,Inlinks>
- Throws:
IOException
configure
public void configure(JobConf job)
- Specified by:
configure
in interface JobConfigurable
close
public void close()
throws IOException
- Specified by:
close
in interface Closeable
- Throws:
IOException
merge
public void merge(Path output,
Path[] dbs,
boolean normalize,
boolean filter)
throws Exception
- Throws:
Exception
createMergeJob
public static JobConf createMergeJob(Configuration config,
Path linkDb,
boolean normalize,
boolean filter)
main
public static void main(String[] args)
throws Exception
- Parameters:
args
-
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface Tool
- Throws:
Exception
Copyright © 2011 The Apache Software Foundation