org.apache.nutch.tools
Class CrawlDBScanner
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.tools.CrawlDBScanner
- All Implemented Interfaces:
- Closeable, Configurable, JobConfigurable, Mapper<Text,CrawlDatum,Text,CrawlDatum>, Reducer<Text,CrawlDatum,Text,CrawlDatum>, Tool
public class CrawlDBScanner
- extends Configured
- implements Tool, Mapper<Text,CrawlDatum,Text,CrawlDatum>, Reducer<Text,CrawlDatum,Text,CrawlDatum>
Dumps all the entries matching a regular expression on their URL. Generates a
text representation of the CrawlDatum-s or binary objects which can then be
used as a new CrawlDB. The dump mechanism of the crawldb reader is not very
useful on large crawldbs as the ouput can be extremely large and the -url
function can't help if we don't know what url we want to have a look at.
- Author:
- : Julien Nioche
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
CrawlDBScanner
public CrawlDBScanner()
CrawlDBScanner
public CrawlDBScanner(Configuration conf)
close
public void close()
- Specified by:
close
in interface Closeable
configure
public void configure(JobConf job)
- Specified by:
configure
in interface JobConfigurable
map
public void map(Text url,
CrawlDatum crawlDatum,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
throws IOException
- Specified by:
map
in interface Mapper<Text,CrawlDatum,Text,CrawlDatum>
- Throws:
IOException
reduce
public void reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
throws IOException
- Specified by:
reduce
in interface Reducer<Text,CrawlDatum,Text,CrawlDatum>
- Throws:
IOException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface Tool
- Throws:
Exception
Copyright © 2011 The Apache Software Foundation