org.apache.nutch.tools
Class CrawlDBScanner

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.tools.CrawlDBScanner
All Implemented Interfaces:
Closeable, Configurable, JobConfigurable, Mapper<Text,CrawlDatum,Text,CrawlDatum>, Reducer<Text,CrawlDatum,Text,CrawlDatum>, Tool

public class CrawlDBScanner
extends Configured
implements Tool, Mapper<Text,CrawlDatum,Text,CrawlDatum>, Reducer<Text,CrawlDatum,Text,CrawlDatum>

Dumps all the entries matching a regular expression on their URL. Generates a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.

Author:
: Julien Nioche

Field Summary
static org.slf4j.Logger LOG
           
 
Constructor Summary
CrawlDBScanner()
           
CrawlDBScanner(Configuration conf)
           
 
Method Summary
 void close()
           
 void configure(JobConf job)
           
static void main(String[] args)
           
 void map(Text url, CrawlDatum crawlDatum, OutputCollector<Text,CrawlDatum> output, Reporter reporter)
           
 void reduce(Text key, Iterator<CrawlDatum> values, OutputCollector<Text,CrawlDatum> output, Reporter reporter)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

CrawlDBScanner

public CrawlDBScanner()

CrawlDBScanner

public CrawlDBScanner(Configuration conf)
Method Detail

close

public void close()
Specified by:
close in interface Closeable

configure

public void configure(JobConf job)
Specified by:
configure in interface JobConfigurable

map

public void map(Text url,
                CrawlDatum crawlDatum,
                OutputCollector<Text,CrawlDatum> output,
                Reporter reporter)
         throws IOException
Specified by:
map in interface Mapper<Text,CrawlDatum,Text,CrawlDatum>
Throws:
IOException

reduce

public void reduce(Text key,
                   Iterator<CrawlDatum> values,
                   OutputCollector<Text,CrawlDatum> output,
                   Reporter reporter)
            throws IOException
Specified by:
reduce in interface Reducer<Text,CrawlDatum,Text,CrawlDatum>
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2011 The Apache Software Foundation