org.apache.nutch.tools.compat
Class ReprUrlFixer

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.tools.compat.ReprUrlFixer
All Implemented Interfaces:
Closeable, Configurable, JobConfigurable, Reducer<Text,CrawlDatum,Text,CrawlDatum>, Tool

public class ReprUrlFixer
extends Configured
implements Tool, Reducer<Text,CrawlDatum,Text,CrawlDatum>

Significant changes were made to representative url logic used for redirects. This tool will fix representative urls stored in current segments and crawl databases. Any new fetches will use the new representative url logic.

All crawl datums are assumed to be temp url redirects. While this may cause some urls to be incorrectly removed, this tool is a temporary measure to be used until fetches can be rerun. This reduce logic is the same for segments fetch and parse directory as well as for existing crawl databases.


Field Summary
static org.apache.commons.logging.Log LOG
           
 
Constructor Summary
ReprUrlFixer()
           
 
Method Summary
 void close()
           
 void configure(JobConf conf)
           
static void main(String[] args)
          Runs The ReprUrlFixer.
 void reduce(Text key, Iterator<CrawlDatum> values, OutputCollector<Text,CrawlDatum> output, Reporter reporter)
          Runs the new ReprUrl logic on all crawldatums.
 int run(String[] args)
          Parse command line options and execute the main update logic.
 void update(Path crawlDb, Path[] segments)
          Run the fixer on any crawl database and segments specified.
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

ReprUrlFixer

public ReprUrlFixer()
Method Detail

configure

public void configure(JobConf conf)
Specified by:
configure in interface JobConfigurable

reduce

public void reduce(Text key,
                   Iterator<CrawlDatum> values,
                   OutputCollector<Text,CrawlDatum> output,
                   Reporter reporter)
            throws IOException
Runs the new ReprUrl logic on all crawldatums.

Specified by:
reduce in interface Reducer<Text,CrawlDatum,Text,CrawlDatum>
Throws:
IOException

close

public void close()
Specified by:
close in interface Closeable

update

public void update(Path crawlDb,
                   Path[] segments)
            throws IOException
Run the fixer on any crawl database and segments specified.

Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Runs The ReprUrlFixer.

Throws:
Exception

run

public int run(String[] args)
        throws Exception
Parse command line options and execute the main update logic.

Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2006 The Apache Software Foundation