org.apache.nutch.tools.arc
Class ArcSegmentCreator

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.tools.arc.ArcSegmentCreator
All Implemented Interfaces:
Closeable, Configurable, JobConfigurable, Mapper<Text,BytesWritable,Text,NutchWritable>, Tool

public class ArcSegmentCreator
extends Configured
implements Tool, Mapper<Text,BytesWritable,Text,NutchWritable>

The ArcSegmentCreator is a replacement for fetcher that will take arc files as input and produce a nutch segment as output.

Arc files are tars of compressed gzips which are produced by both the internet archive project and the grub distributed crawler project.


Field Summary
static org.apache.commons.logging.Log LOG
           
static String URL_VERSION
           
 
Constructor Summary
ArcSegmentCreator()
           
ArcSegmentCreator(Configuration conf)
          Constructor that sets the job configuration.
 
Method Summary
 void close()
           
 void configure(JobConf job)
          Configures the job.
 void createSegments(Path arcFiles, Path segmentsOutDir)
          Creates the arc files to segments job.
static String generateSegmentName()
          Generates a random name for the segments.
static void main(String[] args)
           
 void map(Text key, BytesWritable bytes, OutputCollector<Text,NutchWritable> output, Reporter reporter)
          Runs the Map job to translate an arc record into output for Nutch segments.
 int run(String[] args)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG

URL_VERSION

public static final String URL_VERSION
See Also:
Constant Field Values
Constructor Detail

ArcSegmentCreator

public ArcSegmentCreator()

ArcSegmentCreator

public ArcSegmentCreator(Configuration conf)

Constructor that sets the job configuration.

Parameters:
conf -
Method Detail

generateSegmentName

public static String generateSegmentName()
Generates a random name for the segments.

Returns:
The generated segment name.

configure

public void configure(JobConf job)

Configures the job. Sets the url filters, scoring filters, url normalizers and other relevant data.

Specified by:
configure in interface JobConfigurable
Parameters:
job - The job configuration.

close

public void close()
Specified by:
close in interface Closeable

map

public void map(Text key,
                BytesWritable bytes,
                OutputCollector<Text,NutchWritable> output,
                Reporter reporter)
         throws IOException

Runs the Map job to translate an arc record into output for Nutch segments.

Specified by:
map in interface Mapper<Text,BytesWritable,Text,NutchWritable>
Parameters:
key - The arc record header.
bytes - The arc record raw content bytes.
output - The output collecter.
reporter - The progress reporter.
Throws:
IOException

createSegments

public void createSegments(Path arcFiles,
                           Path segmentsOutDir)
                    throws IOException

Creates the arc files to segments job.

Parameters:
arcFiles - The path to the directory holding the arc files
segmentsOutDir - The output directory for writing the segments
Throws:
IOException - If an IO error occurs while running the job.

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2011 The Apache Software Foundation