public class ArcSegmentCreator extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.Text,NutchWritable>
The ArcSegmentCreator
is a replacement for fetcher that will
take arc files as input and produce a nutch segment as output.
Arc files are tars of compressed gzips which are produced by both the internet archive project and the grub distributed crawler project.
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
static String |
URL_VERSION |
Constructor and Description |
---|
ArcSegmentCreator() |
ArcSegmentCreator(org.apache.hadoop.conf.Configuration conf)
Constructor that sets the job configuration.
|
Modifier and Type | Method and Description |
---|---|
void |
close() |
void |
configure(org.apache.hadoop.mapred.JobConf job)
Configures the job.
|
void |
createSegments(org.apache.hadoop.fs.Path arcFiles,
org.apache.hadoop.fs.Path segmentsOutDir)
Creates the arc files to segments job.
|
static String |
generateSegmentName()
Generates a random name for the segments.
|
static void |
main(String[] args) |
void |
map(org.apache.hadoop.io.Text key,
org.apache.hadoop.io.BytesWritable bytes,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output,
org.apache.hadoop.mapred.Reporter reporter)
Runs the Map job to translate an arc record into output for Nutch
segments.
|
int |
run(String[] args) |
public static final org.slf4j.Logger LOG
public static final String URL_VERSION
public ArcSegmentCreator()
public ArcSegmentCreator(org.apache.hadoop.conf.Configuration conf)
Constructor that sets the job configuration.
conf
- public static String generateSegmentName()
public void configure(org.apache.hadoop.mapred.JobConf job)
Configures the job. Sets the url filters, scoring filters, url normalizers and other relevant data.
configure
in interface org.apache.hadoop.mapred.JobConfigurable
job
- The job configuration.public void close()
close
in interface Closeable
close
in interface AutoCloseable
public void map(org.apache.hadoop.io.Text key, org.apache.hadoop.io.BytesWritable bytes, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
Runs the Map job to translate an arc record into output for Nutch segments.
map
in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.Text,NutchWritable>
key
- The arc record header.bytes
- The arc record raw content bytes.output
- The output collecter.reporter
- The progress reporter.IOException
public void createSegments(org.apache.hadoop.fs.Path arcFiles, org.apache.hadoop.fs.Path segmentsOutDir) throws IOException
Creates the arc files to segments job.
arcFiles
- The path to the directory holding the arc filessegmentsOutDir
- The output directory for writing the segmentsIOException
- If an IO error occurs while running the job.Copyright © 2014 The Apache Software Foundation