public class SegmentMerger extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>, org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>
Optionally, you can apply current URLFilters to remove prohibited URL-s.
Also, it's possible to slice the resulting segment into chunks of fixed size.
It doesn't make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.
Merging segments, which contain just fetchlists (i.e. prior to fetching)
is not recommended, because this tool (unlike the Generator
doesn't ensure that fetchlist parts for each map task are disjoint.
For some types of data (especially ParseText) it's not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with "higher" names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.
Modifier and Type | Class and Description |
---|---|
static class |
SegmentMerger.ObjectInputFormat
Wraps inputs in an
MetaWrapper , to permit merging different
types in reduce and use additional metadata. |
static class |
SegmentMerger.SegmentOutputFormat |
Constructor and Description |
---|
SegmentMerger() |
SegmentMerger(org.apache.hadoop.conf.Configuration conf) |
Modifier and Type | Method and Description |
---|---|
void |
close() |
void |
configure(org.apache.hadoop.mapred.JobConf conf) |
static void |
main(String[] args) |
void |
map(org.apache.hadoop.io.Text key,
MetaWrapper value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output,
org.apache.hadoop.mapred.Reporter reporter) |
void |
merge(org.apache.hadoop.fs.Path out,
org.apache.hadoop.fs.Path[] segs,
boolean filter,
boolean normalize,
long slice) |
void |
reduce(org.apache.hadoop.io.Text key,
Iterator<MetaWrapper> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output,
org.apache.hadoop.mapred.Reporter reporter)
NOTE: in selecting the latest version we rely exclusively on the segment
name (not all segment data contain time information).
|
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
public SegmentMerger()
public SegmentMerger(org.apache.hadoop.conf.Configuration conf)
public void setConf(org.apache.hadoop.conf.Configuration conf)
setConf
in interface org.apache.hadoop.conf.Configurable
setConf
in class org.apache.hadoop.conf.Configured
public void close() throws IOException
close
in interface Closeable
close
in interface AutoCloseable
IOException
public void configure(org.apache.hadoop.mapred.JobConf conf)
configure
in interface org.apache.hadoop.mapred.JobConfigurable
public void map(org.apache.hadoop.io.Text key, MetaWrapper value, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
map
in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>
IOException
public void reduce(org.apache.hadoop.io.Text key, Iterator<MetaWrapper> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
reduce
in interface org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>
IOException
public void merge(org.apache.hadoop.fs.Path out, org.apache.hadoop.fs.Path[] segs, boolean filter, boolean normalize, long slice) throws Exception
Exception
Copyright © 2014 The Apache Software Foundation