|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.hadoop.conf.Configured org.apache.nutch.segment.SegmentMerger
public class SegmentMerger
This tool takes several segments and merges their data together. Only the latest versions of data is retained.
Optionally, you can apply current URLFilters to remove prohibited URL-s.
Also, it's possible to slice the resulting segment into chunks of fixed size.
It doesn't make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.
Merging segments, which contain just fetchlists (i.e. prior to fetching)
is not recommended, because this tool (unlike the Generator
doesn't ensure that fetchlist parts for each map task are disjoint.
For some types of data (especially ParseText) it's not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with "higher" names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.
Nested Class Summary | |
---|---|
static class |
SegmentMerger.ObjectInputFormat
Wraps inputs in an MetaWrapper , to permit merging different
types in reduce and use additional metadata. |
static class |
SegmentMerger.SegmentOutputFormat
|
Constructor Summary | |
---|---|
SegmentMerger()
|
|
SegmentMerger(Configuration conf)
|
Method Summary | |
---|---|
void |
close()
|
void |
configure(JobConf conf)
|
static void |
main(String[] args)
|
void |
map(Text key,
MetaWrapper value,
OutputCollector<Text,MetaWrapper> output,
Reporter reporter)
|
void |
merge(Path out,
Path[] segs,
boolean filter,
boolean normalize,
long slice)
|
void |
reduce(Text key,
Iterator<MetaWrapper> values,
OutputCollector<Text,MetaWrapper> output,
Reporter reporter)
NOTE: in selecting the latest version we rely exclusively on the segment name (not all segment data contain time information). |
void |
setConf(Configuration conf)
|
Methods inherited from class org.apache.hadoop.conf.Configured |
---|
getConf |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public SegmentMerger()
public SegmentMerger(Configuration conf)
Method Detail |
---|
public void setConf(Configuration conf)
setConf
in interface Configurable
setConf
in class Configured
public void close() throws IOException
close
in interface Closeable
IOException
public void configure(JobConf conf)
configure
in interface JobConfigurable
public void map(Text key, MetaWrapper value, OutputCollector<Text,MetaWrapper> output, Reporter reporter) throws IOException
map
in interface Mapper<Text,MetaWrapper,Text,MetaWrapper>
IOException
public void reduce(Text key, Iterator<MetaWrapper> values, OutputCollector<Text,MetaWrapper> output, Reporter reporter) throws IOException
reduce
in interface Reducer<Text,MetaWrapper,Text,MetaWrapper>
IOException
public void merge(Path out, Path[] segs, boolean filter, boolean normalize, long slice) throws Exception
Exception
public static void main(String[] args) throws Exception
args
-
Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |