SegmentMerger (apache-nutch 1.8 API)

java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.nutch.segment.SegmentMerger

All Implemented Interfaces:

Closeable, AutoCloseable, org.apache.hadoop.conf.Configurable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>, org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>
```
public class SegmentMerger
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>, org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>
```
This tool takes several segments and merges their data together. Only the latest versions of data is retained.
Optionally, you can apply current URLFilters to remove prohibited URL-s.

Also, it's possible to slice the resulting segment into chunks of fixed size.

Important Notes

Which parts are merged?

It doesn't make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.

Merging fetchlists

Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because this tool (unlike the Generator doesn't ensure that fetchlist parts for each map task are disjoint.

Duplicate content
Merging segments removes older content whenever possible (see below). However, this is NOT the same as de-duplication, which in addition removes identical content found at different URL-s. In other words, running DeleteDuplicates is still necessary.

For some types of data (especially ParseText) it's not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with "higher" names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.

Merging and indexes
Merged segment gets a different name. Since Indexer embeds segment names in indexes, any indexes originally created for the input segments will NOT work with the merged segment. Newly created merged segment(s) need to be indexed afresh. This tool doesn't use existing indexes in any way, so if you plan to merge segments you don't have to index them prior to merging.

Author:

Andrzej Bialecki

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`SegmentMerger.ObjectInputFormat` Wraps inputs in an `MetaWrapper`, to permit merging different types in reduce and use additional metadata.
`static class`	`SegmentMerger.SegmentOutputFormat`

Constructor Summary

Constructors
Constructor and Description

SegmentMerger()

SegmentMerger(org.apache.hadoop.conf.Configuration conf)

Constructors
Constructor and Description
`SegmentMerger()`
`SegmentMerger(org.apache.hadoop.conf.Configuration conf)`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`close()`
`void`	`configure(org.apache.hadoop.mapred.JobConf conf)`
`static void`	`main(String[] args)`
`void`	`map(org.apache.hadoop.io.Text key, MetaWrapper value, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output, org.apache.hadoop.mapred.Reporter reporter)`
`void`	`merge(org.apache.hadoop.fs.Path out, org.apache.hadoop.fs.Path[] segs, boolean filter, boolean normalize, long slice)`
`void`	`reduce(org.apache.hadoop.io.Text key, Iterator<MetaWrapper> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output, org.apache.hadoop.mapred.Reporter reporter)` NOTE: in selecting the latest version we rely exclusively on the segment name (not all segment data contain time information).
`void`	`setConf(org.apache.hadoop.conf.Configuration conf)`

Methods inherited from class org.apache.hadoop.conf.Configured
getConf

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

SegmentMerger
```
public SegmentMerger()
```

SegmentMerger

public SegmentMerger(org.apache.hadoop.conf.Configuration conf)

Method Detail

setConf
```
public void setConf(org.apache.hadoop.conf.Configuration conf)
```
Specified by:

setConf in interface org.apache.hadoop.conf.Configurable

Overrides:

setConf in class org.apache.hadoop.conf.Configured

close
```
public void close()
           throws IOException
```
Specified by:

close in interface Closeable

Specified by:

close in interface AutoCloseable

Throws:

IOException

configure
```
public void configure(org.apache.hadoop.mapred.JobConf conf)
```
Specified by:

configure in interface org.apache.hadoop.mapred.JobConfigurable

map

public void map(org.apache.hadoop.io.Text key,
       MetaWrapper value,
       org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output,
       org.apache.hadoop.mapred.Reporter reporter)
         throws IOException

Specified by:: map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>
Throws:: IOException

reduce
```
public void reduce(org.apache.hadoop.io.Text key,
          Iterator<MetaWrapper> values,
          org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,MetaWrapper> output,
          org.apache.hadoop.mapred.Reporter reporter)
            throws IOException
```
NOTE: in selecting the latest version we rely exclusively on the segment name (not all segment data contain time information). Therefore it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.

Specified by:

reduce in interface org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,MetaWrapper,org.apache.hadoop.io.Text,MetaWrapper>

Throws:

IOException

merge

public void merge(org.apache.hadoop.fs.Path out,
         org.apache.hadoop.fs.Path[] segs,
         boolean filter,
         boolean normalize,
         long slice)
           throws Exception

Throws:: Exception

main

public static void main(String[] args)
                 throws Exception

Parameters:: args -
Throws:: Exception

Class SegmentMerger

Important Notes

Which parts are merged?

Merging fetchlists

Duplicate content

Merging and indexes

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.conf.Configured

Methods inherited from class java.lang.Object

Constructor Detail

SegmentMerger

SegmentMerger

Method Detail

setConf

close

configure

map

reduce

merge

main