Map sizing

By default, DistCp makes an attempt to size each map comparably so that each copies roughly the same number of bytes. Note that files are the finest level of granularity, so increasing the number of simultaneous copiers (i.e. maps) may not always increase the number of simultaneous copies nor the overall throughput.

The new DistCp also provides a strategy to "dynamically" size maps, allowing faster data-nodes to copy more bytes than slower nodes. Using -strategy dynamic (explained in the Architecture), rather than to assign a fixed set of source-files to each map-task, files are instead split into several sets. The number of sets exceeds the number of maps, usually by a factor of 2-3. Each map picks up and copies all files listed in a chunk. When a chunk is exhausted, a new chunk is acquired and processed, until no more chunks remain.

By not assigning a source-path to a fixed map, faster map-tasks (i.e. data-nodes) are able to consume more chunks, and thus copy more data, than slower nodes. While this distribution isn't uniform, it is fair with regard to each mapper's capacity.

The dynamic-strategy is implemented by the DynamicInputFormat. It provides superior performance under most conditions.

Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommended for long-running and regularly run jobs.

Copying between versions of HDFS

For copying between two different versions of Hadoop, one will usually use HftpFileSystem. This is a read-only FileSystem, so DistCp must be run on the destination cluster (more specifically, on TaskTrackers that can write to the destination cluster). Each source is specified as hftp://<dfs.http.address>/<path> (the default dfs.http.address is <namenode>:50070).

Map/Reduce and other side-effects

As has been mentioned in the preceding, should a map fail to copy one of its inputs, there will be several side-effects.

  • Unless -overwrite is specified, files successfully copied by a previous map on a re-execution will be marked as "skipped".
  • If a map fails mapred.map.max.attempts times, the remaining map tasks will be killed (unless -i is set).
  • If mapred.speculative.execution is set set final and true, the result of the copy is undefined.

SSL Configurations for HSFTP sources:

To use an HSFTP source (i.e. using the hsftp protocol), a Map-Red SSL configuration file needs to be specified (via the -mapredSslConf option). This must specify 3 parameters:

  • ssl.client.truststore.location: The local-filesystem location of the trust-store file, containing the certificate for the namenode.
  • ssl.client.truststore.type: (Optional) The format of the trust-store file.
  • ssl.client.truststore.password: (Optional) Password for the trust-store file.

The following is an example of the contents of the contents of a Map-Red SSL Configuration file:


<configuration>


<property>

<name>ssl.client.truststore.location</name>

<value>/work/keystore.jks</value>

<description>Truststore to be used by clients like distcp. Must be specified. </description>


</property>

<property>

<name>ssl.client.truststore.password</name>

<value>changeme</value>

<description>Optional. Default value is "". </description>

</property>


<property>

<name>ssl.client.truststore.type</name>

<value>jks</value>

<description>Optional. Default value is "jks". </description>

</property>


</configuration>


The SSL configuration file must be in the class-path of the DistCp program.