Basic Usage

The most common invocation of DistCp is an inter-cluster copy:

bash$ hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.

One can also specify multiple source directories on the command line:

bash$ hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:
bash$ hadoop jar hadoop-distcp.jar -f hdfs://nn1:8020/srclist \
hdfs://nn2:8020/bar/foo

Where srclist contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b

When copying from multiple sources, DistCp will abort the copy with an error message if two sources collide, but collisions at the destination are resolved per the options specified. By default, files already existing at the destination are skipped (i.e. not replaced by the source file). A count of skipped files is reported at the end of each job, but it may be inaccurate if a copier failed for some subset of its files, but succeeded on a later attempt.

It is important that each TaskTracker can reach and communicate with both the source and destination file systems. For HDFS, both the source and destination must be running the same version of the protocol or use a backwards-compatible protocol (see Copying Between Versions).

After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to verify that the copy was truly successful. Since DistCp employs both Map/Reduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy. Some have had success running with -update enabled to perform a second pass, but users should be acquainted with its semantics before attempting this.

It's also worth noting that if another client is still writing to a source file, the copy will likely fail. Attempting to overwrite a file being written at the destination should also fail on HDFS. If a source file is (re)moved before it is copied, the copy will fail with a FileNotFoundException.

Please refer to the detailed Command Line Reference for information on all the options available in DistCp.

Update and Overwrite

-update is used to copy files from source that don't exist at the target, or have different contents. -overwrite overwrites target-files even if they exist at the source, or have the same contents.


Update and Overwrite options warrant special attention, since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from /source/first/ and /source/second/ to /target/, where the source paths have the following contents:

hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20


When DistCp is invoked without -update or -overwrite, the DistCp defaults would create directories first/ and second/, under /target. Thus:

distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target


would yield the following contents in /target:

hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20


When either -update or -overwrite is specified, the contents of the source-directories are copied to target, and not the source directories themselves. Thus:

distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target


would yield the following contents in /target:

hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20


By extension, if both source folders contained a file with the same name (say, 0), then both sources would map an entry to /target/0 at the destination. Rather than to permit this conflict, DistCp will abort.


Now, consider the following copy operation:

distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target


With sources/sizes:

hdfs://nn1:8020/source/first/1 32
hdfs://nn1:8020/source/first/2 32
hdfs://nn1:8020/source/second/10 64
hdfs://nn1:8020/source/second/20 32


And destination/sizes:

hdfs://nn2:8020/target/1 32
hdfs://nn2:8020/target/10 32
hdfs://nn2:8020/target/20 64


Will effect:

hdfs://nn2:8020/target/1 32
hdfs://nn2:8020/target/2 32
hdfs://nn2:8020/target/10 64
hdfs://nn2:8020/target/20 32


1 is skipped because the file-length and contents match. 2 is copied because it doesn't exist at the target. 10 and 20 are overwritten since the contents don't match the source.

If -update is used, 1 is overwritten as well.