Replace x.y.z with the tez release number that you are using. E.g. 0.5.0. For Tez versions 0.8.3 and higher, Tez needs Apache Hadoop to be of version 2.6.0 or higher. For Tez version 0.9.0 and higher, Tez needs Apache Hadoop to be version 2.7.0 or higher.
Deploy Apache Hadoop using version of 2.7.0 or higher.
$ hadoop version
Build tez using mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true
Copy the relevant tez tarball into HDFS, and configure tez-site.xml
hadoop fs -mkdir /apps/tez-x.y.z-SNAPSHOT hadoop fs -copyFromLocal tez-dist/target/tez-x.y.z-SNAPSHOT.tar.gz /apps/tez-x.y.z-SNAPSHOT/
Configure the client node to include the tez-libraries in the hadoop classpath
tar -xvzf tez-dist/target/tez-x.y.z-minimal.tar.gz -C $TEZ_JARS
export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
There is a basic example of using an MRR job in the tez-examples.jar. Refer to OrderedWordCount.java in the source code. To run this example:
$HADOOP_PREFIX/bin/hadoop jar tez-examples.jar orderedwordcount <input> <output>
This will use the TEZ DAG ApplicationMaster to run the ordered word count job. This job is similar to the word count example except that it also orders all words based on the frequency of occurrence.
Tez DAGs could be run separately as different applications or serially within a single TEZ session. There is a different variation of orderedwordcount in tez-tests that supports the use of Sessions and handling multiple input-output pairs. You can use it to run multiple DAGs serially on different inputs/outputs.
$HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount <input1> <output1> <input2> <output2> <input3> <output3> ...
The above will run multiple DAGs for each input-output pair.
To use TEZ sessions, set -DUSE_TEZ_SESSION=true
$HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount -DUSE_TEZ_SESSION=true <input1> <output1> <input2> <output2>
Submit a MR job as you normally would using something like:
$HADOOP_PREFIX/bin/hadoop jar hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar sleep -mt 1 -rt 1 -m 1 -r 1
This will use the TEZ DAG ApplicationMaster to run the MR job. This can be verified by looking at the AM’s logs from the YARN ResourceManager UI. This needs mapred-site.xml to have “mapreduce.framework.name” set to “yarn-tez”
The tez.lib.uris configuration property supports a comma-separated list of values. The types of values supported are: - Path to simple file - Path to a directory - Path to a compressed archive ( tarball, zip, etc).
For simple files and directories, Tez will add all these files and first-level entries in the directories (recursive traversal of dirs is not supported) into the working directory of the Tez runtime and they will automatically be included into the classpath. For archives i.e. files whose names end with generally known compressed archive suffixes such as ‘tgz’, ‘tar.gz’, ‘zip’, etc. will be uncompressed into the container working directory too. However, given that the archive structure is not known to the Tez framework, the user is expected to configure tez.lib.uris.classpath to ensure that the nested directory structure of an archive is added to the classpath. This classpath values should be relative i.e. the entries should start with “./”.
The above install instructions use Tez with pre-packaged Hadoop libraries included in the package and is the recommended method for installation. A full tarball with all dependencies is a better approach to ensure that existing jobs continue to run during a cluster’s rolling upgrade.
Although the tez.lib.uris configuration options enable a wide variety of usage patterns, there are 2 main alternative modes that are supported by the framework:
Both these modes will require a tez build without Hadoop dependencies and that is available at tez-dist/target/tez-x.y.z-minimal.tar.gz.
This mode is not recommended for clusters that use rolling upgrades. Additionally, it is the user’s responsibility to ensure that the tez version being used is compatible with the version of Hadoop running on the cluster. Step 3 above changes as follows. Also subsequent steps should use tez-dist/target/tez-x.y.z-minimal.tar.gz instead of tez-dist/target/tez-x.y.z.tar.gz
A tez build without Hadoop dependencies will be available at tez-dist/target/tez-x.y.z-minimal.tar.gz Assuming that the tez jars are put in /apps/ on HDFS, the command would be
"hadoop fs -mkdir /apps/tez-x.y.z" "hadoop fs -copyFromLocal tez-dist/target/tez-x.y.z-minimal.tar.gz /apps/tez-x.y.z"
tez-site.xml configuration
This mode will support rolling upgrades. It is the user’s responsibility to ensure that the versions of Tez and Hadoop being used are compatible. To do this configuration, we need to change Step 3 of the default instructions in the following ways.
"hadoop fs -mkdir /apps/tez-x.y.z" "hadoop fs -copyFromLocal tez-dist/target/tez-x.y.z-minimal.tar.gz /apps/tez-x.y.z"
"hadoop fs -copyFromLocal tez-dist/target/tez-x.y.z-minimal/* /apps/tez-x.y.z"
"hadoop fs -mkdir /apps/hadoop-x.y.z" "hadoop fs -copyFromLocal hadoop-dist/target/hadoop-x.y.z-SNAPSHOT.tar.gz /apps/hadoop-x.y.z"
tez-site.xml configuration
./tez/*:./tez/lib/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/common/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/common/lib/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/hdfs/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/hdfs/lib/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/yarn/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/yarn/lib/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/mapreduce/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/mapreduce/lib/*
./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/common/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/common/lib/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/hdfs/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/hdfs/lib/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/yarn/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/yarn/lib/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/mapreduce/*:./hadoop-mapreduce/hadoop-x.y.z-SNAPSHOT/share/hadoop/mapreduce/lib/*