Getting Started With Hadoop On Demand (HOD) =========================================== 1. Pre-requisites: ================== Hardware: HOD requires a minimum of 3 nodes configured through a resource manager. Software: The following components are assumed to be installed before using HOD: * Torque: (http://www.clusterresources.com/pages/products/torque-resource-manager.php) Currently HOD supports Torque out of the box. We assume that you are familiar with configuring Torque. You can get information about this from the following link: http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki * Python (http://www.python.org/) We require version 2.5.1 of Python. The following components can be optionally installed for getting better functionality from HOD: * Twisted Python: This can be used for improving the scalability of HOD (http://twistedmatrix.com/trac/) * Hadoop: HOD can automatically distribute Hadoop to all nodes in the cluster. However, it can also use a pre-installed version of Hadoop, if it is available on all nodes in the cluster. (http://hadoop.apache.org/core) HOD currently supports Hadoop 0.15 and above. NOTE: HOD configuration requires the location of installs of these components to be the same on all nodes in the cluster. It will also make the configuration simpler to have the same location on the submit nodes. 2. Resource Manager Configuration Pre-requisites: ================================================= For using HOD with Torque: * Install Torque components: pbs_server on a head node, pbs_moms on all compute nodes, and PBS client tools on all compute nodes and submit nodes. * Create a queue for submitting jobs on the pbs_server. * Specify a name for all nodes in the cluster, by setting a 'node property' to all the nodes. This can be done by using the 'qmgr' command. For example: qmgr -c "set node node properties=cluster-name" * Ensure that jobs can be submitted to the nodes. This can be done by using the 'qsub' command. For example: echo "sleep 30" | qsub -l nodes=3 * More information about setting up Torque can be found by referring to the documentation under: http://www.clusterresources.com/pages/products/torque-resource-manager.php 3. Setting up HOD: ================== * HOD is available under the 'contrib' section of Hadoop under the root directory 'hod'. * Distribute the files under this directory to all the nodes in the cluster. Note that the location where the files are copied should be the same on all the nodes. * On the node from where you want to run hod, edit the file hodrc which can be found in the /conf directory. This file contains the minimal set of values required for running hod. * Specify values suitable to your environment for the following variables defined in the configuration file. Note that some of these variables are defined at more than one place in the file. * ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK 1.5.x * ${CLUSTER_NAME}: Name of the cluster which is specified in the 'node property' as mentioned in resource manager configuration. * ${HADOOP_HOME}: Location of Hadoop installation on the compute and submit nodes. * ${RM_QUEUE}: Queue configured for submiting jobs in the resource manager configuration. * ${RM_HOME}: Location of the resource manager installation on the compute and submit nodes. * The following environment variables *may* need to be set depending on your environment. These variables must be defined where you run the HOD client, and also be specified in the HOD configuration file as the value of the key resource_manager.env-vars. Multiple variables can be specified as a comma separated list of key=value pairs. * HOD_PYTHON_HOME: If you install python to a non-default location of the compute nodes, or submit nodes, then, this variable must be defined to point to the python executable in the non-standard location. NOTE: You can also review other configuration options in the file and modify them to suit your needs. Refer to the file config.txt for information about the HOD configuration. 4. Running HOD: =============== 4.1 Overview: ------------- A typical session of HOD will involve atleast three steps: allocate, run hadoop jobs, deallocate. 4.1.1 Operation allocate ------------------------ The allocate operation is used to allocate a set of nodes and install and provision Hadoop on them. It has the following syntax: hod -c config_file -t hadoop_tarball_location -o "allocate \ cluster_dir number_of_nodes" The hadoop_tarball_location must be a location on a shared file system accesible from all nodes in the cluster. Note, the cluster_dir must exist before running the command. If the command completes successfully then cluster_dir/hadoop-site.xml will be generated and will contain information about the allocated cluster's JobTracker and NameNode. For example, the following command uses a hodrc file in ~/hod-config/hodrc and allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes, storing the generated Hadoop configuration in a directory named ~/hadoop-cluster: $ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o "allocate \ ~/hadoop-cluster 10" HOD also supports an environment variable called HOD_CONF_DIR. If this is defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc. Defining this allows the above command to also be run as follows: $ export HOD_CONF_DIR=~/hod-config $ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10" 4.1.2 Running Hadoop jobs using the allocated cluster ----------------------------------------------------- Now, one can run Hadoop jobs using the allocated cluster in the usual manner: hadoop --config cluster_dir hadoop_command hadoop_command_args Continuing our example, the following command will run a wordcount example on the allocated cluster: $ hadoop --config ~/hadoop-cluster jar \ /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output 4.1.3 Operation deallocate -------------------------- The deallocate operation is used to release an allocated cluster. When finished with a cluster, deallocate must be run so that the nodes become free for others to use. The deallocate operation has the following syntax: hod -o "deallocate cluster_dir" Continuing our example, the following command will deallocate the cluster: $ hod -o "deallocate ~/hadoop-cluster" 4.2 Command Line Options ------------------------ This section covers the major command line options available via the hod command: --help Prints out the help message to see the basic options. --verbose-help All configuration options provided in the hodrc file can be passed on the command line, using the syntax --section_name.option_name[=value]. When provided this way, the value provided on command line overrides the option provided in hodrc. The verbose-help command lists all the available options in the hodrc file. This is also a nice way to see the meaning of the configuration options. -c config_file Provides the configuration file to use. Can be used with all other options of HOD. Alternatively, the HOD_CONF_DIR environment variable can be defined to specify a directory that contains a file named hodrc, alleviating the need to specify the configuration file in each HOD command. -b 1|2|3|4 Enables the given debug level. Can be used with all other options of HOD. 4 is most verbose. -o "help" Lists the operations available in the operation mode. -o "allocate cluster_dir number_of_nodes" Allocates a cluster on the given number of cluster nodes, and store the allocation information in cluster_dir for use with subsequent hadoop commands. Note that the cluster_dir must exist before running the command. -o "list" Lists the clusters allocated by this user. Information provided includes the Torque job id corresponding to the cluster, the cluster directory where the allocation information is stored, and whether the Map/Reduce daemon is still active or not. -o "info cluster_dir" Lists information about the cluster whose allocation information is stored in the specified cluster directory. -o "deallocate cluster_dir" Deallocates the cluster whose allocation information is stored in the specified cluster directory. -t hadoop_tarball Provisions Hadoop from the given tar.gz file. This option is only applicable to the allocate operation. For better distribution performance it is recommended that the Hadoop tarball contain only the libraries and binaries, and not the source or documentation. -Mkey1=value1 -Mkey2=value2 Provides configuration parameters for the provisioned Map/Reduce daemons (JobTracker and TaskTrackers). A hadoop-site.xml is generated with these values on the cluster nodes -Hkey1=value1 -Hkey2=value2 Provides configuration parameters for the provisioned HDFS daemons (NameNode and DataNodes). A hadoop-site.xml is generated with these values on the cluster nodes -Ckey1=value1 -Ckey2=value2 Provides configuration parameters for the client from where jobs can be submitted. A hadoop-site.xml is generated with these values on the submit node.