****************** FailMon Quick Start Guide *********************** This document is a guide to quickly setting up and running FailMon. For more information and details please see the FailMon User Manual. ***** Building FailMon ***** Normally, FailMon lies under /src/contrib/failmon, where is the Hadoop project root folder. To compile it, one can either run ant for the whole Hadoop project, i.e.: $ cd $ ant or run ant only for FailMon: $ cd /src/contrib/failmon $ ant The above will compile FailMon and place all class files under /build/contrib/failmon/classes. By invoking: $ cd /src/contrib/failmon $ ant tar FailMon is packaged as a standalone jar application in /src/contrib/failmon/failmon.tar.gz. ***** Deploying FailMon ***** There are two ways FailMon can be deployed in a cluster: a) Within Hadoop, in which case the whole Hadoop package is uploaded to the cluster nodes. In that case, nothing else needs to be done on individual nodes. b) Independently of the Hadoop deployment, i.e., by uploading failmon.tar.gz to all nodes and uncompressing it. In that case, the bin/failmon.sh script needs to be edited; environment variable HADOOPDIR should point to the root directory of the Hadoop distribution. Also the location of the Hadoop configuration files should be pointed by the property 'hadoop.conf.path' in file conf/failmon.properties. Note that these files refer to the HDFS in which we want to store the FailMon data (which can potentially be different than the one on the cluster we are monitoring). We assume that either way FailMon is placed in the same directory on all nodes, which is typical for most clusters. If this is not feasible, one should create the same symbolic link on all nodes of the cluster, that points to the FailMon directory of each node. One should also edit the conf/failmon.properties file on each node to set his own property values. However, the default values are expected to serve most practical cases. Refer to the FailMon User Manual about the various properties and configuration parameters. ***** Running FailMon ***** In order to run FailMon using a node to do the ad-hoc scheduling of monitoring jobs, one needs edit the hosts.list file to specify the list of machine hostnames on which FailMon is to be run. Also, in file conf/global.config the username used to connect to the machines has to be specified (passwordless SSH is assumed) in property 'ssh.username'. In property 'failmon.dir', the path to the FailMon folder has to be specified as well (it is assumed to be the same on all machines in the cluster). Then one only needs to invoke the command: $ cd $ bin/scheduler.py to start the system. ***** Merging HDFS files ***** For the purpose of merging the files created on HDFS by FailMon, the following command can be used: $ cd $ bin/failmon.sh --mergeFiles This will concatenate all files in the HDFS folder (pointed to by the 'hdfs.upload.dir' property in conf/failmon.properties file) into a single file, which will be placed in the same folder. Also the location of the Hadoop configuration files should be pointed by the property 'hadoop.conf.path' in file conf/failmon.properties. Note that these files refer to the HDFS in which have stored the FailMon data (which can potentially be different than the one on the cluster we are monitoring). Also, the scheduler.py script can be set up to merge the HDFS files when their number surpasses a configurable limit (see 'conf/global.config' file). Please refer to the FailMon User Manual for more details.