#Distributed Training on Mesos

This guide explains how to start SINGA distributed training on a Mesos cluster. It assumes that both Mesos and HDFS are already running, and every node has SINGA installed.
We assume the architecture depicted below, in which a cluster nodes are Docker container. Refer to [Docker guide](docker.html) for details of how to start individual nodes and set up network connection between them (make sure [weave](http://weave.works/guides/weave-docker-ubuntu-simple.html) is running at each node, and the cluster's headnode is running in container `node0`)

![Nothing](http://www.comp.nus.edu.sg/~dinhtta/files/singa_mesos.png)

---

## Start HDFS and Mesos
Go inside each container, using:
````
docker exec -it nodeX /bin/bash
````
and configure it as follows:

* On container `node0`

        hadoop namenode -format
        hadoop-daemon.sh start namenode
        /opt/mesos-0.22.0/build/bin/mesos-master.sh --work_dir=/opt --log_dir=/opt --quiet > /dev/null &
        zk-service.sh start

* On container `node1, node2, ...`

        hadoop-daemon.sh start datanode
        /opt/mesos-0.22.0/build/bin/mesos-slave.sh --master=node0:5050 --hostname=XX.XX.XX.XX --log_dir=/opt --quiet > /dev/null &

  where XX.XX.XX.XX is the **public IP address** of the slave node

To check if the setup has been successful, check that HDFS namenode has registered `N` datanodes, via:

````
hadoop dfsadmin -report
````

####Important If the Docker version is 1.9 or newer, make sure [name resolution is set up
properly](docker.html#launch_pseudo) 

#### Mesos logs
Mesos logs are stored at `/opt/lt-mesos-master.INFO` on `node0` and `/opt/lt-mesos-slave.INFO` at other nodes.

---

## Starting SINGA training on Mesos
Assumed that Mesos and HDFS are already started, SINGA job can be launched at **any** container.

#### Launching job

1. Log in to any container, then go to `incubator-singa/tool/mesos`
<a name="job_start"></a>
2. Check that configuration files are correct:
    + `scheduler.conf` contains information about the master nodes
    + `singa.conf` contains information about Zookeeper node0
    + Job configuration file `job.conf` **contains full path to the examples directories (NO RELATIVE PATH!).**
3. Start the job:
    + If starting for the first time:
            
              make
	          ./scheduler <job config file> -scheduler_conf <scheduler config file> -singa_conf <SINGA config file>
    + If not the first time:

	          ./scheduler <job config file>

**Notes.** Each running job is given a `frameworkID`. Look for the log message of the form:

             Framework registered with XXX-XXX-XXX-XXX-XXX-XXX

#### Monitoring and Debugging

Each Mesos job is given a `frameworkID` and a *sandbox* directory is created for each job.
The directory is in the specified `work_dir` (or `/tmp/mesos`) by default. For example, the error
during SINGA execution can be found at:

            /tmp/mesos/slaves/xxxxx-Sx/frameworks/xxxxx/executors/SINGA_x/runs/latest/stderr

Other artifacts, like files downloaded from HDFS (`job.conf`) and `stdout` can be found in the same
directory.

#### Stopping

There are two way to kill the running job:

1. If the scheduler is running in the foreground, simply kill it (using `Ctrl-C`, for example).

2. If the scheduler is running in the background, kill it using Mesos's REST API:

          curl -d "frameworkId=XXX-XXX-XXX-XXX-XXX-XXX" -X POST http://<master>/master/shutdown