#Distributed Training on Mesos
This guide explains how to start SINGA distributed training on a Mesos cluster. It assumes that both Mesos and HDFS are already running, and every node has SINGA installed.
We assume the architecture depicted below, in which a cluster nodes are Docker container. Refer to Docker guide for details of how to start individual nodes and set up network connection between them (make sure weave is running at each node, and the cluster’s headnode is running in container node0
)
Start HDFS and Mesos¶
Go inside each container, using:
docker exec -it nodeX /bin/bash
and configure it as follows:
On container
node0
hadoop namenode -format hadoop-daemon.sh start namenode /opt/mesos-0.22.0/build/bin/mesos-master.sh --work_dir=/opt --log_dir=/opt --quiet > /dev/null & zk-service.sh start
On container
node1, node2, ...
hadoop-daemon.sh start datanode /opt/mesos-0.22.0/build/bin/mesos-slave.sh --master=node0:5050 --hostname=XX.XX.XX.XX --log_dir=/opt --quiet > /dev/null &
where XX.XX.XX.XX is the public IP address of the slave node
To check if the setup has been successful, check that HDFS namenode has registered N
datanodes, via:
hadoop dfsadmin -report
####Important If the Docker version is 1.9 or newer, make sure name resolution is set up properly
Mesos logs¶
Mesos logs are stored at /opt/lt-mesos-master.INFO
on node0
and /opt/lt-mesos-slave.INFO
at other nodes.
Starting SINGA training on Mesos¶
Assumed that Mesos and HDFS are already started, SINGA job can be launched at any container.
Launching job¶
Log in to any container, then go to
incubator-singa/tool/mesos
Check that configuration files are correct:
scheduler.conf
contains information about the master nodessinga.conf
contains information about Zookeeper node0- Job configuration file
job.conf
contains full path to the examples directories (NO RELATIVE PATH!).
Start the job:
If starting for the first time:
make ./scheduler <job config file> -scheduler_conf <scheduler config file> -singa_conf <SINGA config file>
If not the first time:
./scheduler <job config file>
Notes. Each running job is given a frameworkID
. Look for the log message of the form:
Framework registered with XXX-XXX-XXX-XXX-XXX-XXX
Monitoring and Debugging¶
Each Mesos job is given a frameworkID
and a sandbox directory is created for each job.
The directory is in the specified work_dir
(or /tmp/mesos
) by default. For example, the error
during SINGA execution can be found at:
/tmp/mesos/slaves/xxxxx-Sx/frameworks/xxxxx/executors/SINGA_x/runs/latest/stderr
Other artifacts, like files downloaded from HDFS (job.conf
) and stdout
can be found in the same
directory.
Stopping¶
There are two way to kill the running job:
If the scheduler is running in the foreground, simply kill it (using
Ctrl-C
, for example).If the scheduler is running in the background, kill it using Mesos’s REST API:
curl -d "frameworkId=XXX-XXX-XXX-XXX-XXX-XXX" -X POST http://<master>/master/shutdown