Please refer to the installation page for guidance on installing SINGA.
SINGA uses zookeeper to coordinate the training. Please make sure the zookeeper service is started before running SINGA.
If you installed the zookeeper using our thirdparty script, you can simply start it by:
#goto top level folder cd SINGA_ROOT ./bin/zk-service.sh start
(./bin/zk-service.sh stop stops the zookeeper).
Otherwise, if you launched a zookeeper by yourself but not used the default port, please edit the conf/singa.conf:
zookeeper_host: "localhost:YOUR_PORT"
Running SINGA in standalone mode is on the contrary of running it using cluster managers like Mesos or YARN.
For single node training, one process will be launched to run SINGA at local host. We train the CNN model over the CIFAR-10 dataset as an example. The hyper-parameters are set following cuda-convnet. More details is available at CNN example.
Download the dataset and create the data shards for training and testing.
cd examples/cifar10/ cp Makefile.example Makefile make download make create
A training dataset and a test dataset are created under cifar10-train-shard and cifar10-test-shard folder respectively. An image_mean.bin file is also generated, which contains the feature mean of all images.
Since all code used for training this CNN model is provided by SINGA as built-in implementation, there is no need to write any code. Instead, users just execute the running script (../../bin/singa-run.sh) by providing the job configuration file (job.conf). To code in SINGA, please refer to the programming guide.
By default, the cluster topology has a single worker and a single server. In other words, neither the training data nor the neural net is partitioned.
The training is started by running:
# goto top level folder cd ../../ ./bin/singa-run.sh -conf examples/cifar10/job.conf
You can list the current running jobs by,
./bin/singa-console.sh list JOB ID |NUM PROCS ----------|----------- 24 |1
Jobs can be killed by,
./bin/singa-console.sh kill JOB_ID
Logs and job information are available in /tmp/singa-log folder, which can be changed to other folders by setting log-dir in conf/singa.conf.
# job.conf ... cluster { nworker_groups: 2 nworkers_per_procs: 2 workspace: "examples/cifar10/" }
In SINGA, asynchronous training is enabled by launching multiple worker groups. For example, we can change the original job.conf to have two worker groups as shown above. By default, each worker group has one worker. Since one process is set to contain two workers. The two worker groups will run in the same process. Consequently, they run the in-memory Downpour training framework. Users do not need to split the dataset explicitly for each worker (group); instead, they can assign each worker (group) a random offset to the start of the dataset. The workers would run as on different data partitions.
# job.conf ... neuralnet { layer { ... sharddata_conf { random_skip: 5000 } } ... }
The running command is:
./bin/singa-run.sh -conf examples/cifar10/job.conf
# job.conf ... cluster { nworkers_per_group: 2 nworkers_per_procs: 2 workspace: "examples/cifar10/" }
In SINGA, asynchronous training is enabled by launching multiple workers within one worker group. For instance, we can change the original job.conf to have two workers in one worker group as shown above. The workers will run synchronously as they are from the same worker group. This framework is the in-memory sandblaster. The model is partitioned among the two workers. In specific, each layer is sliced over the two workers. The sliced layer is the same as the original layer except that it only has B/g feature instances, where B is the number of instances in a mini-batch, g is the number of workers in a group. It is also possible to partition the layer (or neural net) using other schemes. All other settings are the same as running without partitioning
./bin/singa-run.sh -conf examples/cifar10/job.conf
We can extend the above two training frameworks to a cluster by updating the cluster configuration with:
nworker_per_procs: 1
Every process would then create only one worker thread. Consequently, the workers would be created in different processes (i.e., nodes). The hostfile must be provided under SINGA_ROOT/conf/ specifying the nodes in the cluster, e.g.,
logbase-a01 logbase-a02
And the zookeeper location must be configured correctly, e.g.,
#conf/singa.conf zookeeper_host: "logbase-a01"
The running command is the same as for single node training:
./bin/singa-run.sh -conf examples/cifar10/job.conf
The programming guide pages will describe how to submit a training job in SINGA.