Fork me on GitHub

Quick Start


SINGA setup

Please refer to the installation page for guidance on installing SINGA.

Training on a single node

For single node training, one process will be launched to run SINGA at local host. We train the CNN model over the CIFAR-10 dataset as an example. The hyper-parameters are set following cuda-convnet. More details is available at CNN example.

Preparing data and job configuration

Download the dataset and create the data shards for training and testing.

cd examples/cifar10/
cp Makefile.example Makefile
make download
make create

A training dataset and a test dataset are created respectively. An image_mean.bin file is also generated, which contains the feature mean of all images.

Since all code used for training this CNN model is provided by SINGA as built-in implementation, there is no need to write any code. Instead, users just execute the running script by providing the job configuration file (job.conf). To code in SINGA, please refer to the programming guide.

Training without parallelism

By default, the cluster topology has a single worker and a single server. In other words, neither the training data nor the neural net is partitioned.

The training is started by running:

# goto top level folder
cd ../../
./singa -conf examples/cifar10/job.conf

Asynchronous parallel training

# job.conf
...
cluster {
  nworker_groups: 2
  nworkers_per_procs: 2
  workspace: "examples/cifar10/"
}

In SINGA, asynchronous training is enabled by launching multiple worker groups. For example, we can change the original job.conf to have two worker groups as shown above. By default, each worker group has one worker. Since one process is set to contain two workers. The two worker groups will run in the same process. Consequently, they run the in-memory Downpour training framework. Users do not need to split the dataset explicitly for each worker (group); instead, they can assign each worker (group) a random offset to the start of the dataset. The workers would run as on different data partitions.

# job.conf
...
neuralnet {
  layer {
    ...
    store_conf {
      random_skip: 5000
    }
  }
  ...
}

The running command is:

./singa -conf examples/cifar10/job.conf

Synchronous parallel training

# job.conf
...
cluster {
  nworkers_per_group: 2
  nworkers_per_procs: 2
  workspace: "examples/cifar10/"
}

In SINGA, asynchronous training is enabled by launching multiple workers within one worker group. For instance, we can change the original job.conf to have two workers in one worker group as shown above. The workers will run synchronously as they are from the same worker group. This framework is the in-memory sandblaster. The model is partitioned among the two workers. In specific, each layer is sliced over the two workers. The sliced layer is the same as the original layer except that it only has B/g feature instances, where B is the number of instances in a mini-batch, g is the number of workers in a group. It is also possible to partition the layer (or neural net) using other schemes. All other settings are the same as running without partitioning

./singa -conf examples/cifar10/job.conf

Training in a cluster

Starting Zookeeper

SINGA uses zookeeper to coordinate the training, and uses ZeroMQ for transferring messages. After installing zookeeper and ZeroMQ, you need to configure SINGA with --enable-dist before compiling. Please make sure the zookeeper service is started before running SINGA.

If you installed the zookeeper using our thirdparty script, you can simply start it by:

#goto top level folder
cd  SINGA_ROOT
./bin/zk-service.sh start

(./bin/zk-service.sh stop stops the zookeeper).

Otherwise, if you launched a zookeeper by yourself but not used the default port, please edit the conf/singa.conf:

zookeeper_host: "localhost:YOUR_PORT"

We can extend the above two training frameworks to a cluster by updating the cluster configuration with:

nworker_per_procs: 1

Every process would then create only one worker thread. Consequently, the workers would be created in different processes (i.e., nodes). The hostfile must be provided under SINGA_ROOT/conf/ specifying the nodes in the cluster, e.g.,

192.168.0.1
192.168.0.2

And the zookeeper location must be configured correctly, e.g.,

#conf/singa.conf
zookeeper_host: "logbase-a01"

The running command is :

./bin/singa-run.sh -conf examples/cifar10/job.conf

You can list the current running jobs by,

./bin/singa-console.sh list

JOB ID    |NUM PROCS
----------|-----------
24        |2

Jobs can be killed by,

./bin/singa-console.sh kill JOB_ID

Logs and job information are available in /tmp/singa-log folder, which can be changed to other folders by setting log-dir in conf/singa.conf.

Training with GPUs

Please refer to the [GPU page][gpu.html] for details on training using GPUs.

Where to go next

The programming guide pages will describe how to submit a training job in SINGA.