Fork me on GitHub

Using HDFS with SINGA

This guide explains how to make use of HDFS as the data store for SINGA jobs.

  1. Quick start using Docker
  2. Setup HDFS
  3. Examples

Quick start using Docker

We provide a Docker container built on top of singa/mesos (see the guide on building SINGA on Docker).

git clone https://github.com/ug93tad/incubator-singa
cd incubator-singa
git checkout SINGA-97-docker
cd tool/docker/hdfs
sudo docker build -t singa/hdfs .

Once built, the container image singa/hdfs contains the installation of HDFS C++ library (libhdfs3) and the latest SINGA code. Many distributed nodes can be launched, and HDFS be set up, by following the guide for running distributed SINGA on Mesos.

In the following, we assume the HDFS setup with node0 being the namenode, and nodei (i>0) being the datanodes.

Setup HDFS

There are at least 2 C/C++ client libraries for interacting with HDFS. One is from Hadoop (libhdfs), which is a JNI-based library, meaning that communication will go through JVM. The other is libhdfs3 which is a native C++ library developed by Pivotal, in which the client communicate directly with HDFS via RPC. The current implementation uses the second one.

  1. Install libhdfs3: follow the official guide.

  2. Additional setup: recent versions of Hadoop (>2.4.x) support short-circuit local reads which bypass network communications (TCP sockets) when retrieving data at the local nodes. libhdfs3 will throws errors (but will still continue to work) when it finds that short-circuit read is not set. To deal with this complaints, and improve performance, add the following configuration to hdfs-site.xml and to hdfs-client.xml

      <property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
      </property>
      <property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
      </property>
    

    Next, at each client, set LIBHDFS3_CONF variable to point to hdfs-client.xml file:

      export LIBHDFS3_CONF=$HADOOP_HOME/etc/hadoop/hdfs-client.xml
    

Examples

We explain how to run CIFAR10 and MNIST examples. Before training, the data must be uploaded to HDFS.

CIFAR10

  1. Upload the data to HDFS (done at any of the HDFS nodes)

    • Change job.conf to use HDFS: in examples/cifar10/job.conf, set backend property to hdfsfile
    • Create and upload data:
    cd examples/cifar10
    cp Makefile.example Makefile
    make create
    hadoop dfs -mkdir /examples/cifar10
    hadoop dfs -copyFromLocal cifar-10-batches-bin /examples/cifar10/
    

    If successful, the files should be seen in HDFS via hadoop dfs -ls /examples/cifar10

  2. Training:

    • Make sure conf/singa.conf has correct path to Zookeeper service:
    zookeeper_host: "node0:2181"
    
    • Make sure job.conf has correct paths to the train and test datasets:
    // train layer
    path: "hdfs://node0:9000/examples/cifar10/train_data.bin"
    mean_file: "hdfs://node0:9000/examples/cifar10/image_mean.bin"
    // test layer
    path: "hdfs://node0:9000/examples/cifar10/test_data.bin"
    mean_file: "hdfs://node0:9000/examples/cifar10/image_mean.bin"
    
    • Start training: execute the following command at every node
    ./singa -conf examples/cifar10/job.conf -singa_conf singa.conf -singa_job 0
    

MNIST

  1. Upload the data to HDFS (done at any of the HDFS nodes)

    • Change job.conf to use HDFS: in examples/mnist/job.conf, set backend property to hdfsfile
    • Create and upload data:
    cd examples/mnist
    cp Makefile.example Makefile
    make create
    make compile
    ./create_data.bin train-images-idx3-ubyte train-labels-idx1-ubyte hdfs://node0:9000/examples/mnist/train_data.bin
    ./create_data.bin t10k-images-idx3-ubyte t10k-labels-idx1-ubyte hdfs://node0:9000/examples/mnist/test_data.bin
    

    If successful, the files should be seen in HDFS via hadoop dfs -ls /examples/mnist

  2. Training:

    • Make sure conf/singa.conf has correct path to Zookeeper service:
    zookeeper_host: "node0:2181"
    
    • Make sure job.conf has correct paths to the train and test datasets:
    // train layer
    path: "hdfs://node0:9000/examples/mnist/train_data.bin"
    // test layer
    path: "hdfs://node0:9000/examples/mnist/test_data.bin"
    
    • Start training: execute the following command at every node
    ./singa -conf examples/mnist/job.conf -singa_conf singa.conf -singa_job 0