Fork me on GitHub

Data Preparation


SINGA uses input layers to load data. Users can store their data in any format (e.g., CSV or binary) and at any places (e.g., disk file or HDFS) as long as there are corresponding input layers that can read the data records and parse them.

To make it easy for users, SINGA provides a [StoreInputLayer] to read data in the format of (string:key, string:value) tuples from a couple of sources. These sources are abstracted using a Store class which is a simple version of the DB abstraction in Caffe. The base Store class provides the following operations for reading and writing tuples,

Open(string path, Mode mode); // open the store for kRead or kCreate or kAppend
Close();

Read(string* key, string* val); // read a tuple; return false if fail
Write(string key, string val);  // write a tuple
Flush();

Currently, two implementations are provided, namely

  1. [KVFileStore] for storing tuples in KVFile (a binary file). The create_data.cc files in examples/cifar10 and examples/mnist provide examples of storing records using KVFileStore.

  2. [TextFileStore] for storing tuples in plain text file (one line per tuple).

The (key, value) tuple are parsed by subclasses of StoreInputLayer depending on the format of the tuple,

  • [ProtoRecordInputLayer] parses the value field from one tuple into a [SingleLabelImageRecord], which is generated by Google Protobuf according to [common.proto]. It can be used to store features for images (e.g., using the pixel field) or other objects (using the data field). The key field is not used.

  • [CSVRecordInputLayer] parses one tuple as a CSV line (separated by comma).

Using built-in record format

SingleLabelImageRecord is a built-in record in SINGA for storing image features. It is used in the cifar10 and mnist examples.

message SingleLabelImageRecord {
  repeated int32 shape = 1;                // it obtains 3 (rgb channels), 32 (row), 32 (col)
  optional int32 label = 2;                // label
  optional bytes pixel = 3;                // pixels
  repeated float data = 4 [packed = true]; // it is used for normalization

}

The data preparation instructions for the CIFAR-10 image dataset will be elaborated here. This dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. Each image has a single label. This dataset is stored in binary files with specific format. SINGA comes with the create_data.cc to convert images in the binary files into SingleLabelImageRecords and insert them into training and test stores.

  1. Download raw data. The following command will download the dataset into cifar-10-batches-bin folder.

    # in SINGA_ROOT/examples/cifar10
    $ cp Makefile.example Makefile   // an example makefile is provided
    $ make download
    
  2. Fill one record for each image, and insert it to store.

    KVFileStore store;
    store.Open(output_file_path, singa::io::kCreate);
    
    singa::SingleLabelImageRecord image;
    for (int image_id = 0; image_id < 50000; image_id ++) {
      // fill the record with image feature and label from downloaded binay files
      string str;
      image.SerializeToString(&str);
      store.Write(to_string(image_id), str);
    }
    store.Flush();
    store.Close();
    

    The data store for testing data is created similarly. In addition, it computes average values (not shown here) of image pixels and insert the mean values into a SingleLabelImageRecord, which is then written into a another store.

  3. Compile and run the program. SINGA provides an example Makefile that contains instructions for compiling the source code and linking it with libsinga.so. Users just execute the following command.

    $ make create
    

using user-defined record format

If users cannot use the SingleLabelImageRecord or CSV record for their data. They can define their own record format e.g., using Google Protobuf. A record can be written into a data store as long as it can be converted into byte string. Correspondingly, subclasses of StoreInputLayer are required to parse user-defined records.