Apache SINGA
A distributed deep learning platform .
 All Classes Namespaces Files Functions Variables Typedefs Macros
Public Types | Public Member Functions | Protected Member Functions | List of all members
singa::DataShard Class Reference

Data shard stores training/validation/test tuples. More...

#include <data_shard.h>

Public Types

enum  { kRead = 0, kCreate = 1, kAppend = 2 }
 

Public Member Functions

 DataShard (const std::string &folder, int mode)
 Init the shard obj. More...
 
 DataShard (const std::string &folder, int mode, int capacity)
 
bool Next (std::string *key, google::protobuf::Message *val)
 read next tuple from the shard. More...
 
bool Next (std::string *key, std::string *val)
 read next tuple from the shard. More...
 
bool Insert (const std::string &key, const google::protobuf::Message &tuple)
 Append one tuple to the shard. More...
 
bool Insert (const std::string &key, const std::string &tuple)
 Append one tuple to the shard. More...
 
void SeekToFirst ()
 Move the read pointer to the head of the shard file. More...
 
void Flush ()
 Flush buffered data to disk. More...
 
int Count ()
 Iterate through all tuples to get the num of all tuples. More...
 
std::string path ()
 

Protected Member Functions

int Next (std::string *key)
 Read the next key and prepare buffer for reading value. More...
 
int PrepareForAppend (const std::string &path)
 Setup the disk pointer to the right position for append in case that the pervious write crashes. More...
 
bool PrepareNextField (int size)
 Read data from disk if the current data in the buffer is not a full field. More...
 

Detailed Description

Data shard stores training/validation/test tuples.

Every worker node should have a training shard (validation/test shard is optional). The shard file for training is singa::Cluster::workspace()/train/shard.dat; The shard file for validation is singa::Cluster::workspace()/train/shard.dat; Similar path for test.

shard.dat consists of a set of unordered tuples. Each tuple is encoded as [key_len key record_len val] (key_len and record_len are of type uint32, which indicate the bytes of key and record respectively.

When Shard obj is created, it will remove the last key if the record size and key size do not match because the last write of tuple crashed.

TODO

  1. split one shard into multiple shards.
  2. add threading to prefetch and parse records

Constructor & Destructor Documentation

singa::DataShard::DataShard ( const std::string &  folder,
int  mode 
)

Init the shard obj.

Parameters
folderShard folder (path excluding shard.dat) on worker node
modeShard open mode, Shard::kRead, Shard::kWrite or Shard::kAppend
bufsizeBatch bufsize bytes data for every disk op (read or write), default is 100MB

Member Function Documentation

int singa::DataShard::Count ( )

Iterate through all tuples to get the num of all tuples.

Returns
num of tuples
void singa::DataShard::Flush ( )

Flush buffered data to disk.

Used only for kCreate or kAppend.

bool singa::DataShard::Insert ( const std::string &  key,
const google::protobuf::Message &  tuple 
)

Append one tuple to the shard.

Parameters
keye.g., image path
val
Returns
false if unsucess, e.g., inserted before
bool singa::DataShard::Insert ( const std::string &  key,
const std::string &  tuple 
)

Append one tuple to the shard.

Parameters
keye.g., image path
val
Returns
false if unsucess, e.g., inserted before
bool singa::DataShard::Next ( std::string *  key,
google::protobuf::Message *  val 
)

read next tuple from the shard.

Parameters
keyTuple key
valRecord of type Message
Returns
false if read unsuccess, e.g., the tuple was not inserted completely.
bool singa::DataShard::Next ( std::string *  key,
std::string *  val 
)

read next tuple from the shard.

Parameters
keyTuple key
valRecord of type string
Returns
false if read unsuccess, e.g., the tuple was not inserted completely.
int singa::DataShard::Next ( std::string *  key)
protected

Read the next key and prepare buffer for reading value.

Parameters
key
Returns
length (i.e., bytes) of value field.
std::string singa::DataShard::path ( )
inline
Returns
path to shard file
int singa::DataShard::PrepareForAppend ( const std::string &  path)
protected

Setup the disk pointer to the right position for append in case that the pervious write crashes.

Parameters
pathshard path.
Returns
offset (end pos) of the last success written record.
bool singa::DataShard::PrepareNextField ( int  size)
protected

Read data from disk if the current data in the buffer is not a full field.

Parameters
sizesize of the next field.
void singa::DataShard::SeekToFirst ( )

Move the read pointer to the head of the shard file.

Used for repeated reading.


The documentation for this class was generated from the following file: