Train-One-Batch¶
For each SGD iteration, every worker calls the TrainOneBatch
function to
compute gradients of parameters associated with local layers (i.e., layers
dispatched to it). SINGA has implemented two algorithms for the
TrainOneBatch
function. Users select the corresponding algorithm for
their model in the configuration.
Basic user guide¶
Back-propagation¶
BP algorithm is used for computing gradients of feed-forward models, e.g., CNN and MLP, and RNN models in SINGA.
# in job.conf
alg: kBP
To use the BP algorithm for the TrainOneBatch
function, users just simply
configure the alg
field with kBP
. If a neural net contains user-defined
layers, these layers must be implemented properly be to consistent with the
implementation of the BP algorithm in SINGA (see below).
Contrastive Divergence¶
CD algorithm is used for computing gradients of energy models like RBM.
# job.conf
alg: kCD
cd_conf {
cd_k: 2
}
To use the CD algorithm for the TrainOneBatch
function, users just configure
the alg
field to kCD
. Uses can also configure the Gibbs sampling steps in
the CD algorthm through the cd_k
field. By default, it is set to 1.
Advanced user guide¶
Implementation of BP¶
The BP algorithm is implemented in SINGA following the below pseudo code,
BPTrainOnebatch(step, net) {
// forward propagate
foreach layer in net.local_layers() {
if IsBridgeDstLayer(layer)
recv data from the src layer (i.e., BridgeSrcLayer)
foreach param in layer.params()
Collect(param) // recv response from servers for last update
layer.ComputeFeature(kForward)
if IsBridgeSrcLayer(layer)
send layer.data_ to dst layer
}
// backward propagate
foreach layer in reverse(net.local_layers) {
if IsBridgeSrcLayer(layer)
recv gradient from the dst layer (i.e., BridgeDstLayer)
recv response from servers for last update
layer.ComputeGradient()
foreach param in layer.params()
Update(step, param) // send param.grad_ to servers
if IsBridgeDstLayer(layer)
send layer.grad_ to src layer
}
}
It forwards features through all local layers (can be checked by layer
partition ID and worker ID) and backwards gradients in the reverse order.
BridgeSrcLayer
(resp. BridgeDstLayer
) will be blocked until the feature (resp.
gradient) from the source (resp. destination) layer comes. Parameter gradients
are sent to servers via Update
function. Updated parameters are collected via
Collect
function, which will be blocked until the parameter is updated.
Param objects have versions, which can be used to
check whether the Param
objects have been updated or not.
Since RNN models are unrolled into feed-forward models, users need to implement
the forward propagation in the recurrent layer’s ComputeFeature
function,
and implement the backward propagation in the recurrent layer’s ComputeGradient
function. As a result, the whole TrainOneBatch
runs
back-propagation through time (BPTT) algorithm.
Implementation of CD¶
The CD algorithm is implemented in SINGA following the below pseudo code,
CDTrainOneBatch(step, net) {
# positive phase
foreach layer in net.local_layers()
if IsBridgeDstLayer(layer)
recv positive phase data from the src layer (i.e., BridgeSrcLayer)
foreach param in layer.params()
Collect(param) // recv response from servers for last update
layer.ComputeFeature(kPositive)
if IsBridgeSrcLayer(layer)
send positive phase data to dst layer
# negative phase
foreach gibbs in [0...layer_proto_.cd_k]
foreach layer in net.local_layers()
if IsBridgeDstLayer(layer)
recv negative phase data from the src layer (i.e., BridgeSrcLayer)
layer.ComputeFeature(kPositive)
if IsBridgeSrcLayer(layer)
send negative phase data to dst layer
foreach layer in net.local_layers()
layer.ComputeGradient()
foreach param in layer.params
Update(param)
}
Parameter gradients are computed after the positive phase and negative phase.
Implementing a new algorithm¶
SINGA implements BP and CD by creating two subclasses of
the Worker class:
BPWorker‘s TrainOneBatch
function implements the BP
algorithm; CDWorker‘s TrainOneBatch
function implements the CD
algorithm. To implement a new algorithm for the TrainOneBatch
function, users
need to create a new subclass of the Worker
, e.g.,
class FooWorker : public Worker {
void TrainOneBatch(int step, shared_ptr<NeuralNet> net, Metric* perf) override;
void TestOneBatch(int step, Phase phase, shared_ptr<NeuralNet> net, Metric* perf) override;
};
The FooWorker
must implement the above two functions for training one
mini-batch and testing one mini-batch. The perf
argument is for collecting
training or testing performance, e.g., the objective loss or accuracy. It is
passed to the ComputeFeature
function of each layer.
Users can define some fields for users to configure
# in user.proto
message FooWorkerProto {
optional int32 b = 1;
}
extend JobProto {
optional FooWorkerProto foo_conf = 101;
}
# in job.proto
JobProto {
...
extension 101..max;
}
It is similar as adding configuration fields for a new layer.
To use FooWorker
, users need to register it in the main.cc
and configure the alg
and foo_conf
fields,
# in main.cc
const int kFoo = 3; // worker ID, must be different to that of CDWorker and BPWorker
driver.RegisterWorker<FooWorker>(kFoo);
# in job.conf
...
alg: 3
[foo_conf] {
b = 4;
}