Hybrid Parallelism¶
User Guide¶
SINGA supports different parallelism options for distributed training. Users just need to configure it in the job configuration.
Both NetProto
and LayerProto
have a field partition_dim
to control the parallelism option:
partition_dim=0
: neuralnet/layer is partitioned on data dimension, i.e., each worker processes a subset of data records.partition_dim=1
: neuralnet/layer is partitioned on feature dimension, i.e., each worker maintains a subset of feature parameters.
partition_dim
field in NetProto
will be applied to all layers, unless a layer has its own partition_dim
field set.
If we want data parallelism for the whole model, just leave partition_dim
as default (which is 0), or configure the job.conf like:
neuralnet {
partition_dim: 0
layer {
name: ...
type: ...
}
...
}
With the hybrid parallelism, we can have layers either partitioned on data dimension or feature dimension. For example, if we want a specific layer partitioned on feature dimension, just configure like:
neuralnet {
partition_dim: 0
layer {
name: "layer1_partition_on_data_dimension"
type: ...
}
layer {
name: "layer2_partition_on_feature_dimension"
type: ...
partition_dim: 1
}
...
}
Developer Guide¶
To support hybrid parallelism, after singa read users’ model and paration configuration, a set of connection layers are automatically added between layers when needed:
BridgeSrcLayer
&BridgeDstLayer
are added when two connected layers are not in the same machine. They are paired and are responsible for sending data/gradient to the other side during each iteration.ConcateLayer
is added when there are multiple source layers. It combines their feature blobs along a given dimension.SliceLayer
is added when there are mutliple dest layers, each of which only needs a subset(slice) of this layers’ feature blob.SplitLayer
is added when there are multiple dest layers, each of which needs the whole feature blob.
Following is the logic used in our code to add connection layers:
Add Slice, Concate, Split Layers for Hybrid Partition
All cases are as follows:
src_pdim | dst_pdim | connection_type | Action
0 | 0 | OneToOne | Direct Connection
1 | 1 | OneToOne | Direct Connection
0 | 0 | OneToAll | Direct Connection
1 | 0 | OneToOne | Slice -> Concate
0 | 1 | OneToOne | Slice -> Concate
1 | 0 | OneToAll | Slice -> Concate
0 | 1 | OneToAll | Split -> Concate
1 | 1 | OneToAll | Split -> Concate
Logic:
dst_pdim = 1 && OneToAll ?
(YES) Split -> Concate
(NO) src_pdim = dst_pdim ?
(YES) Direct Connection
(NO) Slice -> Concate