CheckPoint¶
SINGA checkpoints model parameters onto disk periodically according to user configured frequency. By checkpointing model parameters, we can
- resume the training from the last checkpointing. For example, if the program crashes before finishing all training steps, we can continue the training using checkpoint files.
- use them to initialize a similar model. For example, the parameters from training a RBM model can be used to initialize a deep auto-encoder model.
Configuration¶
Checkpointing is controlled by two configuration fields:
checkpoint_after
, start checkpointing after this number of training steps,checkpoint_freq
, frequency of doing checkpointing.
For example,
# job.conf
checkpoint_after: 100
checkpoint_frequency: 300
...
Checkpointing files are located at WORKSPACE/checkpoint/stepSTEP-workerWORKERID. WORKSPACE is configured in
cluster {
workspace:
}
For the above configuration, after training for 700 steps, there would be two checkpointing files,
step400-worker0
step700-worker0
Application - resuming training¶
We can resume the training from the last checkpoint (i.e., step 700) by,
./bin/singa-run.sh -conf JOB_CONF -resume
There is no change to the job configuration.
Application - model initialization¶
We can also use the checkpointing file from step 400 to initialize a new model by configuring the new job as,
# job.conf
checkpoint : "WORKSPACE/checkpoint/step400-worker0"
...
If there are multiple checkpointing files for the same snapshot due to model partitioning, all the checkpointing files should be added,
# job.conf
checkpoint : "WORKSPACE/checkpoint/step400-worker0"
checkpoint : "WORKSPACE/checkpoint/step400-worker1"
...
The training command is the same as starting a new job,
./bin/singa-run.sh -conf JOB_CONF