Every running process has a training object which launches one or more worker (and server) threads. More...

#include <trainer.h>

Public Member Functions
void	Start (bool resume, const SingaProto &singaConf, JobProto *jobConf)
	Entrance function which construct the workers and servers, and luanch one thread per worker/server. More...

Protected Member Functions
void	Resume (JobProto *jobConf)
	Setting the checkpoint field of model configuration to resume training. More...

vector< Server * >	CreateServers (int nthread, const JobProto &jobConf)
	Create server instances. More...

vector< Worker * >	CreateWorkers (int nthread, const JobProto &jobConf)
	Create workers instances. More...

void	SetupWorkerServer (const JobProto &jobConf, const vector< Worker * > &workers, const vector< Server * > &servers)
	Setup workers and servers. More...

void	Run (const vector< Worker * > &workers, const vector< Server * > &servers)

void	DisplayMetric (Msg **msg)
	Display metrics to log (standard output)

Dealer *	CreateInterProcsDealer (int dst_procs)
	Create a socket to send msg to the specified process. More...

void	HandleLocalMsg (std::queue< Msg * > msg_queue, Msg *msg)
	Handle messages to local servers and local stub.

const vector< Msg * >	HandleGet (ParamEntry entry, Msg *msg)
	Generate a request message to Get the parameter object.

void	HandleGetResponse (ParamEntry entry, Msg *msg)

const vector< Msg * >	HandleUpdate (ParamEntry entry, Msg *msg)
	Generate a request message to Update the parameter object.

void	HandleUpdateResponse (ParamEntry entry, Msg *msg)

const vector< Msg * >	HandlePut (ParamEntry entry, Msg *msg)
	Generate a request message to Put the parameter object.

void	GenMsgs (int type, int version, ParamEntry entry, Msg msg, vector< Msg * > *ret)
	Called by HandlePut, HandleUpdate and HandleGet functions. More...

int	Hash (int grp_id, int param_id)
	Get a hash id for a Param object from a group. More...

Protected Attributes
int	procs_id_

Router *	router_

std::unordered_map< int, ParamEntry * >	worker_shard_
	map from slice to the server that updates it

vector< int >	slice2server_

Detailed Description

Every running process has a training object which launches one or more worker (and server) threads.

The main thread runs a loop to forward messages between workers and servers.

Member Function Documentation

Dealer* singa::Trainer::CreateInterProcsDealer ( int dst_procs )

protected

Create a socket to send msg to the specified process.

Parameters

dst_procs the dst process (logical) ID

Returns: the newly created socket

vector<Server*> singa::Trainer::CreateServers	(	int	nthread,
		const JobProto &	jobConf
	)

protected

Create server instances.

Parameters

nthread	total num of threads in current procs which is used to assign each thread a local thread ID. The number of workers is extracted from Cluster
jobConf

Returns: server instances

vector<Worker*> singa::Trainer::CreateWorkers	(	int	nthread,
		const JobProto &	jobConf
	)

protected

Create workers instances.

Parameters

nthread	total num of threads in current procs which is used to assign each thread a local thread ID. The number of workers is extracted from Cluster
jobConf

Returns: worker instances

void singa::Trainer::GenMsgs	(	int	type,
		int	version,
		ParamEntry *	entry,
		Msg *	msg,
		vector< Msg * > *	ret
	)

protected

Called by HandlePut, HandleUpdate and HandleGet functions.

Parameters

type	message type
version	param version
entry
msg
ret	generated messages

int singa::Trainer::Hash	(	int	grp_id,
		int	param_id
	)

inlineprotected

Get a hash id for a Param object from a group.

Simple multiple group_id with a large prime number 997 (assuming there are no more than 997 worker groups) and plus owner param id.

void singa::Trainer::Resume ( JobProto * jobConf )

protected

Setting the checkpoint field of model configuration to resume training.

The checkpoint folder will be searched to get the files for the latest checkpoint, which will be added into the checkpoint field. The workers would then load the values of params from the checkpoint files.

Parameters

jobConf job configuration

void singa::Trainer::SetupWorkerServer	(	const JobProto &	jobConf,
		const vector< Worker * > &	workers,
		const vector< Server * > &	servers
	)

protected

Setup workers and servers.

For each worker, create and assign a neuralnet to it. For each server, create and assign the param shard to it. Create the partition map from slice ID to server

Parameters

modelConf
workers
servers

void singa::Trainer::Start	(	bool	resume,
		const SingaProto &	singaConf,
		JobProto *	jobConf
	)

Entrance function which construct the workers and servers, and luanch one thread per worker/server.

Parameters

resume	if true resume the training from the latest checkpoint files
singaConf	global singa configuration including zookeeper and
jobConf	job configuration, including cluster and model configuration

The documentation for this class was generated from the following file:

/home/wangwei/program/asf/release-0.1/apache-singa-incubating-0.1.0-RC1/include/trainer/trainer.h

Public Member Functions

Protected Member Functions

Protected Attributes

Detailed Description

Member Function Documentation