0 x00 the

Paracel is a distributed computing framework developed by Douban. it is based on parameter server paradigm to solve machine learning problems: logistic regression, SVD, matrix factorization (BFGS, SGD, ALS, CG), LDA, Lasso… .

Paracel supports data and model parallelism, providing users with an easy-to-use communication interface that is more flexible than mapReduce-style systems. Paracel also supports an asynchronous training pattern that allows iterative problems to converge faster. In addition, Paracel programs are structured like serial programs, allowing users to focus more on the algorithm itself and less on distributed logic.

Since PS-Lite doesn’t go deep into SSP, whereas Paracel does, we’ll take a look at how SSP is implemented in this article.

The other articles in this series are:

Machine learning parameter server pS-Lite (1) —– PostOffice

Machine learning parameter server PS-Lite (2) —– communication module Van

Machine learning parameter server pS-Lite (3) —– agent Customer

Machine learning parameter server PS-Lite (4) —– Application node implementation

Machine learning parameter server Paracel (1)—– overall architecture

Some non-body code is removed when this article parses.

0x01 Background information

When different workers perform parallel operations at the same time, the progress of different workers may be different due to external reasons such as network and machine configuration. How to control the synchronization mechanism of workers is an important topic.

1.1 Asynchronous control protocol

Many machine learning problems can be transformed into iterative tasks. For iterative control, in general, there are three levels of asynchronous control protocols: Bulk Synchronous Parallel (BSP), Stalness Synchronous Parallel (SSP), and Asynchronous Parallel (ASP), their synchronization restrictions are relaxed in turn. In order to pursue faster computing speed, the algorithm can choose more loose synchronization protocol.

For better explanation and text integrity, we take out the paragraphs introduced in PS-Lite again.

The three agreements are as follows:

  • ASP: Tasks do not need to wait for each other at all, regardless of the order among workers, each worker moves at its own pace, updates after running an iteration, and the tasks completed first continue the next round of training.

    • Advantages: Eliminates the time to wait for slow tasks and reduces the idle time of the GPU, thus improving hardware efficiency compared with BSP. The computing speed is fast, which makes the maximum use of the computing power of the cluster, and the machines where all workers reside do not have to wait

    • Disadvantages:

      • This process can cause gradients to be calculated with outdated weights, thereby reducing statistical efficiency.
      • Poor applicability, in some cases can not guarantee the convergence

  • BSP: is a synchronization protocol used in general distributed computing. In each iteration, all tasks need to be computed. Each worker must run in the same iteration. Only when all workers of an iteration task have completed, will the synchronization and sharding update between worker and server be carried out.

    • The BSP mode differs from the single-machine serial mode only by batch size, so the convergence of the model is exactly the same. At the same time, because each worker can perform parallel computation within a cycle, it has certain parallel capability. This is what Spark does.

    • Advantages: Wide range of application; The convergence quality of each iteration is high

    • Disadvantages: In each iteration, BSP requires each worker to wait or pause the gradient from other workers, so it needs to wait for the slowest task, which significantly reduces the hardware efficiency and leads to a long time for overall task calculation. The performance of the entire worker group is determined by the slowest worker; This worker is commonly referred to as a Straggler.

  • SSP: A certain degree of task progress inconsistency is allowed, but this inconsistency has a ceiling called staleness value, that is, the fastest task is the most ahead of the slowest task staleness round iteration.

    • It’s a compromise between ASP and BSP. Since ASP allows the iteration interval between different workers to be arbitrarily large, while BSP only allows it to be 0, THEN I will take a constant s. With SSP, BSP can be obtained by specifying s=0. And ASP can also be achieved by specifying s= infinity.

    • Advantages: Reduces the waiting time between tasks to a certain extent, and the calculation speed is fast.

    • Disadvantages: The convergence quality of each iteration is not as good as that of BSP, more iterations may be needed to achieve the same convergence effect, and the applicability is not as good as that of BSP, so some algorithms are not applicable.

1.2 Straggler problem

The traditional approach is to use BSP to complete the iteration, which means that we have to synchronize at the end of each iterator. This leads to the Straggler problem: because of hardware and software, nodes often have different computing capabilities. For the iteration problem, at the end of each round, the fast node has to wait for the slow node to finish the calculation, and then proceed to the next iteration. This wait becomes particularly noticeable as the number of nodes increases, slowing overall performance.

There are two ways to solve this problem:

  • First, we had to write some complex code to unbalance the load so that we could make a fast worker train more data.
  • Second, we can do some asynchronous controls to relax the synchronization condition.

Paracel uses the second method, which relaxes the synchronization condition by relaxing the “wait at every iteration step” constraint:

It is assumed that the synchronization between the fastest worker and the slowest worker does not exceed one bounded parameter, which is a compromise between the convergence of each iteration and the total convergence time. At the end of an iteration, the fast node can continue to the next iteration, but it cannot be s iterations ahead of the slowest node. Paracel forces a wait only when it is more than S iterations ahead.

This asynchronous mode of control not only saves the waiting time on the whole, but also indirectly helps slow nodes to catch up. From the point of view of optimization problem, although the single iteration step converges slowly, the time cost of each iteration step is reduced, and the overall convergence is faster.

This approach is known as Staleness Synchronous Parallel (SSP), the basic idea being to allow each machine to update the model at different paces, but with a restriction so that the progress of the fastest machine is not too different from that of the slowest machine. This has the advantage of reducing the drag of the slow machine on the whole system and ensuring the final convergence of the model.

0 x02 implementation

Let’s first recall the structure of the previous summary.

2.1 ssp_switch

The ssp_switch command controls whether the SSP is used.

Take paracel_read in include/ps. HPP as an example.

If SUP is enabled, then:

  • If the clock is 0 or total_iters, it means SSP start or interval (number of iterations) has arrived, and you need to get the corresponding value again and update your cache.

  • If the cache is hit, it is returned directly.

  • If Miss, if the current clock is already greater than a certain value (stale_cache + limit_s < clock), then the while loop waits.

    • That is, the fast node can continue to the next iteration, but it cannot be s iterations ahead of the slowest node. When the lead is more than S iterations, Paracel forces a wait. So usingpull_int(paracel::str_type("server_clock")To increase the server clock. Recall the SSP core idea (allow a certain degree of task inconsistency, but this inconsistency has a ceiling called the Staleness value, where the fastest task is the most ahead of the slowest task staleness round iteration).
    • Server_clock is dedicated for SSP clock coordination.” Server_clock “is the server clock, and the worker gets this number to see if it is behind or ahead.
    • Stale_cache is initialized to 0 and is set to the value returned by “server_clock” every time in the forced wait loop.

Where the cache is defined:

  paracel::dict_type<paracel::str_type, boost::any> cached_para;
Copy the code
template <class V> bool paracel_read(const paracel::str_type & key, V & val, int replica_id = -1) { if(ssp_switch) { if(clock == 0 || clock == total_iters) { // check total_iters for last // It indicates that the SSP startup or the interval (number of iterations) is due. In this case, you need to obtain the corresponding value again and update the cache. cached_para[key] = boost::any_cast<V>(ps_obj-> kvm[ps_obj->p_ring->get_server(key)]. pull<V>(key)); val = boost::any_cast<V>(cached_para[key]); } else if(stale_cache + limit_s > clock) {val = boost::any_cast<V>(cached_para[key]); } else {// cache miss // If the current clock is greater than a certain value, // pull from server until leading grammar less than s clocks while(stale_cache + limit_s < clock) {// time synchronization  stale_cache = ps_obj-> kvm[clock_server].pull_int(paracel::str_type("server_clock")); } cached_para[key] = boost::any_cast<V>(ps_obj-> KVM [ps_obj->p_ring->get_server(key)]. Pull <V>(key)); val = boost::any_cast<V>(cached_para[key]); } return true; } return ps_obj->kvm[ps_obj->p_ring->get_server(key)].pull(key, val); }Copy the code
int pull_int(const paracel::str_type & key) { if(p_ssp_sock == nullptr) { p_ssp_sock.reset(create_req_sock(ports_lst[4])); } auto scrip = paste(paracel::str_type("pull_int"), key); int val = -1; bool r = req_send_recv(*p_ssp_sock, scrip, val); if(! r) ERROR_ABORT("key: pull_int does not exist"); return val; }Copy the code

2.2 thrd_exec_ssp

Pull_int pull_int pull_int pull_int

The specific code is as follows:

In include/server.hpp, thrd_exec_ssp is the thread dedicated to handling SSP.

The ssp_tbl used is in include/kv_def.hpp.

namespace paracel { paracel::kvs<paracel::str_type, int> ssp_tbl; Paracel :: KVS <paracel::str_type, paracel::str_type> tbl_store; }Copy the code
// thread entry for ssp void thrd_exec_ssp(zmq::socket_t & sock) { paracel::packer<> pk; paracel::ssp_tbl.set("server_clock", 0); while(1) { zmq::message_t s; sock.recv(&s); auto scrip = paracel::str_type(static_cast<const char *>(s.data()), s.size()); auto msg = paracel::str_split_by_word(scrip, paracel::seperator); auto indicator = pk.unpack(msg[0]); //std::cout << indicator << std::endl; If (indicator == "push_int") {// Auto key = pk.unpack(MSG [1]); paracel::packer<int> pk_i; auto val = pk_i.unpack(msg[2]); paracel::ssp_tbl.set(key, val); bool result = true; rep_pack_send(sock, result); } if(indicator == "incr_int") {auto key = pk.unpack(MSG [1]); if(paracel::startswith(key, "client_clock_")) { if(paracel::ssp_tbl.get(key)) { paracel::ssp_tbl.incr(key, 1); } else { paracel::ssp_tbl.set(key, 1); } if(paracel::ssp_tbl.get(key) >= paracel::ssp_tbl.get("worker_sz")) { paracel::ssp_tbl.incr("server_clock", 1); paracel::ssp_tbl.set(key, 0); } } paracel::packer<int> pk_i; int delta = pk_i.unpack(msg[2]); paracel::ssp_tbl.incr(key, delta); bool result = true; rep_pack_send(sock, result); } if(indicator == "pull_int") {// pull auto key = pk.unpack(MSG [1]); int result = 0; auto exist = paracel::ssp_tbl.get(key, result); // Get the corresponding key if(! exist) { paracel::str_type tmp = "nokey"; rep_send(sock, tmp); } rep_pack_send(sock, result); } } // while }Copy the code
+------------------+ worker + server | paralg | | | | | | | | | parasrv *ps_obj | | | + | | +------------------+ | | | |  | start_server | +------------------+ | | | | | | | | | | | v | | | +------------+-----+ +------------------+ +---------+ | | | | parasrv | |kvclt | | kvclt | | | | | | | | | | | | thrd_exec | | | | host | | | | | | | servers | | | | | | | ssp_tbl | | | | ports_lst | | | | | | | kvm +-----------> | |..... | | | | tbl_store | | | | context | | | | | | | p_ring | | | | | | | thrd_exec_ssp | | + | | conn_prefix | | | | | | | |  | | | | | | | ^ | +------------------+ | p_ssp_sock | | | | | | | | | + | | | | | | | | | | | | | | | | | | | | | | | |  | | | v | | | | | | | | | +------------+------+ +------------------+ +---------+ | | | | | ring | | | +------------------+ | | | | | | | | | | | srv_hashring | | | | | | | | | | srv_hashring_dct | +------------------------------------+ | | | +-------------------+ +Copy the code

Mobile phones are as follows:

The logic is as follows (note that due to space constraints, some variables in the figure above have been omitted and new variables and logic have been added) :

The code for thrd_exec_ssp is as follows:

Pull_int, for example, is to pull the data corresponding to the “SSP dedicated KV store” from the server.

2.3 conversion

The user can turn the BSP process into an asynchronous process by adding just a few lines of code. Take a very simple example.

The main thing is to use iter_commit() to commit local updates to the parameter server at the end of each iteration.

class logistic_regression: public paracel::paralg { public: logistic_regression(paracel::Comm comm, std::string hosts_dct_str, std::string _output, int _rounds, int _limit_s, bool _ssp_switch) : paracel::paralg(hosts_dct_str, comm, _output, _rounds, _limit_s, _ssp_switch) {} void training() { theta = paracel::random_double_list(data_dim); paracel_write("theta", theta); // init push for(int iter = 0; iter < rounds; ++iter) { for(int i = 0; i < data_dim; ++i) { delta[i] = 0.; } random_shuffle(idx.begin(), idx.end()); // pull theta theta = paracel_read<vector<double> >("theta"); for(auto sample_id : idx) { for(int i = 0; i < data_dim; ++i) { delta[i] += coff1 * samples[sample_id][i] - coff2 * theta[i]; } } // traverse // update theta with delta paracel_bupdate("theta", delta, "update.so", "lg_theta_update"); // commit to server at the end of each iteration iter_commit(); } // Last pull theta = paracel_read<vector<double> >("theta"); // Last pull theta = paracel_read<vector<double> >("theta"); } void solve() { // init training data auto parser = [](const std::vector<std::string>) { /* ... * /}; auto lines = paracel_load(input); parser(lines); paracel_sync(); // set total iterations of your training process set_total_iters(rounds); // training training(); }}; // class logistic regressionCopy the code
paralg(paracel::str_type hosts_dct_str, paracel::Comm comm, paracel::str_type _output = "", int _rounds = 1, int _limit_s = 0, bool _ssp_switch = false) : worker_comm(comm), output(_output), nworker(comm.get_size()), rounds(_rounds), limit_s(_limit_s), ssp_switch(_ssp_switch) { ps_obj = new parasrv(hosts_dct_str); init_output(_output); clock = 0; stale_cache = 0; clock_server = 0; total_iters = rounds; if(worker_comm.get_rank() == 0) { paracel::str_type key = "worker_sz"; (ps_obj->kvm[clock_server]). push_int(key, worker_comm.get_size()); } paracel_sync(); }Copy the code
// put where you want to control iter with ssp void iter_commit() { paracel::str_type clock_key; if(limit_s == 0) { clock_key = "client_clock_0"; } else { clock_key = "client_clock_" + std::to_string(clock % limit_s); } ps_obj->kvm[clock_server].incr_int(paracel::str_type(clock_key), 1); // value 1 is not important clock += 1; If (clock == total_iters) {// if the total iteration value has been reached, Ps_obj -> KVM [clock_server]. Incr_int (paracel::str_type("worker_sz"), -1); }}Copy the code
bool incr_int(const paracel::str_type & key, int delta) { if(p_ssp_sock == nullptr) { p_ssp_sock.reset(create_req_sock(ports_lst[4])); } auto scrip = paste(paracel::str_type("incr_int"), key, delta); bool stat; auto r = req_send_recv(*p_ssp_sock, scrip, stat); return r && stat; } int pull_int(const paracel::str_type & key) { if(p_ssp_sock == nullptr) { p_ssp_sock.reset(create_req_sock(ports_lst[4])); } auto scrip = paste(paracel::str_type("pull_int"), key); int val = -1; bool r = req_send_recv(*p_ssp_sock, scrip, val); assert(val ! = 1); assert(r); if(! r) ERROR_ABORT("key: pull_int does not exist"); return val; }Copy the code
if(indicator == "incr_int") { auto key = pk.unpack(msg[1]); if(paracel::startswith(key, "client_clock_")) { if(paracel::ssp_tbl.get(key)) { paracel::ssp_tbl.incr(key, 1); } else {paracel::ssp_tbl.set(key, } if(paracel::ssp_tbl.get(key) >= paracel::ssp_tbl.get("worker_sz") paracel::ssp_tbl.incr("server_clock", 1); Paracel ::ssp_tbl.set(key, 0); }} paracel::packer<int> pk_i;}} paracel::packer<int> pk_i; int delta = pk_i.unpack(msg[2]); paracel::ssp_tbl.incr(key, delta); bool result = true; rep_pack_send(sock, result); }Copy the code
  • If key is “client_clock_”, then

    • Increment the corresponding key by the corresponding value, or add this value;

    • If the value of key is greater than the value of “worker_sz”, it means that all workers have completed a round of iteration, so the following needs to be done:

      • Increment “server_clock” by 1. “Server_clock” is the server clock, and the worker gets this number to see if it is behind or ahead;
      • Reset the corresponding “client_clock_” to 0, indicating that the next iteration needs to be considered.
  • For other keys, increase the value of the parameter;

In thread_exec_ssp, the incr_int part of the code looks like this:

The server receives the request forwarded by KVCLT. The processing example is as follows:

2.4.3 Server incr_int

KVCLT contains the following code, which actually forwards the request to the server, so we can skip it:

  • Iter_commit increments the local clock each iteration;
  • If (clock == total_iters), indicating that this worker has reached its overall iteration value, reduce the server “worker_sz” value. That is, this worker has finished the training, so the number of workers training together should be reduced by 1.

Within iter_COMMIT, the logic is as follows.

Worker end iter_commit 2.4.2

“Worker_sz” means: how many workers should be training together at the moment.

In the paralg build function, various data are initialized. What is important here is that the value corresponding to the server key “worker_sz” is set to worker_comm.get_size(), which is the worker value 5.

Against 2.4.1 initialization

We assume that there are 5 workers and limit_s is 3, that is, the fastest node cannot be 3 iterations ahead of the slowest node. When it’s more than three iterations ahead, Paracel forces a wait.

Each of the previous sections was not thoroughly explained and needs to be connected here.

2.4 Logical Series

2.4.4 series

Putting all the logic together, the noun explains as follows:

  • Client_clock_X indicates that in the actual iteration of the current round of virtual iteration, several workers have completed running respectively, 0 <= X < limit_s.
  • Worker_sz indicates how many workers should train together at present.
  • Server_clock is the server clock, which represents the total number of iterations (actual iterations) that have been trained, and the worker gets this value to see whether it is behind or ahead.

The details are as follows:

  • Limit_s is 3, meaning that the fastest node cannot be three iterations ahead of the slowest node. When it’s more than three iterations ahead, Paracel forces a wait. Thus, there are two iterations:

    • A large iteration is a virtual iteration, consisting of three small iteration steps (limit s number).
    • The small iteration is the actual iteration step, which is represented by client_clock_X, and clock_key_0 indicates that in the first actual iteration in this round of virtual iteration, several workers finish running respectively.
  • In the worker’s Paralg build function, various data are initialized. What is important here is that the server key “worker_sz” is set to worker_comm.get_size(), which is the worker’s value 5.

    “Worker_sz” means: how many workers should be training together at the moment.

  • In the worker’s paracel_read, the local clock is always compared to the remote “server_clock”, and if less than limit_s, the worker is forced to wait.

  • In the worker’s iter_COMMIT:

    • Increase the value of the local clock

      • Clock increments from 0, which is the actual number of local iterations.
      • If (clock == total_iters), indicating that this worker local training has reached its overall iteration value, reduce the server “worker_sz” value. That is: this worker has finished the training, so the number of workers training together below needs to be reduced by 1;
    • If limit_s is 3, then clock_key is client_clock_0, client_clock_1, client_clock_2. Add 1 to server (clock % limit_s) based on the local clock value. Clock_key_0 indicates that in the first actual iteration in this round of virtual iteration, several workers finish running respectively;

  • After submitting iter_COMMIT, in server:

    • If key is “client_clock_”, then

      • Increase the corresponding key by the corresponding value;

      • If the value of key is greater than the value of “worker_sz”, it means that all workers have completed a round of iteration, so the following needs to be done:

        • Increment “server_clock” by 1. “Server_clock” is the server clock, and the worker gets this number to see if it is behind or ahead;
        • Reset the corresponding “client_clock_” to 0, indicating that the next iteration needs to be considered.
    • For other keys, increase the value of the parameter;

We can look at the logic diagram:

Worker 1 + Server 1 fast | | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | paracel_read() { | | | | | | | |auto key = pk.unpack(msg[1]); | | while(stale_cache + limit_s < clock) { | | |if(startswith(key, "client_clock_")){| | stale_cache = get("server_clock") | | | if(ssp_tbl.get(key)) { | | } | | | incr(key, 1); | | } | | | } else { | +-----------------------------------------+ | | set(key, 1); | | | } | +---------------------------------------------+ | if(get(key) >= get("worker_sz")) { | worker 2 | | incr("server_clock", 1); Slow | | | set (key, 0); | +-----------------------------------------+ | | } | | iter_commit() { | | |} | | | | |ssp_tbl.incr(key, delta); | | if(limit_s == 0) { | | | | | clock_key = "client_clock_0" | | +-------------------------------------+ | } else { | |  | clock_key = "client_clock_" + | | | (clock % limit_s) | | | } | | | | | | incr_int(clock_key, 1); | | | | | | clock += 1; | | | | | | if(clock == total_iters) { | | | incr_int("worker_sz"), +1); | | |} | | |} | | |} | | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + +Copy the code

Mobile phones are as follows:

We can also use a diagram to show the logical process, where:

  • Client_clock_1 is abbreviated as C_c_1, indicating that in the actual iteration of this round of virtual iteration, several workers have finished running respectively.
  • Worker_sz, abbreviated to W_sz, indicates how many servers should currently be training together.
  • Server_clock is abbreviated as S_c.” Server_clock “is the server clock, which represents the total number of iterations (actual iterations) that have been trained, and the worker gets this value to see if it is behind or ahead.
  • These variables are server-side variables.

Start the training first, from top to bottom in the table.

The first worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1

c_c_0 c_c_1 c_c_2 w_sz s_c instructions
worker1 1 1 5 The first worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker2
worker3
worker4
worker5

The second worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1

c_c_0 c_c_1 c_c_2 w_sz s_c instructions
worker1 1 1 5 The first worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker2 2 2 5 The second worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker3
worker4
worker5

The third worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1

The third worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1

c_c_0 c_c_1 c_c_2 w_sz s_c instructions
worker1 1 1 5 The first worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker2 2 2 5 The second worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker3 3 3 5 The third worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1
worker4
worker5

The fourth worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1

c_c_0 c_c_1 c_c_2 w_sz s_c instructions
worker1 1 1 5 The first worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker2 2 2 5 The second worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker3 3 3 5 The third worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1
worker4 4 4 5 The fourth worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1
worker5

The fifth worker starts training, the actual training step, and increases C_C_0. Because the actual iteration has been completed, server_clock increases by 1.

At this point, worker 5 is one iteration behind (server_clock = 1).

c_c_0 c_c_1 c_c_2 w_sz s_c instructions
worker1 1 1 5 The first worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker2 2 2 5 The second worker starts training, and the actual training takes two steps, adding C_c_0 and C_c_1
worker3 3 3 5 The third worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1
worker4 4 4 5 The fourth worker starts training. The actual training takes two steps, adding C_c_0 and C_c_1
worker5 5 –> 0 5 1 The fifth worker starts training, and the actual training is one step, and c_C_0 is added. Because all 5 workers have completed the actual iteration, server_clock is increased by 1, and the corresponding “client_clock_0” is reset to 0, indicating that the next iteration needs to be considered.

Now let’s look at the special case.

First of all, 4 workers have finished 3 steps, but worker 5 is not running, and the status is as follows:

c_c_0 c_c_1 c_c_2 w_sz s_c instructions
worker1 1 1 1 5 The first worker in this round actually trains in three steps, adding C_C_0, C_C_1 and C_c_2
worker2 2 2 2 5 The second worker in this round actually trains three steps, adding C_C_0, C_C_1 and C_c_2
worker3 3 3 3 5 The third worker in this round actually trains three steps, adding C_C_0, C_C_1 and C_c_2
worker4 4 4 4 5 The fourth worker in this round actually trains in three steps, adding C_C_0, C_C_1 and C_C_2
worker5

Assuming worker 5’s iter_commit, if worker 5 finds itself (clock == total_iters), indicating that worker 5 has reached its overall iteration value, reduce the server “worker_sz” value. That is: this worker has finished the training, so the number of workers training together below needs to be reduced by 1;

Because worker 5 completes 3 steps of training all at once, s_c becomes 3, that is, the total number of iterations is 3.

In this virtual iteration, all 5 workers have completed the training, so c_c_1 ~ C_c_2 first becomes 5 and then reset to 0.

c_c_0 c_c_1 c_c_2 w_sz s_c instructions
worker1 1 1 1 5 The first worker in this round actually trains in three steps, adding C_C_0, C_C_1 and C_c_2
worker2 2 2 2 5 The second worker in this round actually trains three steps, adding C_C_0, C_C_1 and C_c_2
worker3 3 3 3 5 The third worker in this round actually trains three steps, adding C_C_0, C_C_1 and C_c_2
worker4 4 4 4 5 The fourth worker in this round actually trains in three steps, adding C_C_0, C_C_1 and C_C_2
worker5 5 –> 0 5 –> 0 5 –> 0 4 3 After the fifth worker training in this round is finished, worker 5 finds itself again (clock == total_iters), then the value of “worker_sz” is reduced by 1, and it is ok to look at 4 workers in the future.

So far, we have completed the analysis of SSP, and the data/model loading is analyzed below.

0xEE Personal information

Thoughts on life and technology

Wechat public account: Rosie’s Thinking

0 XFF reference