Neural networks are computationally intensive and often take hours or days to train. Data parallelism is a way to speed up training as the number of workers (such as Gpus) increases. At each step, the training data is divided into small batches and distributed among workers, each worker calculates its own gradient update set and applies it to all replicas. All-reduce, the default cross-device communication operation in TensorFlow, PyTorch, and Horovod, collects gradients in each iteration and sums them over multiple workers. Communication in each training iteration utilizes a large amount of network bandwidth.

To speed up data parallel training on GPU clusters, Vertex AI has introduced Reduction Server, a faster gradient aggregation algorithm developed at Google that doubles the algorithm bandwidth for all restore operations. Reduction Server enables distributed ML training jobs to efficiently utilize bandwidth (2x throughput) and complete training jobs faster at run time. This benefit of reduced training time can lead to a reduction in total operating costs. In addition, users can implement restore servers on vertex AI without changing any of the underlying training code.

This post introduces the Reduction Server concept and demonstrates how Google Cloud customers can take advantage of this feature on Apex AI to improve their training time. In the next section, we’ll delve into the technical details and examine all-reduce, a key operation in parallel training for distributed data.

All return

All-reduce is a collective operation that reduces the target array (such as sum, multiply, Max or min operations) in All workers to a single array and returns the result to All workers. It has been successfully used in distributed neural network training scenarios where gradients from multiple workers need to be summed up and passed to all workers. Figure 1 illustrates the semantics of all-Reduce.

There are many ways to implement full restore effectively. In traditional full reduction algorithms, workers communicate with each other and exchange gradients through the topology of communication links, such as rings or trees. Ring full reduction is a bandwidth-optimal full reduction algorithm in which workers form a logical ring and communicate only with their nearest neighbors. However, even the bandwidth-optimal full reduction algorithm still requires the input data to be transmitted twice over the network 1.

Restoring the server

Reduction Server is a faster GPU full Reduction algorithm developed by Google. There are two types of nodes: worker and restorer. Workers run copies of the model, calculate gradients, and apply optimization steps. Restorer is a lightweight CPU virtual machine instance (much cheaper than a GPU virtual machine) dedicated to aggregating worker gradients. Figure 2 shows the overall architecture of a restorer set with four workers and a shard.

Each worker only needs to transfer a copy of the input data over the network. Therefore, the Reduction Server effectively halves the amount of data that needs to be transferred. Another advantage of restoring a server is that its latency does not depend on the number of workers. The restore server is also stateless; it only restores gradients and shares them with workers.

The following table summarizes the data transfer volume and latency per worker of the Reduction Server compared to the full Reduction algorithm based on rings and trees, with n workers.

Reduction Server provides transparent support for many frameworks that use NCCL for distributed GPU training, such as TensorFlow and PyTorch, and is available on Vertex AI. This enables ML practitioners to use Reduction Server without changing the training code.

Performance advantage

Figure 3 shows the performance improvement when the Reduction Server was used to fine-tune the BERT model of the TensorFlow model garden on the MNLI dataset with 8 NVIDIA A100 Gpus per GPU working node. In this experiment, training throughput was increased by 75% using 20 restorer nodes. Other large models also benefit from restoring the server, with increased throughput and reduced training time.

conclusion

In this blog post, we describe how Reduction Server on Vertex AI provides significant improvements in GPU training for distributed data parallelism and enables a transparent transition from traditional full restore to Reduction Server for ML practitioners.

To learn more, please visit our documentation for in-depth information to help you gain some hands-on experience with Reduction Server on Vertex AI.


1. If the target array size has N elements, each worker needs to send and receive 2(n-1) elements over the network during a full restore operation.

[

Related articles

Optimize training performance using restore server on vertex AI

Learn how to configure Vertex training jobs using Reduction Server to optimize bandwidth and latency for distributed training for S…

Read the article

] (gweb-cloudblog-publish.appspot.com/topics/deve…).