Flying oar distributed training and new push, 4D hybrid parallel training billions of AI model

Recently, flyblade has proposed a 4D hybrid parallel strategy to train billion-order dense parameter models.

In recent years, developers in the field of deep learning have been increasingly pursuing model effects, breaking charts and charts, and this phenomenon is often the result of “mass training”. In simple terms, it is to use large-scale data or large-scale model to do training. Large-scale data can make the model have enough “teaching materials” for “learning”, while the large number of parameters can make the model have stronger “learning ability” and easier to “learn” from “knowledge” in “teaching materials”. In the process of data and parameter scale growth, the conventional single machine training gradually appears to be insufficient due to the limitation of hardware resources, and distributed training has become the inevitable choice of the majority of developers.

So-called distributed training, is to use multiple machines to complete the training task, which involves many machine task split, cluster training resources, balance training speed and convergence speed, flexibility and fault tolerance, a number of important technologies such as is also the major deep learning framework reveals the importance of technical strength “strategic highland”.

PaddlePaddle is China’s first open source, self-developed and fully functional industry-level Deep Learning platform. Its English name “PaddlePaddle” is the acronym combination of “Parallel Distributed Deep Learning”. Flying oar not only supports the training ability of trillion-level sparse parameter model for the first time in the industry, but also innovatively proposed 4D hybrid parallel strategy recently to train trillion-level dense parameter model. It can be said that distributed training is one of the most characteristic technologies of flying oar. So how does it work? This is inseparable from the exercise of the actual business.

▲ Figure 1 Baidu rich business scenarios

Fly blade distributed training techniques in providing before internal are widely used in baidu, such as recommended, baidu translation, baidu search engine and information maps, nice video, wen xin groeb, etc., complex, contains both network density parameters characteristics of computer vision (CV)/natural language processing (NLP) model training scenarios, It also covers the recommendation search training scene with huge Embedding model and large amount of data, which can be described as a unique “training room” for distributed training technology.

▲ Figure 2 Large-scale training scenario

After the search recommendation business honed the most mature trillion sparse parameter model training technology to ride the dust

Search recommendation scenarios often face the problems of large amount of data, high feature dimension and sparse. The parameter Server mode of distributed training adopts a centralized management of model parameters to realize distributed storage and update of model parameters. This mode has two roles, Server and Worker: Worker is used to perform forward and reverse calculation of the model; Server is responsible for collecting and summarizing gradients and updating parameters from various workers, so it is very friendly to the training scenarios for storing parameters of super-large models, and is often used to train search recommendation field models with massive sparse parameters.

▲ Figure 3. Traditional parameter server

As the largest Chinese search engine in the world, Baidu search has very high requirements on the scale and performance of the model. In order to cope with severe practical business challenges, flyblade’s pure CPU parameter server mode can support trillion-scale sparse parameter model training as early as 2018. After that, with the more complex network structure in the model and the further pursuit of training efficiency and cost performance, the flying blade parameter server technology is also updated: From the early pure CPU parameter server with the same hardware model of Worker node to pure GPU parameter server, and then to heterogeneous parameter server with mixed scheduling training of CPU, GPU and other AI hardware, it always leads the development of parameter server technology. At the same time, more applications have been implemented, such as OPPO app store recommendation, netease cloud music anchor recommendation and so on.

From the traditional pure CPU parameter server to pure GPU parameter server

The traditional pure CPU parameter Server is composed of high-performance asynchronous Worker training, efficient communication strategy and high-performance Server. The number of available cpus is usually large, and the advantages of multi-core CPU throughput can be fully demonstrated in training. Training simple models in asynchronous training mode can greatly improve data throughput and the overall training speed is excellent.

▲ Figure 4 Traditional parameter server workflow

Network is more and more complex, but as the model to calculate power demand is higher and higher, on the basis of the amount of data, the CPU performance difference appeared the weakness of the will, although might be solved by increasing the number of CPU machine, even can add hundreds of units, but this method is not only cost is greatly increased, and the stability of the cluster and extensibility is not there is a big problem. Therefore, feoar introduced a pure GPU parameter server to improve the computing performance. It takes only one multi-card GPU machine to complete the training of a model that could only be trained by 100 CPU machines before. Of course, we also need to solve the problems caused by hardware replacement.

The powerful computing power of GPU can undoubtedly improve the computing performance of the cluster. However, as a result, not only the model scale will be restricted by the machine’s video memory and memory, but also the communication bandwidth will become a bottleneck due to the reduction of the number of network cards in the cluster. In order to solve these two problems, PaddleBox introduced two bright technologies, SSD-MEM-HBM three-level storage and RPC&NCCL hybrid communication, forming PaddleBox (pure GPU parameter server) unique to PaddleBox [1] :

Ssd-mem-hbm tier 3 storage allows full parameters to be stored on SSDS, and high-frequency parameters to be stored in memory. The current Batch training parameters are used in video memory, and SSD parameters can be quickly copied between hard disks, memory, and video memory. In this way, the asynchronous pipeline execution mechanism conceals the extra performance cost caused by IO, and ensures the training speed while making the training model size no longer subject to video memory and memory, greatly increasing the model size.
RPC&NCCL hybrid communication can use RPC protocol for inter-node communication of some sparse parameters, and NCCL mode for inter-card communication of other parameters to make full use of bandwidth resources.

▲ Figure 5 Pure GPU parameter server workflow

Although the flyblade pure GPU parameter server solves the problems faced by the previous pure CPU mode, a new problem arises — how to improve the utilization of training resources?

From traditional pure GPU parameter server to heterogeneous parameter server

In the case of pure GPU parameter server, all training is carried out in GPU. When some network layers in the model are complex, GPU utilization is difficult to be fully filled, and the hardware ratio of CPU and GPU in GPU machine is fixed, which cannot be flexibly adjusted. There are two solutions to this situation:

Customize GPU models and adjust the CPU and GPU hardware ratio.
Mix CPU and GPU machine nodes to adjust the hardware ratio between machines.

Based on these two solutions, flyblade Framework version 2.0 innovatively introduced the common heterogeneous parameter server function. In the traditional parameter server mode, Worker nodes must strictly use the same hardware model, so that training tasks are insensitive to hardware model, that is, different hardware can be used to mix heterogeneous training at the same time. For example, CPUS, AI-specific chips (such as Baidu Kunlun XPU) and gpus of different models, such as V100, P40, and K40. At the same time, it can solve the problem of low chip resource utilization caused by high I/O ratio in large-scale sparse feature model training scenarios. With heterogeneous parameter server training mode, users can deploy distributed training tasks in heterogeneous hardware clusters, such as cloud server clusters, which efficiently utilize different computing power chips to provide users with higher throughput and lower resource consumption training capabilities.

▲ Figure 6 Schematic diagram of heterogeneous parameter server

The biggest highlight of heterogeneous parameter server is hardware – aware task sharding. As shown in Figure 6, training tasks that are both computationally intensive and IO intensive, such as ERNIE+CTR, can be divided into multiple sub-tasks. The IO intensive tasks (such as data reading and Embedding query) are segmented to CPU machine, and the computation intensive tasks are segmented to GPU machine. Users can flexibly determine the machine ratio according to the computational complexity of subtasks, and it is compatible with the training tasks supported by traditional pure CPU parameter servers and pure GPU parameter servers.

Help Wenxin ERNIE to quickly iterate, initiate 4D hybrid parallel and lead the trend of super-scale pre-training

In the NLP field, the “Semantic Understanding Technology and Platform Writing ERNIE” has won numerous awards, including five SemEval 2020 awards in March last year; In May, ERNIE-GEN, the pre-training model for language generation, was released, and SOTA was refreshed. In June, released the multi-mode model ERNIE-ViL, refreshed 5 task records and topped the authoritative LIST VCR; In July, it appeared in the 2020 World Artificial Intelligence Conference and won SAIL Award, the highest honor. In November, he won the Excellent Science and Technology Achievement Award of Chinese Society of Artificial Intelligence. Behind ERNIE’s shining achievements, there is also the contribution of the distributed training technology of flying OARS.

Firstly, for NLP and CV models with complex networks and dense parameters, the ensemble communication mode of oar distributed training technology can support the training of such models well. In this mode, there is no central node to manage model parameters, and each node is a Worker. While each Worker is responsible for model training, he/she also needs to master the latest global gradient information. Ensemble communication mode has high requirements on computing power and network interconnection between chips, such as HIGH-PERFORMANCE computing GPU and high-speed network interconnection between chips such as NVLINK and InfiniBand, so it is very suitable for computing intensive training tasks in CV and NLP fields.

However, in the early set communication architecture, the transmission of parameter information between multiple nodes is usually completed by point-to-point communication between various workers for many times, resulting in low communication efficiency. Baidu made a breakthrough in 2016 by proposing and using Ring-AllReduce multi-GPU training to complete model parameter transmission of global nodes with fewer point-to-point communication rounds, making a great breakthrough in multi-GPU expansion capability of synchronous parallel training and greatly improving training speed of ensemble communication mode. This model can be more widely used in NLP and CV fields.

▲ Figure 7 Set communication training

4D hybrid parallel strategy supports ERNIE Billions language model training

The current flying OARS collective communication mode can already support Wen Xin ERNIE’s training ability of 100 billion language models, and its Sharing-DP strategy has recently helped Wen Xin ERNIE’s multi-task scores refresh the GLUE list. This Sharding-DP strategy is one of many parallel strategies supported by the oAR ensemble communication model to train large-scale and complex models like ERNIE. So what strategies have been used to successfully support ERNIE’s hundred billion language model training? How do these strategies work? The following will be introduced in detail.

ERNIE 100 billion model adopts 100 multi-layer Transformer network structure, complex calculation, training needs to occupy T level video memory resources, if you want to use fewer machines to train efficiently, must take a series of performance optimization and video memory optimization measures.

Let’s start with performance optimization. We use a formula to see which factors can affect the training speed in a fixed hardware environment:

** Total training speed * single card speed * Number of cards * Multi-card acceleration ratio **

Single card speed is determined by data reading and calculation speed. The multi-card acceleration ratio is determined by computing/communication efficiency. Obviously, these three are the key factors. In addition to the basic performance optimization strategies such as operator fusion and mixing accuracy that can be used by a single card, distributed training also introduces a series of parallel strategies. The core idea of parallel strategy is to segment data and computation-related graphs/operators to different devices, reduce the cost of communication between devices as much as possible, rationally use the resources of multiple devices, realize efficient concurrent scheduling training, and maximize training speed. Common Parallel strategies include Data Parallel (DP), inter-layer Parallel (Pipeline Parallel PP), and in-layer Parallel (Model Parallel MP, Model Parallel). See the following table. We analyze the advantages and disadvantages of the three strategies in terms of device resources and computing/communication efficiency:

The acceleration ratio of data parallel training is the highest, but each device is required to back up a model, and the video memory consumption is relatively high. Therefore, our improved scheme is the parallel strategy of grouping parameter slicing data (detailed principle will be introduced later), which is compatible with the advantages of MP+DP, but has the disadvantage of large communication volume.
Model parallelism, high communication ratio, suitable for model parallelism in the machine and limited model types are supported.
Pipeline parallel, training equipment is prone to idle state, acceleration efficiency is not as high as DP; However, it can reduce communication boundaries and support more layers, which is suitable for use between machines.

Secondly, we look at the video memory problem. According to the video memory occupation sources analyzed in the following table, it can be seen that the parallel strategy mentioned above can also cope with the video memory occupation from different sources well, and more layers can be solved by pipeline parallelism and grouping parameter segmentation strategy. A layer with large parameters can be solved by model parallelism. Secondly, the flywheel also provides some other flexible optimization methods, such as the amount of video memory occupied by the output of each layer, which can be solved by recalculation and Offload.

To sum up, several parallel strategies are useful for performance optimization and video memory optimization, but they also have their own limitations. Therefore, if you want to train billions of models efficiently, you need to combine these strategies to learn from each other and give full play to their advantages.

So how do you combine them? The 2D strategy of model parallelism and group parameter slice combination is firstly used in the single machine. The reason for this choice is that these two strategies have high communication volume and are suitable for inter-card communication in the machine. Then, in order to carry hundreds of billions of models, the pipelining parallel strategy is superimposed and shared by multiple machines. Finally, in order to achieve high efficiency, data parallelism is superimposed on the outer layer to increase the number of concurrent and improve the overall training speed. Thus the industry’s first 4D hybrid parallel strategy was born.

▲ Figure 8 schematic diagram of 4D hybrid parallel strategy

Let’s briefly introduce the principles of several parallel strategies.

The model parallel strategy refers to that a layer of network is cut into multiple parts and distributed to different cards for parallel calculation. Each card only needs to calculate part of the results. For the Transformer network structure in ERNIE, model parallelism enables FC segmentation at the full connection layer and then merges the calculation results through communication operations [2].

The pipelining parallel strategy supports placing different layers of the model on different devices, and sharing video memory consumption by multiple devices to realize the training of super-large models. Data is transmitted between adjacent devices through a communication link. Because each device transmits only the output tensor between adjacent devices, the communication volume is small, so it is relatively suitable for the scenario of communication between machines.

It is worth noting that pipelined parallelism is a special case of generalized model parallelism. In this paper, model parallelism only refers to Tensor segmentation, that is, the same network layer is assigned to different cards for calculation, while pipelined parallelism is segmentation according to the granularity of network layer.

▲ Figure 9 schematic diagram of pipeline parallel strategy

Pipelined parallelism strategy itself also has a lot of room for optimization. As shown in FIG. 10 (a), before optimization, only a single computing device is in computing state at any time, while other computing devices are in idle state, which is called Bubble time [3]. In order to reduce the Bubble time, as shown in FIG. 10 (b), the flying propeller further divides the mini-batch into several micro-batches of smaller granularity, and each device calculates the results of a single micro-batch in turn, thus increasing the concurrency between devices. Reduced pipeline parallel Bubble time ratio.

In addition, after further analysis of the pipeline parallel training process, the researchers found that the utilization rate of video memory can be further optimized. As shown in Figure 10 (c), after a micro-batch completes the forward calculation, the corresponding backward calculation is scheduled to be completed in advance. In this way, part of the video memory can be released to accept new data and improve the overall training performance. Measured by ERNIE model, from 10(b) to 10(c), the total BatchSize can be improved by 32 times and the performance can be improved by 9 times.

▲ Figure 10 Parallel sequence diagram of pipeline

Finally, the grouped parameter slicing strategy of flying propeller is featured by using the parameter slicing method to save video memory and combining with the data parallel strategy into a more powerful Sharding-DP strategy. In short, the combined strategy has strong flexibility. Users can set the number of model parameter sharding_degree and dp_degree according to the existing hardware environment. Ensure that sharding_degree x dp_degree = total number of cards.

As an example, suppose the user has four stand-alone machines with four cards (16 cards in total) and a 16-layer network model. If the model parameter size can be hosted by a machine, it is recommended to use dp_degree=4 & sharding_degree=4, as shown in Figure 11. The advantage of this approach is that there is only inter-card communication within the machine, but the model cannot exceed the maximum storage capacity of a single machine.

▲ Figure 11 Sharding-DP diagram of dp_degree=4 & sharding_degree=4

If the model size is larger than a single machine, this is not a problem, and the user has the flexibility to choose dp_degree=2 & SHARding_degree =8, as shown in Figure 12. Compared with the previous approach, this approach supports twice the size of model parameters.

▲ Figure 12 Sharding-DP diagram of dp_degree=2 & sharding_degree=8

However, in some special cases, if the scale of model parameters is so large that half of the machines cannot bear them, dp_degree=1 & SHARding_degree =16 can be further used, that is, the whole model parameters are delivered to all the machines to bear them, which is also the standard zero-DP [4] method. See Figure 11. In this way, the number of cross-machine communication is very high, which has a great impact on the training speed. In fact, Sharding-DP can be said to be a distillation of zero-DP, allowing users to use a more efficient way to deal with most of the training tasks except for special scenarios.

▲ Figure 13 Sharding-DP schematic diagram of DP_degree =1 & SHARding_degree =16

Finally, from the perspective of theoretical performance, we compare and analyze several hybrid parallel strategies, namely DP2+PP32+Sharding2+MP4, PP64+Sharding2+MP4 and DP2+PP32+MP8. As shown in the table below, compared with the two 3D methods, the 4D hybrid parallel strategy does not significantly increase the traffic and Bubble time (for specific formula derivation and examples, please refer to relevant tutorial [5]), but greatly improves the number of data parallel paths!

The validation test

From the above theoretical analysis, 4D hybrid parallel strategy should have better performance. So what are the practical effects? Let’s enter the field measurement stage. We used the environment of 64 8-card GPU V100 machines to verify the training effect of different strategy combinations, and the test object was the “heavyweight” ERNIE model with 230 billion parameters. The training speed of 4D hybrid parallel strategy is higher than that of the other two 3D hybrid parallel strategies, reaching 8698 tokens/s, increasing the training speed by at least 23.7%.

Write in the last

Since the beginning of the design of the flying paddle, the distributed training technology has been studied to cope with the training task of large-scale parametric model. Driven by rich search and recommendation services, flying oar distributed training parameter server model has gone through three generations. The earliest pure CPU parameter servers have been able to train trillion-scale sparse parameter models. Subsequently, with the development of business requirements and cutting-edge technologies, the pure GPU parameter server mode with stronger computing capacity comes into being. The newly launched heterogeneous parameter server mode, which is the first in the industry, supports more scenarios and can greatly improve the utilization efficiency of hardware resources. For the large-scale dense parameter model, the flying oar distributed training technology is also closely combined with the business, and its set communication mode supports the distributed training of the 230 billion parameter scale wenzhong ERNIE model through the latest 4D hybrid parallel strategy. Today, flying-blade has begun to study the next generation of distributed technologies that are compatible with the training of both super-large dense and sparse parameter models. It is believed that under the driving force of practical industrial application, the distributed training of flying OARS will become the North star in the sea of stars, guiding the course for the vast number of developers.

[1] Zhao W, Xie D, Jia R, et al. Distributed hierarchical gpu parameter server for massive scale deep learning ads systems[J]. arXiv preprint ArXiv: 2003.05622, 2020

【2】Shoeybi M, Patwary M, Puri R, et al. Megatron-lm: Training multi-billion parameter Language Models using Model Parallelism [J]. ArXiv PrePrint arXiv:1909.08053, 2019.

【3】Huang Y, Cheng Y, Bapna A, et al. Gpipe: Efficient training of Giant Neural Networks using Pipeline Parallelism [J]. ArXiv PrePrint arXiv:1811.06965, 2018.

[4] Rajbhandari S, Rasley J, Ruwase O, et al. Zero: Memory optimizations toward training trillion parameter models[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020: 1-16.

[5] Related courses:

Fleet – x.r eadthedocs. IO/en/latest/p…

If you have any problems in the process of use, you can join the official QQ group for communication: 778260830.

If you want to learn more about flying OARS, please refer to the following documentation.

Paddles official website address

www.paddlepaddle.org.cn/

Fly oar open source framework project address

GitHub:

Github.com/PaddlePaddl…

Gitee:

Gitee.com/paddlepaddl…

Flying oar distributed training and new push, 4D hybrid parallel training billions of AI model

Related Posts

Application of deep learning in OC

Tensorflow LSTM implements multi-dimensional input and output prediction

The current inadequacy of artificial intelligence