Recently, Ali Cloud PAI team and Dharma Academy intelligent Computing laboratory released a “low-carbon version” of the giant model M6, greatly reducing the trillion-parameter super-model training energy consumption. With the help of our self-developed Whale framework, we only use 480-card GPU to train the trillion-parameter multi-mode large model M6, which is 10 times larger than human neuron. Compared with the trillion-parameter scale achieved by traditional overseas companies, the energy consumption is reduced by over 80% and the efficiency is increased by nearly 11 times.
The author | | ali wang Lin source technology public number
Recently, Ali Cloud PAI team and Dharma Academy intelligent computing laboratory released a “low-carbon version” of the giant model M6, greatly reducing the trillion-parameter super-large model training energy consumption. With the help of our self-developed Whale framework, we only use 480-card GPU to train the trillion-parameter multi-mode large model M6, which is 10 times larger than human neuron. Compared with the trillion-parameter scale achieved by traditional overseas companies, the energy consumption is reduced by over 80% and the efficiency is increased by nearly 11 times.
M6 is the first commercial multi-mode large model in China. M6 has cognitive and creative abilities beyond traditional AI. It is good at painting, writing and question answering, and has broad application prospects in e-commerce, manufacturing, literature and art and many other fields.
Here is an introduction to the Whale framework design that supports trillion-parameter model training.
Model development trends and challenges
1 development trend of the model
With the popularity of deep learning, the parameter scale of the model also grows rapidly. OpenAI data shows that:
- Before 2012, the calculation time of the model doubled every two years, consistent with Moore’s Law;
- After 2012, the computing time of the model doubled every 3.4 months, far exceeding the speed of hardware development.
In the past year, the scale of model parameters has grown rapidly. Google, Nvidia, Ali and Zhiyuan Research Institute have all released a trillion parameter model, and some Dachang have also released a billion, billion parameter model. At the same time, with the increase of model parameter scale, the model effect is gradually improved. Nvidia tests different parameter scale of Bert model, and it is found that the model confusion decreases with the increase of model parameter scale.
Google also found in the GShard paper that the larger the size of the MoETransformer model parameters, the higher the translation quality.
Two major model training challenges
Large models not only improve the model effect, but also bring greater challenges to the training framework. For example, when we want to train a trillion-scale model, we will face the following challenges:
- Difficult training: GPU memory is not enough to store model copies, and data parallelism cannot meet the requirements; The framework needs to provide a new parallel strategy to store and train models with multi-GPU capabilities. How to provide users with simple and easy-to-use interface, so that users can easily implement the distributed version of the model; The super-large scale model brings great challenges to the efficiency of computing and communication. How to improve the efficiency of computing and communication? How to connect downstream tasks, how to support batch prediction and online reasoning needs;
- High cost: Take the trillium model for example, the model parameters have a size of 4TB and a gradient of 4TB. With the Optimizer States and active tensor, the video storage needs are huge. Resources needed to train models of the same scale in the industry: Nvidia 3072 A100 and Google 2048 TPU V3, which are too expensive to be implemented; How to reduce cost and increase efficiency, use fewer resources, faster training convergence;
Several distributed training frameworks exist, such as: Horovod, Tensorflow Estimator and PyTorch DDP support data parallelism, while Gpipe, PipeDream and PipeMare support stream parallelism. Mesh Tensorflow, FlexFlow, OneFlow and MindSpore support operator splitting, but these frameworks still have some shortcomings:
- Single pattern: many frameworks only support partial parallelism and cannot support all kinds of hybrid parallelism.
- High access threshold: it is difficult and costly for users to implement the distributed version of the model, and domain expert experience is required to realize efficient distributed parallel strategy;
- High migration costs: the parallel implementation of different distributed frameworks is fragmented, and different frameworks have their own DSLS. When users want to switch the parallel strategy, they need to learn various interfaces and rewrite the model.
- Performance is not ideal: some framework implementations do not consider the cluster physical environment;
In order to meet the current challenges of distributed training, we developed the distributed training framework Whale. The main objectives are:
- Unify multiple parallel policies: support various parallel policies and various combinations of these policies in a framework;
- Simple and easy to use interface: users only need to add a few lines of annotations to complete the configuration of parallel strategy, and the model code does not need to change;
- Efficient training framework: combine hardware resources, network topology and model for collaborative optimization to create an efficient distributed training framework;
PAI self-developed Whale framework
1 Whale architecture
We launched Whale, a high-performance distributed training framework that integrates multiple parallel strategies, to meet the challenges of distributed training from the following perspectives:
- Different parallel strategies are abstracting and encapsulated in a unified way, and multiple parallel strategies are supported in a distributed training framework.
- Design a set of distributed parallel interfaces based on Tensorflow, which is fully compatible with Tensorflow. Users only need to add a few lines of annotations to realize rich distributed parallel strategies.
- Combined with model structure and network topology, scheduling and communication optimization can provide efficient distributed training capability.
The Whale framework is shown in the figure below, which is mainly divided into 4 modules:
- API: provides a concise and easy-to-use interface that allows users to combine various hybrid parallel strategies;
- Whale IR: Turn the parallel strategy into internal expression, and express various parallel strategies through TaskGraph, multi-Dimension and VirtualDevices abstraction;
- Whale Engine: Based on WhaleIR, a graph editing tool is used to build distributed execution diagrams;
- Runtime: Convert the distributed execution graph to TFGraph, and then call TF Runtime for execution;
2 Introduction to Whale Easy-to-use interface
Whale provides a concise and easy-to-use interface to describe various parallel strategies. The main primitives are:
- Cluster: Divides Virtual devices
- Replica: Indicates data parallel
- Stage: Divides the TaskGraph
- Pipeline: Flow parallel
- Split: split the operator
These interfaces can be used to combine various parallel strategies, such as:
- Data parallelism:
- Flow parallel:
- Flow parallel + data parallel:
- More examples of parallel policies:
3 Whale training process
Using Whale for distributed training flow:
- Parallel strategy configuration: Using Whale API to configure the parallel strategy for the model, only a few lines of annotations need to be added without modifying the model code, as shown in Section 2.2. The model can be divided into multiple Taskgraphs. Taskgraphs can be configured with multiple parallel policies, and each TaskGraph can be configured with different parallel policies.
- Virtual resource partitioning: Virtual devices are divided according to the parallel strategy, and each TaskGraph corresponds to a Virtual Device. Select Physical Device for Virtual Device by GPU resource and network TopO;
- Distributed execution graph: Based on parallel strategy and resource allocation information, use graph editing tools to edit execution graph (graph copy, split, insert communication node, etc.) to generate the final distributed execution graph; Call the TF runtime to execute the distributed Graph;
Three trillion M6 model pre-training
The computing power requirement of trillion model is very large. In order to reduce the computing power requirement, the Mixture (MoE) structure is implemented in Whale. The main characteristic of MoE is sparse activation, and the Gating(Router) is used to select Top K expert for input calculation (K is usually 1 and 2). Thus greatly reducing computational power requirements.
The Mixture of Experts layer (MoE) is implemented in Whale, and expert parallelism is supported. Experts are split into multiple Devices, reducing the video memory and computing power requirements of a single Device. At the same time, data parallelism is beneficial to improve the concurrency of training, so the mixed parallel strategy of data parallelism + expert parallelism is adopted to train M6 model: MoElayer adopts expert parallelism, and other layers adopt data parallelism.
Whale provides a simple and easy-to-use interface to carry out mixed parallel training of the model. It only needs to add a few lines of annotations to configure parallel strategy, and the model itself does not need any modification. M6 model adopts the strategy of data parallelism + expert parallelism, which only needs to add annotations as shown below:
Meanwhile, in order to save training resources and improve training efficiency, VARIOUS optimization techniques are provided in Whale:
Video memory optimization:
- Auto Gradient Checkpoint: Automatically selects the optimal Checkpoint node to save activation memory.
- Group-wise Apply, optimize the display memory in the Optimizer Apply phase;
- CPU Offload technology, Optimizer Status and Weight display memory;
- Communication pooling, control the size of communication data block and concurrency, save communication video memory;
Computing and communication acceleration:
- DP+EP hybrid parallel strategy is adopted to reduce the computation force requirement.
- Packet fusion communication, semi-precision communication, topological awareness All2All communication operator and other technologies are used to improve communication efficiency.
- Combining mixing precision and compiler optimization to improve training efficiency;
With the help of Whale framework, the pre-training of trillion-M6 model was completed on 480 V100 in 3 days for the first time. Compared with nvidia using 3072 A100 GPU to achieve trillion-parameter and Google using 2048 TPU to achieve 1.6 trillion-parameter large model, This time Darma Institute only uses 480 CARD V100 32G GPU to achieve trillion-model M6, saving computing power resources by over 80% and improving training efficiency by nearly 11 times.
Four conclusion
The scale of model parameters has been increasing, and large model has become a development trend. In order to solve the challenge of super-large model training, we developed Whale framework, unified abstraction and encapsulation of different parallel strategies, and supported multiple parallel strategies in a set of distributed training framework. Whale provides a clean, easy-to-use interface that allows users to implement various parallel strategies by adding a few lines of annotations without modifying the model itself. At the same time, we combine hardware resources, network TOPO and model to carry out software and hardware co-optimization and provide an efficient distributed training framework.
Through Whale framework, we trained the trillion-scale model with 480 V100 GPU card, and completed the model training convergence within 3 days, which provided the possibility for the training of super-large scale model. In the future, we will further improve Whale framework to expand the capabilities of Whale framework from three dimensions of larger scale, faster speed and higher cost performance. Meanwhile, it will promote the implementation of Whale capabilities in more business scenarios and transform technical capabilities into product capabilities.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.