0 x00 the
Parameter Sharding is to segment model parameters equally to each GPU, so as to achieve the purpose of using fewer Gpus to achieve large-scale model training. This series will analyze Parameter Sharding based on papers, blogs and codes of Google, Microsoft and Facebook, with about 5 ~ 6 articles.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Extree-scale Model training for everyone is an optimizer developed by Microsoft to efficiently utilize video memory. It distributes model state quantities (optimizer states, gradients, and model parameters) across multiple parallel Gpus. The goal is to train multi-billion parameter models without using model parallelism.
ZeRO is a combination of zero-DP and zero-R. Zero-dp is an enhanced data parallelism mechanism that uses dynamic communication strategies to partition optimizer states, gradients, and parameters to minimize traffic and avoid redundancy of model states. Zero-r uses partition activation recomputation, constant size buffers, and dynamic memory defragmentation mechanisms to optimize memory consumption for the remaining states.
This article will not analyze word for word, but select some key points, and try to add my own understanding.
Other articles in this series are as follows:
PyTorch Distributed ZeroRedundancyOptimizer
0 x01 review
DeepSpeed: Extreme-Scale Model Training for Everyone.
1.1 the challenge
First, we need to understand the video memory and computational efficiency challenges of training large models.
1.1.1 Video memory efficiency
The video memory required to train trillion-parameter models far exceeds the size of a single GPU. For example, mixed precision training using Adam optimizer requires approximately 16TB of video memory to store model state quantities (parameters, gradients, and optimizer state quantities). Just to store model state, you need 400 Nvidia A100 Gpus (40 GB of video memory each).
Activation functions also need to occupy additional video memory, which increases with batch size. When batch size is set to 1, the trillion-parameter model will use more than 1 TB of video memory to store activation. People have also tried to checkpoint activated video memory, which is to swap video memory with a computer. This reduces the memory to about 20 GB, but this is still too high a requirement for training.
Therefore, it is necessary to effectively divide model state and activate video memory among multiple GPU devices in order to train such a large model without running out of video memory.
1.1.2 Computing efficiency
According to estimates based on OpenAI’s Law of Scaling, it takes around 5000 Zflops (i.e., 5 followed by 24 zeros) to train a trillion-parameter model end-to-end. Training such a model requires 4000 A100 sheets to run at 50% computational efficiency for about 100 days.
Although large supercomputing Gpus clusters can have more than 4,000 Gpus, it is still challenging to achieve high computing efficiency at this scale due to batch size limitations. Computational efficiency increases as the ratio of computation time to communication time increases. The proportion is proportional to the batch size. However, there is an upper limit on the batch size that can be trained by the model. If the upper limit is exceeded, the convergence will deteriorate rapidly.
1.2 weigh
Let’s look at the tradeoff between data parallelism, model parallelism, and pipeline parallelism.
1.2.1 Data Parallelism
Data parallelism is a very common technique in deep learning. In data parallelism, each batch of input training data is divided equally among data parallelism workers. After the back propagation, we need to communicate to regulate the gradient, so as to ensure that the optimizer can get the same update on each worker. Data parallelism has several obvious advantages, including high computational efficiency and low workload. However, the batch size of data parallelism will increase with the number of workers, and it is difficult for us to increase batch szie infinitely without affecting convergence.
- Video memory efficiency: Data parallelism copies models and optimizers across all workers, so video memory efficiency is not high.
- Calculation efficiency: With the increase of parallelism, the amount of calculation performed by each worker is constant. Data parallelism can scale almost linearly on a small scale. However, because the communication cost of the statute gradient among workers is positively correlated with the size of the model, the computing efficiency will be limited when the model is large or the communication bandwidth is low. Gradient accumulation is a common strategy used to evenly distribute communication costs. It can increase batch size, use micro-Batch to carry out multiple forward and back propagation locally, and specify gradient before optimizer update to share communication costs.
1.2.2 Model parallelism
Model parallelism is another broad class of techniques. It can divide the layers of the model among multiple workers. By its very nature, model parallel computation and communication vary from model structure to model structure, so it requires a lot of work to implement. DeepSpeed utilizes Nvidia’s Megatron-LM to build parallel language models for large-scale models based on Transformer. Model parallelism will reduce video memory usage proportionally according to the number of workers, which is the highest video memory efficiency among the three parallel modes. But it comes at the cost of being the least computationally efficient.
- Video memory efficiency: The video memory usage of model parallel can be reduced proportionally according to the number of workers. Crucially, this is the only way to reduce the amount of active video memory in a single network layer. DeepSpeed further improves video memory efficiency by dividing and activating video memory between model parallel workers.
- Computational efficiency: The computational efficiency of model parallelism is low because of the additional communication required to pass activation in each forward and back propagation. Model parallelism requires high communication bandwidth and does not scale well beyond a single node with limited communication bandwidth. In addition, the parallel worker in each model will reduce the amount of computation executed between each communication stage, thus affecting the calculation efficiency. Model parallelism is often used in conjunction with data parallelism to create a trade-off between memory and computational efficiency.
1.2.3 Pipeline Parallelism
Pipelining parallelism divides the layers of a model into stages that can be processed in parallel. When a phase completes a micro-batch forward propagation, the active memory is sent to the next phase in the pipeline. Similarly, when the next phase completes the backpropagation, the gradient is passed back through the pipeline. To ensure that each stage of the pipeline can be computed in parallel, multiple micro-batches must be computed at the same time. There are already several implementations for tradeoffs between memory and computational efficiency and convergent behavior, such as PipeDream. DeepSpeed achieves parallelism through gradient accumulation, and under the same total batch size, it can achieve the same convergence with traditional data parallelism and model parallelism training.
- Video memory efficiency: The video memory reduced by pipeline parallelism is proportional to the number of stages of the pipeline, which allows the size of the model to expand linearly with the number of workers. However, pipelining parallelism does not reduce the active video memory footprint of each layer. In addition, each worker must store activation values for each micro-batch running simultaneously. This results in roughly the same amount of active memory for the first phase of the pipeline as the total active memory for a single mirCO batch.
- Computational efficiency: because the traffic of pipeline is only proportional to the activation value of each layer at the stage boundary, the traffic of pipeline parallel is the lowest. However, it cannot be extended indefinitely. As with model parallelism, increasing the pipeline size reduces the amount of computation per pipeline stage, which reduces the ratio of computation to communication. In order to achieve good computational efficiency, pipelining parallelism also requires perfect computational load balancing at each stage.
In addition, pipeline parallelism causes bubble overhead at the beginning and end of each batch due to the need to refill or empty the pipeline. Training using 4-x or 8-x gradient cumulative steps (as well as batch sizes) of the number of pipeline phases achieved 81% and 90% scalability, respectively, compared to just one pipeline phase.
1.3 Achieve memory and computing efficiency through 3D parallelism
Data, model, and pipeline parallelism each play a specific role in improving memory and computing efficiency, so DeepSpeed combines these three powerful technologies to train trillion-scale models and scale up to thousands of Gpus. The symbiosis and parallelism of these three strategies simultaneously solves the two basic challenges of trillion-parameter model training: video memory efficiency and computational efficiency, making the training scale of deep learning far beyond what can be achieved by using each strategy alone. As a result, DeepSpeed can drop huge models in video memory without sacrificing speed.
Video memory efficiency: the layers of the model are firstly divided into different pipeline stages, and then the layers of each stage are further divided through the model in parallel. This 2D combination simultaneously reduces memory consumption for the model, optimizer, and activation. However, we cannot divide the model indefinitely without introducing communication overhead, which inevitably limits computational efficiency.
Computational efficiency: In order to expand the number of workers beyond the scale supported by model and pipeline parallelism without sacrificing computational efficiency, data parallelism supported by ZeRO (zero-DP) is used. Zero-dp can not only further improve video memory utilization efficiency by partitioning the optimizer state quantity, but also scale to any number of Gpus with minimal communication overhead by utilizing mapping relationships based on communication topology.
Figure 1:3D parallel examples of 32 workers. The layers of the neural network are divided into four pipeline stages. Layers in each pipeline stage are further divided among four model parallel workers. Finally, each pipeline stage is replicated across two parallel instances of data, between which ZeRO partitions optimizer state.
The following figure shows the 3D mapping of communication topological awareness: by leveraging two key architectural attributes, we carefully map each dimension in 3D parallelism onto the worker to maximize computational efficiency.
- Optimization of intra-node and inter-node communication bandwidth: Model parallelism has the largest communication overhead among the three strategies, so we give priority to placing model parallelism worker groups within nodes to take advantage of greater intra-node bandwidth. Here we perform model parallelism of tensor tangent based on Nvidia Megatron-LM. When the model parallel group cannot occupy all workers in the node, we choose to place the data parallel group in the same node. Otherwise, they parallel data across nodes. Pipeline parallelism has the lowest traffic, so we can schedule pipeline phases across nodes without being limited by communication bandwidth.
- Increase bandwidth through parallel communication: The amount of gradients that each data parallel group needs to pass decreases linearly with the size of pipeline and model parallelism, so the total 3D traffic is less than that using data parallelism alone. In addition, each data parallel group will communicate independently within a small part of workers locally, and inter-group communication can be parallel with each other. Therefore, by reducing the traffic and increasing the locality and parallelism, we can effectively expand the effective bandwidth of data parallel communication.
Figure 2: The mapping of worker in Figure 1 to Gpus on the system with eight nodes (each node has four Gpus). Gpus of the same color are on the same node.
1.4 How does 3D parallelism take advantage of each parallelism
A trillion-parameter model can be extended on 4,096 NVIDIA A100 Gpus using 8-channel model parallelism, 64-channel pipeline parallelism, and 8-channel data parallelism.
-
By combining model parallelism and pipeline parallelism, 3D parallelism achieves excellent memory efficiency and high computational efficiency on multiple nodes. Model parallelism improves activation within nodes and storage efficiency of model state, and pipelining parallelism (compared to model parallelism alone) enables efficient storage of model state across nodes without sacrificing computational efficiency.
-
By combining model parallelism with pipeline parallelism, pipeline parallelism can achieve high computational efficiency with minimal bubble overhead even in very small batch sizes. Under 8-channel model parallelism, each microbatch used in each model will result in an effective microbatch size of 1/8 for each GPU. Therefore, pipelined parallelism can use gradient accumulation steps of 8 times pipelined parallelism to achieve 90% computational efficiency, and the total cumulative batch size of each GPU is only 1. When combined with data parallelism, the total effective batch size on 4096 Gpus is 4096, and the pipeline efficiency of 90% can still be achieved.
But what about the computational efficiency of data parallelism? Does data parallelism require large numbers per GPU to be efficient?
Model parallelism can reduce the effective Batch size on each GPU to less than 1. This allows pipeline parallelism to effectively hide pipeline Bubble overhead even in small batches. Note that by using cross-node pipeline parallelism, we can allow communication between data parallel nodes in each phase of the pipeline to occur independently and in parallel with other pipeline phases. In fact, in the fully connected network topology common in high-end GPU clusters, this is significant for the available effective communication bandwidth for data parallel training. Since each node in the pipeline stage can communicate with its corresponding data parallel node in parallel, the effective communication bandwidth is proportional to the number of pipeline stages. With 64 pipeline parallel levels, the effective bandwidth is 64 times the bandwidth to and from a single node. With such a large pipeline parallel effective bandwidth, data parallelism can be effectively extended even in small batches with very low computation-communication ratios.
An introduction to 0 x02
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
2.1 Abstract of original Text
Large-scale deep learning models can significantly improve accuracy, but training billions to trillions of parameters is a challenge because a single GPU cannot accommodate such a large amount of storage models and states. Existing solutions such as data and model parallelism across gpus have major limitations: these solutions, while gaining computing, communication, and development efficiency, are trade-offs among various factors, and have a fundamental problem: models can only reside in limited device memory.
The authors develop a new solution that uses the Zero redundancy optimizer (Zero) to optimize memory, which can greatly increase the training speed while increasing the size of the model that can be effectively trained. ZeRO is looking for an appropriate midpoint between data parallelism and model parallelism. It wants to eliminate memory redundancy in data and model parallelism training while maintaining low traffic and high computational granularity, allowing us to scale model sizes to the number of devices with sustained efficiency. Thus, ZeRO gains the benefits of data parallelism and model parallelism. You can use the same memory to run larger models, and you can use data parallelism to train models that previously could only be trained in model parallelism.
2.2 Introduction to the text
Common data parallelism (DP) does not reduce memory per device, while other existing solutions such as pipeline parallelism (PP), model parallelism (MP), CPU-offloading, etc., are trade-offs between functionality, availability, and memory and computing/communication efficiency. MP is probably the most promising of the various existing solutions for training large models; however, MP cannot scale to larger sizes. MP splits the model vertically, dividing the calculations and parameters in each layer across multiple devices, which requires a lot of communication between each layer. Therefore, they work well within a single node with high inter-GPU communication bandwidth, but outside of a single node, their efficiency will rapidly decline, so they cannot be effectively extended beyond a single node.
So how do we overcome the limitations of existing solutions to train large models more effectively? To answer this question, we first analyze the total memory consumption of the existing system in model training and divide it into two parts:
- For large models, most of the memory is taken up by model states, including optimizer states (such as momentum and variance in Adam), gradients, and parameters.
- The remaining memory is occupied by activation, temporary buffers, and unusable fragments, which we collectively refer to as the remaining state.
Therefore, we developed Zero Redundancy Optimizer to optimize memory efficiency for both computing and communication while achieving high efficiency.
2.2.1 Optimization of model state
Model states typically consume the largest amount of memory during training, but existing methods such as DP and MP do not provide a satisfactory solution.
DP has good computing/communication efficiency, but poor memory efficiency, while MP has poor computing/communication efficiency. More specifically, DP replicates the entire model state in all data parallel processes, resulting in redundant memory consumption; While MP partitions these states for high memory efficiency, it tends to result in overly fine-grained computation and expensive communication, which reduces scaling efficiency. In addition, these methods statically maintain all model states required throughout the training process, even if all model states are not always required during the training process.
Based on these observations, we developed zero-DP, which achieves the memory efficiency of MP while maintaining the computing/communication efficiency of DP. ZeRO – DP through the model state partition rather than copy to eliminate redundant data in the process of the parallel memory state, so that each memory consumption will be above the GPU is inversely proportional to the parallelism and data, and by using dynamic communication scheduling during the training period to retain the same DP calculation size and traffic basic consistent, so that we can keep the computing/communication efficiency.
During model training, most of the memory is consumed in one of three ways:
- Activation.
- OGP state is a tensor composed of optimizer state (O), parameter gradient (G) and parameter itself (P).
- Temporary buffer.
Someone may ask why not consider the memory consumption of the input data, in fact, the input data of occupied memory is not big, this is because the basic use of iterators to read the user data, which means that the data is not a one-time all read in memory, so every time input occupied memory is trivial compared with the whole network parameters.
ZeRO DP has three main optimization phases (shown in Figure 1 below) that correspond to partitioning of optimizer states, gradients, and parameters. When progressively enabled:
1) Optimizer state partitioning (Pos) : Memory is reduced by 4 times and the traffic is the same as DP. This stage is also called zero-OS.
2) Add gradient partition (Pos+ G) optimization: memory is reduced by 8 times, and the traffic is the same as DP;
3) Add parameter partition (Pos+ G + P) optimization: memory reduction has a linear relationship with DP parallelism. Model memory is allocated evenly over each GPU, and memory consumption on each GPU is inversely proportional to data parallelism, but traffic increases only modestly. For example, splitting across 64 Gpus (Nd=64) would result in a 64-fold memory reduction. Traffic increased by a modest 50 per cent.
Memory consumption can be specified in the following figure:
Figure 1: Comparison of memory consumption per device in the three phases of zero-DP optimization. ψ represents model size (number of parameters), K represents memory multiplier of optimizer state, Nd represents DP parallelism, that is, Nd Gpus. In this case, we assume that the model size ψ=7.5B, DP Nd=64, K=12 based on the Adam optimizer training of mixed accuracy.
2.2.2 Optimized residual memory
After zero-DP was used to optimize the Memory corresponding to the model State, Residual State Memory became a secondary Memory bottleneck, consisting of activations, temporary buffers, and unavailable Memory fragments. We developed zero-R to optimize the remaining memory consumed by each of these factors.
-
For activation (stored in the forward-propagation results to support forward-propagation), we noticed that optimizing checkpoints would be helpful, but not sufficient for large models. Therefore, Zero-R optimizes active memory by identifying and removing active copies in existing MP schemes. It also offloads the activation to the CPU at the appropriate time.
-
Zero-r defines an appropriate size for temporary buffers to achieve a balance between memory and computational efficiency.
-
We observed some memory fragmentation during training due to the varying life cycles of different tensors. Because of these fragments, even if there is enough available memory, memory allocation fails due to lack of contiguous memory. Zero-r proactively manages memory according to the different life cycles of tensors to prevent memory fragmentation.
Zero-dp and Zero-R combine to form a powerful DL training memory optimization system, which we collectively call ZeRO.
2.2.3 ZeRO and MP
Because ZeRO eliminates memory inefficiencies in DP, it’s natural to ask: Do we still need MP? When do you need it? How does ZeRO work with MP?
With ZeRO, MP becomes less attractive for large models. Zero-dp is at least as effective as MP in reducing memory footprint per device, or even more effective when MP does not divide the model evenly. It also has equal or better scaling efficiency. In addition, data parallelism is very easy to use, so it is widely applicable to different workloads, whereas today’s MP approach usually requires some extra work by model developers to modify their models, such as existing work (such as Megatron-LM) that supports only a limited set of operations and models.
However, there are still a few situations where we want to take advantage of MP:
I) When used in conjunction with zero-R, MP can reduce the activation memory footprint of large models.
Ii) smaller models where activating memory is not a problem. When DP is used alone, it may fail to achieve good convergence due to the large aggregation batch size. In this case, MP is also beneficial. In this case, ZeRO can be combined with MP to give the model an acceptable batch size of aggregation.
0x03 Related work
3.1 Data, model and pipeline parallelism
Parallelization is a key strategy for large model training. For models that can be stuffed into device memory, data parallelism (DP) is used to extend training to multiple devices. In DP, model parameters are copied to each device. At each step, a small batch is distributed evenly among all data parallel processes so that each process performs forward and back propagation on different subsets of data samples and locally updates the model using the average gradient between processes.
When a model does not fit into device memory, model parallelism (MP) and pipeline parallelism (PP) split models vertically and horizontally between processes, respectively.
PP splits a model horizontally between layers, runs different partitions on different devices, and hides pipe bubbles using microbatch. Some features, such as til-weight and Batch-normalization, have been difficult to achieve due to horizontal decoupling and micro-batching.
Popular PP implementations (such as G-Pipe) partition both model parameters and total activation, but require a batch size proportional to the number of pipe partitions to hide pipe bubbles. The large batch size may affect the convergence speed, and PP also requires a large amount of memory to store the activation.
PipeDream is another implementation of PP that retains multiple copies of obsolete parameters to hide pipe bubbles without significantly increasing the batch size, thereby reducing memory efficiency. Furthermore, this implementation is not equivalent to standard DL training and has implications for training convergence.
By contrast, ZeRO achieves the same or better memory efficiency as PP without the features, performance, and convergence limitations PP brings.
3.2 Non-parallel aspects of work
Non-parallelism based Approach to Reduce memory.
In addition to MP and PP, there is a lot of work aimed at reducing the memory overhead for DL training.
3.2.1 Reduce memory activation
Currently, a lot of work is focused on reducing the memory footprint of activation, including compression, activation checkpoints, or real-time analysis. These efforts are complementary and can work with ZeRO. In fact, activation memory reduction in zero-R can work in parallel with activation checkpoints.
3.2.2 CPU Offload
There are also efforts to take advantage of the heterogeneity of compute nodes to transfer model state to CPU memory through algorithmic design or virtualized memory, respectively. But this results in 50% of the time being wasted on GPU-CPU-GPU transfers. ZeRO is different in that it significantly reduces memory consumption without having to store model state in CPU memory. In rare cases, Zero-R may only uninstall activation checkpoints for very large models to improve performance.
3.2.3 Memory Efficient optimizer
Other efforts to reduce the memory consumption of adaptive optimization methods by obtaining coarse-grained statistics of model parameters and graditions may have an impact on the guarantee of model convergence. ZeRO is orthogonal to these efforts, and its optimization does not change model optimization methods or affect model convergence, but effectively reduces the memory footprint of optimizer states and gradients per device.
3.3 Training the optimizer
For large models, Adaptive optimization (Adaptive) is very important to achieve SOTA performance and accuracy. It maintains fine-grained first – and second-order statistics for each model parameter and gradient at the expense of significant memory footprint compared to SGD. ZeRO can reduce the memory footprint of these optimizers by several orders of magnitude, making these complex optimization methods very practical for training large models on hardware with moderate device memory. It also makes it possible to develop and use more complex, memory-intensive, and convergent optimizers.
Where does 0x04 model memory go?
Let’s take a step back and look at the memory consumption of current training systems. For example, a 1.5B parameter GPT-2 model requires 3GB of memory for 16-bit weights (or parameters), but one cannot train on a 32GB GPU using Tensorflow or PyTorch. One might wonder where all the memory goes. During model training, most of the memory is consumed by model states, which are tensors made up of Optimizer states, gradients, and parameters. Beyond these model states, the remaining memory is activated, temporarily buffered, and fragmented memory consumption, which we call the remaining states. We will examine memory consumption in detail in both of these areas.
4.1 Model states: optimizer states, gradients and parameters
The original subtitle: Model States: Optimizer States, Gradients and Parameters
Most device memory is consumed by model state during training. For example, take Adam, one of the most popular optimizers in DL training. Adam needs to store two optimizer states, I) time averaged momentum and ii) variance of the gradients to calculate updates. Therefore, to use the ADAM training model, there must be enough memory to hold copies of the gradient momentum and variance. Also, there needs to be enough memory to store the gradients and weights themselves. Of the three types of parametric correlation tensors, optimizer state usually consumes the most memory, especially when applying mixed precision training.
4.1.1 Mixed precision training
The most advanced way to train large models on the current generation of NVIDIA Gpus is through mixed precision (FP16/32) training, in which the parameter and activation stores are FP16, enabling the use of high-throughput tensor core units on these Gpus. During mixed precision training, the optimizer performs forward and back propagation using FP16 weights and activations. However, in order to effectively calculate and apply weight updates at the end of backpropagation, the mixing precision optimizer must retain fp32 copies of the parameters as well as fp32 copies of all other optimizer states.
Let’s take the Adam optimizer as an example. Enough memory is needed to store the FP16 copies of parameters and gradients for the mixed precision training of the model with ψ parameters by Adam. Its memory requirements are 2ψ bytes and 2ψ bytes respectively. In addition, it also needs to store fp32 copies of optimizer state, parameter momentum, and variance, whose memory requirements are 4ψ, 4ψ, and 4ψ bytes respectively.
Let’s use K to represent the memory multiplier for the optimizer states, that is, the extra memory required to store them is Kψ bytes. Mixing accuracy Adam’s K=12. In total, this would produce a memory requirement of 2ψ+2ψ+K =16ψ bytes. For a model like GPT-2 with 1.5 billion parameters, this requires at least 24GB of memory, far more than the 3GB required to store fp16 parameters separately.
4.2 Remaining Memory Usage
Originally titled Residual Memory Consumption
2 the activation
Activation takes up a lot of memory during training. As a concrete example, the GPT-2 model with 1.5B parameters is trained with a sequence length of 1K and a batch size of 32, requiring approximately 60GB of memory. An activation checkpoint (or activation recalculation) is a common method that reduces the activation memory to the square root of the total activation at a 33% recalculation cost. This reduces the activation memory consumption for this model to about 8 GB.
Although this is a significant reduction, for larger models, activation memory can become quite large even with activation checkpoints. For example, a GPT-like model with 100 billion parameters requires approximately 60 GB of memory for a batch size of 32, even with activation checkpoints.
4.2.2 Temporary buffer
For large models, temporary buffers used to store intermediate results consume a lot of memory. Some operations, such as gradient All-reduce or Gradient Norm Computation, tend to merge all gradients into a single flat buffer to run a unified operation, which can improve throughput. For example, the bandwidth of all devices decreases as messages grow larger. While the gradients themselves are typically stored as FP16 tensors, the fusion buffer can be fp32 tensors (depending on the type of operation). These temporary buffer sizes are important when the model is large. For example, flat FP32 buffers require 6GB of memory for a 1.5B model.
4.2.3 Memory Fragmentation
So far, we’ve discussed actual memory consumption during training. In addition, even if there is enough available memory, it is possible to run out of available memory. Memory fragmentation can cause this. If there is not enough contiguous memory to satisfy a request for memory, a request for memory will fail even if the total available memory is greater than the requested memory. When training very large models, we observed significant memory fragmentation, which resulted in insufficient memory problems and, in some extreme cases, failure to allocate memory even when more than 30% of the memory was still available.
0x05 ZeRO: Insight and Overview
ZeRO has two sets of optimizations: I) ZeRO DP is designed to reduce the memory footprint of model states, and ii) Zero-R is designed to reduce remaining memory consumption.
5.1 Insights and Overview: Zero-DP
ZeRO POWERED DP is based on three key insights:
-
DP has better scaling efficiency than MP because MP reduces the granularity of computation and also increases the communication overhead. Beyond a certain point, lower computing granularity reduces efficiency per GPU, while increased communication overhead hides scalability across Gpus, especially when crossing node boundaries. Conversely, DP has higher computational granularity and lower traffic, resulting in higher efficiency.
-
DP memory is inefficient because model state is stored redundantly in all data parallel processes. Instead, MP partitions the model state for memory efficiency.
-
Both DP and MP retain all of the model states required throughout the training process, but not in all cases. For example, the parameters for each layer are required only during forward and back propagation of the layer.
Based on these insights, the ZeRO DP achieves the memory efficiency of MP while retaining the training efficiency of DP. ZeRO DP partitions model states, rather than replicating them, and uses a dynamic communication plan that takes advantage of the time nature inherent in model states while minimizing traffic. By doing so, zero-DP maintains efficiency by linearly reducing the per-device footprint of the model as DP degree increases, while keeping traffic close to the default DP.
5.2 Insights and Overview: Zero-R
5.2.1 Reducing the Active MEMORY
The two key insights are:
-
MP partitions model state, but typically requires copy-activated memory. For example, if we split the parameters of a linear layer vertically and compute them in parallel across two Gpus, then each GPU would need the entire activation to compute its partitions.
-
For GPT-2 and larger models, the arithmetic strength (the ratio of the amount of computation per iteration to the number of active checkpoints per iteration) is very large (≥ 10K) and increases linearly as the hidden dimension increases, allowing the cost of data movement at active checkpoints to be hidden, even at low bandwidth.
ZeRO eliminates memory redundancy in MP by dividing active checkpoints across gpus, and recreates them on demand using AllGather. The reduction in active memory is proportional to the degree of MP. For very large models, ZeRO can even choose to move the active partition into CPU memory, while still achieving good efficiency due to the high computational intensity in these models.
5.2.2 Managing temporary Buffers
Zero-r uses constant-size buffers to avoid temporary buffers crashing as the model size increases, while making them large enough to remain efficient.
5.2.3 Managing Memory Fragments
Memory fragmentation is the result of staggered allocation of short-life and long-life memory objects. During forward propagation, the activation checkpoint lifetime is long, but the recalculated activation lifetime is short. Similarly, in the reverse calculation, the activation gradient has a short lifetime and the parameter gradient has a long lifetime. Based on this understanding, ZeRO performs dynamic memory defragmentation by moving activation checkpoints and gradients into pre-allocated contiguous memory buffers. This not only improves memory availability, but also improves efficiency by reducing the time it takes the memory allocator to find available contiguous memory.
0x06 Learn more about Zero-DP
While the existing DP approach replicates model states on each device and introduces significant memory overhead, ZeRO DP eliminates this memory redundancy by partitioning them (optimizer states, gradients, and parameters) in parallel processes across the data. Figure 1 quantifies and visualizes the memory requirements with and without Zero-DP. The figure shows the memory footprint of (1) optimizer state (2) gradient and (3) parameter cumulative redundancy after partitioning. We refer to these as the three optimization stages for ZeRO DP: Pos, Pg, and Pp, which we’ll detail below. Here I post figure 1 again.
6.1 Pos: Optimizer state partition
For a DP with NdN_dNd parallelism, we group the optimizer state into equal NdN_dNd partitions so that the i-th data parallel process only updates the optimizer state corresponding to the i-th partition. Therefore, each data parallel procedure only needs to store and update 1Nd \ FRAc {1}{N_d}Nd1 of the total optimizer state, and then update only 1Nd \ FRAc {1}{N_d}Nd1 parameters. At the end of each training step, we perform an all-Gather operation across data parallel processes to get fully updated parameters across all data parallel processes.
In a concrete example shown in Figure 1, the 7.5B parametric model uses 64-channel DP (NdN_dNd=64) and its Pos requires 31.4GB of memory. Using standard DP requires 120 GB of memory. In addition, when NdN_dNd is large, the memory requirement of model state decreases from 4 +12ψ=16ψ bytes to 4 +12ψ Nd\ FRAc {12}{N_d}Nd12 ≈ 4ψ bytes, resulting in a 4x multiple reduction.
6.2 Pg: Gradient partitioning
Since each data parallel process is only responsible for updating its corresponding parameter partition, each node is only responsible for the gradient of that part of the parameter. After merging, each node only needs the gradient corresponding to its parameter partition, not the other gradients, so their memory can be freed. This reduces the memory footprint of the gradient from 2ψ bytes to 2ψNd\ FRAc {2}{N_d}Nd2ψ.
In effect, this is a Reduce-scatter operation where gradients of different parameters are reduced into different processes. To improve efficiency, we use the bucketization strategy, where we will correspond to all gradients of bucketization for a specific partition and immediately enforce the entire bucket. In our example, we perform a Reduce instead of all-Reduce at partition boundaries to reduce memory footprint and overlap computation and communication.
Memory savings: By eliminating the gradient and optimizer state redundancy, we further reduce the memory footprint to 2ψ+ 14ψNd\ FRAc {14ψ}{N_d}Nd14ψ≈ 2ψ. As shown in the example in Figure 1, a 7.5B parametric model using Pos+ G and 64-way DP (Nd=64) requires only 16.6GB of memory, compared to 120 GB for standard DP. When NdN_dNd is large, the memory requirement of model state decreases from 2 +14 =16 to 2 +14 nD-FRAc {14}{N_d}Nd14 ≈ 2, which is 8 times.
6.3 Pp: Parameter partition
Just like optimizer states and gradients, each process stores only the parameters corresponding to its partition. When forward and back propagation requires parameters outside its partition, they are received from the appropriate data parallel process through the broadcast operation. While at first glance this might cause significant communication overhead, we found that this approach only increased the total traffic on the baseline DP system by a factor of 1.5 while achieving a memory reduction proportional to Nd.
Memory saving: by parameter partition, we reduce the memory footprint of ψ from 16 to 16ψ N_d} {16}Nd16. As shown in the example in Figure 1, 7.5B parametric models using Pos+ G +pP_{OS + G + P}Pos+ G + P and 64-way DP (Nd=64) require only 1.9GB of memory, compared to 120 GB for standard DP.
This has profound implications: Zero-DP can fit any size model as long as a sufficient number of devices share model state.
6.4 Influence on model size
The three phases of Pos, Pos+ G and Pos+ G + P reduce the memory consumption of each data parallel process in the model state by 4, 8 and NdN_dNd times, respectively. Table 1 analyzes the model state memory consumption in three stages of zero-DP for several sample models at different DP levels.
If ZeRO is not used, the memory consumption is equal to the first row in the table, regardless of the DP level. Note that when Nd=64, ZeRO can use Pos, Pos+ G, and Pos+ G + P to train models with parameters up to 7.5b, 14B, and 128B, respectively. When Nd=1024, enabling all optimized ZeRO (Pos+ G + P) trains models with 1 trillion parameters! Or it could be any size model! Without ZeRO, the maximum model that DP can run has fewer than 1.5 billion parameters.
0x07 深入 ZeRO-R
7.1
: Will Activation Checkpointing partition
As discussed earlier, MP is designed to require replication activation to produce redundant copies of activation between model parallel Gpus. ZeRO eliminates this redundancy by partitioning activations and only implements them once as a copy of the activation layer before they are used for computation.
More specifically, once the forward propagation of a layer in the model is calculated, the input activation is partitioned across all models in parallel until it is needed again in the back propagation. At this point, ZeRO recreates the active copy using the All Gather operation. We call this optimization Pa. It works with activation checkpoints, storing only the partition’s activation checkpoints rather than copying copies. In addition, in the case of very large models and very limited device memory, these partition activation checkpoints can also be offloaded onto the CPU, reducing the activation memory overhead to almost zero at an additional communication cost, which we call Pa+cpuP_{a+ CPU}Pa+ CPU.
With partition activation checkpoints, ZeRO reduces the activation footprint by a factor proportional to the degree of MP. Consider training a 100B model with a batch size of 32, sequence length of 1024, and degree of MP of 16. If we check for one activation per converter layer, storing activation checkpoints alone would require approximately 33 GB of memory per GPU. But if Pa is zero, the capacity of each GPU can be reduced to about 2GB. In addition, this 2GB can be unloaded onto the CPU, reducing the active memory footprint to almost zero.
7.2 CB: Fixed-size buffer
ZeRO carefully chooses the size of the temporary data buffer to balance memory and computational efficiency. During training, the computational efficiency of some operations may be highly dependent on the input size, and the larger the input, the higher the efficiency. For example, a large all-reduce operation gets higher bandwidth than a small operation. Therefore, for better efficiency, high-performance libraries (such as NVIDIA Apex or Megatron) merge all parameters into a single buffer before applying these operations. However, the memory overhead of a fusion buffer is proportional to the size of the model. For example, for the 3B parametric model, the 32-bit fusion buffer would require 12GB of memory. To solve this problem, we simply use a high-performance Constant Size Buffers when the model becomes too large. By doing this, buffer size is independent of model size, and we can still achieve good efficiency by keeping the buffer size large enough.
7.3 MD: Memory defragmentation
Memory fragmentation in model training is the result of activation checkpoint and gradient calculations. During forward propagation with activation checkpoints, only selected activations are stored for backpropagation, and most activations are discarded because they can be recalculated during backpropagation. This creates an interleaving of short-term memory (discarded activation) and long-term memory (checkpoint activation), resulting in memory fragmentation. Similarly, during backpropagation, parameter gradients are long-lived, and any other buffers required to activate gradients and calculate parameter gradients are short-lived. Similarly, this interleaving of short-term and long-term memory can lead to memory fragmentation.
When there is enough memory is available, limited memory fragments are usually not a problem, but for large models using limited memory operation training, fragmentation will lead to two questions, I) due to lack of contiguous memory, even if there is enough available memory will also lead to OOM, ii) due to the memory allocator spend a lot of time to search the continuous memory blocks to meet memory requests, Resulting in inefficiency.
ZeRO dynamically defragment memory by pre-allocating contiguous chunks of memory for activation checkpoints and gradients and copying them into pre-allocated memory at build time. MD not only enables ZeRO to train larger models in larger batches, but also improves training efficiency with limited memory.
0x08 Zero-DP Traffic Analysis
Since ZeRO increases the size of the trainable model by eliminating memory redundancy, it is natural to question whether you are trading traffic for memory efficiency. In other words, how much traffic is the zero-powered to DP approach compared to the baseline DP approach? The answer is divided into two parts: I) Zero-DP uses Pos and Pg without additional communication and achieves 8x memory reduction; Ii) Zero-DP generates 1.5 times more communication when using Pp other than Pos and Pg, but further reduces memory footprint by Nd times.
8.1 Data Parallel Traffic
During data parallel training, the gradients of all data parallel processes are averaged at the end of the back propagation and before the next update is calculated. Averaging is done using all-reduce. For large models, all-Reduce communication is completely limited by communication bandwidth, so our analysis is limited to the total traffic sent and sent between each data parallel process.
The latest all-Reduce implementation generally uses a two-step method. The first step is reduce-Scatter operation, which specifies different parts of data in different processes. The next step is the All Gather operation, where each process collects the data specified on all processes. The result of these two steps is an all-reduce operation. Reduce-scatter and All-Gather are both implemented using pipeline methods, which result in movement of a total of ψ elements (assuming data contains ψ elements). Therefore, standard DP produces 2ψ data movements in each training step.
8.2 Zero-DP Traffic
8.2.1 Using Pos+ G traffic
With gradient partitioning, each process stores only a portion of the gradient, which is required to update its corresponding parameter partitioning. Therefore, different from All-Reduce, ZeRO only needs to perform scatter-Reduce operation on the gradient, thus generating the communications volume of ψ. After each process updates the partition for its responsible parameters, all-Gather is performed to collect all updated parameters from all data parallel processes. This also causes the traffic to be ψ. Therefore, the total traffic of each training step is ψ+ψ=2ψ, which is exactly the same as baseline DP.
8.2.2 Use Pos+ G + P traffic
After parameter partitioning, each data parallel process stores only the parameters it is responsible for updating. Therefore, it needs to receive the parameters of all other partitions during forward propagation. However, this can be pipelined to avoid memory overhead. The data parallel process responsible for a particular partition can broadcast weights to all data parallel processes before calculating forward propagation of the part of the model corresponding to that partition. Once forward propagation of the partition is complete, these parameters can be discarded. Therefore, the total traffic is ψ×NdNd=ψ\ FRAc {ψ×N_d}{N_d}=ψNd ×Nd=ψ. In other words, we rely on an all-gather that rearranges parameters throughout forward propagation, discarding them after they are used. Note, however, that for backpropagation, this all-gather needs to happen again (but in reverse order).
Therefore, the total traffic is the sum of the traffic generated by reduce-Scatter and all-Gather, and the total volume is 3ψ, 1.5 times of the baseline. Both gradient and parameter partitioning take advantage of the insight that not all gradient and parameter states are always needed, but instead optimize memory by sensibly passing state.
0x09 Zero-R Communication Analysis
We compared traffic at partition activation checkpoints (Pa) in zero-R with baseline MP, and showed that Pa typically caused an increase in traffic of less than one-tenth of baseline MP. In addition, we analyze the relationship between Pa communication overhead and DP traffic to determine the case where Pa improves efficiency by allowing larger batches and reducing DP traffic. We use this analysis to decide if and when to apply Pa and Pa+ CPU.
The traffic tradeoff for partition activation checkpoints depends on model size, checkpoint policy, and MP policy. To share specific insights, we analyze it in the context of a model implemented using the SOTA MP method (Megatron-LM).
In megatron-LM with activation checkpoints, each transformer performs two Batch ×seq_length×hidden _DIM all-reduce operations in forward propagation, two All-reduce operations for forward recalculation, The other two All-reduce operations are used for back propagation. The total traffic for each block is 12 × SEq Length × hidden DIM, because the traffic for All Reduce is 2 × message size.
When Zero-R partitions active checkpoints, an additional All Gather operation needs to be performed before the forward recalculation of backpropagation at each active checkpoint. Typically, we check the input activation for each Transformer block, and each transformer block requires an All Gather. Therefore, the communication cost Pa is seq Length ∗ hidden Dim, because all the gathered traffic is message size. Therefore, the total communication overhead of Pa is less than 10% of the original parallel traffic of the model.
When MP is used in conjunction with DP, Pa can be used to reduce data parallel traffic by an order of magnitude, while increasing model parallel traffic by 10%, and significantly improve efficiency when data parallel traffic becomes a performance bottleneck. Note that Pa reduces the MP parallelism in activated memory consumption, allowing for a proportional increase in batch size. For large models, MP can be as large as 16 (# Gpus on dgX-2 nodes), allowing batch sizes to increase by up to 16 times. The traffic of data parallel training is inversely proportional to batch size. Therefore, an order of magnitude increase in batch size due to Pa may result in an order of magnitude decrease in data parallel traffic.
Finally, with Pa+ CPU, the partition activation check point is unloaded to the CPU, the activation memory requirement is reduced to almost zero, and there is a 2x increase in data movement between CPU memory compared to Pa. In extreme cases, DP traffic is a major bottleneck because even with Pa, the batch size is small, and Pa+ CPU can improve efficiency by increasing the batch size as long as the CPU data transfer overhead is less than the DP traffic overhead, which is usually appropriate for small batch sizes.
Given the model and hardware characteristics, we used the above analysis to determine if and when to apply Pa and Pa+ CPU.
0xEE Personal information
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
0 XFF reference
ZeRO: Model training method for trillion-level parameters
DeepSpeed: a large-scale model training tool available to all
DeepSpeed: Extreme-scale model training for everyone