ML2021 | (tencent) PatrickStar: based on block of memory management to achieve the training model of parallel training

preface

At present, the most common parallel training is data parallelism, which is based on the premise that the model can be stored on one GPU. When this premise cannot be met, the model needs to be placed on multiple Gpus. In this paper, a heterogeneous training system named PatrickStar is proposed. PatrickStar overcomes these shortcomings by managing model data in a fine-grained manner to make more efficient use of heterogeneous memory.

An example use of PatrickStar is attached to this article. PatrickStar is independent of the model definition, adding a few lines of code to the PyTorch script can provide end-to-end acceleration.

This article is from the public CV technical guide of the paper sharing series

Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management

Code: github.com/Tencent/Pat…

Use the sample

We first introduce a more concerned about the problem – complex not complex, good implementation.

PatrickStar is independent of the model definition, adding a few lines of code to the PyTorch script can provide end-to-end acceleration.

Background

The current consensus in AI is to use PTMs(pre-trained Models) as the backbone of the task, rather than training the model from scratch on a mission-related data set. The high performance of PTMs is accompanied by a large number of parameters, which puts forward a huge demand on computing and storage resources.

Since large model data cannot be stored in the memory of a single GPU, the most commonly used data parallelism technique is not suitable for PTM (a more common example is AlexNet in 2012). Recent advances have made it possible to increase the scale of PTM by using parallel training to allocate model data between multiple GPU memory, such as ZeRO, model parallelism, and pipeline parallelism. The SOTA solution combines them into 3D parallelism, which can scale PTM to trillions of parameters on thousands of Gpus.

Innovative ideas

We observe that two types of training data must be managed during PTM training: model data consists of parameters, gradients, and optimizer states, whose footprints are related to model structure definitions; Non-model data consists of intermediate tensors generated by operator. Non-model data changes dynamically based on the configuration of the training task, such as batch size. Model data and non-model data compete for GPU memory.

Existing solutions statically divide model data between CPU and GPU memory without considering non-model data, and their memory layout is constant for different training configurations. This static partitioning strategy leads to several problems.

First, the system crashes when GPU or CPU memory is insufficient to meet its corresponding model data requirements, even if there is still memory available on other devices at the time. Second, communication is inefficient when data is transferred between different memory Spaces of tensor granularity, and CPU-GPU traffic is unnecessary when model data can be placed ahead of time on the target computing device.

Aiming at the above problems, this paper proposes a heterogeneous training system named PatrickStar. PatrickStar overcomes these shortcomings by managing model data in a fine-grained manner to make more efficient use of heterogeneous memory.

The authors organize model data tensors into blocks, i.e. contiguous chunks of memory of the same size. During training, the distribution of blocks in heterogeneous storage space is dynamically arranged according to the tensor state of blocks. By reusing blocks that do not coexist, PatrickStar also reduces the memory footprint of model data even further than the SOTA solution.

The authors use warm-up iteration to collect statistics on available GPU memory for model data at run time. An efficient block reclamation strategy and a device aware operator placement strategy based on statistics were designed to reduce the amount of cpu-GPU data movement. Using zero redundancy optimizer, block-based memory management can effectively coexist in parallel with data through intra-GPU communication of blocks.

Contributions

1. Built a block-based memory management DNN training system named PatrickStar from scratch. Compared with the existing heterogeneous training methods, this method supports larger model size and higher computational efficiency by reducing non-model memory space, improving memory efficiency and reducing CPU-GPU traffic.

2. Block-based management is naturally symbiotic with zero redundancy optimizer data parallelism. The block-based collective communication mode reduces the bandwidth requirement and improves the bandwidth utilization.

3. The authors evaluated the system on cloud computing nodes with 8x V100 Gpus, 240 GB and 120 GB DRAM memory cpus. With 240 GB of memory, PatrickStar trained a 12 billion-parameter gpt-like model, which is 1.5 times the size of DeepSpeed’s maximum model. With 120 GB of memory, PatrickStar’s model was four times larger than DeepSpeed’s.

4.PatrickStar is more computationally efficient than DeepSpeed, achieving superlinear scaling on 8X Gpus.

Design Overview

This paper designs a parallel PTM training system, PatrickStar, whose working principle is shown in the figure below.

PatrickStar improves the model size and efficiency of existing heterogeneous training by managing model data in chunks and storing it in heterogeneous Spaces. The circles in the figure represent the elements of the parameters, which are arranged as blocks in memory. When the calculation of the operator is triggered (in color on the right), PatrickStar arranges the block of parameters on the desired computing device.

This paper presents an optimal scheme for efficiently arranging blocks in heterogeneous memory space. In addition, it can be combined with zero redundancy optimizer to scale to multiple Gpus. If necessary, move blocks to desired devices during training.

System design on a single GPU

PatrickStar acts as middleware between PyTorch and heterogeneous memory, as shown in the figure above.

The system consists of a static module working in the pre-processing stage and a run-time module working in the training stage. PatrickStar’s static module is processed before training.

Based on the structure of neural network, the mapping mode between tensors and blocks is constructed.

During training, the runtime module takes over PyTorch’s memory access by redirecting tensors to the block-based storage it manages, and uses a block manager to intelligently manage blocks in heterogeneous storage.

Preprocessing Stage

Before the training, in the pre-processing stage, each tensor of model data is allocated a block of space, and a block-tensor mapping pattern is generated. An effective mapping pattern needs to have the following three characteristics: 1. Increase the locality of tensor access. 2. Reduce peak memory consumption. 3. Be parallel friendly.

This paper proposes an efficient mapping model.

Simply put, according to the type of tensors in the model data, blocks are divided into four types, namely, parameter FP16 list, parameter FP32 list, momentum list and variance list, with a total of 14M bytes (M is the number of parameters). Blocks have the same size, so different blocks can reuse the same memory and facilitate collective communication in parallel training. In particular, PatrickStar does not assign a hierarchical FP16 list. The gradient FP16 tensor can reuse the block space of the parameter FP16 list, eliminating the dependence of the gradient FP16 tensor on the parameter FP16 tensor.

Compared to zero-offload, which has a minimum model data memory footprint of 16M bytes, PatrickStar reduces the memory footprint. In addition, it allocates extra GPU memory for holding gradiences to be moved to the CPU, while PatrickStar eliminates this overhead. Therefore, compared with other PTM training schemes, PatrickStar’s model data storage space is minimal.

Training Stage

During the training, PatrickStar needs to properly and efficiently orchestrate blocks of data in heterogeneous storage space. This paper introduces the mechanism of the guide block moving correctly, and introduces an optimization strategy to improve its efficiency. This mechanism is quite complex, the paper has a very specific introduction, not to be interpreted here.

Scaling to Multiple GPUS

PatrickStar uses multiple processes to perform data parallelization across multiple Gpus. Assume that the number of processes is NPROC. Each process is responsible for a single GPU, while all processes share the CPU. The heterogeneous memory space of a process consists of all memory of the GPU and 1/nproc CPU memory space. The process manages the 1/ Nproc of the total block located in its local heterogeneous storage space in the same way as the zero redundancy optimizer. Blocks in local space are called local blocks and blocks not in local space are called remote blocks.

As shown, the communication group consists of successive NPROC blocks of the block list, each of which belongs to a different process. According to the previous block tensor mapping pattern, we can design a data parallel communication scheme with minimal interprocess traffic. The process only needs to pass the parameters FP16 and gradient FP16 during the FWD and BWD phases. The ADAM phase that requires the largest amount of data (including momentum, variance, parameters FP32, and gradient FP16) is executed locally.

There are very detailed process introduction and algorithm in the paper, which will not be interpreted here.

Compared with related work, PatrickStar achieved lower in-GPU bandwidth requirements and higher bandwidth utilization. According to the cost model, PatrickStar’s bandwidth requirements are 2(P −1)/ P ×2m(all-gathers)+(P −1)/ P ×2m(reduce-scatter)=6(P −1)/ P ×M, where P is parallelism and M is the number of parameters. In zero-offload and zero-DP, the parameters for each layer are owned by a single process and broadcast to the rest. In contrast to all-gathers, broadcast centralizes data transmission on a single GPU, underutilizing aggregate bandwidth.

The broadcast-based bandwidth requirement is 4(P −1)/ P x 2M(broadcast)+(P −1)/ P x 2M=10(P −1)/ P x M, which is 2/3 higher than PatrickStar. Although Zero-Infinity also adopted a gathers approach, PatrickStar’s bandwidth utilization remained high.

It has been shown that a grouping strategy of placing sequence tensors in buckets for transmission leads to higher bandwidth utilization because it will transmit more data per communication. PatrickStar’s block-based approach is naturally chunked, whereas zero-Infinity’s transport unit is a tensor. PatrickStar further avoids data replication overhead to improve performance.

Optimization

Heterogeneous training methods introduce additional CPU-GPU data movement overhead compared to hosting model data in GPU memory. Block-based memory management has been optimized to make it more efficient and powerful.

First, it can lay out operators in a fine-grained way so that memory-intensive operators are not on their preferred devices, but it can reduce the amount of data movement and improve the overall efficiency of the system. To this end, PatrickStar proposed a device-aware Opetator Placement optimization.

Second, when data cannot reside permanently on a carrier’s computing device, blocks of data are ejected if they are not used. In addition to improving on a larger model scale, PatrickStar also minimizes block eviction through a block eviction strategy.

To achieve this optimization, we must have a good understanding of how much GPU Memory space on each device can be allocated to blocks during training, which is referred to as Chunkable Memory in this section. Therefore, PatrickStar provides a way to collect statistics on chunkable memory at run time.

The details of these schemes are described in detail in the paper.

Conclusion

PatrickStar supports larger model sizes and higher computational efficiency by reducing non-model memory space, improving memory efficiency, and reducing CPU-GPU traffic.

Training throughput of PyTorch, DeepSpeed and PatrickStar on the same GPU. OOM indicates excessive memory.

Train throughput for PyTorch, DeepSpeed, and PatrickStar using DP on multiple Gpus.

This paper proposes an innovative heterogeneous training system called PatrickStar. It organizes model data into chunks, allowing for more flexibility in orchestrating them in heterogeneous storage Spaces. The system is symbiotic with zero redundancy optimizer data parallelism. PatrickStar managed to reduce the hardware requirements for PTM training and scale to multiple Gpus more efficiently.

On an 8xGPU 240 GB CPU node in the cloud computing platform, it is 2 times faster than SOTA in terms of maximum model size and speed.

Some future work can be studied at PatrickStar.

First, a better approach to block cleaning can be achieved through joint optimization of non-model data checkpoint and unload strategies. Secondly, the block-based heterogeneous method can be combined with other parallel training methods for multi-node scaling.

Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

ML2021 | (tencent) PatrickStar: based on block of memory management to achieve the training model of parallel training

Use the sample

Background

Innovative ideas

Contributions

Design Overview

System design on a single GPU

Preprocessing Stage

Training Stage

Scaling to Multiple GPUS

Optimization

Conclusion

Other articles

ML2021 | (tencent) PatrickStar: based on block of memory management to achieve the training model of parallel training

Use the sample

Background

Innovative ideas

Contributions

Design Overview

System design on a single GPU

Preprocessing Stage

Training Stage

Scaling to Multiple GPUS

Optimization

Conclusion

Other articles

Related Posts

Kaiyuan Big Data Weekly – Issue 6

Machine learning chapter 10 dimensionality reduction and metric learning

ESIM doing reasoning