0 x00 the

As we mentioned earlier, Microsoft ZeRO can scale a trillion-parameter model with 8-channel model parallelism, 64-channel pipeline parallelism, and 8-channel data parallelism on 4,096 NVIDIA A100 Gpus.

FSDP (Fully Sharded Data Parallel) is an upgraded version of PyTorch DDP proposed by Facebook after deeply learning from Microsoft ZeRO. It can be considered as a counterstandard to Microsoft ZeRO. Its essence is parameter Sharding. Parameter Sharding is to segment model parameters equally to each GPU. We will use papers, blogs, and code from Google, Microsoft, and Facebook for study and analysis.

Other articles in this series are as follows:

PyTorch Distributed ZeroRedundancyOptimizer

Distributed Training Parameter Sharding ZeRO

Distributed Training Parameter Sharding: Google Weight Sharding

0 x01 profile

1.1 FAIR & FSDP

Training ai models on a large scale is not easy. In addition to the amount of computing power and resources required, there is considerable engineering complexity behind training very large models. Facebook AI Research (FAIR) engineering has been building tools and infrastructure to make it easier to train large AI models.

Fully Sharded Data Parallel (FSDP) is the latest tool introduced by FAIR. It splits the parameters of the AI model among data parallel workers, and can choose to unload part of the training calculation to CPU. As the name suggests, FSDP is a data parallel training algorithm. Although the parameters are sharded to different Gpus, the calculation of each microbatch data is still local to each GPU worker. This conceptual simplicity makes FSDP easier to understand and more suitable for a variety of usage scenarios (compared to in-layer and pipelined parallelism). Compared with optimizer State + Gradient Sharding data parallel method, FSDP performs more uniform segmentation of model parameters through communication and computational overlap during training, and has better performance.

FSDP can train orders of magnitude larger models more efficiently with fewer Gpus. FSDP has been implemented in the FairScale library, allowing engineers and developers to extend and optimize their model training using simple apis. At Facebook, FSDP has been integrated and tested to train some NLP and Vision models.

1.2 Demand for large-scale training computing capacity

Large-scale model training requires a lot of computing resources, such as OpenAI’s GPT-3 with 175 billion parameters. Its training is estimated to require 355 years of GPU time, which is equivalent to 1000 Gpus working continuously for more than 4 months.

In addition to requiring significant computational and engineering resources, most training extension approaches incur additional communication costs and require engineers to carefully evaluate the tradeoff between memory usage and computational efficiency. For example, typical data parallel training requires that redundant copies of the model be maintained on each GPU, and model parallel training introduces additional communication costs for moving activations between workers (Gpus).

FSDP, by contrast, makes relatively no trade-offs. It improves memory efficiency by splitting model parameters, gradients, and optimizer states on the GPU, and improves computational efficiency by breaking down communication and overlapping it with forward and backward processes. FSDP produces the same results as standard Distributed Data parallelism (DDP) training and provides an easy-to-use interface that is an alternative to the PyTorch Distributed data parallelism module. Facebook’s early testing shows that FSDP can scale to trillions of parameters.

0x02 How does FSDP work

In standard DDP training, each worker processes a separate batch, and all-reduce is used to summarize the gradient between workers. While DDP has become very popular, it takes up more GPU memory than it really needs, because model weights and optimizer states have a copy in all DDP workers.

2.1 Full parameter sharding

One way to reduce duplicates is to apply a process called full parameter sharding, in which only a subset of the model parameters, gradients, and optimizers required for local calculations are provided. An implementation of this approach, Zero-3, has been popularized by Microsoft.

The key to unlocking full parameter shard is that we can split all Reduce operations in DDP into separate Reduce-Scatter and All-Gather operations.

Image from: engineering.fb.com/wp-content/…

All-reduce is a combination of reduce-Scatter and All-Gather. The standard “all-reduce” operation of the aggregation gradient can be broken down into two separate stages: “Reduce-Scatter” and “All-Gather”.

  • In the Reduce-Scatter phase, blocks equal to each rank are summed on each GPU based on rank index.
  • In the “All-Gather” phase, aggregation gradient fragments on each GPU are available to all Gpus.

By rearranging Reduce Scatter and All Gather, each DDP worker only needs to store one parameter fragment and optimizer state.

2.2 comparing

The following figure shows standard DDP training (top half) and FSDP training (bottom half) :

  • In the standard data parallel training method, there is a copy of the model on each GPU, and the sequences passed forward and backward only run on their own data fragments. After these local calculations, the parameters and optimizers for each local process are shared with other Gpus in order to compute global weight updates.

  • In FSDP:

    • Model Shard: Only Model shards exist on each GPU.
    • All-gather: Each GPU uses all-Gather to collect ownership weights from other Gpus for propagation ahead of local computation. Is the paper idea Pp underline part.
    • Forward (local) : the Forward operation is performed locally. Both forward and backward calculations use the full model.
    • All-gather: The weight collection is then performed again before the backward propagation. It is the underlined part of the thesis idea Pp.
    • Backward (local) : The Backward operation is performed locally. Both forward calculation and backward calculation use the complete model, and all gradients are on each GPU.
    • Reduce-scatter: After backward scatter, local gradients are aggregated and fragmented on each GPU through reduce-Scatter. The gradient on each fragment is the corresponding part of this partition after aggregation, which is the underlined part in the paper idea Pg.
    • Update Weight (local) : Each GPU updates its local Weight fragments.

To maximize memory efficiency, we can discard all weights after each layer propagates forward, saving memory for subsequent layers. This can be done by applying FSDP wrappers to each layer in the network (by setting reshard_after_forward=True).

Here is the pseudo-code implementation:

FSDP forward pass:
    for layer_i in layers:
        all-gather full weights for layer_i # weights
        forward pass for layer_i
        discard full weights for layer_i # weights

FSDP backward pass:
    for layer_i in layers:
        all-gather full weights for layer_i # weights
        backward pass for layer_i
        discard full weights for layer_i # weights
        reduce-scatter gradients for layer_i # gradient
Copy the code

2.3 comb

We combine the ideas of the paper to comb FSDP again.

2.3.1 ideas

The thesis idea is as follows:

  • Pp: Parameter Partitioning, in which each process stores only the parameters corresponding to its partition. When forward and back propagation requires parameters outside its partition, they are received from the appropriate data parallel process through the broadcast operation. While at first glance this might cause significant communication overhead, we found that this approach only increased the total traffic on the baseline DP system by a factor of 1.5 while achieving a memory reduction proportional to Nd.
  • Pos: Optimizer State Partitioning. For a DP with NdN_dNd parallelism, we group the Optimizer State into equal NdN_dNd partitions so that the i-th data parallel process only updates the Optimizer State corresponding to the i-th partition. Therefore, each data parallel procedure only needs to store and update 1Nd \ FRAc {1}{N_d}Nd1 of the total optimizer state, and then update only 1Nd \ FRAc {1}{N_d}Nd1 parameters. At the end of each training step, we perform an all-Gather operation across data parallel processes to get fully updated parameters across all data parallel processes.
  • Pg: Gradient Partitioning: Since each data parallel process is only responsible for updating its corresponding parameter Partitioning, each node is only responsible for the Gradient Partitioning of that part of parameters. After merging, each node only needs the gradient corresponding to its parameter partition, not the other gradients, so their memory can be freed. This reduces the memory footprint of the gradient from 2ψ bytes to 2ψNd\ FRAc {2}{N_d}Nd2ψ. In effect, this is a Reduce-scatter operation where gradients of different parameters are reduced into different processes.

To summarize: because model parameters are partitioned, parameter gradients (in framework implementations, gradients are often member variables of parameters) are naturally partitioned. Partition parameters are set in the optimizer, so the optimizer will only optimize the partition parameters, so the state of the optimizer is naturally after the partition. Note that in forward propagation and backward propagation, all models are used to calculate each GPU, and the gradient obtained is also all gradient, but only the corresponding part of its own partition is stored during storage.

2.3.2 Process steps

Let’s show you the process again. Assuming that the degree of data parallelism is N, there are n Gpus, and 1/n of the total model parameters are stored on each GPU. At the same time, the state of the optimizer is naturally partitioned with data parallelism on each GPU.

  • Start status: Pn,Gn,OnP_n, G_n, O_nPn,Gn, and On are displayed On each GPU. Note that since the model on this GPU is PnP_nPn, OnO_nOn naturally corresponds to PnP_nPn, which is automatically sharded.
  • In forward calculation, each GPUnGPU_nGPUn broadcasts its own parameter PnP_nPn to all other Gpus. After forward calculation, Each GPUnGPU_nGPUn gets the losSNLOSS_nLOSSN of its input training data datandATA_ndatan.
  • In reverse calculation, each GPUnGPU_nGPUn also broadcasts its own parameter PnP_nPn to all other Gpus, and finally calculates the gradient GnG_nGn corresponding to datandatA_ndatan.
  • The gradient GnG_nGn is aggregated to the corresponding GPUnGPU_nGPUn, and then the gradient on GPUnGPU_nGPUn is reduce(G0,… ,Gn)reduce(G_0, … , G_n)reduce(G0,… Gn) the part corresponding to its own rank. Note that reduce-Scatter is used for gradient aggregation, because each GPU only needs to update its own part Pn,Gn,OnP_n,G_n,O_nPn,Gn,On, so there is no need for all-Gather.

0x03 How to use FSDP

Currently, FAIR offers four solutions to use FSDP to suit different needs.

3.1 Use FSDP in the language model

For the language model, FSDP can be supported in the Fairseq Framework with the following new parameters:

  • – DDP-backend =fully_sharded: Enables full sharding through FSDP.
  • – CPU-offload: offloads the optimizer state and a copy of the FP32 model to the CPU (used in conjunction with -Optimizer = CPU_ADAM).
  • — no-reshard-after-forward: Increases the training speed of large models (1B+ Params), similar to ZeRO Stage 2.
  • The other common options (– fp16, — update-freq, — checkpoint-Activations, — offload-Activations, etc.) continue to work as usual.

See the Fairseq tutorial for details.

3.2 Use FSDP in computer vision models

For computer vision models, FSDP is supported in VISSL and has been tested on the RegNET architecture. Layers like BatchNorm and ReLU have been handled seamlessly and their convergence has been tested. FSDP can be enabled using the following options.

  • config.MODEL.FSDP_CONFIG.AUTO_SETUP_FSDP=True
  • config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch
  • config.MODEL.AMP_PARAMS.AMP_TYPE=pytorch

You can continue studying this section in the links below.

3.3 Using FSDP in PyTorch Lightning

To make it easier to integrate with more general use cases, PyTorch Lightning has included FSDP as a beta feature. [this tutorial] (pytorch – 20. Readthedocs. IO/en/latest/a… Training) contains a detailed example of how to use the FSDP plug-in with PyTorch Lightning. Add plugins=’ FSDP ‘to activate it, as shown below.

model = MyModel()
trainer = Trainer(gpus=4, plugins='fsdp', precision=16)
trainer.fit(model)

trainer.test()
trainer.predict()
Copy the code

3.4 Use FSDP library directly from FairScale

FSDP’s main development library is FairScale. You can use FairScale’s FSDP directly by replacing the DDP with the following example.

from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
...
# sharded_module = DDP(my_module)
sharded_module = FSDP(my_module)
optim = torch.optim.Adam(sharded_module.parameters(), lr=0.0001)
for sample, label in dataload.next_batch:
  out = sharded_module(x=sample, y=3, z=torch.Tensor([1]))
  loss = criterion(out, label)
  loss.backward()
  optim.step()
Copy the code

The FSDP library in FairScale provides options for many important aspects of large-scale training. When you want to use the full functionality of FSDP, you can do your own research on the following aspects.

  1. ** Model encapsulation: ** To minimize GPU memory requirements in the short term, users need to encapsulate models in a nested manner. This adds complexity, but is useful when porting existing PyTorch model code.
  2. Model initialization: Unlike DDP, FSDP does not automatically synchronize model weights between GPU worker processes. This means that model initialization must be done carefully so that all GPU workers have the same initial weight.
  3. ** Optimizer Settings: ** Due to sharding and wrapping, FSDP only supports certain types of optimizer and optimizer Settings. In particular, if a module is wrapped in FSDP and its parameters are flattened into a single tensor, the user cannot use different hyperparameters for different parameter groups in such a module.
  4. ** Blending accuracy ** FSDP supports advanced blending accuracy training for FP16 weights, as well as FP16 type Reduce and Scatter on gradients. However, some parts of the model may only converge when using full precision, in these cases additional wrapping is needed to selectively run some parts of the model at full precision.
  5. ** State checking and inference: ** Saving and loading model state can become difficult when the model is large. FSDP supports a variety of methods to make this task possible, but these methods come at a cost.
  6. Finally, FSDP is usually used in conjunction with an activation checkpoint function, such as checkpoint_wrapper. Users may need to carefully adjust the activation checkpoint strategy to accommodate a large model within limited GPU memory space.

0x04 Memory Management

Let’s look at how FSDP manages memory.

FairScale provides the algorithm inspired by ZeRO < https://arxiv.org/pdf/1910.02054.pdf > : when using data parallel training, you need to weigh the memory usage in computing/communication efficiency. On the other hand, when using model parallel training, there is a computation/communication trade-off for memory.

Memory usage for model training generally falls into two categories:

  • Model state: optimizer state, gradient, parameters.

  • Remaining states: active, temporary buffer, fragmented memory.

To reduce redundancy in the model state, ZeRO proposes three different algorithms. These are implemented in FairScale as Optimizer State Sharding (OSS), Sharded Data Parallel, SDP and Fully Sharded Data Parallel (FSDP). Let’s take a closer look at the actual mechanics of each algorithm and understand why they save memory.

4.1 Optimizer State Sharding (OSS)

FairScale has implemented memory optimization OSS related to optimizer memory.

Optimizers like Adam typically need to maintain momentum and variance. Even if you can train with parameters and gradients of FP16 precision, the parameters and gradients need to be saved as FP32 precision. When each rank updates the full model, this means that a significant portion of memory is occupied by redundant representations of optimizer state.

To overcome this redundancy, the optimizer state shard needs to divide the model optimization steps between different ranks so that each rank is only responsible for updating the corresponding shard of the model. This in turn ensures that the optimizer state is much smaller per rank and that it does not contain redundant information across ranks.

4.1.1 Training process

The training process can be modified as follows from the DDP execution process:

  1. Wrapped Optimizer splits the optimizer state in a greedy algorithm based on the size of the arguments rather than the order in which they are used. This is to ensure that each rank has nearly the same optimizer footprint.

  2. The training process is similar to PyTorch’s distributed Data parallelism (DDP) process. Forward propagation is done on each rank, followed by backward propagation. Allreduce is used to synchronize gradients during backward propagation.

  3. Each rank updates only the state parameters assigned by the optimizer it is responsible for, and discards the rest.

  4. After the update, broadcast or AllGather is executed to ensure that all ranks receive the latest parameter values.

OSS is useful when you use an optimizer with additional state, such as Adam. If you are using SGD or any optimizer with limited memory footprint, you may see slowdowns when using multiple nodes due to the extra communication in Step 4. In the AllReduce process at step 2, there is also some wasted memory for storing gradients, which is then discarded.

4.1.2 Best Practices

  • OSS exposes a broadcast_FP16 Flag that you should probably use in multi-node jobs. This is not usually necessary in single-node experiments.

  • If your model is highly uneven in size (for example, there is a huge tensor), then this method will be of great help, not the tensor segmentation options, such as’ fairscale. Nn. FullyShardedDataParallel ‘will be preferable.

  • 3.OSS should be a temporary solution in the DDP environment that remains compatible with most DDP functions.

4.1.3 performance

  • OSS should always be faster than Vanilla PyTorch on a single node, and memory savings will vary depending on the optimizer used

  • OSS can also be faster or slower than Vanilla PyTorch when using multiple nodes, depending on the optimizer used and optional flags (broadcast_fp16, gradient compression, gradient accumulation mentioned above)

  • If you can use a larger batch size for your experiment, it is often beneficial to take a larger batch size and reduce the number of ranks involved, or to use gradient accumulation, as this can reduce communication costs.

4.2 Optimizer + Gradient State Sharding

Although OSS addresses redundancy in the optimizer, there is still duplication of gradient aggregation calculations and extra memory for gradients. To overcome redundant gradient memory, we can use gradient sharding or zero-2. This is implemented by the Sharded Data Parallelism (SDP) API in FairScale.

To enable gradient sharding, each rank is assigned a set of parameters that manage optimizer state and gradient aggregation. By assigning a model fragment to a given rank, we ensure that gradients are specified to specific ranks, which in turn are responsible for updating accordingly. So this reduces communication and memory usage.

4.2.1 Training process

The training process is as follows:

  1. As before, the wrapped optimizer splits the parameters in different column groups.

  2. The model is now wrapped with a Shard Data Parallelism (SDP) wrapper that allows us to add appropriate hooks and maintain state during training.

  3. The SDP focuses on trainable parameters and adds a reverse hook for each parameter.

  4. During backpropagation, the gradient is specified to the specified rank, which was specified in 1 as part of the shard process. Reduce OP replaces AllReduce OP to reduce communication costs.

  5. Each rank updates the parameters it is responsible for.

  6. After the update, a broadcast or AllGather is performed to ensure that all ranks receive the latest updated parameter values.

Both OSS and SDPAPI allow you to reduce memory for gradient and optimizer state, but there may be additional communication costs if the network is slow. OSS and SDP can be used as a first step when running into out of memory (OOM) issues.

4.2.2 Best Practices

  • If multiple nodes are used, ensure that SDP is using Reduce Buffers by specifying the reduce_BUFFer_size parameter. Changing their size may be an optimization goal, and the best configuration may depend on interconnection.

  • If you are on a single node, it is usually best not to use ‘reduce_BUFFer_Size’ as it incurs a delay cost but does not add memory. Setting this value to 0 indicates that this feature is not used.

  • If you can use a larger batch size for your experiment, it is often beneficial to take a larger batch size and reduce the number of ranks involved, or to use gradient accumulation, as this can reduce communication costs.

4.3 Optimizer + Gradient + Horizontal Model Sharding

To further optimize training and achieve greater memory savings, we need to enable parameter sharding.

Parameter segmentation is similar to gradient and optimizer states, that is, each data parallel rank is responsible for a fragment of model parameters. FairScale by completely fragmented data parallel (FSDP) API parameters fragmentation, the API inspired by ZeRO to three < https://arxiv.org/pdf/1910.02054.pdf >.

Parameter sharding has two key points:

  • Allreduce operations can be divided into Reduce and AllGather, similar to previous sharding techniques (optimizer state and gradient).

  • Layers can be wrapped using the FSDP API, which allows us to introduce all the parameters required for a single layer to a given GPU in a given instance, compute forward pass, and then discard parameters that are not part of that rank.

Using FSDP is as simple as simply replacing the original DDP in your code. Note: FSDP currently requires the model to be a NN.Sequential model.

from torch.utils.data.dataloader import DataLoader
from torchvision.datasets import FakeData
from torchvision.transforms import ToTensor

from fairscale.experimental.nn.offload import OffloadModel

num_inputs = 8
num_outputs = 8
num_hidden =  4
num_layers =  2
batch_size =  8

transform = ToTensor()
dataloader = DataLoader(
    FakeData(
        image_size=(1, num_inputs, num_inputs),
        num_classes=num_outputs,
        transform=transform,
    ),
    batch_size=batch_size,
)

model = torch.nn.Sequential(
    torch.nn.Linear(num_inputs * num_inputs, num_hidden),
    *([torch.nn.Linear(num_hidden, num_hidden) for _ in range(num_layers)]),
    torch.nn.Linear(num_hidden, num_outputs),
)
Copy the code

4.3.1 Training process

The specific training process is as follows:

  • The parameters required for forward propagation for each layer of the AllGather model before starting to compute specific layers.

  • Compute forward.

  • Allgather propagates the required parameters back to each layer of the model before a particular layer begins the reverse propagation.

  • Computations propagate backwards.

  • Specify gradients to accumulate polymerization gradients on the rank responsible for the corresponding parameters.

  • Have each rank update the parameters assigned to it using an aggregation gradient.

With FSDP, there are some small changes that need to be made when using the API to do checkpoint Settings and save optimizer state. Given the fragmented nature of optimizer states and parameters, any API that aims to save model states for training or reasoning needs to consider saving weights for all workers. FSDP implements required plumbing to save the weights of all workers, save the weights of individual workers, and save the optimizer state of all workers.

FSDP also supports mixed precision training, where computation and communication are performed at FP16 precision. If you want to reduce the number of operations performed in FP32 (which is the default behavior of DDP), you must set fp32_reduce_Scatter =True.

To further save memory, FSDP supports offloading currently unused parameters and gradients onto the CPU. This can be enabled by setting “move_PARAMS_to_CPU” and “move_grads_to_CPU” to True.

4.3.2 Best practices

  • For FSDP, it is best to use model.zero_grad(set_to_NONE =True) because it saves a lot of memory after a single step.

  • Torch.cuda.amp. Autocast is fully compatible with FSDP. You need to set the ‘mixed_precision’arg to True.

  • When combined with activation checkpoints, it is best to use FSDP(checkpoint_wrapper(module)) instead of checkpoint_wrapper(FSDP(module)). The latter will result in more communication and slower speeds.

  • FSDP is compatible with DDP using the Pointwise optimizer, such as Adam, AdamW, ADADDelta, Adamax, SGD, etc. Sharding results in slightly different results when using non-pointwise optimizers such as Adagrad, Adafactor, LAMB, etc.

4.3.3 performance

  • For best memory efficiency, wrap each layer in the network with FSDP using “auto_wrap” and set reshard_after_forward to True. This will be slow, but the video memory overhead is minimal.

  • To get the best training speed, set reshard_after_forward to False (you don’t need to wrap each layer, but if you do, speed it up further).

Now that we’ve covered the basics of FSDP support and how to use it, in the next article we’ll look at the code details and see how to minimize memory usage.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Fully Sharded Data Parallel: faster AI training with fewer GPUs

ZeRO & DeepSpeed: Training model with over 100 billion parameter optimizations (Microsoft)

Fully Sharded Data Parallel: faster AI training with fewer GPUs

Github.com/microsoft/D…

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training