0 x00 the

The previous articles covered the basics of PyTorch pipelineparallelism, automatic balancing, and shard data. In this article, we will look at how to ensure forward computation execution order.

Pipelining parallelism other articles are linked below:

Deep learning pipeline parallel Gpipe(1)– pipeline basic implementation

Deep learning pipeline parallel GPipe (2) —– gradient accumulation

Deep learning pipeline parallel GPipe(3) —- recalculation

Deep learning pipeline parallel PipeDream(1)– Profile stage

Deep learning pipeline parallel PipeDream(2)– computing partitions

Deep learning pipeline parallel PipeDream(3)– transformation model

Deep learning pipeline parallel PipeDream(4)– runtime engine

Deep learning pipeline parallel PipeDream(5)– communication module

Deep learning pipeline parallel PipeDream(6)– 1F1B strategy

PyTorch Pipeline parallel implementation (1)– Basics

PyTorch Pipeline parallel implementation (2)– How to divide the model

PyTorch pipeline parallel implementation (3)- shards data and runtime systems

The images are from the paper and github source code.

0 x01 paper

As mentioned earlier, since GPipe is based on the TensorFlow library (it’s a Google product, of course), some of the Kakaobrain engineers implemented GPipe in PyTorch and opened source it. This is Torchgpipe. The address is: Github.com/kakaobrain/… PIP install Torchgpipe

The same team also published a paper as follows: arxiv.org/pdf/2004.09… .

Next, we will analyze this paper. This paper will not fully translate this paper, but select the parts closely related to implementation for translation analysis.

An introduction to 1.1

One obstacle to parallel training is that the commonly used optimization techniques for training neural networks are sequential in nature. These algorithms repeatedly perform the following operations: for a given mini-batch of data, calculate its gradients against the loss function and use these gradients to update model parameters.

1.1.1 Data Parallelism

In the case of a large number of computing resources, data parallel divides mini-batch into micro-batch and delegates the calculation of each micro-batch to available equipment to speed up the overall optimization process. With careful hyperparameter tuning, data parallelism can effectively reduce the training time required for small batches of a certain size, which may depend on the model, optimization algorithm, and data.

The problem with data parallel training is that each device has its own network version of the model to perform sub-tasks and must synchronize model network parameters after each parameter update. This can lead to heavy communication loads when many parameters need to be synchronized.

However, data parallelism does not apply when the model is too large to compute gradients even if the model cannot be accommodated by a single machine.

1.1.2 Model parallelism

Model parallelism is a way of training a large model by breaking it up into parts and placing them on different devices. Each device evaluates only a small part of the model and updates only the parameters in that part. However, model parallelism is affected by its “underutilized” behavior. Because most neural networks consist of a series of layers, the device holding the late part of the model must wait until the device holding the early part of the model completes the computation.

One possible solution is to use gradient checkpoints, which store only a subset of activation values and recalculate discarded activation values as needed to save memory. Obviously, this requires two calculations of some parts of the model and an increase in overall training time.

In subsequent sections, we will discuss how to break forward and backward processes into subtasks (under certain assumptions), describe the device allocation strategy for microbatch pipelines in parallel, and demonstrate the required execution sequence for each device. It also discusses the complexities of implementing optimal timelines for pipeline parallelism in PyTorch and explains how Torchgpipe solves these problems.

In addition, we relax the assumption that the models are composed sequentially and provide a way to represent the model using long-jump connections in order to still apply pipeline parallelism without giving up efficiency.

1.2 Model Definition

Suppose we have a neural network consisting of a series of subnetworks. We assume that these sub-networks are, and their parameters are respectively, then the whole network is:

Because the nuggets formula can not be analyzed recently, so I can only use pictures to show the text, please understand.

Or please click www.cnblogs.com/rossiXYZ/p/…

The first three F’s are the forward propagation of the three sub-networks, and the last three B’s are the backward propagation of the three sub-networks.

The following represents the first microbatch, which completes the forward propagation and backward propagation of the three subnets in sequence.

Given a set of tasks and a pool of devices that can work in parallel, different parallelization strategies have their own rules for assigning tasks to devices.

Once the dependencies are resolved, each device calculates one or more assigned tasks. In the above setup, all dependencies for Tasks have the same microbatch index I. Therefore, tasks can be effectively parallelized by assigning tasks with different microbatch indexes to different devices, which is called data parallelism.

1.3 Calculation chart of GPipe

The strategy for pipeline parallelism is to assign tasks according to partition index J so that the JTH partition is completely in the JTH device. In addition, there are mandatory requirements that must be done before, and must be done before execution.

In addition to microbatch pipelining, GPipe further reduces memory requirements by using gradient checkpoints on each. Since the first device only performs each time, when the calculation is performed, only the activation diagram is needed.

Because the forward propagation is computed just before execution, we reduce memory consumption by a factor of m. In addition, recalculation can be performed while the device is waiting, as shown in the following figure:

The dotted arrows represent the execution order between individual tasks resulting from the introduction of microbatch order. Colors indicate different devices.

We note that the recalculation of the last microlot, i.e., is unnecessary here.

This is because on the JTH device, the last task in the forward pass is, therefore, dropping the intermediate activations in the forward pass and recalculating them at the beginning of the back pass does not reduce memory, but only slows down the pipe. Therefore, the figure is omitted.

1.4 Devicewise Execution Order

In summary, in pipelined parallelism (with checkpoints), each device is assigned a set of tasks in a specified order. Once cross-device dependencies are satisfied, each device performs a given task individually. However, there is one component missing from this picture — data transfer between devices. For illustrative purposes, device J must follow the complete execution order shown in the figure. And for emphasis, the data transfer operation is explicitly referred to as “receive” and “send.”

For convenience, the library provides the submodule Torchgpipe. balance to calculate the partition, with the goal of keeping the resource differences between two partitions (Pairwise) as small as possible. Resource usage is calculated by profile. [2] Imre B´ar´any and Victor S Grinberg. Block partitions of sequences. Israel Journal of Mathematics, 206(1):155 — 164.

1.5 PyTorch Implementation difficulties

What we care about most is efficiency. In order for pipeline parallelism to work as expected, tasks must be assigned to each device in the correct order. There are several complications in implementing this in Pytorch.

  • First, because of PyTorch’s Define By Run style and its eager execution behavior (as opposed to the In Construct-and-Run framework), the kernel is published dynamically to each device.

    • Therefore, the host code must be carefully designed so that not only can the device-bound tasks be published in the correct order on each device, but also that they are not delayed (asynchronously with the CPU) on the device because the Python interpreter fails to request them in advance.
    • This delay can occur when some tasks are CPU-intensive or involve a large number of cheap kernel calls. As a solution, Torchgpipe introduces the Deterministic clock-cycle, which gives the overall order of tasks.
  • Secondly, the computation graph of backward propagation is constructed dynamically in the process of forward propagation. In other words, “it avoids the reification of the ‘forward graph’ and only records what is needed for differential computation”. Because PyTorch does not record forward computed graphs or maintain a gradient tape, PyTorch’s Autograd engine only propagates backward computed graphs. This means that the autoload engine may not run in exactly the opposite order of execution from the forward process unless enforced by the structure of the graph. To solve this problem, Torchgpipe developed a pair of basic functions called “fork” and “Join” to dynamically create explicit dependencies in a backward computed graph.

  • Third, communication between multiple devices can lead to bidirectional synchronization if not carefully managed. This leads to underutilization because the sender may wait to synchronize with the receiver even when there is no explicit dependency between the replica and the next task in the queue, and vice versa. Torchgpipe avoids this problem by using non-default CUDA streams so that replicas don’t block calculations unless they have to wait for data.
  • Finally, Torchgpipe attempts to loosen the constraints of microbatch pipeline parallelism (models must be sequential).

    • Although in principle any neural network can be written in sequential form, this requires knowing the entire computation diagram in advance, which is not the case in PyTorch. In particular, if a tensor jumps from one layer in the device to another in the device, that tensor will be copied to all devices in between, because Torchgpipe can’t know about it in advance. To avoid this problem, we designed an interface to indicate which intermediate tensors were skipped and which layers used them.

1.6 summarize

Let’s summarize the current core difficulty to introduce the following work.

  • The original pipeline status is as follows:

    • The strategy for pipeline parallelism is to assign tasks according to partition index J so that the JTH partition is completely in the JTH device.
    • Devices holding late portions of the model must wait until the calculations of devices holding early portions of the model are complete.

  • The target pipeline status is as follows:

  • Current issues:

    • If divided into several microbatches, you need to enforce that it must be done before and that it must be done before execution.
    • The computation graph of backward propagation is dynamically constructed during forward propagation. PyTorch does not record forward computations nor maintain a gradient tape. Instead, PyTorch’s Autograd engine propagates the computations back. This means that the autoload engine may not run in exactly the opposite order of execution from the forward process unless enforced by the structure of the graph.

)

  • Current difficulties:

    • How to publish those device-bound tasks in the correct order on each device to avoid delayed execution on the device (asynchronously with the CPU) because the Python interpreter fails to request them in advance.
    • How to establish cross-device dependencies between these small batches.
  • Implementation scheme:

    • How to ensure correct execution order? Torchgpipe introduces deterministic clock-cycle, which gives the overall order of tasks.

    • How do I guarantee dynamic explicit dependencies in the computed graph? For each run plan generated for CLOCK_Cycles:

      • Use the fence function to call “fork” and “join” to dynamically create explicit backpropagation dependencies in a backcomputed graph.
      • Use compute(schedule, Skip_trackers, IN_queues, out_queues) for calculations.

This paper first looks at how to ensure the correct order of execution in forward calculation.

0x02 Execution sequence

Let’s look at the Forward Dependency: Deterministic clock-cycle algorithm. This sort is used specifically for forward propagation, which is evaluated one by one by this algorithm.

In general, the forward propagation calculation is done according to the model structure, but because pipeline parallelism is special and the model has been split, torch- Gpipe needs to provide its own forward propagation execution sequence to execute individual microbatches.

2.1 Thesis Content

The overall order of tasks is determined by the host code propagating forward. Each device implicitly understands the dependencies between tasks by the order in which the CPU is allocated. Ideally, if you can assign tasks to devices cost-free, the CPU can assign tasks to devices in any order as long as the order within the device is correct. However, this assumption is not realistic because starting the core function on the GPU is not free for the CPU, memory transfers between Gpus may need to be synchronized, or the task may be CPU-intensive. Therefore, to minimize latency from the CPU, we sort all tasks by “distance to a node”.

This scheme is named deterministic clock-cycle algorithm. In this algorithm, the CPU executes in a count-to-clock cycle. In the KTH clock cycle, for these indexes:

  • The task first executes all copies of the required data.
  • The computing kernel used to perform the task is then registered with the appropriate device (since tasks within the same clock cycle are independent, it is safe to multithread).

2.2 analytical

Let’s take a look at the picture of the paper, namely:

  • It’s on the chart at 1 o ‘clock
  • It’s on the chart at 2:00. That is, run one bar to the right, while the second micro batch enters training, i.e., run.
  • Three o ‘clock. It’s on the chart. That is, run one bar to the right, run one bar to the right, and the third micro batch enters the training process, that is, run.
  • At 4:00, it’s on the chart. That is, run one bar to the right, run one bar to the right, and the fourth microlot enters the training process, that is, run.
  • And so on…..

And that corresponds to the graph, we can see,

  • The step distance to PI is 1, one step to PI.
  • The step distance to PI is 2, two steps to PI.

This logic can be clearly seen in the figure below. So, the clock algorithm is to sort all the tasks by the distance to them. This is much like the ripples of water created by throwing a stone into the water, which spread layer by layer from near to far from the drop point.

Here the colors represent different devices.

2.3 code

Let’s look at the code. The first is the generated clock cycle, where:

  • Min (1+k, n) is the maximum number of devices (partition) that can boot at k clock.
  • Max (1+ K-m, 0) is the minimum micro-batch (micro-batch) that can be started with the K clock.

Therefore, the final sequence returned is the index of Micro-Batch (index of Partition) sequence that can be started at the time of K clock.

def clock_cycles(m: int, n: int) -> Iterable[List[Tuple[int, int]]]: """Generates schedules for each clock cycle.""" # m: number of micro-batches # n: number of partitions # i: index of micro-batch # j: index of partition # k: Clock number # # k (I, j) (I, j) (I, j) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # 0 (0, 0) # 1 (1, 0) (0, 1) # 2 (2, 0) (1, 1) (0, 2) # 3 (2, 1) (1,2) # 4 (2,2) # let's see, k is the number of clocks, starting at 1, at most the number of clocks is m+n-1. # Max (1+k-m, 0) batch for k in range(m+n-1): yield [(k-j, j) for j in range(max(1+k-m, 0), min(1+k, n))]Copy the code

Solve (4,3) solve(4,3) solve(4,3) solve(4,3)

[(0, 0)]
[(1, 0), (0, 1)]
[(2, 0), (1, 1), (0, 2)]
[(3, 0), (2, 1), (1, 2)]
[(3, 1), (2, 2)]
[(3, 2)]
Copy the code

Because the paper has a sample diagram, and this diagram is not completely consistent with the annotation & code, for better illustration, we will follow the diagram, because the picture is from the beginning, so we modify the annotation as follows:

# 0 (0, 0) -- - > clock running on (1, 1) # 1 (1, 0) (0, 1) -- - > clock running on 2 (2, 1) (1, 2) # 2 (2, 0) (1, 1) (0, 2) -- - > clock 3 Running on (3, 1) (2, 2) (1, 3) # 3 (2, 1) (1, 2) -- - > clock running on 4 (3, 2) (2, 3) # 4 (2, 2) -- - > clock running on 5 (3, 3)Copy the code

Let’s modify the solve code to print the correct index, so that you can better match the code to the image.

m=4 # m: number of micro-batches n=3 # n: number of partitions for k in range(m + n - 1): Print ([(k-j +1, j +1) for j in range(Max (1 + k-m, 0), min(1 + k, n))]) print([(k-j +1, j +1) for j in range(Max (1 + k-m, 0), min(1 + k, n))]) [(1, 1)] # 1 round training plan & data [(2, 1), (1, 2)] # 2 rounds of training plan & data [(3, 1), (2, 2), (1, 3)] # 3 rounds of training plan & data [(4, 1), (3, 2), [(4, 2)] # 5 Training plan & Data [(4, 3)] # 6 training plan & dataCopy the code

Let’s redraw the assembly line.


Let’s draw the output above according to the diagram of the pipeline for comparison.

It can be seen that in the first 4 clock cycles, 4 micro-batch entered cuda:0, respectively (1,1) (2,1) (3,1) (4,1). Then different schedules are executed within each iteration (clock cycle) according to the order given by the Clock_Cycles algorithm. After 6 clock cycles, the first round of forward operation is completed. This creates an assembly line.

The advantage of pipelining is that if the number of micro-batches is properly configured, they can operate all batches to the maximum extent possible within a single clock cycle. Native pipelines, by contrast, can only have one device interactively active at a time.

+ + + + + + + | | | | | | | | | | | | | | cuda: 0 | | (1, 1) (2, 1) | | (3, 1) (4, 1) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cuda: 1 | | (1, 2) | | (2, 2) (3, 2) | (4, 2) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cuda: 2 | | | | (1, 3) (2, 3) | | (3, 3) (4, 3) | | | | | | | | | | | | | | | | | | | | | | | 1 2 | | clock clock 3 clock  | clock 4 | clock 5 | clock 6 | + + + + + + + +------------------------------------------------------------------------------> TimeCopy the code

The trend of specific data batch is as follows:

+ + + + + + + | | | | | | | cuda: 0 | | (1, 1) (2, 1) | | (3, 1) (4, 1) | | | | + + + + | | | | | | | | | | | | | | | | | | |  | | | | | +----------+ | | | | | | | +-----------+ | | | | | | | +------------+ | | | | | | | | | | | | | | | | | | + -- -- -- -- -- -- -- -- -- -- -- -- + | | | | | | | | | | | | | | | | | | | | | | | | | v v v | | | | v | | | | | cuda: 1 | | (1, 2) | | (2, 2) (4, 2) (3, 2) | | | | | + + + + | | | | | | | | | | | | | | | | | | | | | | | | + -- -- -- -- -- -- -- -- -- -- -- -- -- + | | | | | | | + -- -- -- -- -- -- -- -- -- -- +  | | | | | | | +------------+ | | | | | | | +-----------+ | | | | | | | | | | | | v | v | v | | | | v | | | | cuda:2 | | | | (1, 3), (2, 3) | | (3, 3) (4, 3) | | | | | | | | | | | | | | | | | | | | | | | clock 1 | 2 | 3 | | clock clock 4 clock clock 5 | clock 6 | + + + + + + + +-----------------------------------------------------------------------------------> TimeCopy the code

2.4 the use of

In the Pipeline class, we can see that the calculation is started on a clock cycle, so that in the forward propagation, the sequence spreads like a ripple of water.

def run(self) -> None: """Runs pipeline parallelism. It modifies the given batches in place. """ batches = self.batches partitions = self.partitions devices = self.devices skip_layout = self.skip_layout m = len(batches) n = len(partitions) skip_trackers  = [SkipTrackerThroughPotals(skip_layout) for _ in batches] with spawn_workers(devices) as (in_queues, out_queues): for schedule in clock_cycles(m, n): # used here, gives the execution sequence plan, Self.fence (schedule, Skip_trackers) self.compute(Schedule, Skip_trackers, in_queues, Queues) # perform calculationsCopy the code

At this point, the forward propagation process is analyzed. In the next chapter, we will analyze the dependency relationship.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Markdown formula

Formula editing tutorial in Markdown

Docs.nvidia.com/cuda/cuda-r…

CUDA Learning: Basics Summary

Use of CUDA Stream

NVIDIA solution architect in-depth parsing of large-scale parametric language model Megatron-Bert

Accelerating Wide & Deep Recommender Inference on GPUs

HugeCTR: High-Performance Click-Through Rate Estimation Training

Discuss.pytorch.org/t/how-to-pr…

Github.com/NVIDIA/apex…

Github.com/justheurist…

Pytorch.org/tutorials/i…

Pytorch.org/docs/stable…

Pytorch.org/docs/notes/…

zhuanlan.zhihu.com/p/61765561

Pytorch.apachen.org/docs/1.7/64…

Zhidx.com/p/217999.ht…