0 x00 the

In the previous articles, we introduced the basic knowledge of PyTorch pipeline parallelism, automatic balancing mechanism and shard data, etc. In this article, we will take a look at how to implement pipeline dependencies, the core of which is how to establish cross-device dependencies between these small batches.

Pipelining parallelism other articles are linked below:


Deep learning pipeline parallel Gpipe(1)– pipeline basic implementation

Deep learning pipeline parallel GPipe (2) —– gradient accumulation

Deep learning pipeline parallel GPipe(3) —- recalculation

Deep learning pipeline parallel PipeDream(1)– Profile stage

Deep learning pipeline parallel PipeDream(2)– computing partitions

Deep learning pipeline parallel PipeDream(3)– transformation model

Deep learning pipeline parallel PipeDream(4)– runtime engine

Deep learning pipeline parallel PipeDream(5)– communication module

Deep learning pipeline parallel PipeDream(6)– 1F1B strategy

PyTorch Pipeline parallel implementation (1)– Basics

PyTorch Pipeline parallel implementation (2)– How to divide the model

PyTorch pipeline parallel implementation (3)- shards data and runtime systems

PyTorch pipeline parallel implementation (4)- forward calculation

The images are from the paper and github source code.

0x01 Previous review

To better understand this article, let’s first look at the key parts of the previous article.

  • The original pipeline status is as follows:

    • The strategy for pipeline parallelism is to assign tasks according to partition index J so that the JTH partition is completely in the JTH device.
    • Devices holding late portions of the model must wait until the calculations of devices holding early portions of the model are complete.

  • The target pipeline status is as follows:

  • Current issues:

    • If divided into several microbatches, you need to enforce that it must be done before and that it must be done before execution.
    • The computation graph of backward propagation is dynamically constructed during forward propagation. PyTorch does not record forward computations nor maintain a gradient tape. Instead, PyTorch’s Autograd engine propagates the computations back. This means that the autoload engine may not run in exactly the opposite order of execution from the forward process unless enforced by the structure of the graph.
  • Current difficulties:

    • How to publish those device-bound tasks in the correct order on each device to avoid delayed execution on the device (asynchronously with the CPU) because the Python interpreter fails to request them in advance. [This has already been introduced]
    • How to establish cross-device dependencies between these small batches.
  • Implementation scheme:

    • How to ensure correct execution order? Torchgpipe introduces deterministic clock-cycle, which gives the overall order of tasks. [This has already been introduced]

    • How do I guarantee dynamic explicit dependencies in the computed graph? For each run plan generated for CLOCK_Cycles:

      • Use the fence function to call “fork” and “join” to dynamically create explicit backpropagation dependencies in a backcomputed graph.
      • Use compute(schedule, Skip_trackers, IN_queues, out_queues) for calculations.

Because the execution sequence scheme was described earlier, this article describes how to calculate dependencies.

0x02 Compute dependencies

+-----------------------------------------------------------------------------------------+
|                                                                                         |
| Layer 1 +--->  Layer 2 +-----> Layer 3 +----->  Layer 4 +-----> Layer 5  +---> Layer 6  |
|                                                                                         |
+--------------------------+---------------------------+----------------------------------+
                                          +
                                          |
                                          |
                                          v
 +------------------------------------------------------------------------------------+
 | +--------------------+         +---------------------+      +--------------------+ |
 | |Partition 1         |         |Partition 2          |      |Partition 3         | |
 | |                    |         |                     |      |                    | |
 | |      Layer 1       |    +---------> Layer 4        |      |                    | |
 | |         +          |    |    |         +           |  +------->   Layer 6      | |
 | |         |          |    |    |         |           |  |   |                    | |
 | |         v          |    |    |         |           |  |   |                    | |
 | |      Layer 2       |    |    |         |           |  |   |                    | |
 | |         +          |    |    |         v           |  |   |                    | |
 | |         |          |    |    |      Layer 5 +---------+   |                    | |
 | |         v          |    |    |                     |      |                    | |
 | |      Layer 3  +---------+    |                     |      |                    | |
 | |                    |         |                     |      |                    | |
 | +---------+----------+         +---------+-----------+      +-----------+--------+ |
 |                                                                                    |
 +------------------------------------------------------------------------------------+
Copy the code

Why do we need to calculate dependencies?

  • Because the model has been stratified, different parts of the model have been separated into different devices, and the data has been divided into microbatches, so the original linear dependency within the model needs to be changed into pipeline dependency. Therefore, the original calculation diagram cannot meet the demand, so it needs to be supplemented with specific. As shown in the figure above, the six layers are divided into three partitions. How are dependencies built between these three partitons?
  • Whereas previously linear dependencies were basically established at model definition, now a dynamic dependency needs to be established at each run time.

So for pipeline parallelism, Torchgpipe needs to add a native cross-device pseudo-distributed dependency of its own. Torchgpipe does this by making various adjustments to the forward and backward computed graphs. The computation diagram implies a variety of dependency logic, which is complemented by the Fork and Join functions described in this section.

There was an initial question as to how Torchgpipe could build a remote backcalculation graph without using PyTorch RPC and P2P. As it turns out, I was overthinking it, because Torchgpipe doesn’t take that into account. It targets Gpus that are all on the same host, and doesn’t involve remote multi-machine computing.

Torchgpipe is essentially a process that runs multiple threads for computation, an alternative to DP. For example, there is a comparison in the source code as follows:

### ResNet-101 Accuracy Benchmark Batch size | torchgpipe | nn.DataParallel | Goyal et al. ---------- | ---------: | -- -- -- -- -- -- -- -- -- -- -- -- -- -- : | -- -- -- -- -- -- -- -- -- -- - : 256 | 21.99 + / - 0.13 22.02 + / - 0.11 | | | | 22.24 + 22.08 + / - 0.06 1 k 0.19 22.04 + / - 0.24 | N/A 4 k | 22.13 + / - 0.09 | | N/A N/ACopy the code

For example, the code explicitly says:

If you decide not to use checkpointing at all, :class:`nn.DataParallel
<torch.nn.DataParallel>` might be more efficient than GPipe.
Copy the code

0x03 Backpropagation Dependency

Let’s first look at backpropagation dependencies, which is the focus of this paper.

3.1 analytical

Again, let’s recall the first two illustrations.

Figure 1

Figure 2

There are two dependencies that need to be completed:

Because the nuggets have a formula problem recently, they can only show it with texturing.

3.2 Basic Functions

3.2.1 Function

First, let’s look at what torch. Autograd. Function does.

The torch. Autograd. Function class is actually the basic parent of an operation Function. Such an operation Function must have two basic procedures: forward operation and reverse derivation.

If something cannot be done through PyTorch’s existing layers or existing methods, a new method needs to be implemented to extend PyTorch. The torch. Autograd. Function class should be extended when you need to customize the derivation rules instead of using automatic differentiation. Since PyTorch no longer provides an automatic derivative function, users are asked to define their own calculations for Extending forward and back propagation, which is called “Extending torch. Autograd “.

Next, we introduce the key algorithm of Backward Dependency: Fork and Join.

3.2.2 the Fork

Fork is the auto grad function that maps a tensor x to pair(x,), which is an empty tensor. The Fork method extends torch. Autograd. Function.

def fork(input: Tensor) -> Tuple[Tensor, Tensor]:
    """Branches out from an autograd lane of the given tensor."""
    if torch.is_grad_enabled() and input.requires_grad:
        input, phony = Fork.apply(input)
    else:
        phony = get_phony(input.device, requires_grad=False)
​
    return input, phony
​
​
class Fork(torch.autograd.Function):
    @staticmethod
    def forward(ctx: 'Fork', input: Tensor) -> Tuple[Tensor, Tensor]:  # type: ignore
        phony = get_phony(input.device, requires_grad=False)
        return input.detach(), phony.detach()
​
    @staticmethod
    def backward(ctx: 'Fork', grad_input: Tensor, grad_grad: Tensor) -> Tensor:  # type: ignore
        return grad_input
​
Copy the code

3.2.3 the Join

Join is the auto grad function, which maps pair(x,) to a tensor x, in this case an empty tensor. The Join method also extends torch. Autograd. Function.

def join(input: Tensor, phony: Tensor) -> Tensor:
    """Merges two autograd lanes."""
    if torch.is_grad_enabled() and (input.requires_grad or phony.requires_grad):
        input = Join.apply(input, phony)
​
    return input
​
​
class Join(torch.autograd.Function):
    @staticmethod
    def forward(ctx: 'Join', input: Tensor, phony: Tensor) -> Tensor:  # type: ignore
        return input.detach()
​
    @staticmethod
    def backward(ctx: 'Join', grad_input: Tensor) -> Tuple[Tensor, None]:  # type: ignore
        return grad_input, None
Copy the code

3.2.4 Phony

Phony is a space-free tensor, and because it does not require any gradient accumulation, it is possible to build arbitrary dependencies in autograd diagrams.

def get_phony(device: torch.device, *, requires_grad: bool) -> Tensor: """Gets a phony. Phony is tensor without space. It is useful to make arbitrary dependency in a autograd graph because it  doesn't require any gradient accumulation. .. note:: Phonies for each device are cached. If an autograd function gets a phony internally, the phony must be detached to be returned. Otherwise, the autograd engine will mutate the cached phony in-place:: class Phonify(torch.autograd.Function): @staticmethod def forward(ctx, input): phony = get_phony(input.device, requires_grad=False) return phony.detach() # detach() is necessary. """ key = (device, requires_grad) try: phony = _phonies[key] except KeyError: with use_stream(default_stream(device)): phony = torch.empty(0, device=device, requires_grad=requires_grad) _phonies[key] = phony return phonyCopy the code

3.2.5 detach

In code, you can often see the use of Detach, as you can see from the comments, to fix a bug in PyTorch.

    # A Python autograd function might fail with this error:
    #
    #   RuntimeError: Returning Variables sharing storage with other Variables
    #   that require grad is not supported in Python functions. Please submit a
    #   feature request if you hit this error.
    #
    # It doesn't look like an essential restriction. But it happens on the
    # current PyTorch version. To avoid it, we should detach the tensor before
    # returning by identity autograd functions, such as Wait, Fork, and Join.
    #
Copy the code

3.3 the use of

In Pipeline, we see this in detail. The fence method uses Depend to build backpropagation dependencies to ensure that batches[I-1] are completed after batches[I].

def fence(self, schedule: List[Tuple[int, int]], skip_trackers: List[SkipTrackerThroughPotals], ) -> None: """Copies micro-batches after computation for the previous micro-batches. """ batches = self.batches copy_streams = self.copy_streams skip_layout = self.skip_layout for i, j in schedule: # Ensure that batches[i-1] is executed after batches[i] in # backpropagation by an explicit dependency. if i ! = 0: [J][I] for [I][I][I] name in skip_layout.copy_policy(j): prev_stream = copy_streams[prev_j][i] skip_trackers[i].copy(batches[i], prev_stream, next_stream, ns, name) if j ! = 0: prev_stream = copy_streams[j-1][i] copy(batches[i], prev_stream, next_stream)Copy the code

The specific Depend code is as follows:

def depend(fork_from: Batch, join_to: Batch) -> None:
    fork_from[0], phony = fork(fork_from[0])
    join_to[0] = join(join_to[0], phony)
Copy the code

Let’s use the example code to assign the parameters passed in and reinterpret the method as follows so you can understand it better.

def depend(batches[i-1]: Batch, batches[i]: Batch) -> None:
    batches[i-1][0], phony = fork(batches[i-1][0])
    batches[i][0] = join(batches[i][0], phony)
Copy the code

The logic is as follows: A bridge is completed using phony, which means that in forward propagation, batches[I] depends on the result of batches[I-1].

      +----------------+          +--------------+
      |                |          |              |
      |  batches[i-1]  |          |  batches[i]  |
      |                |          |              |
      +----------+-----+          +-----+--------+
                 |                      |
                 |                      |
                 |                      |
                 v                      v
+--------------------------------------------------------+
| depend         |                      |                |
|                |                      |                |
|                |                      |                |
|                v                      |                |
|        +-----------------------+      |                |
|        | fork  |               |      |                |
|        |       |    get_phony  |      |                |
|        |       |        +      |      |                |
|        |       |        |      |      |                |
|        |       |        |      |      |                |
|        +-----------------------+      |                |
|                |        |             |                |
|                |        |             |                |
|                |        |             |                |
|                v        v             |                |
|    +-----------+--+  +--+-----+       |                |
|    |              |  |        |       |                |
|    | batches[i-1] |  | phony  |       |                |
|    |              |  |        |       |                |
|    +--------------+  +--+-----+       |                |
|                         |             |                |
|                         |             |                |
|                         v             v                |
|                      +--+------------------+           |
|                      |Join            |    |           |
|                      |                |    |           |
|                      |                |    |           |
|                      |                v    |           |
|                      +---------------------+           |
|                                       |                |
|                                       |                |
|                                       |                |
|                                       v                |
|                                 +-----+------+         |
|                                 |            |         |
|                                 | batches[i] |         |
|                                 |            |         |
|                                 +------------+         |
|                                                        |
+--------------------------------------------------------+
Copy the code

We’re putting together a number of batches so we see a dependency chain.

+----------------------------------------------------------+ | depend | | | | +------------+ | +------------- | |fork | +-----------+ | | | | | | | | | |batches[i] +----------------------> | batches[i]| | | | | | | | | | +------------- | | | +-----------+ | | | | +-------+ | | | +-----------> | Join | | | | | | | | | +------------+ | | | +------------- | | |  +--------------+ | | | | | | | | | |batches[i+1]+-------------------------------------------->+ batches[i+1] | | | | | | | | | | +---------+--- | | | +--------------+ | | | +-------+ | | | | | +----------------------------------------------------------+ | +----------------------------------------------------------+ | | depend | | | | | | +-------------+ | | | |fork | +------------+ | | | | | | | | +--------------------------> |batches[i+1]| | | | | | | | | | | +------------+ | | | | +-------+ | | | +---------> |Join | | | +-------------+ | | | +------------+ | | | +-------------+ | | | | | | | | | |batches[i+2]+--------------------------------------------> | batches[i+2]| | | | | | | | | | +----------+-+ | | | +-------------+ | | | +-------+ | | | | | +----------------------------------------------------------+ | | +-----------------------------------------------------------+ | | depend | | | | +-----------------------------> ...... | | | | | +-----------------------------------------------------------+Copy the code

This is a forward seemingdiagram, so that in fanning, seemingbatches [I] must be completed before seemingbatches [I-1].

depend(batches[i-1], batches[i])
Copy the code

In order to correspond to the figure in the paper, we modified it as:

depend(batches[i], batches[i+1])
Copy the code

The depend code also changes to:

def depend(batches[i]: Batch, batches[i+1]: Batch) -> None:
    batches[i][0], phony = fork(batches[i][0])
    batches[i+1][0] = join(batches[i+1][0], phony)
Copy the code

0x04 Forward propagation dependency

Let’s go back to positive dependencies. Because part of the purpose of forward propagation is to complete the back-propagation dependence, and at present, back-propagation only completes the dependence between rows, but not between columns, so we complete it now.

A dependency between columns is a dependency between devices, that is, the output of one device is the input of the next.

4.1 Segmentation Model

First of all, we need to review how to split the model. As you can see from split_module,

The partitions member variable for GPipe is of type nn.modulelist. Nn. ModuleList is a container that stores different modules and automatically adds parameters for each module to the network. However, nn.ModuleList does not define a network, but only stores different modules together. There is no order between these modules, and the order of network execution is determined by the forward function.

def split_module(module: nn.Sequential, balance: Iterable[int], devices: List[torch.device], ) -> Tuple[List[nn.Sequential], List[int], List[torch.device]]: balance = list(balance) j = 0 partitions = [] layers: NamedModules = OrderedDict() for name, layer in module.named_children(): If len(layers) == balance[j]: Array size = balance[j], Partition = nn.Sequential(layers) # Group buffered layers as a partition. Partition = nn.Sequential(layers Module device = device [j] partition. To (device) # partitions are placed on related devices. Append (partition) # Clear () j += 1 # for the next partition. Layer. clear() j += 1 # for the next partition = cast(List[nn. nn.ModuleList(partitions)) del devices[j:] return partitions, balance, devicesCopy the code

The question then becomes: Partitions can be Sequential to perform a series of forward operations internally, but how do you configure the order of execution between partitions?

+-----------------------------------------------------------------------------------------+
|                                                                                         |
| Layer 1 +--->  Layer 2 +-----> Layer 3 +----->  Layer 4 +-----> Layer 5  +---> Layer 6  |
|                                                                                         |
+-----------------------------------------+-----------------------------------------------+
                                          |
                                          |
                                          |
                                          v
+-----------------------------------------------------------------------------------------+
| +--------------------+           +---------------------+         +--------------------+ |
| |Partition 1         |           |Partition 2          |         |Partition 3         | |
| |                    |   ???     |                     |         |                    | |
| |      Layer 1       |     +----------> Layer 4        |   ???   |                    | |
| |         +          |     |     |         +           |     +------->   Layer 6      | |
| |         |          |     |     |         |           |     |   |                    | |
| |         v          |     |     |         |           |     |   |                    | |
| |      Layer 2       |     |     |         |           |     |   |                    | |
| |         +          |     |     |         v           |     |   |                    | |
| |         |          |     |     |      Layer 5 +------------+   |                    | |
| |         v          |     |     |                     |         |                    | |
| |      Layer 3  +----------+     |                     |         |                    | |
| |                    |           |                     |         |                    | |
| +--------------------+           +---------------------+         +--------------------+ |
|                                                                                         |
+-----------------------------------------------------------------------------------------+
Copy the code

4.2 Establishing Dependencies

Let’s start with the paper. Suppose we have a neural network consisting of a series of subnetworks. We assume that these sub-networks are, and their parameters are respectively, then the whole network is:

The parameters are, for clarity, let’s call it the JTH partition of f, and assume that the parameters of the partition are disjoint.

When training the network, gradient-based methods (such as stochastic gradient descent method) need to calculate the output result f(x) of the network after a small batch of training data X and corresponding losses are given. And the gradient G of loss relative to network parameters. These two stages are called forward and backward propagation respectively.

Since f consists of the order of its L layer submodules (), the forward propagation can be calculated by making (i.e., entering x) and then applying each partition sequentially, i.e., here. Can be expressed as:

So we know that the order of forward propagation is determined by phi.

We can take a closer look at the code to see how sequential dependencies between partitions are enforced.

def run(self) -> None: """Runs pipeline parallelism. It modifies the given batches in place. """ batches = self.batches partitions = self.partitions devices = self.devices skip_layout = self.skip_layout m = len(batches) n = len(partitions) skip_trackers  = [SkipTrackerThroughPotals(skip_layout) for _ in batches] with spawn_workers(devices) as (in_queues, out_queues): for schedule in clock_cycles(m, n): # used here, gives the execution sequence plan, Follow this to execute self.fence(schedule, Skip_trackers) self.compute(Schedule, Skip_trackers, in_queues, out_queues)Copy the code

The target of the resolution is for schedule in clock_cycles(m, n)

Our question is: How do we get through this for loop so that it has to run before? , that is, how to arrange the successive run of back propagation? How do you do inline dependencies?

This will be analyzed through the source code of Compute. The highlights are:

  • This is a variation of batches[I], i.e., batches[0] will turn into partitions[J] after being calculatedbatches[0][j].
  • For the compute method, the key is that the code batches at the bottom [I] = Batch. Assigning the number of batches calculated by the JTH device to the number of batches assigned to them is an equal number of batches[J]. In this case, the number of batches calculated by the JTH batch is an equal number of batches. The construction is just AN F[I, J +1], and the next Depend operation in the fence is just for batches.
  • Therefore, using this assignment, batches[I, J +1] are dependent on batches[I, J] in the forward calculation, and therefore must be completed before batches[I, J] in the reverse calculation.
def compute(self, schedule: List[Tuple[int, int]], skip_trackers: List[SkipTrackerThroughPotals], in_queues: List[InQueue], out_queues: List[OutQueue], ) -> None: """Runs tasks with synchronization to copy streams.""" batches = self.batches partitions = self.partitions devices = self.devices n = len(partitions) streams = [current_stream(d) for d in devices] for i, j in schedule: [J] # Synchronize with the copied input. ([1] in the Table) # Determine whether checkpointing or not. If checkpoint: # ignore else: def compute(batch: Batch = batch, partition: nn.Sequential = partition, skip_tracker: SkipTrackerThroughPotals = skip_trackers[i], ) -> Batch: with use_skip_tracker(skip_tracker): Return batch.call(partition) # Forward computation, computing is computed in units of partitions, and the internal layers of a partition are computed sequentially, guaranteed by the Sequential order. task = Task(streams[j], compute=compute, ([2] in the diagram) in_queue [J]. Put (task) # let worker calculate for I, j in schedule: [j].queue () [j].queue () Batch = cast(Tuple[task, batch], payload) # The copy stream synchronizes to copy the output. ([3] in the # diagram) # Finalize tasks. If checkpointing is  enabled, Here the # recomputation is scheduled at backpropagation. ([4] in the # diagram F[I, J] batches[I] = Batch # This is just an F[I, J +1], and the next fence depends is just for a fandom.Copy the code

For this assignment, the corresponding grad_fn is PermuteBackward, as in:

a = torch.tensor([2., 3.], requires_grad=True)
c = a
c.backward(gradient=external_grad)
print(c)
Copy the code

Concrete is:

c = {Tensor: 2} tensor([2., 3.], requires_grad=True)
  T = {Tensor: 2} tensor([2., 3.], grad_fn=<PermuteBackward>)
Copy the code

Now, let’s update the image below.

+-------------------------------------------------------------------+ | depend | | | | +---------------+ | | |fork | | +------------- | | | +-----------+ | | | | | | | | | |batches[i] +-------------------------> | batches[i]| | | | | | | |  | | +------------- | | | +-----------+ | | | | | | | | | | | | +--------+ +-------+ | | | get_phony +------> | +--->+ Join | | | | | | phony | | | | | +---------------+ | | | | | | +--------+ | | | | | | | +------------- | | | +--------------+ | | | | | | | | | |batches[i+1]+----------------------------------------------------->+ batches[i+1] | | | | | | | | | | +------------- | | | +--------------+ | | +-------+ | | | +-------------------------------------------------------------------+Copy the code

We scale out batches as follows, i.e. a batch is divided into two batches of seemly [I] and Seemly [I +1], which are pipelined over partitions[J] and [J +1] of two devices, so that both rows and columns have the dependency of back propagation.

F[i,j] F[i,j+1] +------------------------------------------------+ +-----------------------------------------------+ | partitions[j] | | partitions[j+1] | | | | | | +--------------------+ +------------------+ | | +-------------------+ +------------------+ | | |fence | | compute | | | | fence | | compute | | | | | | | | | | | | | | +--------------+ | | +--------------+ | | +------------+ | | +-----------------+ | | +-------------+ | | +------------+ | | +-----------------+ | | | | | depend | | | |forward | | | | | | | | depend | | | |forward | | | | | | batches[i] +---------------------------------------------------------> | batches[i][j] +----------------------------------------------------------> | batches[i][j+1] | | | | | | | | | | | | | | | | | | | | |  | | | | | | +--------------+ | | | | | | | | | | +-----------------+ | | | | | | | | | | +-----------------+ | | | | | | +------------+ | | | | | | | | +------------+ | | | | | | | | | | | | | | | | | | +--------------+ | | | | | +------------------+ | +-----------------+ | | | | | +------------------+ | +-------------------+ | | | | | | | | | | | | | | | | | | | batches[i+1]+---------------------------------------------------------> | batches[i+1][j] +----------------------------------------------------------> | batches[i+1][j+1] | | | | | | | | | | | | | | | | | | | +--------------+ | | +--------------+ | | +-----------------+ | | +-------------+ | | +-------------------+ | | | | | | | | | +--------------------+ | | +-------------------+ | +------------------------------------------------+ +-----------------------------------------------+Copy the code

The mobile phone is as follows:

0 x05 summary

The figure below. That is, the model is divided into 3 subnetworks, and small batches are divided into 4 microbatches. The subscripts of F and B are m, n.

As shown above, there are two dependencies that need to be completed:

  • Inter-line dependencies are batch dependencies and intra-device dependencies. It’s a dotted line, which means it has to be done before, it has to be done before.
  • Inter-column dependencies are dependencies between partitions (devices). The solid line on the graph means that it must be completed before the first device must be completed before the second device, and the output of the first device is the input of the second device.

As shown above, we need to complete both row and column dependencies.

  • The dependency between batches is guaranteed by Join & Fork, and the dependency is set by using empty tensors to ensure that the batches are completed after batches[I].
  • Between the columns rely onIs through thebatches[i] = batchDone, using PermuteBackward to complete the dependency between devices.

Now that we have the execution order and dependencies set up, the next article will show you how to do parallel processing.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Markdown formula

Formula editing tutorial in Markdown

Docs.nvidia.com/cuda/cuda-r…

CUDA Learning: Basics Summary

Use of CUDA Stream

NVIDIA solution architect in-depth parsing of large-scale parametric language model Megatron-Bert

Accelerating Wide & Deep Recommender Inference on GPUs

HugeCTR: High-Performance Click-Through Rate Estimation Training

Discuss.pytorch.org/t/how-to-pr…

Github.com/NVIDIA/apex…

Github.com/justheurist…

Pytorch.org/tutorials/i…

Pytorch.org/docs/stable…

Pytorch.org/docs/notes/…

zhuanlan.zhihu.com/p/61765561

Pytorch.apachen.org/docs/1.7/64…

Zhidx.com/p/217999.ht…