0 x00 the

Following GPipe, we open a new series of pipelining parallel training, introducing Microsoft’s PipeDream. This paper introduces its general idea, architecture and Profile stage.

The Gpipe pipeline has two problems: low hardware utilization and high memory consumption. Thus, in another flowing paper, Microsoft PipeDream proposed an improvement method for these problems, namely 1F1B (One Forward Pass followed by One Backward Pass) strategy. This improved strategy can solve the problem of the number of copies with a cache activation, so that the number of copies with a cache activation only depends on the number of stages, which further saves video memory and allows training of larger models.

PipeDream can be divided into four stages: profile, compute Partition, convert Model, Runtime, etc.

  • Profile stage: Deduce DNN training time through Profile of small batch data.
  • Compute Partition phase: The runtime of all layers is determined based on the profile results and then optimized. The optimizer returns an annotated operator graph that maps each model layer to a phase ID.
  • Convert Model phase: BFS traverses the operator graph, generating a separate torch.nn.Module code for each phase. PipeDream sorts the operators in each stage to ensure that they remain consistent with the input/output dependencies of the original PyTorch model diagram.
  • Runtime Phase: The PipeDream Runtime assigns each phase (including a copy of the replication phase) to a single worker process according to its 1F1B-RR scheduling policy.

This article first looks at the PipeDream overall thinking, architecture and Profile phases.

0 x01 overview

1.1 Review

As mentioned above, there are several necessary parallel techniques for distributed model training at present:

  • Pipelining parallelism, especially how to set pipelining automatically;
  • Gradient Accumulation;
  • Backward recalculation;
  • 1 f1b strategy;

In previous articles, we introduced the first three technologies of Gpipe. At the beginning of this article, we take a look at how Microsoft distributed DNNs training system PipeDream implements pipeline parallelism and 1F1B strategy.

1.2 Current Problems

DNN training is characterized by bidirectional training. The training is calculated iteratively in the forward and backward channels, and the two propagations pass through the same layer in reverse order. In each iteration, the training process cycles through a minibatch of input data and updates model parameters.

1.2.1 Data Parallelism

The most common DNN parallelization training method is data parallelization, which distributes input data to each worker (each worker has all models). Unfortunately, while some progress has been made in performance optimization to speed up data parallelism, training on cloud infrastructure can incur significant communication overhead. Moreover, with the rapid improvement of GPU computing speed, the time-consuming bottleneck of all model training will further shift to the communication link.

The following figure shows data parallelism, but its hardware utilization is still too low. Because a single worker in data parallelization has to communicate and wait when exchanging gradient data.

1.2.2 Model parallelism

Model parallelization is another form of parallelization training. In this method, operators are scattered to each worker for calculation, which is usually used to train large DNN models. In this paper, model parallelism refers to placing different layers of the model on different machines, and does not involve the scene of shard the same layer on different machines.

The figure below shows model parallelization, showing a calculation timeline with four machines and a pipe.

  • In the forward phase, each phase performs forward passes to the miniBatch of layers in the phase and sends the results to the next phase. The output stage calculates the miniBatch loss after the forward pass.
  • In the backward stage, each stage forms a backward channel that transmits the loss to the previous stage one by one.

Only one minibatch can be processed between workers at the same time, and only one minibatch is active in the system, which greatly limits hardware utilization.

Furthermore, model parallelization requires the programmer to decide how to split a particular model according to a given hardware resource, which adds an invisible burden to the programmer.

1.2.3 Gpipe

In addition to general issues, the GPipe pipelining parallelism strategy has a memory problem: multiple activations need to be cached.

If a batch is divided into N micro-batches, n activations need to be cached. N is the number of gradient accumulation. In order to flow as quickly as possible, this number of accumulation is usually large, usually greater than twice the number of stages. So even if you cache only a few Tensor, this strategy still requires a lot of video memory.

0 x02 paper

PipeDream proposed an improved method 1F1B to solve these problems. PipeDream is the first system to combine pipeline parallelism, model parallelism, and data parallelism in an automated and universal manner. PipeDream first uses the model to divide DNN in parallel, and assigns subsets of each layer to each worker. However, different from the traditional model parallelism, PipeDream realizes the potential pipeline parallel design by pipelining small batch data. At any time, different workers process different inputs, thus ensuring full load of the pipeline and parallel BSP.

Microsoft in the paper PipeDream: Fast and Efficient Pipeline Parallel DNN Training for PipeDream in detail, so we based on this paper for analysis.

2.1 Solution Overview

2.1.1 Parallel mode

The basic unit of the PipeDream model is the layers, and PipeDream divides these layers of DNN into stages. Each stage consists of a set of successive layers in the model.

The main parallel method of PipeDream is to put different layers of the model on different stages, and different stages are deployed on different machines to carry out sequential forward and reverse calculations, forming a pipeline.

Each phase performs forward and backward passes to all layers within that phase. PipeDream refers to the phases that contain the input layer as the input phase and the phases that contain the output layer as the output phase. But each stage may have a different replication, which is data parallelism.

For the stage that uses data parallelism, round-robin is adopted to allocate tasks to various devices, so it is necessary to ensure that a batch of data occurs on the same machine in both forward and backward directions.

2.1.2 1 f1b

Since the activation of forward calculation can not be released until the corresponding backward calculation is completed (with or without Checkpointing technology), if you want to save the number of cache activation copies as much as possible under running stream parallel, In other words, each activation should be released as early as possible. Therefore, the backward calculation of each micro-batch data should be completed as early as possible. Therefore, the priority of backward calculation should be raised. Make the back direction with a small micro-batch number do first than the front direction with a large micro-batch number. Therefore, if we make the last stage do the backward calculation of this micro-batch immediately before and after finishing the first micro-batch, we can make the other stages start backward calculation as early as possible, which is the 1F1B strategy.

The scheduling mode of 1F1B (one-forward-one-Backward) will alternately carry out forward-backward calculation of small batches of data on each worker machine, and at the same time ensure that these small batches can be routed to the same worker with “forward-propagated” in “back-propagated”.

This scheme can make each GPU have a batch of data being processed, so that all workers keep busy without pipeline suspension, and the whole pipeline is relatively balanced. At the same time, it can ensure that the parameter updates on each stage are performed in a fixed cycle, which also helps to prevent too many small batches from being processed at the same time and ensure the model convergence.

0 x03 line

PipeDream’s Pipeline Parallelism (PP) is a new parallelization strategy that combines intra-batch parallelism with inter-batch parallelism.

3.1 Assembly line improvement

Let’s first look at pipelining’s improvement on model parallelism.

In the example in the previous section, if there was only one active minibatch, the system would have at most one GPU active at any given point in time.

But we want all gpus to be active. Therefore, multiple small batches of PipeDream are injected into the pipeline one by one, thus enhancing the model parallel training through the pipeline. As the forward delivery of a small batch completes, each phase asynchronously sends the output activation to the next phase while processing another small batch begins. Similarly, after a small batch is passed backward, each phase asynchronously sends a gradient to the previous phase while another small batch is computed.

Compared with ordinary parallel layer training, pipelining has two main advantages:

  • Pipeline traffic is low. Pipelined parallelism requires much less traffic than data parallelism. Different from the approach in data parallelism (using collective communication or parameter server) (aggregating gradients of all parameters and sending results to all workers), each worker in pipelined parallelism only needs to send a subset of gradients and output activation to another worker between two stage boundaries. This can result in a significant reduction in traffic for some models.
  • Pipelining overlaps computation and communication. Asynchronous communication across phase forward output activation and backward gradient allows these communications to overlap in time with subsequent small-batch computations because they run on different inputs and are completely independent of computation and communication without dependent edges, so it can be more easily parallelized. In the stable ideal state, all workers are running all the time, unlike the time when they stop and wait in the model parallelization training.

Below is an assembly line with 1F1B implemented. Machine 1 calculates blue 1 first, and then sends the blue 1 to Machine 2 to continue the calculation. Machine 1 then computes blue 2. Only a subset of the model is transmitted between Machine 1 and Machine 2. Computing and communication can run in parallel.

3.2 the challenge

The goal of PipeDream is to combine pipeline parallelism, model parallelism, and data parallelism in a way that minimizes the total training time. However, to make this approach effective for large DNN models and reap the potential benefits of pipeline parallelization training, PipeDream must overcome several major challenges:

  • How to efficiently divide the pipeline. Like the pipeline in the processor, DNN needs to be efficiently and correctly divided into several “stages” (layer sequences), and each stage is deployed for execution on different workers.

    • Model idiosyncrasies and hardware topologies reduce efficiency, and partitioning should depend on the model architecture and hardware deployment. Bad partitioning (periodic and widely skewed workloads) can lead to long periods of idle workers. Therefore, it needs to be divided according to certain principles (communication and resource utilization). For example, layers that communicate with each other should be allocated to adjacent processors; If multiple layers operate on the same data structure, they should be assigned to the same processor, and independent layers can be mapped to different processors. So allocation algorithms must also take into account model characteristics and hardware topology.
    • Excessive communication between machines reduces hardware efficiency.
    • How to schedule calculations to maximize throughput while ensuring that training tasks move forward.
  • How to prevent pipeline bottlenecks.

    • According to the barrel principle, in a steady state, the throughput of a pipeline is determined by the throughput of the slowest link on the pipeline. If the processing capacity of each link is very different from that of the other link, idle time (half of the link is called bubble) will occur in the pipeline. In this way, the fastest link must stop and wait for other environments, which will cause hunger and lead to insufficient resource utilization. Therefore, it is necessary to ensure that all stages in the pipeline take roughly the same computation time, otherwise the slowest stage will become the bottleneck of the entire pipeline.
  • How to schedule work between different input data to balance the pipeline.

    • Unlike traditional one-way pipelining, DNN training is bidirectional: forward propagation and backward propagation, with both propagating through the same layer in reverse order. How to coordinate the assembly line work is a problem.
  • How to ensure effective training in the face of asynchrony brought by pipelining.

    • One problem with pipelining is that there are many versions of weight. If a higher version of weight is used in backward propagation than in forward propagation, the quality of the training model will be reduced.
    • PipeDream manages the version of weights in the back channel and solves this problem by maintaining the version number for each small batch of weights so that the version of weights used in the back channel is the same as that used in the forward channel, numerically calculating the gradient correctly (as we’ll explain in a later article).

3.4 Pipeline partitioning algorithm

PipeDream automatically divides the layers of DNN based on the results of a short-run analysis, using algorithms to balance computational load between stages while minimizing communication. The overall goal of PipeDream’s automatic partitioning algorithm is to output a balanced pipeline, ensuring that each phase performs roughly the same total amount of work. At the same time, it is necessary to ensure that the amount of data communicated between stages is as small as possible to avoid communication interruption. The algorithm is as follows:

  • The DNN layer is divided into stages so that each stage is completed at roughly the same rate, that is, it takes roughly the same amount of computing time.
  • Try to minimize communication between workers in a topology-aware manner (e.g., send larger outputs to higher-bandwidth links if possible).
  • Since DNN can not always evenly distribute among available workers, PipeDream allows replication of a stage, in which multiple workers are used for data parallelism, in order to further improve load balancing. In this way, multiple workers can be assigned to the same stage of the assembly line to process a batch of different Mini-batches in parallel, thus improving the processing efficiency. This policy is also called 1F1B-RR(one-forward-noe-backward-round-robin) because RR is used in data parallel.

This partition problem is equivalent to minimizing the time spent in the slowest stage of the assembly line, and has the attribute of optimal subproblem: given worker’s workload, the assembly line maximizing throughput is composed of a series of subassembly lines, each of which maximizes its output for a smaller worker’s workload. Therefore, PipeDream uses dynamic programming to find the optimal solution.

The details are as follows:

3.5 the Profile

One characteristic of DNN training is that the computation time of different inputs is almost unchanged. So PipeDream takes full advantage of this fact. Given a DNN with N layers and M available machines, PipeDream first analyzes the model on one machine, records the computation time taken by the forward and backward processes, the size of the layer output and the size of the related parameters of each layer, and finally outputs it as a result file.

The partitioning algorithm not only uses the profile result file as input, but also considers other constraints, such as hardware topology and bandwidth, number of workers, and memory capacity of computing devices, and finally divides the layers into multiple stages. Replicators for each stage are also determined to minimize the total training time of the model.

So the overall algorithm is roughly as follows:

Since PipeDream borrows a lot of ideas from GPipe, you can see how it improves over GPipe.

For example, Gpipe does pipelined load balancing by rigorously predicting OPS in the code, while PipeDream does profile first and then deduces based on the actual situation.

0 x04 Profile stage

Profile is the first phase of PipeDream’s work and is the foundation of the partitioning algorithm. Based on the results of profiling, PipeDream uses dynamic programming to partition the model into different stages and the number of replicas per stage.

This is an improvement of PipeDream for GPipe, both of which estimate the running time of each layer and then divide the model.

  • GPipe uses empirical or mathematical methods to estimate the running time.
  • PipeDream estimates run times based on the results of profiling.

PipeDream is more accurate and advanced because it has real data to back it up.

4.1 train of thought

The evaluation mechanism takes advantage of the fact that DNN training has little change in computation and communication time. Therefore, we can deduce the DNN training time through the profile of small batch data. To determine the running time of all layers, PipeDream profiled short (few minutes) runs of the DNN model on one of the machines using 1000 small batches.

4.1.1 How to Calculate

The elapsed time

For the running time of each layer, we can get it by running time = calculation time + communication time.

  • The calculation time is the forward and backward calculation time of each layer, which can be obtained from the profile.
  • The communication time needs to be estimated according to the size of the model. PipeDream estimated the communication time as “the amount of data to be transmitted” divided by “the bandwidth on the communication link”.

Communication time

On the pipeline, most communication takes place in three steps:

1) Transfer mobile data from GPU to CPU on the sending end machine.

2) Send data from sender to receiver over the network.

3) At the receiving end, move data from CPU to GPU.

Sending data from sender to receiver over the network is the most time-consuming, so PipeDream’s main consideration is this. If this factor is further subdivided, then:

  • PipeDream estimates the time to transfer the activation value from layer I to layer I + 1 based on the “activation value”.

  • If data parallelism is configured (m workers are used for data parallelism for layer I), the time of weight synchronization is estimated by “weight” :

    • If the server using distributed parameter, the weight number has been forecast for 4 x (m – 1) (I) | | x w/m.
    • If use all_reduce, each worker to send other workers (m – 1) (I) | | x w/m bytes, also receive the same number of bytes.

4.1.2 Profile content

To sum up, PipeDream records three quantities for each layer I in the profile:

  • Ti, the sum of the forward and backward computing time of layer I on GPU, namely the forward and backward computing time of each layer;
  • Ai, the size of the output activation of layer I (and the size of the input gradient in the backward process) in bytes, i.e. the size of the output of each layer;
  • Wi is the size (in bytes) of the weight parameter of layer I, that is, the size of the parameter of each layer;

4.2 code

Different models or domains have different profiles.

We use profiler/translation/train. Py for entrance is analyzed.

4.2.1 Training Script

We omit the extraneous code below.

4.2.1.1 Training Process
class Seq2SeqTrainer: def feed_data(self, data_loader, training=True): """ Runs training or validation on batches from data_loader. :param data_loader: data loader :param training: If True runs training else runs Validation "" # whitelist = ["EmuBidirLSTM", "RecurrentAttention", "Classifier"] # sample set for I, (SRC, TGT) in enumerate(data_loader): break (src, src_length) = src (tgt, tgt_length) = tgt src_length = torch.LongTensor(src_length).cuda() src = src.cuda() tgt = tgt.cuda() model_input = (src, Src_length, TGT [:-1]) # Torchsummary = Torchsummary. Summary (model=self.model, module_whitelist=module_whitelist, model_input=model_input, verbose=True) for i, (src, tgt) in enumerate(data_loader): if training and i in eval_iters: test_bleu, _ = self.translator.run(calc_bleu=True, epoch=self.epoch, Iteration = I) # Train model self.model.train() self.preallocate(data_loader, training=True) # Train model self.model. Iteration = I) # Train model self.model. create_graph(self.model, module_whitelist, (src, tgt), summary, os.path.join("profiles", self.arch))Copy the code
4.2.1.2 Calculation Parameters

In the training script production in the previous section, the role of Torchsummary is to calculate the network calculation parameters and other information. We give examples of Torchsummary as follows:

import torch import torch.nn as nn from torchsummary import summary class SimpleConv(nn.Module): def __init__(self): super(SimpleConv, self).__init__() self.features = nn.Sequential( nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=1), nn.ReLU(), ) def forward(self, x, y): x1 = self.features(x) x2 = self.features(y) return x1, x2 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = SimpleConv().to(device) summary(model, [(1, 16, 16), (1, 28, 28)]Copy the code

It is printed as follows:

---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 1, 16, 16] 10 ReLU-2 [-1, 1, 16, 16] 0 Conv2d-3 [-1, 1, 28, 28] 10 ReLU-4 [-1, 1, 28, 28] 0 ================================================================ Total params: 20 Trainable params: 20 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.77 Forward/ Backward pass size (MB): 0.02 Params size (MB): 0.00 Estimated Total size (MB): 0.78 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -Copy the code
4.2.1.3 create diagrams

Create_graph uses torchgraph.GraphCreator to create a graph that can be interpreted as a DAG graph inside the model. Each node records the following information.

Node10 -- Dropout(P =0.2) -- forward_compute_time=0.064, Backward_compute_time =0.128, Activation_size =6291456.0, Parameter_size = 0.000Copy the code

The specific code is as follows:

def create_graph(model, module_whitelist, model_input, summary, directory): """Given a model, Creates and visualizes the computation DAG of the model in the passed-in directory."" # creating graph_creator = Torchgraph. creator (model, summary, module_whitelist # build hook graph_creator. Hook_modules (model, root=True) (SRC, tgt) = model_input (src, src_length) = src (tgt, Tgt_length) = TGT src_length = torch.longtensor (src_length).cuda() SRC = src.cuda() TGT = tgT.cuda () # Model (SRC, srC_length, TGT [:-1]) graph_creator. Unhook_modulesCopy the code

4.2.2 create diagrams

Creating a graph is basically done in GraphCreator.

class GraphCreator(object):
    def __init__(self, model, summary, module_whitelist):
        if isinstance(model, torch.nn.Module) is False:
            raise Exception("Not a valid model, please provide a 'nn.Module' instance.")
​
        self.model = model
        self.module_whitelist = module_whitelist
        self.summary = copy.deepcopy(summary)
        self.forward_original_methods = {}
        self.graph = graph.Graph()
        self.inputs = {}
Copy the code
4.2.2.1 Settings wrapper

Hook_modules sets up a Wrapper for the model’s forward function and traverses the Settings for submodules so that you can keep track of the relationships between models as they run.

def hook_modules(self, module, root=False): this_creator = self sub_modules = module.__dict__['_modules'] # Wrapper function to "forward()", keeping track of dependencies. def forward_wrapper(self, *wrapped_inputs): input = [] wrapped_inputs_list = list(wrapped_inputs) for i in range(len(wrapped_inputs_list)): If isinstance(wrapped_inputs_list[I], TensorWrapper): Insert input input.append(wrapped_inputs_list[I]. Tensor) else: key = wrapped_inputs_list[i] if key in this_creator.inputs: Wrapped_inputs_list [I] = this_creator. Inputs [key] else: J = len(this_creator. Inputs) # Wrapped_inputs_list [I] = TensorWrapper(wrapped_inputs_list[I], "Input%d" % j, Inputs = wrapped_inputs_list[I] input. Append (wrapped_inputs_list[I]. Tensor Wrapped_result = TensorWrapper(result, this_creator. Forward_original_methods [self](*input) # STR (self), this_creator) # add edge to graph for wrapped_input in wrapped_inputs_list: this_creator.graph.add_edge(wrapped_input.node(), wrapped_result.node()) return wrapped_result # Wrapper function to "forward()", keeping track of dependencies. def forward_wrapper_root(self, *wrapped_inputs): input = [] wrapped_inputs_list = list(wrapped_inputs) for i in range(len(wrapped_inputs_list)): if isinstance(wrapped_inputs_list[i], TensorWrapper): input.append(wrapped_inputs_list[i].tensor) else: key = wrapped_inputs_list[i] if key in this_creator.inputs: wrapped_inputs_list[i] = this_creator.inputs[key] else: j = len(this_creator.inputs) wrapped_inputs_list[i] = TensorWrapper(wrapped_inputs_list[i], "Input%d" % j, this_creator) this_creator.inputs[key] = wrapped_inputs_list[i] input.append(wrapped_inputs_list[i].tensor) result = This_creator. Forward_original_methods [self](*input) return result # Recursively set wrapper for name, sub_module in sub_modules.items(): # nn.Module is the only thing we care about. if sub_module is None or isinstance(sub_module, torch.nn.Module) is False: break sub_module_name = sub_module.__class__.__name__ sub_sub_modules = sub_module.__dict__['_modules'] if len(sub_sub_modules) == 0 or sub_module_name in self.module_whitelist: sub_module.reset_hooks() # # Hook nn.Module with no descendants. # # Replace "forward" with "wrapped_forward". # Replace forward if sub_module not in this_creator. Forward_original_methods with wrapped_forward: this_creator.forward_original_methods.update({sub_module: sub_module.forward}) sub_module.forward = forward_wrapper.__get__(sub_module, sub_module.__class__) if len(sub_sub_modules) >forward_compute_time 0 and sub_module_name not in self.module_whitelist: # # Recursively setupvisit this module's shrink. # Recursively setupvisit this module's wrapper self.hook_modules(sub_module) if root: This_creator. Forward_original_methods. update({module: module.forward}) module.forward = forward_wrapper_root.__get__(module, module.__class__)Copy the code
4.2.2.2 TensorWrapper

TensorWrapper implements wrapper functionality, and graph_Creator. Summary is the network information obtained from torchsummary.summary. As you can see, this class iterates over the summary, computes forward_compute_time and so on, and builds a node.

Note that activation_SIZES is computed according to output_Shape.

class TensorWrapper(object): def __init__(self, tensor, node_desc, graph_creator, activation_size=None): self.tensor = tensor global object_id self.object_id = object_id object_id += 1 self.node_desc = node_desc i = 0 for i in range(len(graph_creator.summary)): if str(graph_creator.summary[i]['layer_name']) == node_desc: break if i < len(graph_creator.summary) and node_desc == str(graph_creator.summary[i]['layer_name']): summary_elem = graph_creator.summary.pop(i) forward_compute_time = summary_elem['forward_time'] backward_compute_time = summary_elem['backward_time'] if isinstance(summary_elem['output_shape'][0], list): Activation_sizes = [4.0 * functools.reduce(lambda x, y: x * y, elem) for elem in summary_elem[' output_Shape ']] else: Activation_sizes = 4.0 * functools.reduce(lambda x, y: x * y, Summary_elem ['output_shape']) parameter_size = 4.0 * float(Summary_elem ['nb_params']) self._node = graph.node (" Node %d" %)  object_id, node_desc=node_desc, forward_compute_time=forward_compute_time, backward_compute_time=backward_compute_time, activation_size=activation_sizes, parameter_size=parameter_size) elif activation_size is not None: self._node = graph.Node("node%d" % object_id, node_desc=node_desc, activation_size=activation_size) else: self._node = graph.Node("node%d" % object_id, node_desc=node_desc) self.graph_creator = graph_creatorCopy the code

Some built-in methods are also handled, such as the following.

def __iadd__(self, other):
    self_activation_size = self.node().activation_size
    other_activation_size = other.node().activation_size
    assert(self_activation_size == other_activation_size)
    wrapped_result = TensorWrapper(self.tensor, "Add(inplace)", self.graph_creator,
                                   activation_size=self_activation_size)
    self.tensor += other.tensor
    self.graph_creator.graph.add_edge(self._node, wrapped_result.node())
    self.graph_creator.graph.add_edge(other.node(), wrapped_result.node())
    return wrapped_result
Copy the code

Final correspondence:

Node58 -- Add(inplace) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=102760448.000, Parameter_size = 0.000Copy the code

Holdings persistence

Persist_graph outputs the profile results to a file.

def persist_graph(self, directory):
    self.graph.to_dot(os.path.join(directory, "graph.dot"))
    with open(os.path.join(directory, "graph.txt"), 'w') as f:
        f.write(str(self.graph))
    self.graph.render_bar_graphs_and_cdfs(directory)
Copy the code

The to_dot function is extracted as follows:

def to_dot(self, arch):
    dot = graphviz.Digraph()
    for node in self.nodes.values():
        node_desc = "%s\n[forward_compute_time=%.3f,backward_compute_time=%.3f,activation_size=%s,parameter_size=%.1f]" % (
            node.node_desc, node.forward_compute_time, node.backward_compute_time,
            node.activation_size, node.parameter_size)
        if node.stage_id is not None:
            color = self._colors[node.stage_id % len(self._colors)]
            dot.node(node.node_id, node_desc,
               color=color, style='filled')
        else:
            dot.node(node.node_id, node_desc)
    for node in self.nodes.values():
        if node.node_id not in self.edges:
            continue
        for out_node in self.edges[node.node_id]:
            dot.edge(node.node_id, out_node.node_id)
    dot.render(arch)
Copy the code

4.3 the results

We use the result in the source code, for example pipedream is – pipedream is/profiler/translation/profiles/GNMT/graph. TXT, to show you the specific results.

node1 -- Input0 -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node4 -- Embedding(32320, 1024, padding_idx=0) -- forward_compute_time=0.073, backward_compute_time=6.949, activation_size=6291456.0, parameter_size=132382720.000
node5 -- EmuBidirLSTM(  (bidir): LSTM(1024, 1024, bidirectional=True)  (layer1): LSTM(1024, 1024)  (layer2): LSTM(1024, 1024)) -- forward_compute_time=5.247, backward_compute_time=0.016, activation_size=12582912.0, parameter_size=67174400.000
node2 -- Input1 -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node6 -- Dropout(p=0.2) -- forward_compute_time=0.077, backward_compute_time=0.196, activation_size=12582912.0, parameter_size=0.000
node7 -- LSTM(2048, 1024) -- forward_compute_time=3.190, backward_compute_time=5.348, activation_size=[6291456.0; 131072.0; 131072.0], parameter_size=50364416.000
node8 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6291456.0, parameter_size=0.000
node9 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node10 -- Dropout(p=0.2) -- forward_compute_time=0.064, backward_compute_time=0.128, activation_size=6291456.0, parameter_size=0.000
node11 -- LSTM(1024, 1024) -- forward_compute_time=2.491, backward_compute_time=4.203, activation_size=[6291456.0; 131072.0; 131072.0], parameter_size=33587200.000
node12 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6291456.0, parameter_size=0.000
node13 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node14 -- Add -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6291456.0, parameter_size=0.000
node15 -- Dropout(p=0.2) -- forward_compute_time=0.059, backward_compute_time=0.121, activation_size=6291456.0, parameter_size=0.000
node16 -- LSTM(1024, 1024) -- forward_compute_time=2.492, backward_compute_time=4.201, activation_size=[6291456.0; 131072.0; 131072.0], parameter_size=33587200.000
node17 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6291456.0, parameter_size=0.000
node18 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node19 -- Add -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6291456.0, parameter_size=0.000
node3 -- Input2 -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node21 -- Embedding(32320, 1024, padding_idx=0) -- forward_compute_time=0.066, backward_compute_time=0.328, activation_size=6291456.0, parameter_size=132382720.000
node20 -- hidden -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node22 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node23 -- RecurrentAttention(  (rnn): LSTM(1024, 1024)  (attn): BahdanauAttention(    (linear_q): Linear(in_features=1024, out_features=1024, bias=False)    (linear_k): Linear(in_features=1024, out_features=1024, bias=False)    (dropout): Dropout(p=0)  )  (dropout): Dropout(p=0)) -- forward_compute_time=4.546, backward_compute_time=6.141, activation_size=[6160384.0; 131072.0; 131072.0; 6160384.0; 288768.0], parameter_size=41979904.000
node24 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node25 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node26 -- __getitem__(2) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node27 -- __getitem__(3) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node28 -- Dropout(p=0.2) -- forward_compute_time=0.058, backward_compute_time=0.176, activation_size=6160384.0, parameter_size=0.000
node29 -- Concat(2) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node30 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node31 -- LSTM(2048, 1024) -- forward_compute_time=3.151, backward_compute_time=5.288, activation_size=[6160384.0; 131072.0; 131072.0], parameter_size=50364416.000
node32 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node33 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node34 -- Dropout(p=0.2) -- forward_compute_time=0.061, backward_compute_time=0.174, activation_size=6160384.0, parameter_size=0.000
node35 -- Concat(2) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node36 -- __getitem__(2) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node37 -- LSTM(2048, 1024) -- forward_compute_time=3.145, backward_compute_time=5.306, activation_size=[6160384.0; 131072.0; 131072.0], parameter_size=50364416.000
node38 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node39 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node40 -- Add -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node41 -- Dropout(p=0.2) -- forward_compute_time=0.055, backward_compute_time=0.198, activation_size=6160384.0, parameter_size=0.000
node42 -- Concat(2) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node43 -- __getitem__(3) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=0.0, parameter_size=0.000
node44 -- LSTM(2048, 1024) -- forward_compute_time=3.149, backward_compute_time=15.883, activation_size=[6160384.0; 131072.0; 131072.0], parameter_size=50364416.000
node45 -- __getitem__(0) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node46 -- __getitem__(1) -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=131072.0, parameter_size=0.000
node47 -- Add -- forward_compute_time=0.000, backward_compute_time=0.000, activation_size=6160384.0, parameter_size=0.000
node48 -- Classifier(  (classifier): Linear(in_features=1024, out_features=32320, bias=True)) -- forward_compute_time=5.609, backward_compute_time=1.227, activation_size=194437120.0, parameter_size=132512000.000
   node1 -- node4
   node4 -- node5
   node2 -- node5
   node5 -- node6
   node6 -- node7
   node7 -- node8
   node7 -- node9
   node8 -- node10
   node10 -- node11
   node11 -- node12
   node11 -- node13
   node12 -- node14
   node8 -- node14
   node14 -- node15
   node15 -- node16
   node16 -- node17
   node16 -- node18
   node17 -- node19
   node14 -- node19
   node3 -- node21
   node20 -- node22
   node21 -- node23
   node22 -- node23
   node19 -- node23
   node2 -- node23
   node23 -- node24
   node23 -- node25
   node23 -- node26
   node23 -- node27
   node24 -- node28
   node28 -- node29
   node26 -- node29
   node20 -- node30
   node29 -- node31
   node30 -- node31
   node31 -- node32
   node31 -- node33
   node32 -- node34
   node34 -- node35
   node26 -- node35
   node20 -- node36
   node35 -- node37
   node36 -- node37
   node37 -- node38
   node37 -- node39
   node38 -- node40
   node32 -- node40
   node40 -- node41
   node41 -- node42
   node26 -- node42
   node20 -- node43
   node42 -- node44
   node43 -- node44
   node44 -- node45
   node44 -- node46
   node45 -- node47
   node40 -- node47
   node47 -- node48
Copy the code

So far, we know the content of the Profile stage, which is: run the training script, calculate parameters according to the results of the run, build a DAG diagram inside the model, and persist the parameters and DAG diagram in a file, the contents of the file will be used in the subsequent stages.

In the next article we look at how to calculate automatic partitioning.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

www.microsoft.com/en-us/resea…

Lingvo framework day reading notes

Tensorflow implements the accumulation of multiple minibatch-computed gradients before propagating them back

Gradient accumulation is realized by tensorflow2

Tenfold model calculation time increased by only 20% : OpenAI open source gradient replacement plugin

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Paper interpretation series 5: Microsoft Stanford and other PipeDream fast training large-scale neural network

Cs231n. Making. IO/neural – netw…

www.cnblogs.com/geekfx/p/14…

Video memory optimization techniques during training — OP merge and Gradient Checkpoint

Pytorch Note 04- Customize torch. Autograd. Function

PyTorch tutorial Autograd

A simple definition and case study of Torch. Autograd. Function

Pytorch’s custom extension (2) — Torch. Autograd. Function completes the custom layer

Torch. Autograd: Gradient calculation in detail

Back Propagation

CS231n Course Notes Translation: Backpropagation notes

Maximal antichain of poset

Topological Sorting