The improvements in Profiler V1.9 focus on the execution steps that consume the most energy at runtime and/or in memory, visualizing the workload distribution between the GPU and CPU.
Profiler V1.9 has five new features:
1. Distributed training view: This helps you keep track of the time and memory consumed in distributed training tasks. Assuming you have a training model, when you want to split the load into Worker nodes to run in parallel, you can have all kinds of problems like a black box. The overall goal of the model is to improve the training speed. This distributed training view helps you diagnose and debug problems within a single node.
Memory view: With this view, you can better understand memory usage. This tool can help you avoid Out of Memory errors by showing how much active Memory is allocated during different phases of your application.
3. GPU application Visualization: This tool can ensure that the GPU is fully utilized.
4. Cloud Storage support: The Tensorboard plugin can now read parsed data from Azure Blob Storage, Amazon S3, and Google Cloud Platform.
5, jump source code: this feature supports stack trace information visualization, and can directly jump to the source code. This helps you quickly optimize and iterate code based on the results of your analysis.
PyTorch Profiler Colab portal
Chinese version of the Colab portal
Colab Content overview:
-
Prepare data and models
-
Logging execution events using profilers
-
Run the Profiler
-
Use TensorBoard to view the results and analyze model performance
-
Improve performance with profilers
-
Analyze performance using other advanced features
Start using the PyTorch Profiling tool
First of all:
$ pip install torch-tb-profiler
import torch.profiler as profiler
With profiler.profile(XXXX)
Copy the code
Note: For CUDA and CPU analysis, see Here
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA],
Copy the code
-
Profiler.record_function (“$NAME”) : Allows you to add decorators (names associated with tags) to function blocks.
-
Profile_memory=True in profiler.profile enables you to analyze CPU and GPU memory usage.
Visualize PyTorch model performance
Distributed training
Recent advances in deep learning demonstrate the value of large data sets and large models, which also means that model training requires more computational resources.
Distributed Data Parallelism (DDP) and the Nvidia Multi-card Communication Framework (NCCL) are widely used paradigms in PyTorch to accelerate deep learning training.
DDP on the NCCL backend is now supported in this version of PyTorch Profiler.
Computing/communications Overview
In the “Computing/Communication Overview” in the distributed training view, users can observe the computing/communication ratio of all the “Load Balancer” nodes among workers, which is measured by granularity.
Load Balancer link: Here
Scene 1:
If the calculation and overlap time of one Worker is longer than that of other workers, it may indicate that there is a problem in workload balancing, or one of the nodes is Straggler. The calculation is the sum of GPU kernel time, minus overlap time. Overlap time refers to the time saved by interleaved communication during computation.
The longer the overlap time, the better the parallelism between computation and communication. Ideally, computing and communication overlap completely. Communication is the total communication time minus overlap time.
The following example shows this on Tensorboard. Straggler sample
Scene 2:
If the batch size is small (i.e. less computation on all workers) or the data that needs to be transferred is large, then the computational communication ratio may also be small, as can be seen in the Profiler with low GPU utilization and long waiting time.
Users can review code based on this computing/communication view to reduce communication by using gradient accumulation or to reduce communication ratio by increasing batch size. DDP communication time depends on model size. Batch size is independent of model size. Therefore, increasing the batch size can lead to longer computation time and larger computational communication cases.
Synchronization/communication overview
In the synchronization/communication view, you can view the communication efficiency. This is obtained by subtracting step time from computation and communication time. Synchronization time is part of the total communication time waiting and synchronizing with other workers. The synchronization/communication view includes initialization, data loader, CPU calculation, and so on.
From this view, we can know: what proportion of the total traffic is actually used for data exchange, and what is the idle time waiting for data provided by other workers.
For example, if there are inefficient workload balancing or Straggler issues, they can be found in the synchronization/communication view. This view will show that some workers wait longer than others.
Detailed statistical data of all communication operators in each node can be obtained from the above table. The table lets you know which operator types are called, how many times each operator is called, how much data is transferred by each operator, and so on.
Memory view
Using this tool, we can understand the hardware resource consumption of the operator in the model. Understanding the time and memory consumption at the operator level can help solve performance bottlenecks and speed up model running. Given the limited memory size of the GPU, optimizing memory usage efficiency can help:
-
Allows you to run larger models and perform better on end-level tasks.
-
Allows larger batch sizes to improve training speed.
Profiler records all memory allocations during Profiler intervals. Select Device to view the memory usage details of each operator on the GPU or host side.
Note: Profile_memory =True must be enabled to generate the following memory data.
Related links: Here
Profile (Profiler_memory=True # This will take 1 -- 2 minutes to complete.)Copy the code
Important Definitions:
-
“Size Increase” displays the sum of all allocated bytes, minus all memory freed bytes.
-
Allocation Size displays the total of all allocated bytes excluding memory release.
-
“Self” means that the allocated memory is not from any child operator, but is allocated by the operator itself.
GPU metrics on the timeline
With this feature, you can easily debug performance issues when one or more Gpus are underutilized. Ideally, your program should have high GPU utilization (100% GPU utilization as much as possible), lowest cpu-to-GPU communication costs, and no power consumption.
Overview: The overview page highlights the results of three important GPU Utilization metrics (I.E. GPU Utilization, Est. SM Efficiency, and Est. Achieved Occupancy) at different levels.
Essentially, each GPU has a lot of SM, and each SM has a lot of Warp that can execute a lot of threads at the same time. Warp executes more threads because the number depends on the GPU. From a higher perspective, GPU metrics on the timeline can help developers get a holistic view of the stack, which is very important.
If GPU utilization is low, it indicates a potential problem with the model. Common reasons are as follows:
-
There is insufficient parallelism in the kernel, that is, batch sizes are too small
-
The small kernel is called in a loop, i.e. boot overhead is not amortized
-
CPU or I/O bottlenecks cause insufficient work content and low GPU usage
On the overview page, the performance suggestions section is a list of possible suggestions for improving GPU utilization. In this example, GPU utilization is low, so the performance recommendation is to increase the batch size. Increasing the batch size from 4 to 32 based on performance recommendations resulted in a 60.68% increase in GPU utilization.
GPU utilization: In Profiler, there is a step interval time that occurs when the GPU engine executes a workload. The higher the utilization percentage, the better. It is not accurate to judge performance bottleneck only by GPU utilization. You can’t tell how many Streaming multiprocessors are running.
Note that while this metric is useful for detecting idle periods, a high number does not indicate high GPU utilization. For example, a single-threaded kernel running continuously will have 100% GPU utilization.
Estimated stream processor Efficiency (Est. SM Efficiency) is a more detailed metric that represents the percentage of SM that is in use throughout the tracking process, the percentage of time that represents at least one active wrap on SM, and those that are idle.
NVIDIA documentation: Here
Est. SM Efficiency also has limitations. For example, a kernel with only one thread per block cannot take full advantage of all SM. SM Efficiency alone does not tell you the utilization of each SM, only what operations each SM is doing, including pauses while waiting for memory load results.
To keep SM utilization high, there must be enough ready wrap to run whenever a stall occurs.
Est. Achieved Occupancy is a more accurate estimate of performance diagnostics than Est. SM Efficiency and GPU utilization. Estimated implementation occupancy indicates how much warp per SM can move simultaneously. Having a sufficient number of active warps is usually the key to good throughput. Unlike GPU utilization and SM Efficiency, getting this value as high as possible is not the ultimate goal.
As a rule of thumb, good throughput gains can be achieved by increasing this metric to 15% or more. But at some point, there will be diminishing returns. For example, if the value has already reached 30%, the subsequent return becomes uncertain. This indicator shows the average of all warp Schedulers during kernel execution
NVIDIA documentation: Here
Est. Achieve Occupancy values are larger and better.
Details: Resnet50_batchsize4
Details: Resnet50_batchsize32
Kernel view: The kernel has “Blocks per SM” and “Est. Achieved Occupancy”.
Est. Achieved Occupancy is a useful tool for comparing model health.
Mean Blocks per SM:
Number of blocks per SM = number of blocks of the kernel/number of SM of the GPU. If this number is less than 1, it indicates that the GPU multiprocessor is not fully utilized.” Mean Blocks per SM “is the weighted average of all runs for this kernel name, using the length of each run as the weight.
Mean Est. Occupancy:
Est. Achieved Occupancy is defined in the same way as outlined above. Mean Est. Achieved Occupancy is the weighted average of all runs of this kernel name, using the duration of each run as a weighting.
Trace view:
The trace view displays a timeline of the duration of the operator in the model and which system performed the operation. This view can help you identify whether high consumption and long execution are caused by input or model training. Currently, the trace view shows GPU utilization and Est. SM Efficiency within a timeline.
In the above example, the GPU utilization of ProfilerStep5 during thread 28022 is higher than that of Optimizer.step. You can zoom in to see why.
As can be seen from the figure above, the former has a longer core than the latter. The latter kernel’s execution time is too short, resulting in low GPU utilization.
Est. SM Efficiency: Each kernel has a calculated Est. SM Efficiency between 0 and 100%. For example, if the following kernel has only 64 blocks and this GPU has SM 80, its “Est. SM Efficiency” is 64/80, or 0.8.
Cloud Storage Support
After running PIP Install Tensorboard, in order to read data from the cloud provider, you can run:
torch-tb-profiler[blob]
torch-tb-profiler[gs]
torch-tb-profiler[s3]
Copy the code
PIP install torch-tb-profiler[blob] PIP install torch-tb-profiler[gs] Or PIP install Torch – tB-profiler [S3] can read data from a cloud service provider.
For more information, see: Here
Jump to source code
One of the benefits of integrating TensorBoard and PyTorch Profiler directly into Visual Studio Code (VS Code) is the ability to jump directly from the stack trace of the Profiler to the source Code (files and lines). The VS Code Python extension now supports TensorBoard integration.
Jumping to source is only available when Tensorboard is running in VS Code. If profiling with_STACK =True, the Stack trace will appear on the plug-in UI. Clicking on stack Trace in the PyTorch Profiler will open the corresponding file and jump directly to the corresponding Code for debugging. This allows you to quickly optimize and modify your code based on the analysis results and recommendations.
Jump to the source Code with Visual Studio Code Plug In UI
For details on how to optimize batch size performance, see the tutorial: Here
PyTorch Profiler can also be integrated with PyTorch Lightning, simply start the Lightning training task with trainer.Profiler = PyTorch to generate trace.
Detailed example: Here
Original address: Here