17x Acceleration: GPU optimization anatomy of the PyTorch model

Welcome to use drops cloud GPU products: www.didiyun.com/production/… , the lowest price of the whole network, in the final purchase settlement page enter the master code: 9999, on the basis of the original price to enjoy 10% discount.

IFX is didi’s self-developed AI reasoning engine framework, which provides AI deployment solutions for cloud, end and side. Currently, there are many PyTorch models within Didi that are already plugged into the IFX cloud framework and delivered via Serving.

There is one necessary step in the model’s journey from plug in to live: measure the performance benefit of IFX compared to PyTorch(1.3.1) in the same environment.

The following figure shows the measured performance results of several models recently added:

These models fall into two broad categories: Resnet and Mobilenet. It can be seen from the above results that the performance of resNet series (A, B, C, D) models is about 3-4 times after being adapted to IFX framework, and that of Mobilenet series (E, F) models is about 17 times.

For E and F models of user access, IFX is about 17 times better than PyTorch. Why there is so much room for optimization is analyzed step by step in this paper.

▎ Basic analysis

E, F model network structure is Mobilenet-V3, here is the introduction of Mobilenet-V3 model, you can see the main structure and use of the model. Roughly estimated from the OnNX model derived from PyTorch, mobilenet-V3 contains about 460 operators.

PyTorch’s conv operator implementation calls an additional data handler. The data handler in PyTorch’s conV operator implementation calls an additional data handler in PyTorch’s conV operator implementation.

Then we analyze the model structure in detail, the following is the most common structure in mobilenet-V3 network.

Mobilenet-v3 network used a special activation function hSWish, HSigmoID, the above four black operators to achieve hSWish.

The Python code is as follows:

class hswish(nn.Module):
    def forward(self, x):
        out = x * F.relu6(x + 3, inplace=True) / 6
        return out

 class hsigmoid(nn.Module):
    def forward(self, x):
        out = F.relu6(x + 3, inplace=True) / 6
        return out
Copy the code

In the whole Mobilenet-V3 network, hSWish is called 31 times, and HSIGmoID is called 13 times, so the total number of related operators is 314+133=163, which accounts for a relatively large proportion in the model. However, PyTorch’s calculation does not carry out a separate optimization for this part. I’m just doing the basic operations in turn.

The figure contains PyTorch’s implementation process for HSWish. The four cells pointed to by the red arrows respectively implement hSWish’s Add, Clip, MUL, and div operations. The total time of the four operators is 10.88us.

The following is an optimization method of IFX for this model. Since the two operators hSWish and HSigmoID are universal, IFX integrates the two functions hSWish and HSigmoID into two separate operators. Such fusion operation reduces the number of operators. At the same time, the optimized operator has better performance than the original.

Take a look at the test results:

It can be seen that the calculation time of THE HSWish operator implemented by IFX alone is 2.14us, which is about 5 times that of PyTorch.

In fact, this hSWish fusion operation is only one of IFX fusion strategies. Common structures such as CONV +elementwise and CONV + BN + RELu can be fused into a CONV.

After all fusion strategies are applied to Mobilenet-V3, the number of model operators is reduced from 460 to 160, and the operator performance is also improved.

The following shows the performance improvement brought by IFX for model optimization:

The test conditions	performance
PyTorchPyTorchPyTorch	Time: 25.6 ms
IFX (only do HSWish + HSigmoID Fusion + basic operator optimization)	Time: 3.17888 ms
IFX (Apply all fusion and operator optimization strategies)	Time: 1.49237 ms

(Note: We did not directly compare tensorRT, mainly because a large number of business models, including Mobilenet V3, do not run directly on tensorRT5.0.2, and adaptation requires additional development effort)

It can be seen that the performance of the model has been improved by about 8 times after the fusion of only two operators + the optimization of the basic operators (Conv, Bn, Fc, etc.). When all the optimization strategies are applied to the model, the performance is improved by another 1 times +.

It is important to understand that all operator calls need to be calculated from the CPU launch to the GPU, which is quite expensive in PyTorch’s implementation.

Here are the results for the sub-network portion of nvProf:

It can be seen that the calculation of 17 operators takes about 1.39ms on the GPU, and the actual calculation takes only a small proportion (about 104us). The blank area is where PyTorch calls some other Runtime apis of the GPU, indicating that the GPU computing resources are not fully utilized. So for the PyTorch framework: Operations More, Waste More.

Let’s look at the operator call of the IFX framework:

It can be seen that the implementation of the same subnetwork takes about 80us for the five IFX operators, among which the calculation takes about 53ms, which is more efficient than PyTorch.

In this way, IFX is about 17 times faster for typical sub-network structures, which is consistent with the overall performance improvement. The testing process and results show that PyTorch is slow for a reason.

▎ summary

PyTorch’s implementation does have some CUDA invocation mechanisms that are directly or indirectly expensive.

In addition to this, we can also see from NVProf that PyTorch has a lot of gaps in model reasoning other than operator computation time and extra Runtime API time, which are actually executing CPU instructions. ** Is there room to optimize this place? Yes, you can analyze why this part takes time.

Welcome to use Didi Cloud GPU cloud host for deep learning model training and reasoning, with excellent performance.

Buy using my Didi Cloud AI master code (9999) can enjoy 10% discount ha!

17x Acceleration: GPU optimization anatomy of the PyTorch model

Related Posts

How to generate prediction intervals for machine learning

“Cloud weekly” issue 122: ransomware protection complete walkthrough, no longer let the server run naked

How to teach yourself ARTIFICIAL intelligence?