Author: Chen Sixu | kuang apparent MegEngine architects
3 minutes to get started
Effect of contrast
For the deep learning framework, the reasoning performance of the model is an important indicator that users pay attention to. One of the things that really affects performance is the Tensor Layout Format (NCHW, NHWC, etc.), and how you choose the correct Layout will really affect your reasoning performance.
MegEngine referred to the global optimization algorithm of data Format in GPU inference proposed by Chen Yuankai, an engineer from Alibaba, and extended it to CPU. He provided a solution of global graph optimization to optimize the selection of Tensor Format on the whole and improve the overall reasoning performance of the model. On int8 model, the inference time pairs of global graph optimization and MegEngine primitive Format optimization (hereinafter called traditional graph optimization) are shown in the following figure:
Method of use
The use of global graph optimization at source level can be seen in the load_and_run executable of MegEngine. If you are only concerned with testing model performance with load_and_run, the parameters you need to use are:
--layout-transform [cuda|cpu]
: Enables global graph optimization for computational graphs and specifies the target device for the optimization.--layout-transform-dump dump_path
: Specifies the path to the file stored in the optimized global graph.
Enabling global optimization can also enable other optimization options such as Fast Run and so on. The usage method is as follows:
load_and_run model.mge --layout-transform cuda --layout-transform-dump model_opt.mge
load_and_run model_opt.mge
Copy the code
For those who are interested in the underlying technical principles, please read on ~~ 😀
Problems solved by global graph optimization
In neural networks, a Tensor is usually a multidimensional array, and computer data storage can only be linear, so there are many different data storage formats that can store a Tensor in computer memory. Usually the Tensor is stored in NCHW Format, you can have any number of W’s, but the architecture of the hardware tends to want things to be a little bit more neat (like 4-aligned or 32-aligned) so you can have the best performance on your device. Therefore, many high-performance computing libraries use different formats to implement operators.
MegEngine supports many major deep learning inference platforms, including CUDA, ARM, X86, etc. There are huge differences between platforms, so NCHW Format does not achieve optimal fetch performance in most cases. To address this problem, MegEngine provides a number of operators for Format. See the user guide for their features. Different formats are applicable on different platforms. For example, NCHW4, NCHW32, CHWN4, and so on can be used on NVDIA GPU devices.
The common idea is to have the user specify the Format to use, and then rely on traditional graph optimization to translate the Input Tensor of all the operations in the calculation diagram (like convolution) into the user-specified Format (this process is called Layout translation).
There is often no effective way to decide which layouts are suitable for the current platform and which layouts are best for performance. In addition, traditional graph optimization can involve multiple Layout transformations, which can introduce a lot of additional memory rearrangement overhead that users do not consider when selecting Format. In general, at present, traditional graph optimization faces the following problems:
- Performance issues: Users are not sure which Format performs better, and negative optimization can occur.
- Local problem: Additional Layout transformation overhead is not taken into account.
- Extensibility issues: Every time you add a Format, you add a Layout transformation method.
To solve the problems of traditional graph optimization, MegEngine has introduced global graph optimization, which automatically selects the most appropriate Format, taking into account the extra overhead of Layout transformation, so that users are unaware of Format.
Principle of global graph optimization
The essence of global graph optimization is to solve the problem of Tensor Format selection in the calculation diagram and select appropriate Layout for the operators in the calculation diagram, so that the inference performance of the whole calculation diagram can be optimized. Its core is a dynamic programming algorithm.
process
The overall process of global graph optimization is as follows:
- The connected subgraph composed of specific operators is extracted from the computational graph, and the global graph optimization is applied to the connected subgraph.
- Specific operators include Convolution operator, Elemwise operator, CV operator and Pooling operator.
- The performance data of the operators in the subgraph under each Layout and the cost data of conversion between different layouts are collected, and the Layout selection problem is constructed by using the above data.
- All of the above data is obtained by actually executing the operator on the target device.
- The Layout selection problem is solved and the Layout that should be selected by each operator in the subgraph is obtained.
- Replace each operator Layout in the subgraph with the Layout obtained in the previous step, and connect the subgraph to the calculation graph.
The global graph optimization strategy focuses on how to solve the Layout selection problem, this part adopts the global graph optimization algorithm based on dynamic programming, the algorithm is introduced below.
algorithm
Before introducing the global graph optimization algorithm, two concepts are briefly explained:
- Op\text{Op}Op: compute node. An operator is a compute node.
- Var\text{Var}Var: data node. Op\text{Op}Op input, output Tensor.
The following figure shows part of the calculation diagram after topological sorting. It consists of multiple Op\text{Op}Op, which is connected to each other by Var\text{Var}Var. Var\text{Var}Var {Var}Var is the output of the previous Op\text{Op}Op and the input of the next Op\text{Op}Op. For example, Var2\text{Var}_2Var2 is both the output of Op1\text{Op}_1Op1 and the input of Op3\text{Op}_3Op3.
The solution of the global graph optimization algorithm is the Layout that should be selected for each Op\text{Op}Op in the connected subgraph when the optimal performance is achieved. To facilitate the introduction of the algorithm, we agree on the meanings of the following symbols:
In:
和
The calculation graph is divided into two parts, left and right, the two parts through some
Attached to it.- Cut2\text{Cut}_2Cut2 \text{Var}_2Var2 \text{Var}_3Var3 \ Cut2\text{Cut} _2Var2 \text{Var}_2Var2 \text{Var}_3Var3 Cut2\text{Cut}_2Cut2 \text{Cut}_2Cut2 \ Var {Var}Var
: Indicates connection.
Two parts of
I’m in a state of zero
.- Such as Var2 \ text {Var} _2Var2 and Var3 \ text Format of {Var} _3Var3 respectively f1 \ text {} f _1f1 and f2 \ text {} f _2f2, \text{f}_1,\text{f}_2)(f1,f2) is one of the states of Cut2\text{Cut}_2Cut2.
- Cutn\text{Cut}_nCutn \text{Var} {Var} {N}N, each Var\text{Var}Var {M}M Format, Cutn\text{Cut}_nCutn is MN\text{M}^NMN.
- Okn\text{O}_k^nOkn: indicates the time of Opn\text{Op}_nOpn in Staten−1k\text{State}_{n-1}^kStaten−1k.
- V – > kn \ text {n} _ {s} – > k ^ nVs – > kn: Indicates the length of Cutn\text{Cut}_nCutn converting from Statens\text{State}_n^sStatens to Statenk\text{State}_n^kStatenk. Var\text{Var}Var is the time that all Var\text{Var}Var of Cutn\text{Cut}_nCutn is converted from state S \text{s}s to state k\text{k}k.
:
在
Indicates the minimum time spent under.- Cutn\text{Cut}_nCutn when Opn\text{Op}_nOpn output Var\text{Var}Var is k\text{k}k.
Said:
Is the shortest time. namely
In all
The minimum value in the shortest time.
Okn\text{O}_k^nOkn, Okn\text{O}_k^nOkn, Okn\text{O}_k^nOkn, Okn\text{O}_k^nOkn Vs−> kN \text{V}_{S -> K}^nVs−> KN \text{V}_{S ->k}^n So let’s go ahead and derive the state transition equation.
In Cut0\text{Cut}_0Cut0, Op0\text{Op}_0Op0 is the first Op0\text{Op} Op, So Tk0\text{T}_k^0Tk0 is equal to the time of Op0\text{Op}_0Op0 at Statek\text{State}^kStatek.
In Cut1 \ text {Cut} _1Cut1 place, Tk1\text{T}_k^1Tk1 should equal the shortest Cut0\text{T}_0T0 + Op1\text{Op}_1Op1 at State0k\text{State}_0^kState0k Ok1\text{O}_k^1Ok1, Cut0\text{Cut}_0Cut0 \text{Cut}_0Cut0 \text{Op}_0Op0 \text{Op}_0Op0 \text{Op}_0Op0 \text{Op}_1Op1 \text{Op}_1Op1 \text{State}State State\text{State}State conversion time Vs−>k0\text{V}_{S -> K}^0Vs−>k0.
Extending to Cuti+1\text{Cut}_{I +1}Cuti+1, we can obtain the state transition equation:
Cuti+1\text{Cut}_{I +1}Cuti+1
Given Tk0\text{T}_k^0Tk0, Vs→ki\text{V}_{s\to k}^iVs→ki, Oki\text{O}_k^iOki, suppose there are n\text{n}n Cut\text{Cut}Cut. According to the above equation, the shortest time of Cutn\text{Cut}_nCutn Tn\text{T}_nTn is obtained firstly. Thus, the State\text{State}State x which makes Tn\text{T}_nTn reach the minimum value can be determined. At this point you can determine which Layout Opn\text{Op}_nOpn should choose. Then State\text{State}State y that minimizes Txn\text{T}_x^nTxn can be determined according to the State transition equation. Opn−1\text{Op}_{n-1}Opn−1 should be selected, and so on. You can work backwards from the last Cut\text{Cut}Cut according to the state transition equation to determine which Layout each Op\text{Op}Op should choose. The problem has been solved.
If there are K\text{K}K Cut\text{Cut}Cut, the maximum number of Var\text{Var} is N\text{N}N, Each Var\text{Var}Var has M\text{M}M Format, so the algorithm complexity is O(KMN)\text{O}(\text{KM}^N)O(KMN). If the model is complex and N\text{N}N is large, State\text{State} number MN\text{M}^NMN increases exponentially, resulting in a very long algorithm running time. Therefore, in order to reduce the algorithm complexity, State\text{State}State is pruned.
pruning
Let Cutn\text{Cut}_nCutn be connected by m\text{m} Var\text{Var}Var (v1,v2,… ,vm)(v_1,v_2,… ,v_m)(v1,v2,… Vm), Cutn\text{Cut}_nCutn from Stateni\text{State}_n^iStateni \text{State}_n^jStatenj \ J) \ text {short (I, j)} short (I, j).
Tin\text{T}_i^nTin indicates Cutn\text{Cut}_nCutn in Stateni\text{State}_n^iStateni, When Tin\text{T}_i^nTin < Tjn\text{T}_j^nTjn, If Tin\text{T}_i^nTin + Cutn\text{Cut}_nCutn from Stateni\text{State}_n^iStateni \text{State}_n^jStatenj \text{State}_n^jStatenj \ Statenj\text{State}_n^jStatenj = Statenj\text{State}_n^jStatenj = Statenj\text{State}_n^jStatenj = Statenj\text{State}_n^jStatenj This leads to the principle of pruning:
The engineering practice
Global graph optimization algorithm can solve the performance and local problems of traditional graph optimization, but the scalability of traditional graph optimization mentioned above has not been solved. Therefore, in addition to implementing the algorithm, some extra work is also done in the global graph optimization strategy.
Data format converter
Since MegEngine provides a number of formats, such as NCHW, NHWC, NCHW4, CHWN4, and so on. Given N\text{N}N formats, it would be theoretically unacceptable to maintain O(N2)\text{O}(\text{N}^2)O(N2) Layout conversion algorithms.
0 0 Considering that Layout transformations are 0 0 and 0 0 are 0 0 based on MegEngine and Transpose, we designed the data format converter to 0 0 and Transpose for all Layout transformations The following is an example of a Layout transformation that is done using 0 0 and transpose.
To facilitate the description of the convention Layout format, for example, NCHW4 is split Channel /4 in the Channel dimension, expressed as (N, C//4, H, W, C%4). Example: If you add NCHW16, how do you convert NCHW4 to NCHW16?
- layout0 = (N, C//4, H, W, C%4)
- layout1 = (N, C//16, H, W, C%16)
- 0 0 splits C dimensions :(N, C//4//4, C//4%4, H, W, C%4)
- Transpose adjustment dimension position :(N, C//4//4, H, W, C//4%4, C%4)
- 0 0 consolidation of the last two dimensions :(N, C//16, H, W, C%16) = layout1
With the data Format converter, when the new Format is added, there is no need to hand-write the corresponding Layout conversion algorithm, and the scalability problem existing in the traditional graph optimization is solved in the global graph optimization.
conclusion
In terms of performance, global graph optimization can bring more benefits than traditional graph optimization. Due to the number of int8 formats supported by MegEngine, the acceleration effect on the INT8 model is particularly significant. The premise of using global graph optimization is that there is a real equipment for profiling to make the right decision, so it is not a perfect substitute for traditional graph optimization at this point. When this condition is met, global graph optimization is a better choice because it is easier and performs better.
The attached:
GitHub: MegEngine Tianyuan (welcome star~
Gitee: MegEngine/MegEngine
Website: MegEngine- Deep learning, simple development
Welcome to the MegEngine TECHNICAL exchange QQ group: 1029741705