Users practice | deep learning model based on MegEngine mobile terminal CPU reasoning performance optimization

The User Practices series will contain articles on how MegEngine users have done their framework work and hopefully help people who have used MegEngine better

The author: wang lei | kuang depending on science and technology research and development engineers

background

With the development of artificial intelligence technology and the expansion of its application fields, mobile devices with weak computing power have become an important carrier of model reasoning. Therefore, optimizing its reasoning performance has become an important engineering problem. It is generally believed that running the model on the GPU will have a great advantage over running it on the CPU, resulting in considerable performance improvement. This is often the case, but in engineering practice we have also found that for some models with smaller model dimensions, running the GPU on mobile devices does not bring performance gains and introduces additional compatibility issues. Therefore, in some application scenarios, we need to use THE CPU as the running carrier to try various methods to improve the model reasoning performance.

In the engineering practice of optimizing the reasoning performance of a key point model, we found that two optimization methods are more effective based on The MegEngine reasoning engine, NCHW44 and Record. This paper will give a detailed description of their principles and use methods.

NCHW44 optimization

The principle of

As we all know, increasing the degree of parallel computing is an important means to improve the speed of computing. On THE CPU, this requires the use of SIMD Instruction — Single Instruction, Multiple Data, that is, the execution of a Single Instruction, operation to complete the operation of Multiple Data. For example, to perform the addition operation, if the non-SIMD instruction is used, that is, the general addition instruction, only one number can be operated at a time, and in the model reasoning, the number is often 8 bit, 16 bit, the maximum is only 32 bit floating point number, which is really some waste for the modern 64 bit register. If you store more than one number in the register, and one instruction completes the calculation, you can multiply the speed of calculation. On x86 cpus, SIMD is implemented with SSE, AVX and other instruction sets, while on ARM cpus, it is the NEON instruction set. The CPU also provides special registers for SIMD instructions, which can be 128, 256, or even 512 bits on the x86 platform, and 128 bits on the ARM platform, so that four float32 data operations can be performed at once. Therefore, if we can try to use SIMD as much as possible in the model reasoning operation, the reasoning performance can be improved.

Let’s look at the problems with using SIMD in model reasoning. Generally, the tensor is stored in memory in NCHW (that is, the row and column data of each channel are arranged continuously and then stored in order of each channel). For example, when dealing with common convolution operations, the size of the convolution kernel may be various, such as 3×3. In this case, a row of 3 consecutive pixel data needs to be multiplied with the corresponding position data of the convolution kernel each time (other columns and channels are processed), while the corresponding SIMD instruction, which usually uses 128-bit registers, also needs to process 4 data at a time to take full advantage of float32. These four data must be located adjacent to each other in memory, so this method of calculation greatly limits the SIMD instruction.

As an improvement, in the layout of NCHW44 (also known as NC4HW4), the data of four channels in the same position (HW) are continuously arranged together. During the convolution operation, they participate in the calculation together, and each time SIMD instruction is executed, they can be loaded into the register together, thus improving the calculation efficiency. The following figure shows the data storage arrangement of NCHW44.

practice

MegEngine supports two ways to use NCHW44 optimization:

1. Dump (serialize) the NCHW44 model offline and MegEngine will automatically determine its arrangement and implement the corresponding operator when inferring. Here are two methods for dump

megengine.jit.trace.dump

megengine.core.tensor.megbrain_graph.optimize_for_inference
Copy the code

Enable_nchw44 is supported. If enable_nCHW44 is set to True, the NCHW44 model is displayed.

In the SDK /load-and-run/dump_with_testcase_mge.py script, add the — enable-nchw44 parameter to the load_and_run script. The generated model is the NCHW44 model that can be loaded and executed by load_and_run.

2. Enable the transformation online, do not do nCHW44 configuration when dump model, and enable the transformation at run time using option:

serialization::GraphLoader::LoadConfig load_config;

load_config.comp_graph = ComputingGraph::make();

auto &&graph_opt = ret.load_config.comp_graph->options();

graph_opt.graph_opt.enable_nchw44();
Copy the code

Correspondingly, if you want to pre-test performance with load_and_run, you can add the command line parameter – enable-nchw44 when load_and_run is executed.

The two methods can be selected based on specific usage conditions: If the SDK or app we developed may load multiple models, some of which use NCHW44 while others do not, the offline mode is more suitable; If, for some reason, we cannot re-dump the model (for example, the original model file is lost), then we have to go online.

The effect

In our engineering practice, the reasoning speed of a model on the current mainstream Android phone has been improved by about 20-30%.

Record optimization

The principle of

When MegEngine performs an inference, the underlying execution is a static graph whose execution sequence is determined. For each operator in the diagram, the execution is divided into two steps: preparing the kernel and actually executing it. During the kernel preparation phase, MegEngine decides which algorithm to execute based on the filter size, stride, shape, etc., i.e. chooses the function to execute, i.e. Kernel (for convolution, there may be several different implementations). During the execution phase, these functions are actually called.

If the criteria for selection remain the same (in the actual case, shape is not changed), then the process of preparing the kernel only needs to be performed once, and the selected function objects are recorded in a list. Later, when executing, the function objects are directly retrieved from the list and executed sequentially. This saves time in preparing the kernel for subsequent executions. That’s what the name record means.

Currently Two levels of Record exist in MegEngine. Record1 is used to speed up execution, as described above; Record2 is mainly used to save memory, if the shape is not changed MegEngine can disassemble some of the information stored on the shape (this information can be used to derive the shape when the shape is changed). For scenarios where we want to improve computing performance, the general record1 is appropriate.

Note that one of the most important limitations of record is that shape cannot be changed. For some detection models, it may be necessary to resize the model according to the size of the input graph. In this case, record cannot be used. For models with constant input length and width and number of channels, it is still important to note that the batch parameter (i.e., N in NCHW) does not change, which may be ignored. In addition, after the model is loaded, we can still change shape before the first run, as long as the shape is not changed after the first run, it will not affect the use of record.

In addition to the condition that shape does not change, there are some restrictions:

All operators cannot rely on dynamic memory allocation, because the recorded function object also contains Pointers to input and output, which changes in dynamic memory;
The input/output Pointers on the Host cannot change.
Synchronization can only occur at the end of the network execution, that is, it cannot be performed on an intermediate node during the network execution.
There cannot be more than one CompNode in the entire diagram.

These conditions can be satisfied for general use.

practice

Enabled in Option

serialization::GraphLoader::LoadConfig load_config; load_config.comp_graph = ComputingGraph::make(); auto &&graph_opt = load_config.comp_graph->options(); graph_opt.comp_node_seq_record_level = 1; / / 2Copy the code

Correspondingly, if you want to pre-test performance with load_and_run, you can add a command line argument –record-comp-seq or –record-comp-seq2 when load_and_run is executed.

The effect

In our engineering practice, the reasoning speed of a model on the current mainstream Android phones has been improved by about 10%.

conclusion

In this paper, two optimization methods of MegEngine NCHW44 and Record are introduced in terms of principle and usage. They are only two effective methods that we try to find when optimizing the reasoning performance of a key point model. The effectiveness of the optimization method depends on the characteristics of the model, so for a specific model you can try MegEngine’s other optimization options to choose a more appropriate method. Of course, optimization involves many aspects. In addition to model reasoning itself, optimization of pre-processing and post-processing, reduction of data replication, reasonable setting of CPU affinity for Android devices, and so on, can also be tried and considered.

The attached:

GitHub: MegEngine Tianyuan

MegEngine- Deep learning, simple development

Welcome to MegEngine technical Exchange QQ group: 1029741705

Users practice | deep learning model based on MegEngine mobile terminal CPU reasoning performance optimization

background

NCHW44 optimization

The principle of

practice

The effect

Record optimization

The principle of

practice

The effect

conclusion

The attached:

Related Posts

Baidu q&A summary and reasoning project

PyTorch Pipeline parallel implementation (2)– How to divide the model

Edge detection for image segmentation – basic Roberts, sobel, prewitt, canny, Laplace