Abstract: In this paper, GreaterEqual is used as the test operator. The calculation logic of GreaterEqual is relatively simple (Output = InpuT1 >= inpuT2), which aims to reduce the calculation time as much as possible, and make the operator time as much as possible to data operation and operator scheduling as the main body.

This article is shared from Huawei cloud community “CANN AICPU Operator Time Analysis and Optimization Exploration” by DavilSu.

1. Analyze the purpose

In the actual development of CANN operators, it often appears that the operator functions normally, but its performance is far lower than that of TensorFlow benchmarked operators. To solve this problem, this paper takes GreaterEqual as the test operator, which has relatively simple calculation logic (Output = InpuT1 >= inpuT2), aiming to reduce the calculation time as much as possible, and make the operator time as much as possible with data operation and operator scheduling as the main body.

2. Test code and platform introduction

The test platform is Ascend server provided by OpenLab, carrying Ascend910A. CANN Toolkit version is 5.0.2 Alpha005.

Since the research test code reference cac625f243dfe7b04dbb2a82059cd0e4349f77d1 this commit, modified the commit for radio operating performance is optimized. Self-developed parallel threshold: 8K if broadcast operation is included, 32K if no broadcast operation is included.

GreaterEqual TensorFlow to the operator for TensorFlow1.15 version operator, canndev commit to the operator for d660e086717b94b8cfb3f35a8e08046ca0461772, This version of the operator tries to use the broadcast operation of Eigen library to avoid the poor performance of CANNdev source storehouse Bcast, but does not enable parallel computing for acceleration.

And then you have a Tensor that needs to have 1 elements and you have no 1 elements. The data types supported by INT8, INT16, INT32, INT64, Uint8, FLOAT16, FLOAT32 and FLOAT64 are tested. For each data type, 14 data scale gradients 128B, 256B, 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 1M, 2M and 8M are set respectively. The detailed mapping between data scale and SHAPE is as follows:

3. Single-thread performance analysis

This part aims to test the performance gap between single thread data processing CANN operator and TensorFlow operator. To avoid the impact of broadcast operations on test results, this test uses data batches that are not involved in broadcast operations.

Figure 1 Single thread time ratio

As can be seen, CANN operator has certain performance advantages compared with TensorFlow for small data scale with less than 2K data. However, with the increase of data volume, the performance of CANN operator deteriorates significantly, especially the Uint8 data type, which suffers serious deterioration with performance deterioration as high as 6.57 times. For non-C ++ standard float16 data type, both of them are replaced by the half data type in Eigen library, and the test results are close.

Figure 2. Calculation of 1K data time

I also tested how long it takes to calculate 1K data when CANN and TF single-core calculate 16K-8m data.

It can be seen that TensorFlow’s time consumption increases proportionally as the space occupied by data types increases. Strangely, the time consumption of INT8 and Uint8 of CANN is similar to that of INT16, which is also reflected in the performance degradation of Int8 and Uint8 is much higher than that of other data types in the time consumption ratio. It may be because int8 and Uint8 are extended to 16 bits before calculation. The performance of CANN in float32 and float64 is also very strange. As the amount of data increases, the time consumption fluctuates greatly. The specific situation is analyzed and optimized in vectorization code and performance analysis.

4. Performance comparison between the self-developed operator and the operator realized by the master warehouse

Canndev main storehouse GreaterEqual operator, attempts to use broadcast operation of Eigen library to avoid the problem of insufficient broadcast performance of Canndev source storehouse, but parallel computing is not enabled for acceleration. The self-developing operator uses the Bcast class in canndev warehouse to broadcast, refines and specializes the case whether broadcasting is needed or not, and sets parallel thresholds for different data scales.

This part tests two batches of data involving broadcast operation and non-broadcast operation respectively, aiming to test the advantages and disadvantages of the method provided by Canndev and the broadcast operation provided by Eigen, as well as the performance advantages of self-developed operators.

Figure 3 does not include the proportion of broadcast operation time

Figure 4 time consuming ratio including broadcast operation

As can be seen from the results, when the broadcast operation is not enabled, the performance of the self-developed operator is better than that of the existing operator. In the case of small data volume, the performance of the self-developed operator is better than that of the existing operator because the pointer is directly operated and the existing operator is not processed after the broadcast method of Eigen is checked. In the case of large data volume, the performance of the self-developed operator is far better than that of the existing operator because multi-threading is enabled.

However, when the broadcast operation is enabled, the parallel threshold is set at 8K, and all small data amounts are processed by single thread. It can be seen that the Bcast performance of CANN is inferior to that of broadcast implemented by Eigen. When the data amount is larger than 8K, the performance of self-developed operators is far superior to that of existing operators due to the parallel processing advantage of multithreading.

The broadcast operation implemented by TensorFlow has a significant performance advantage over broadcast implemented by Eigen and Bcast implemented by CANN. The broadcast operation implemented by TensorFlow is 8 to 26 times ahead of Broadcast implemented by Eigen and more ahead of CANN.

5. Parallel threshold comparison

As the reference operator is the broadcast optimized Less operator, I set a control group with the same threshold as that of Less operator (2K with broadcast operation and 7K without broadcast operation) to verify whether its parallel threshold is reasonable. To avoid the impact of broadcast operations on test results, this test uses data batches that are not involved in broadcast operations.

The test results are as follows:

Figure 5 Time threshold ratio of Less operator threshold and self-research operator threshold

It can be seen that the parallel threshold of Less operator is not set reasonably. When the data scale is 8K, there is an obvious time increase, the main time-consuming is the time of parallel communication rather than calculation, and the self-research operator is relatively gentle. This threshold is obtained from the dichotomy cycle test, and the critical point parallel acceleration ratio is close to 1.

Vectorization code and performance analysis

During the single-thread performance analysis, I noticed a strange phenomenon that the time of INT8 and INT16 was very close (as shown in Figure 2), which attracted my attention. When processing data, the processor’s time would be related to factors such as whether the data to be processed is fixed point or floating point, the bit width of the data, and the instructions to be invoked to process the data. It should take longer for INT16 than int8 to process the same amount of int8 and INT16 data. Observe the execution time of TensorFlow operator, the time of INT8 and Uint8 is also less than the time of INT16.

Modern processors tend to support SIMD (Single-instruction multi-data stream), which enables DLP(DataLevel Parallelism) to speed up data-intensive operations by packing data into a vector register and performing multiple data calculations in one instruction. However, the calculation process of GreaterEqual operator does not contain branch selection structure, and the calculation logic is simple and repetitive, which is suitable for SIMD acceleration.

After consulting the data, it is found that the AICPU in Ascend910 processor is TaiShan core with 16 cores. Through system query, it supports AArch64 instruction set, which also includes NEON instruction set.

I tried to embed assembly code in the C++ implementation code for manual quantization, and the performance really improved. Although manual vectorization can achieve the highest degree of vectorization theoretically, due to the different SIMD extended instruction sets provided by different processors and the complex and variable characteristics of different applications, the SIMD vectorization code has poor readability and low portability, and it is difficult to continue optimization. Considering that the operator code may need to be migrated to cpus of x86-64, ARM and other architectures in the future, the compiler is selected to automatically generate the vector program for SIMD extension of the target processor. Automatic vector quantization programmers do not need to care about SIMD extension component structure and instruction set provided by the bottom, but only need to express the parallelism in the program clearly, which largely solves the problem of low portability of high-performance code.

Query the canndev main warehouse code content, vector-optimization related keywords only appear in TFPlugin, and check the compilation option of cmakelists. TXT for only O2 optimization. Because the compiler used to compile AICPU code is GCC, O2 contains the following compilation options in addition to O1 optimizations, according to the GCC documentation:

You can see that table 3 does not contain compilation options for vectorization optimization, Therefore, we enable automatic vectorization by adding -ftree-loop-vectorize (including -ftree-loop-vectorize and -ftree-slp-vectorize) to cmakelists. TXT. The optimization results are as follows:

FIG. 6 Time consuming of single-thread vectorization calculation of 1K data

Looking at the results in Figure 6, you can see a significant improvement in code performance with single-threaded vectorization optimization. At the same time, we can also be observed that the same symbol types of fixed-point or floating-point computation time consuming as data bits wide double proportional increase, this also corresponds to the SIMD extension components of the vector register length is fixed, NEON vector length is 128 – bit register, so we set parallel threshold should not be carried out in accordance with the number of elements design, It should be determined by the total size of the element data.

Figure 7 Time ratio of whether or not to open temporary variables for FP16

When you try to translate half into a float at Tensor and store it in a temporary float array, performance deteriorates because the overhead of assignment after a per-element conversion is much greater than the performance gain from vectorization.

Figure 8 Time ratio of single thread vectorization or not

FIG. 9 Multithreading vectorization or not

Figure 9 shows that all C++ native data types perform better than TensorFlow operators after vectorization.

Observed in figure 10, after vectorization optimization, the performance of the operator to ascend, but we can see some data type in the amount of data for 128 k performance as not optimized, instead to quantitative optimization of parallel version of the code here because threshold is set according to the data size, here you can according to different data types to set more fine-grained parallel threshold.

Figure 10 Proportion of vectorization with or without broadcast operations (the number of elements that need broadcast Tensor is 1)

I also tested the special case of a single element being broadcast after vectorization, and saw that the compiler correctly vectorized the case because the single element pointer was dereferenced instead of being broadcast, resulting in a significant performance improvement.

Unfortunately, because when you need to broadcast, you need to access your elements in the Tensor by calling GetBroadcastXIndex and GetBroadcastYIndex of your Bcast class to calculate the address offset after broadcast, which involves a lot of math, The compiler cannot vectorize it, and the overhead of creating temporary space and assigning values far outweighs the performance improvement brought by vectorization, so how to optimize this process remains to be studied.

Figure 11 Comparison of disassembly code before and after -ftree-Vectorize is enabled

As shown in Figure 11, after the -ftree-Vectorize compilation option is enabled, the compiler not only performs automatic SIMD optimization, but also performs unroll operation on the loop, which is beneficial to reduce the loop overhead, provide instruction level parallelism, and optimize the scheduling of instruction pipeline.

In the case of FLOAT16, most calculations (except operator/) are converted to float and then converted to half when the CPU is used. The code snippet is as follows:

Figure 12. Function definition of the half data type operator>= in the Eigen library

This implementation involves two data type conversions, and because it does not call the ARM native data type, SIMD optimization is not possible, and it is not conducive to loop expansion, and the actual calculation efficiency is far lower than other native data types.

Figure 13 disassembly code, GCC11.1 on the left and Clang9.0.0 on the right

Looking through the official ARM architecture documentation, I found that ARMV8.2-A includes semi-precision floating-point instructions, which eliminates the need to convert to and from single-precision floating-point instructions, resulting in higher performance code. This means that AICPU can fully call the data type __fp16 to achieve native support for half-precision floating-point calculation. GCC<=11.1 is converted to float and then compared to the larger equal operator. GCC<=11.1 is converted to float. Clang>=9.0.0 generates the corresponding SIMD instruction set code for semi-precision floating-point numbers.

However, __fp16 is an Arm C language extension. On x86-64, only native storage is supported for FP16, and calculations need to convert it to float. GCC7.3 cannot compile, Clang can compile. To ensure portability of code, this data type is not recommended.

Is there a high portability, high performance implementation solution? While looking through the Eigen update log, I noticed that in the version eigen3.4-RC1 updated at 2021/04/19, Eigen:: Half is implemented as arm-native __fp16, with improved support for all back-end vectoxization and ARM’s scheduling of NEON’s instruction set for matrix computation.

Figure 14. Eigen update log

Figure 15 Eigen3.4.0 half-h Definition of Eigen:: Half when the schema is ARM64

Figure 16 Add operator disassembly code (left for __fp16, 3.4.0 Eigen::half, right for 3.3.9 Eigen::half)

If you look at the disassembly code in Figure 16, you can see that the compiler has successfully called the SIMD instruction set of FP16. The code generated by Eigen::half is basically the same as __fp16. It is more efficient than the code that did not call the SIMD instruction set and did not enable native FP16. Moreover, the amount of calculated data in a cycle is improved (SIMD calculates 8 FP16 data at a time, and the instruction without SIMD can only calculate 4 data in a cycle even if the cycle unrolling is carried out, and the instruction amount is much larger than the optimized version).

Since my personal familiarity with vendor source code is higher than that of TensorFlow, PyTorch was chosen as the comparison object. Some manual optimizations were made to SIMD, such as encapsulation of Vectorized classes and a series of common calculation functions in the aten/ SRC/aten/ CPU/veC directory. To some extent, it avoids the code readability loss caused by SIMD function embedded in the implementation file. At the same time, it determines the target CPU architecture through a series of environment macro definitions and enables SIMD function of the corresponding architecture to further optimize the actual vectomization performance on the basis of automatic vectomization.

Figure 17 PyTorch aten/SRC/aten/CPU/vec vec256 file directory

7. Limitations of vectorization

Of course, is it perfect to turn on vectorization? Of course not. Vectorization has its limits.

1. The vector register length of existing SIMD extension components is fixed. If the vector register length is too long and the number of loop iterations or isomorphic statements in the basic block is small, the program cannot be vectorized.

2. SIMD has a great influence on the execution efficiency on whether the data address is continuous or not. When the access address is not on the aligned boundary, additional shift and merge operations are needed to obtain the vector data that meets the requirements. The unaligned access structure not only adds additional access operations, but also adds special operations (such as shift and merge operations, etc.) to obtain the vector data that meets the requirements of SIMD extensions. Because the logical addresses of the Tensor data are aligned, this problem doesn’t have much effect on the element-wise operator.

3. Some programs need to carry out inadequate SIMD vectorization because the number of iterations is not enough, or the statements of vector parallelism in the basic block are not enough to provide sufficient parallelism for vector registers.

4. Through the operator implementation code embedded in the hand-written assembly code or within a function provided by the compiler to add SIMD instructions, manual vectorization theoretically able to achieve the highest degree of vectorization, but due to the expansion of different processors provide SIMD instruction set each are not identical, can lead to code portability has fallen dramatically, and to continue to optimize. At present, automatic vectorization has some limitations in code optimization.

5. Loop unwinding causes some code bloat.

6. The floating-point calculations of ARM’s NEON extension do not fully implement floating point operations in accordance with IEEE 754 standards. In particular, non-regular values will be treated as 0. Partially insecure floating-point calculations in NEON’s code GCC are not implemented in autovectorization, further limiting ARM’s SIMD performance.

8. Summary and optimization suggestions

conclusion

1. According to the current compilation options of Canndev source repository, the performance of various data types has a big performance gap with TensorFlow when the data scale is above 4K, and int8 and Uint8 take abnormal time, so it is possible to calculate and process according to 16bit. Canndev and TensorFlow both use half of the Eigen library for Float16 processing. The performance gap is the smallest among all data types, but the gap ratio is still as high as 1.3x.

2. Currently, the GreaterEqual operator in canndev source storehouse has not enabled multi-core, and it has not specialized in the case that broadcasting is not required, so the performance is far lower than that of self-research operator in the case that broadcasting is not required. When the broadcast operation involving non-single element is involved, because the broadcast performance of Eigen library is better than canndev’s Bcast, the performance of GreaterEqual operator in canndev’s source storehouse with small data volume is better than that of self-research operator. However, with the increase of data volume, the performance of self-research operator exceeds that of the operator in source storehouse after the multi-core operation is enabled.

3. The self-developed operator is designed by referring to the Less operator in the source code warehouse. The calculation logic of the two operators is basically the same, but the parallel threshold of Less operator design is low, resulting in an obvious time consuming peak when all data types are at 8K data scale, and the situation is improved after moving the parallel threshold later.

4. At present, automatic vectomization is not enabled for the compilation option of Canndev master warehouse. After automatic vectomization is enabled, the performance of codes that can be correctly vectomized is greatly improved, and the calculation accuracy does not change significantly when -funsafe-math-Optimizations compilation option is not enabled.

5. Operator code vectoization is explored from the point of view of assembly instructions. The half data type of Eigen<3.4 is not implemented by __fp16, which is supported by ARM native, so it cannot be vectoized. SIMD instruction can be called correctly, and the performance is greatly improved.

Optimization Suggestions

1. Optimize the parallel threshold of Less operator to make the parallel acceleration ratio of critical data volume as close as possible to 1.

2. Enable the automatic vectorize option of the compiler, ftree-Vectorize, to improve the CPU computing efficiency in a clock cycle.

3. Upgrade Eigen version to 3.4 and later versions, specify the corresponding ARM architecture during cross-compilation, and enable FP16 support, such as -march= ARMv8.2 + FP16, can realize fp16 native support on ARM platform, SIMD optimization and loop expansion by the compiler. Improved Eigen:: Half performance on THE ARM architecture.

4. Optimize the implementation logic of Bcast. The current version relies on operator developers to manually determine whether the Broadcast operation is needed, and extracts three special cases for manual implementation (no Broadcast, X as an element, Y as an element), operator implementation code is full of a lot of redundant code. Actions such as deciding whether to broadcast should be abstracted and elements accessed through a unified interface.

5. The realization method of obtaining element index to optimize the broadcast condition of Bcast. At present, the performance of Bcast in the warehouse is far lower than that of TensorFlow and broadcast of Eigen library, and the implementation of GetBroadcastXIndex method is not friendly to compiler optimization.

9. Conclusion

This paper is only a CANN operator developer’s simple analysis of AICPU operator time-consuming and optimization scheme exploration. The analysis and optimization ideas are relatively rough, and we would like to invite huawei experts to give advice if there is anything inappropriate. We also hope to have the opportunity to discuss and exchange optimization schemes with relevant experts.

Click to follow, the first time to learn about Huawei cloud fresh technology ~