Talk about TVM Ansor

“This is the 19th day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. This article focuses on the principles of TVM ANSor.

Reference Papers:

Reference code:

github.com/apache/tvm

1. Deep learning system stack

Stack of deep learning systems, fronting various AI frameworks (PyTorch, TF, MXNet, Caffe…) , the back-end is a variety of hardware (NVGPU, ARMCPU, X86CPU, NPU…)

For the deep learning compiler, it is to convert the network/code_gen/compiler from the front-end AI framework into the machine code corresponding to the high performance of the target hardware back end. TVM/Halide split the compiler process into two processes: compute and scheduling optimized code, for a simple reason. After decoupling, only scheduling optimization needs to be paid attention to, because scheduling is strongly related to hardware, and there is no need to worry about scheduling affecting accuracy. For the dense_layer + relu subgraph, the computational representation of halide and TVM might look like this:

# halideDense (o, b) + = data * weight (I, b) (I, o) relu (o, b) = Max (dense (o, b), 0.0)
# tvmdense = compute(shape, lambda b, o: sum(data[b, i] * weight[o, i], i)) relu = compute(shape, lambda b, o: Max (dense [b, o], 0.0))Copy the code

2. Related high performance tensor generation

2.1 auto TVM

The approach is to use template-Guided search, and templates to define the search space for each op.

The disadvantage is that:

(1) Not fully automatic: a lot of manual work is needed;

(2) Limited search space: better optimization performance cannot be achieved;

2.2 halide auto – the scheduler

The approach is based on sequential construction, using the Beam search algorithm to search for sequential generation. Consider the Beam search algorithm, which is an improvement on the greedy search, but nowhere near the exponential size of the greedy search, It’s a compromise between the two.

The disadvantages of Halide Auto-Scheduler are:

(1) Because it is generated sequentially, it is discontinuous and incomplete for the operator optimization in the middle, so the cost model cannot accurately predict, leading to the accumulation of errors;

(2) Like Auto TVM, search space is also limited;

Through the discussion of the above two scheduling search algorithms, three pain points are obtained :(1) how to automatically obtain a large search space; (2) How to make search more efficient; (3) How do we allocate resources for so many search tasks? For these three pain points, Ansor’s solutions are: (1) Hierarchical Search space; For (2) sample Complete programs and fine-tune them; For (3), task scheduler program is used to prioritize important tasks.

3. TVM ANSor principle

Take a look at TVM Ansor’s logical stack first, and then discuss the main operations.

The front-end is connected to the models derived from the AI framework, and then it is divided into many Subgraphs. For each subgraph, the following scheduling optimization is performed.

3.1 the program from

The goal of Program Sampling is to automatically construct a continuous large search space. To understand what the search space is for, MY understanding is the posture of the search computation or the scheduling of the computation, not the computation itself. The method is to use Sketch Generation and random annotation. For example, “Sketch” is a few good high-level structures. For example, “Annotaion” is not one of low-level details. Annotaion does the underlying mapping and builds up a large search space with these two steps.

This is used to solve the pain point (1) search space size problem.

3.1.1 the sketch generation

First post some descriptions from the paper:

The general process is as follows: Generate Sketch expression according to designed Sketch Gen rules (such as “SSRSRS” loop unwinding structure for CPU). Therefore, the whole process of Sketch can be shown in the following example of Dense_layer + relu. It can be seen that there are many parameters like TILE_I0 in the generated sketch sketch. These are the overparameters and the values that need to be tuned. It is also one of the work of random annotation, so there is more than one sketch generated in the end, usually a list of sketches.

And the same goes for this example:

3.1.2 the random annotation

Annotaion is a follow-up action to get a sketch of sketch, and the annotaion operation is taking a sketch, filling tile sizes, and outer circulation, 1. The inner cycle of the crystals calculates the position of the nodes to fine-tune the tile stucture. Again, the dense_layer + relu example above:

3.2 the performance fine – tuning

Then the operation after Program Sampling comes to Performance Tuner. This includes the design of search algorithm and cost evaluation model, so that ANSOR can conduct search and convergence more efficiently.

This solves the pain point (2) search efficiency problem.

3.2.1 evolutionary search

Random generation/sampling of random annotations cannot guarantee performance. Ansor adopts evolutionary search algorithm similar to genetic algorithm to further conduct evolutionary search on annotation sampling program. It mainly involves gene mutation and gene crossover.

(1) Mutation: randomly changing tile size; Random change the parallel/unroll/vectorize factor and granularity; Computation location of random change;

(2) Crossover: explained from the perspective of operator calculation of directed acyclic graph, as follows:

3.2.2 learned cost model

Predicting the score of each non-loop innermost statement may be abstract, as explained in some of the previous papers

Statements B and C are non-loop innermost statements.

Each non-loop innermost statement extracts features such as cache lines used, memory used, reuse distance, and arithmetic intensity

3.3 the task scheduler

The Task Scheduler in ANSor prioritizes important tasks.

This is used to solve the pain point (3) multi-task scheduling resource allocation problem.

Let’s take resnet50, which is sliced into 29 separate subgraphs in ansor. If you follow the traditional task scheduling method, the subGraphs will be scheduled sequentially, like this.

The Task Scheduler in ANSor slices time and prioritizes important subgraphs, like this, much like pipeline optimization in inference.

4. Performance data

Here are some performance data for the tests, also in the paper:

(1) single operator

The test platform was Intel-Platinum 8124M (18 cores) :

(2) the subgraph

The test platforms are “@C” for Inter CPU (8124M), “@G” for NVIDIA (V100).

(3) the whole network

The test platform is Intel CPU (8124M), NVIDIA GPU (V100), ARM CPU (A53).

As can be seen from the above data, AnsOR performs well in terms of single operator, Subgraph and the whole network.

The above discusses my current understanding of TVM Ansor. I hope my sharing can help you a little bit.