Akik: Long farewell

1. Introduction

With the development of IT infrastructure, modern data processing systems need to process more data and support more complex algorithms. The increase of data volume and the complexity of algorithms bring severe performance challenges for data analysis systems. In recent years, we have seen a lot of performance optimization technologies in the fields of databases, big data systems and AI platforms, ranging from architecture to compilation to high performance computing. As a representative of compiler optimization technology, this paper mainly introduces the code generation technology based on LLVM (Codeden for short).

LLVM is a popular open source compiler framework that supports a variety of languages and underlying hardware. Developers can build their own compilation framework based on LLVM and carry out secondary development to compile different languages or logic into executable files running on various hardware. For the Codegen technology, we focused on the format of the LLVM IR and the API to generate the LLVM IR. In the following part of this paper, we first introduce LLVM IR, then introduce the principle and usage scenarios of Codegen technology, and finally, we introduce the typical application scenarios of Codegen in AnalyticDB PostgreSQL, the cloud native data warehouse product developed by Ali Cloud.

2. Introduction to LLVM IR and starting course

IR is a very important link in compiler theory and practice. The full name for IR is Intermediate Representation, which translates as “Intermediate Representation.” For a compiler, from the high-level language at the top of the abstraction to the underlying assembly language, there are many steps (pass), through different forms of expression. There are many kinds of compiler optimization techniques, each of which plays a different role in the compilation process. But IR is a clear watershed. The compiler optimization above IR does not need to care about the details of the underlying hardware, such as the instruction set and register file size of the hardware. The following compiler optimizations are required to deal with hardware. LLVM is best known for its IR design. Thanks to clever IR design, LLVM can support different languages up and different hardware down, and different languages can reuse the optimization algorithm of IR layer.



The figure above shows a framework of LLVM. LLVM divides the whole compilation process into three steps :(1) the front end, which converts the high-level language to IR. (2) The middle end is optimized in the IR layer. (3) The back end converts IR into assembly language of the corresponding hardware platform. So LLVM scales well. For example, if you want to implement a language called toyc that you want to run on an ARM platform, all you need to do is implement a TOYC-> LLVM IR front end and the rest of the LLVM module. Or if you want to build a new hardware platform, you just need to get through the LLVM IR-> new hardware stage, and then the hardware can support many existing languages. Therefore, IR is the most competitive place for LLVM and the most central place for learning to use LLVM Codegen.

2.1 Basic knowledge of LLVM IR

The IR format of LLVM is very similar to assembly. For those who have learned assembly language, it is very easy to learn how to program with LLVM IR. For those of you who haven’t learned assembly language, don’t worry, it’s actually not that difficult. Assembly is not difficult to learn, but the engineering implementation. Because the difficulty of developing assembly language increases exponentially as the complexity of the project increases. Next we need to understand the three most important parts of IR, instruction format, Basic Block & CFG, and SSA. Complete the LLVM IR information please refer to https://llvm.org/docs/LangRef… .

Instruction format. LLVM IR provides a three-address code instruction format similar to assembly language. The following code snippet is a very simple function implemented using LLVM IR. The function takes five integers of type I32 (INT32). The function calculates and returns the sum of these five numbers. LLVM IR supports some basic data types, such as i8, i32, floating point, and so on. Variables in LLVM IR are named with names beginning with “%”. By default %0 is the first argument to the function, %1 is the second argument, and so on. Machine-generated variables are usually named numerically, but if they are handwritten, you can choose the appropriate naming method according to your preference. The instruction format of LLVM IR includes operator, type, input, and return value. For example, “%6 = add i32% 0, %1” has the operation symbol “add”, the type “i32”, the input “%0” and “% 1”, and the return value “%6”. In general, IR supports some basic instructions that the compiler uses to perform some complex operations. For example, if we write an expression in C that looks like “A * B + C” in LLVM IR it is done with A multiplication and an addition instruction, and possibly some type conversion instructions as well.

define i32 @ir_add(i32, i32, i32, i32, i32){
  %6 = add i32 %0, %1
  %7 = add i32 %6, %2
  %8 = add i32 %7, %3
  %9 = add i32 %8, %4
  ret i32 %9
}

Basic Block & CFG. Now that we know the instruction format of IR, we need to understand two concepts: Basic Block(BB for short) and Control Flow Graph(CFG). The figure below (left) shows a simple C function. The figure below (middle) is the corresponding LLVM IR compiled using Clang. The figure below (right) is the CFG drawn using Graphviz. Combined with this picture, we will explain the concept of Basic Block and CFG.

In our daily contact with high-level languages, each language will have a lot of branch jump statements, such as C language for, while, if and other keywords, these keywords represent branch jump. Developers use branch jumps to implement different logical operations. Assembly language usually through conditional jump and unconditional jump instructions to achieve logical operations, LLVM IR is the same. For example, in LLVM IR “br label %7″ means to jump to label named %7 no matter what, which is an unconditional jump instruction. Br i1%10, label %11, label %22” is a conditional jump, meaning that this jumps to label %11 if %10 is true, otherwise jumps to label %22.

After understanding the concept of jump instructions, we introduce the concept of Basic Block. A Basic Block is a sequence of instructions that execute in a sequence without jumping instructions except for the last sentence. The first instruction in a Basic Block entry is called “Leading instruction”. Each Basic Block, except for the first Basic Block, will have a name. The first Basic Block can also be there, but sometimes it’s not necessary. For example, there are five Basic blocks in this code. Basic Block concept, to solve the problem of control logic. With Basic Blocks, we were able to divide our code into different blocks. During compilation optimizations, some were specific to a single Basic Block, and some were specific to multiple Basic Blocks.

The CFG(Control Flow Graph) is actually a Graph composed of Basic blocks and jump relationships between Basic blocks. For example, in the code shown in the figure above, there are a total of five Basic blocks, and the arrows list the jump relations between Basic blocks, which together constitute a CFG. If a Basic Block has only one arrow pointing to another Block, then the jump is unconditional; otherwise, it is conditional. CFG is a relatively simple and very basic concept in compilation theory. CFG is a further step of DFG (Data Flow Graph), and many advanced compilation optimization algorithms are based on DFG. For those of you using LLVM for CodeGen development, understand the concept of CFG.

SSA. The full name of SSA is Static Single Assignment, which is a very basic concept in compilation technology. SSA is a concept that must be familiar with when learning LLVM IR, and it is also the most difficult concept to understand. Careful readers looking at the IR code listed above will see that each “variable” is assigned only once, which is the core idea of SSA. Because from the compiler’s point of view, the compiler doesn’t care about “variables,” the compiler is designed around “data.” Each write to each “variable” generates a new version of the data around which the compiler optimizes. Let’s use the following C code to explain the idea.

The image above (left) shows a simple piece of C code. The image above (right) is the SSA version of this short code, or “code as the compiler sees it.” In C, we know that data is stored in variables, so the core of data manipulation is variables. Developers need to care about the lifetime of variables, when they are assigned, and when they are used. But the compiler only cares about where the data is flowing, so each assignment generates a new lvalue. For example, the code on the left has only one “a”, but the code on the right has four variables because there are four versions of the data in “a”. In addition to each assignment generating a new variable, the last PHI node generates a new variable. In SSA, each variable represents a version of the data. That is, high-level languages are variable-centric, whereas SSA formats are data-centric. Each assignment in SSA generates a version of the data. Therefore, when writing IR, always keep in mind that IR variables are different from high-level languages. An IR variable represents a version of the data. Phi node is an important concept in SSA. In this case, the value of a_4 depends on which branch was executed before, if the first branch was executed, then a_4 = a_1, and so on. The PHI node selects the appropriate version of the data by determining which Basic Block the code is jumping from. LLVM IR naturally also requires developers to write PHI nodes. In the place of loop and conditional branch jump, many PHI nodes are often needed to be written by hand, which is logically difficult to deal with when writing LLVM IR.

2.2 Learn to use LLVM IR to write programs

The best way to get familiar with LLVM IR is to write several programs using IR. Before starting to write, it is recommended that the first spend 30 minutes to an hour and then a rough reading under the official handbook (https://llvm.org/docs/LangRef)… , familiarize yourself with the types of instructions. Next, let’s familiarize ourselves with the whole process of IR programming of LLVM through two simple cases. The following is a fragment of a circular addition function. This function contains three Basic blocks, loop, loop_body, and final. Where loop is the beginning of the entire function, loop_body is the body of the function loop, and final is the end of the function. In lines 5 and 6, we use the PHI node to implement the result and loop variables.

define i32 @ir_loopadd_phi(i32*, i32){
  br label %loop
      
loop:
  %i = phi i32 [0,%2], [%newi,%loop_body]
  %res = phi i32[0,%2], [%new_res, %loop_body]
  %break_flag = icmp sge i32 %i, %1
  br i1 %break_flag, label %final, label %loop_body 
      
loop_body:
  %addr = getelementptr inbounds i32, i32* %0, i32 %i
  %val = load i32, i32* %addr, align 4
  %new_res = add i32 %res, %val
  %newi = add i32 %i, 1
  br label %loop
final:
  ret i32 %res;
}

Here is a fragment of an array bubble sort function. This function contains two loop bodies. The LLVM IR implementation of the loop itself is complicated, and the nesting of two loops is even more complicated. If you can implement a bubbling algorithm using the LLVM IR, you basically understand the whole logic of the LLVM.

define void @ir_bubble(i32*, i32) { %r_flag_addr = alloca i32, align 4 %j = alloca i32, align 4 %r_flag_ini = add i32 %1, -1 store i32 %r_flag_ini, i32* %r_flag_addr, align 4 br label %out_loop_head out_loop_head: ; check break store i32 0, i32* %j, align 4 %tmp_r_flag = load i32, i32* %r_flag_addr, align 4 %out_break_flag = icmp sle i32 %tmp_r_flag, 0 br i1 %out_break_flag, label %final, label %in_loop_head in_loop_head: ; check break %tmpj_1 = load i32, i32* %j, align 4 %in_break_flag = icmp sge i32 %tmpj_1, %tmp_r_flag br i1 %in_break_flag, label %out_loop_tail, label %in_loop_body in_loop_body: ; read & swap %tmpj_left = load i32, i32* %j, align 4 %tmpj_right = add i32 %tmpj_left, 1 %left_addr = getelementptr inbounds i32, i32* %0, i32 %tmpj_left %right_addr = getelementptr inbounds i32, i32* %0, i32 %tmpj_right %left_val = load i32, i32* %left_addr, align 4 %right_val = load i32, i32* %right_addr, align 4 ; swap check %swap_flag = icmp sge i32 %left_val, %right_val %left_res = select i1 %swap_flag, i32 %right_val, i32 %left_val %right_res = select i1 %swap_flag, i32 %left_val, i32 %right_val store i32 %left_res, i32* %left_addr, align 4 store i32 %right_res, i32* %right_addr, align 4 br label %in_loop_end in_loop_end: ; update j %tmpj_2 = load i32, i32* %j, align 4 %newj = add i32 %tmpj_2, 1 store i32 %newj, i32* %j, align 4 br label %in_loop_head out_loop_tail: ; update r_flag %tmp_r_flag_1 = load i32, i32* %r_flag_addr, align 4 %new_r_flag = sub i32 %tmp_r_flag_1, 1 store i32 %new_r_flag, i32* %r_flag_addr, align 4 br label %out_loop_head final: ret void }

We compile the above LLVM IR with Clang compiler into object file, and then link to the program written in C language together, you can call normally. In the case mentioned above, we only used basic data types such as i32 and i64, while advanced data types such as struct are supported in LLVM IR, which can achieve more complex functions.

2.3 Implement Codegen using LLVM API

Compilers are essentially calling various APIs to generate code based on input, and LLVM Codegen is no exception. Inside LLVM, a function is a class, a Basic Block is a class, an instruction is a class, a variable is a class. To implement CodeGen with LLVM API is to use the internal data structure of LLVM to implement the corresponding IR according to the requirements.

Value *constant = Builder.getInt32(16);
    Value *Arg1 = fooFunc->arg_begin();
    Value *val = createArith(Builder, Arg1, constant);
    Value *val2 = Builder.getInt32(100);
    Value *Compare = Builder.CreateICmpULT(val, val2, "cmptmp");
    Value *Condition = Builder.CreateICmpNE(Compare, Builder.getInt1(0), "ifcond");
    ValList VL;
    VL.push_back(Condition);
    VL.push_back(Arg1);
    BasicBlock *ThenBB = createBB(fooFunc, "then");
    BasicBlock *ElseBB = createBB(fooFunc, "else");
    BasicBlock *MergeBB = createBB(fooFunc, "ifcont");
    BBList List;
    List.push_back(ThenBB);
    List.push_back(ElseBB);
    List.push_back(MergeBB);
    Value *v = createIfElse(Builder, List, VL);

Above is an example of implementing CodeGen using the LLVM API. It’s all about writing IR in C++, and if you know how to write IR, you just need to be familiar with the API. The API provides some basic data structures, such as instructions, functions, basic blocks, LLVM Builders, etc., and then we only need to call the corresponding functions to generate these objects. In general, we start by prototyping the function, including the function name, parameter list, return type, and so on. Then, according to the functions of the function, we determined which Basic blocks were needed and the jump relationship between Basic blocks, and then generated the corresponding BASIC. Finally, we fill each Basic Block with instructions in a certain order. The logic is that this process is similar to writing code with LLVM IR.

3. Codegen technical analysis

If we use the method described above to generate some simple functions and write the corresponding versions in C for performance comparison, we will find that the performance of LLVM IR is not faster than that of C. On the one hand, what the computer does at the bottom is assembly, and the C language itself is very close to assembly, and programmers who know the bottom can often infer from the C code what kind of assembly is likely to be produced. On the other hand, modern compilers tend to do a lot of optimizations, some of which greatly lightens the programmer’s optimization burden. Therefore, using LLVM IR for Codegen does not provide better performance than handwritten C, and there are some obvious drawbacks to using LLVM Codegen. To really use LLVM well, we also need to be familiar with the features of LLVM.

3.1 Disadvantages Analysis

Disadvantage 1: Hard to develop. In actual development, there will be almost no project to use assembly as the main development language, because the development difficulty is too great, interested partners can try to write a quicksort to feel it. Even for basic software such as databases and operating systems, assembly is often used only in a few places. Similar problems arise with IR development using LLVM. For example, the most complex example shown above is the bubbling algorithm. A developer can write a bubble in C in a few minutes, but it can take an hour to write a bubble in LLVM IR. In addition, LLVM IR is difficult to deal with complex data structures, such as structures and classes. It is very difficult to add a complex data structure beyond the basic ones in the LLVM IR. Therefore, in the real world of development, adopting Codegen increases the difficulty of development exponentially.

Disadvantage 2: Difficult debugging. Developers typically debug code by stepping through it, but LLVM IR does not support it. Once the code goes wrong, it can only be human flesh to watch the LLVM IR over and over again. If you know assembly, you can debug the assembly generated by single-step tracking. However, the mapping relationship between assembly language and IR is not simple, so it can only reduce the difficulty of debugging to a certain extent, but not completely solve the problem of debugging.

Disadvantage 3: running cost. Generating an LLVM IR is often quick, but the resulting IR requires optimization with the tools in LLVM and compilation into a binary file, which takes time (think GCC compilation speed). During database development, our rule of thumb is that each function costs about 10ms-100ms of CodeGen. Most of the time was spent optimizing IR and IR to assembly.

3.2 Application scenario

Knowing the weaknesses of LLVM Codegen allows us to analyze its strengths and choose the right scenario. The following sections are the scenarios that the team summarized during development as appropriate for using LLVM CodeGen.

Scenario 1: Java/ Python, etc. As mentioned above, LLVM IR will not be faster than C, but it will be faster than Java/ Python etc. For example, in Java, in order to improve performance, some C functions are called through JNI to improve performance. Similarly, Java can call functions generated by the LLVM IR to improve performance.

Scenario 2: Hardware and language incompatibility. LLVM supports a variety of backends, such as X86, ARM, and GPU. For some hardware and language incompatible scenarios, LLVM can be used to achieve compatibility. For example, if our system is developed in the Java language and wants to call the GPU, we can consider generating GPU code with LLVM IR and then calling it through JNI methods. This scheme not only supports NVIDIA GPU, but also supports AMD GPU, and the corresponding generated IR can also be executed on CPU.

Scenario 3: Simplification of logic. Take the database as an example, the database execution engine needs to do a lot of data type, algorithm logic related judgment in the execution process. This is mainly due to the data types and logic in SQL, much of which cannot be determined at database development time and can only be determined at run time. This part of the process is also called “interpretive execution.” We can use LLVM to generate code at run time, and since the data types and logic are determined at this point, we can remove unnecessary judgment operations from the LLVM IR, resulting in improved performance.

4. Application of LLVM in database

In the database, the team uses LLVM to process the expression. Next, we compare the PostgreSQL database with the cloud native data warehouse AnalyticDB PostgreSQL to explain the application method of LLVM.

In order to implement the interpretation of expressions, PostgreSQL uses a set of “function spelling” scheme. There are a number of C functions implemented in PostgreSQL, such as addition and subtraction, size comparison, and many other types of C functions. SQL will select the corresponding function according to the type of the expression symbol and the data type in the execution plan phase, save the pointer, and call it again when it is executed. So for filter conditions like “a > 10 and b < 5”, assuming that a and b are both int32, PostgreSQL actually calls a combination of functions like “Int8AndOp(Int32GT(a, 10), Int32LT(b, 5))”, like building blocks. There are two obvious performance problems with such a scheme. On the one hand, this scheme will lead to a relatively large number of function calls, and function calls themselves have a cost. On the other hand, this solution must implement a unified function interface, which requires some type conversion both inside and outside the function, which is also an additional performance overhead. Odyssey uses LLVM for codegen, which can achieve minimal code. Since the database knows the sign of the expression and the type of input data after the SQL is issued, it only needs to select the corresponding IR instruction according to the requirements. So we only need three IR instructions to implement this expression, and then we wrap the expression into a function that can be called at execution time. This operation simplifies multiple function calls to a single function call, greatly reducing the total number of instructions.

SQL > select count(*) from table where a > 10 and b < 5; // result = Int8AndOp(Int32GT(a, 10), Int32LT(b, 5)); %res1 = icmp ugt i32 %a, 10; %res1 = icmp ugt i32 %a, 10 %res2 = icmp ult i32 %b, 5; %res = and i8 %res1, %res2;

In a database, expressions appear mainly in a few scenarios. One is the filter condition, which usually appears in the WHERE condition. One is the output list, which usually follows the SELECT. Some operators, such as join, agg, etc., may also have some more complex expressions in their judgment conditions. Therefore, the processing of expressions occurs in various modules of the database execution engine. In the PostgreSQL version of AnalyticDB, the development team abstracted an expression processing framework to process these expressions through LLVM CodeGen, thereby improving the overall performance of the execution engine.

5. To summarize

LLVM, as a popular open source compilation framework, has been used in recent years to accelerate the performance of databases, AI and other systems. Because of the high threshold of compiler theory, LLVM is difficult to learn. Moreover, from the perspective of engineering, it is also necessary to have a more accurate understanding of the engineering characteristics and performance characteristics of LLVM, in order to find the appropriate acceleration scenario. PostgreSQL version of AnalyticDB, the cloud native data warehouse product of AliCloud Database Team, has implemented a set of expression processing framework at runtime based on LLVM, which can effectively improve the performance of the system when conducting complex data analysis.

This article is the original content of Aliyun, shall not be reproduced without permission.