(JIT Compilation: Understanding and Implementation)
This article mainly introduces the JIT Compilation technique in the basic Compilation technique and how to quickly build a simple JIT Compiler using C++.
The idea of “writing an article on how JIT Compiler works” had been languishing on my TODO List for about a year, but now I had time to put it into action. Through this article, I hope to let you know the following:
- What is JIT Compilation technology? What are its features?
- How do I use C++ to implement a JIT Compiler without relying on any framework?
Due to the limited space and scope of topics, the following will not be covered in this article:
- How to write a complete Interpreter/Compiler?
- Related advanced compiler optimization techniques.
Since writing the JIT Compiler will involve from high-level programming language such as C/C + +, assembly, computer architecture, until the operating system, and other aspects of knowledge, so here I will assume that the reader has basic knowledge of these areas, and when in actual involves the related content, I will also were introduced simply.
In the examples that follow in this article, we will implement a simple JIT Compiler for a real programming language called Brainfuck for completeness and Benchmark purposes. We also implement an Interpreter to compare JIT Compilation with Interpretation in terms of overall code execution efficiency. For detailed implementation details of the Interpreter section, you can refer to the source code in the repository for the example, which is not covered in this article due to space constraints. Before we start in earnest, here are some things you need to know ahead of time to continue reading:
- Code repository: github.com/Becavalier/…
- The JIT Compiler we built will be x86-64 as the target platform and will run on macOS and Linux systems.
- Since there are many codes in the Compiler implementation part, this article will introduce them selectively. For the complete code, please refer to the above warehouse.
Ok, so let’s get started.
Brainfuck programming language
Brainfuck is a very special programming language by its name. It was created by Urban Muller in 1993. The word brainfuck itself is a slang term used to refer to something that is beyond human understanding, very complex and rare, as brainfuck is known. For example, the following Brainfuck code prints “Hello, world!” to the console after normal execution. These characters.
++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.
Copy the code
As you can see, it’s impossible to tell the intent of the entire program by looking at the code itself, which mirrors the name of Brainfuck. Nonetheless, Brainfuck itself is a Turing-complete, minimalist programming language. Brainfuck consists of only eight different instructions, and all programs written in Brainfuck contain different sequences of instructions made up of these eight different instructions. While the program is running, the sequence of instructions it contains will be executed sequentially. In addition, Brainfuck’s execution model is very simple. In addition to these eight different instructions, the program maintains a one-dimensional byte array of at least 30,000 units (which we’ll refer to as “paper tape” later) as it executes. When the program is initially executed, all cells in the array are initialized to the value 0, and a “data pointer” that can be moved back and forth will default to the first cell in the array. At run time, the program moves the data pointer back and forth according to different instructions, and updates or uses the contents of the currently pointed cell accordingly. For the eight instructions mentioned above, their corresponding characters and instructions are as follows:
Instruction character | meaning |
---|---|
> | Increase the value of the data pointer by 1 (that is, move it to the right of the current cell) |
< | Reduce the value of the data pointer by 1 (that is, move to the left to point to the cell to the left of the current cell) |
+ | Increments the byte value in the cell to which the data pointer is currently pointing by 1 |
– | Reduces the byte value in the cell to which the data pointer is currently pointing by 1 |
. | Prints the byte value stored in the cell to which the current data pointer points |
. | Takes an input byte value and stores it in the cell to which the current data pointer points |
[ | If the byte value in the cell to which the current data pointer points is 0, the execution process jumps to the next instruction to which it is paired with the “] “instruction |
] | If the byte value in the cell to which the current data pointer points is not zero, the execution process falls back to the next instruction ahead of the paired “[” instruction |
To understand this, we can use a simple example, such as this Brainfuck code.
++[->+<]
Copy the code
This code first increments the value in the first cell of the tape twice (++), which becomes 2. Then, [instruction to check the value is not within the current cell is 0 (2), so continue with the next instruction. The subsequent four instructions – > + < first the value minus one within the current cell, then the data pointer moves to the second cell to the right, then the value within the cell plus one, then return to the first cell, The program ends when the final] instruction determines that the value in the first cell is 0. Therefore, we can know that this program is mainly used to change the values in the first two cells of the tape. You can also use Brainfuck Visualizer to see the full dynamic execution of the above application.
JIT Compilation
Now that we know our target language, let’s take a look at what JIT Compilation technology really is. Whether you’re working on the front end, back end, or mobile, you’ve probably heard the word “JIT” before. The full name for JIT Compilation is “just-in-time Compilation”, which translates to “just-in-time Compilation”. Its most notable feature is that the compilation of the code takes place during the execution of the program rather than before. In compilation technology, we usually compare JIT with AOT. AOT compiler: AOT compiler: AOT compiler Compile C/C++ code with Clang, compile ES6 code with Babel, Even the process of compiling JavaScript code into IR (Intermediate Representation) specific to a PARTICULAR JS engine can be thought of as a specific type of AOT compilation. The biggest difference between JIT and AOT is the “point in time when compilation takes place”. For JIT, compilation takes place while the program is running. With AOT, compilation takes place before the program executes (usually at build time).
Traditional JIT compilers perform a series of profiling on the original code or its corresponding IR intermediate code before actually generating the machine code on the fly. Through these analysis procedures, the compiler is able to find “critical code paths” that can be optimized for performance with JIT compilation. The trade-off here is that the performance gains from runtime optimization of this code need to outweigh the performance overhead. As we will see in a later article, for our example, this overhead comes mainly from the run-time compilation of the code and the process of doing OSR (on-stack Replacement). For ease of understanding, we will not implement the code preanalysis process that traditional JIts do in the examples that follow in this article.
It is also important to note that JIT compilers are typically used in conjunction with the interpreter due to concerns about “startup time delay”. The more detailed code analysis and optimizations a JIT compiler performs, the higher the quality of dynamically generated machine code, but the greater the initial code execution delay. The addition of an interpreter allows the code to be executed earlier. During this time, the JIT compiler analyzes and optimizes the code at the same time, and at certain points transforms the program’s execution flow from interpreted execution to execution of its dynamically generated optimized machine code. Thus, for JIT Compilation itself, one of the key trade-offs in how it is implemented is the trade-off between Compilation time and the quality of the code you want to generate. For example, the JVM has two JIT modes to choose from — client and Server, with the former using minimal compilation and optimization options to minimize startup latency; The latter uses a strategy of maximizing compilation and optimization at the expense of program startup time.
Implementation details
Ok, it’s time to showcase :). First, we declare that the JIT Compiler we implemented for Brainfuck is only for POC as part of this article, and does not consider completeness as a production version, for example: Exception-handling, Thread-safe, profiling, Assembly fine-tuning, and more. Second, the implementation details will focus on the function bfJITCompile, the function allocateExecMem, and the VM class in the source code. It is recommended to browse the source code before continuing.
As mentioned above, JIT Compilation occurs when a program is run, so as you can see from the source code, we use different parameters (– JIT) provided by the user when running the interpreter program to determine whether to JIT or explain execution. For “JIT compiled execution”, the process can be summarized as follows:
- Read in source code (containing sequence of instructions in ASCII form);
- Call bfJITCompile function to compile the source code into machine code;
- Call the allocateExecMem function to dynamically allocate the machine code to an executable memory segment;
- Call the VM::exec function to execute the process via OSR transfer;
- The code is transferred back to the main process after execution.
- Perform some cleanup.
Next, we will focus on the implementation details of items 2, 3, and 4 in the above process.
Compile machine code
In this step, we “extract” all the instruction characters from the Brainfuck source code entered at startup, and generate the corresponding machine code version binaries for them directly in sequence. This generated collection of binary machine code is stored in a STD ::vector for later use. To simplify the machine code generation process, we simply recognize the character corresponding to the instruction through the switch statement and “return” the x86-64 binary machine code corresponding to the instruction. The returned machine code will also be “concatenated” directly into the Vector that holds the collection of machine code.
It is important to note that the returned binary machine code may contain riP-relative address information before it is actually stored in the Vector. We also need to “relocate” the binary machine code with methods such as _relocateAddrOfPrintFunc. With these methods, we can accurately calculate the actual information about these relative addresses and correct them.
First, we can find the following code in the definition of the bfJITCompile function. With this code, we store the address of the “data pointer” in the Brainfuck execution model in register RBX, so that we can later control the position of the data pointer by modifying or using the value in the register, or read or modify the contents of the tape cell to which the current data pointer points. The comment “/* mem slot */” in the code here indicates that the contents of the comment location will be replaced at compile time with the actual referenced memory address. This address will come from the little-endian address returned by the value of bfState:: PTR after _resolvePtrAddr processing.
// prologue.
std::vector<uint8_t> machineCode {
// save dynamic pointer in %rbx.
0x48.0xbb./* mem slot */
};
// ...
Copy the code
Then, as the instruction characters are read in, the bfJITCompile function converts these instructions in turn to their respective machine code versions. For the four instructions “+ – > <“, the corresponding machine instructions can change the state of the Brainfuck abstract machine simply by manipulating the address value of the data pointer we previously stored in the RBX register. For example, with the “+” directive, we can find the following code:
// ...
case '+': {
for (n = 0; *tok == '+'; ++n, ++tok);
const auto ptrBytes = _resolvePtrAddr(ptrAddr);
std::vector<uint8_t> byteCode {
0x80.0x3.static_cast<uint8_t>(n), // addb $0x1, (%rbx)
};
_appendBytecode(byteCode, machineCode);
--tok;
break;
}
// ...
Copy the code
In this code, we first use an optimization strategy that is easy to think of. Instead of generating the same machine code for each “+” instruction, we can choose to first count the number of consecutive “+” instructions encountered. And then through a single assembly instruction addB $N, (% RBX) to complete the state change caused by the multiple “+” instruction. The same can be applied to the remaining three instructions, which correspond to changes in the value of the cell to which the data pointer points and changes in the value of the data pointer itself.
And for “, “and”.” For instructions, since they involve IO operations, the corresponding machine code here will involve the invocation of the operating System Call. Operating system calls need to follow specific Calling conventions, such as registers RAx for system call numbers, RDI for the first parameter, RSI for the second parameter, RDX for the third parameter, and so on. At the same time, the macOS and Linux operating systems have different system call numbers, which are distinguished by special macros.
// ...
case ', ': {
/** movl $0x2000003, %eax movl $0x0, %edi movq %rbx, %rsi movl $0x1, %edx syscall */
std::vector<uint8_t> byteCode {
#if __APPLE__
0xb8.0x3.0x0.0x0.0x2.#elif __linux__
0xb8.0x0.0x0.0x0.0x0.#endif
0xbf.0x0.0x0.0x0.0x0.0x48.0x89.0xde.0xba.0x1.0x0.0x0.0x0.0xf.0x5}; _appendBytecode(byteCode, machineCode);break;
}
// ...
Copy the code
Finally, the implementation logic is a little more complicated for the “[” and”] “directives. Take the “[” instruction for example, as shown in the following code. Here, it is quite simple to map the semantic logic of the” [” instruction directly to assembly code, as follows: Determine whether the value of the cell pointed to by the current data pointer is 0. If it is 0, the execution process jumps to the next instruction of the subsequent paired “] “instruction; otherwise, the next instruction continues to be executed. If it is 0, then use JE assembly instruction to jump to the following instruction location, otherwise directly execute the next instruction. The use of “subsequent instruction address” in the code will be redirected in its paired “] “instruction processing flow. Therefore, here we will use four consecutive bytes of 0x0 value for placeholder. Another thing to know is that we will stick to the “Near Jump” mode here to simplify the implementation.
// ...
case '[': {
/* cmpb $0x0, (%rbx) je <> */
std::vector<uint8_t> byteCode {
0x80.0x3b.0x0.0xf.0x84.0x0.0x0.0x0.0x0./* near jmp */
};
// record the jump relocation pos.
_appendBytecode(byteCode, machineCode);
jmpLocIndex.push_back(machineCode.size());
break;
}
// ...
Copy the code
At this point, we have completed the dynamic compilation of machine instructions. At this stage, our program can convert the input Brainfuck instruction character sequence into the corresponding platform-specific binary machine code. You can see the following code at the end of the bfJITCompile function. This code is mainly used to output the contents of the stdout cache before the program exits and reset the rip register value to return the program execution flow back to C++ code. We’ll revisit this later.
// epilogue.
// mainly restoring the previous pc, flushing the stdout buffer.
/** cmpq $0, %r11 je 8 callq
jmpq *(%rsp) */
std::vector<uint8_t> byteCode {
0x49.0x83.0xfb.0x0.0x74.0x8.0xe8./* mem slot */
0xff.0x24.0x24};Copy the code
Executable memory allocation
Next, our focus will shift from “how to dynamically generate machine code” to “how to dynamically execute machine code.” For this part of the implementation, you can refer to a function named allocateExecMem, as shown in the code below.
// ...
uint8_t* allocateExecMem(size_t size) {
return static_cast<uint8_t* > (mmap(
NULL,
size,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANONYMOUS,
- 1.0));
}
// ...
Copy the code
In this function definition, we call the function named Mmap, which is the key to enabling “dynamic execution of machine code”. The Mmap function is a system call provided by the C library that allows you to create a map in the Virtual Address Space (VAS) of the current process. This mapping can point to a specific file or to an anonymous space. One of the most familiar uses of the Mmap function is that used by the operating system to point page table entries to the appropriate location in the target file when virtual pages are allocated to the target file. In this case, we need to use this function to create an “anonymous space” that does not point to any actual files and continuously place the binary machine code we compiled in the previous step into this memory space.
Not only that, but by specifying the PROT_EXEC attribute as the third parameter of the Mmap function, we can mark the requested anonymous memory space as “executable.” This means that machine instructions stored in this memory space can be executed by the CPU. For more details on how to configure the other parameters of this function, see here. The actual invocation of the allocateExecMem function is placed in the constructor of the VM class, where we simply encapsulate resource allocation and destruction through RAII.
OSR (On-Stack Replacement)
Once the compile-generated binary machine code is placed in the executable anonymous memory segment, the next focus is: * How do I move the program’s instruction execution flow to the beginning of this memory segment? * for this part of the implementation, we need to leverage the “C++ inline assembly” feature provided by the Clang/GCC compiler. You can find the answer in the implementation of the VM::exec function. This code looks like this:
// ...
void exec(a) {
// save the current %rip on stack (by PC-relative).
// %r10 - stdout buffer entry.
// %r11 - stdout buffer counter.
asm(R"( pushq %%rax pushq %%rbx pushq %%r10 pushq %%r11 pushq %%r12 movq %1, %%r10 xorq %%r11, %%r11 lea 0xe(%%rip), %%rax pushq %%rax movq %0, %%rax addq %2, %%rax jmpq *%%rax )": :"m" (mem), "m" (stdoutBuf), "m" (prependStaticSize));
// clean the stack.
asm(R"( addq $8, %rsp popq %r12 popq %r11 popq %r10 popq %rbx popq %rax )");
}
// ...
Copy the code
In this code, we use asm assembly instructions twice. Among them, the purpose of the first inline assembly is mainly to transfer the execution flow of the program to the machine code generated by dynamic compilation. The first five lines of the push instruction here focus on putting the values in these registers on the stack to protect the register state at the moment. These values will be restored after the program’s execution flow is returned to the C++ code. The movQ instruction in line 6 stores the first address of the STdout buffer in register R10. This buffer will be used to cache through “. The character content of the command output to reduce the actual number of system calls and improve performance. In lines 8-9, we put the address of the next instruction in the normal C++ code execution flow on the stack so that it can be returned from the dynamic execution flow. In lines 10-11, we correctly set the address of the anonymous executable memory segment and the corresponding offset position (across the static subroutine definition section). In the last line, with the JMPQ instruction, we redirect the CPU’s execution flow to the memory address containing the value in the RAX register, which contains the first dynamic instruction we will execute.
At this point, the execution transition from C++ code to dynamic instructions is complete. When the dynamically generated instructions are all executed, we need to move the execution process back to normal C++ code in a similar way. Remember that little “epilogue” assembly code we talked about at the end of the section on “Compiling machine code”? If you look back, you will see that in the last instruction of this code, we used the JMPQ *(% RSP) instruction, which will move the CPU execution to the memory location with the qword value at the bottom of the current process stack as the address. This value is the return address of the C++ code we stored in the previous step. When the execution flow returns to C++ code, we encounter a second asm assembly instruction. With this instruction, we can clear the contents of the stack and restore the state of the relevant registers simultaneously. At this point, the execution process of the program is basically over.
Let’s return to the topic of this section, “OSR”. The full name of OSR is on-Stack Replacement. With the help of Google, we can find a definition of it as follows:
On-stack-replacement (OSR) describes the ability to replace currently executing code with a different version, either a more optimized one (tiered execution) or a more general one (deoptimization to undo speculative optimization).
In fact, OSR can be understood simply as “the transformation process from one execution environment to another.” For example, when the VM::exec function executes, it moves the execution environment from C++ code to dynamically generated machine code and back again in the same way. Such an execution environment transformation can be considered an OSR process. The following figure is a visual representation of the OSR process described above.
Benchmark
So far, we have introduced some key implementation details of Brainfuck JIT Compiler. Now let’s take a look at the performance of this crude VERSION of the JIT compiler. The source code for the project provides two sets of tests, one for “IO intensive” and the other for “compute intensive” scenarios. One set of test results is as follows:
- IO intensive case:
Benchmark for 10 seconds: (higher score is better)
12950 interpreter
35928 jit (win)
Copy the code
- Computation-intensive case:
Benchmark Result: (lower time is better)
13.018s interpreter
0.885s jit (win)
Copy the code
As you can see, the overall results are pretty good. For IO-intensive test cases, JIT Compilation can provide nearly a threefold improvement over Interpretation alone. For computationally intensive scenarios, the performance gains from JIT can be significant. In our test case of “Printing mandelbrot collections,” the JIT Compilation resulted in a nearly 15-fold performance improvement over Interpretation. Of course, since we did not adopt a more complete test set and test scheme, these test case results are for reference only.
For more information
Next, we will discuss some additional issues appropriately. Of course, each of these topics could be expanded into an entire article, so it’s only a generalization here. For more information please Google.
Interpretation of the regression of
Arguably, “branch-misprediction” is the most important of the many reasons why the interpreter runs so slowly. Take the switch statement based interpreter we implemented in the source code for this example. The interpreter model reads one character instruction at a time from the input source file and changes the state of the current interpreter accordingly (e.g. data Pointers, tape contents, etc.). The problem with this approach is that, at a macro level, the CPU does not know what the next possible symbolic instruction will be when the switch statement is actually executed, and this causes “PC branch prediction” to fail. From a micro point of view, failure to predict or fail to predict can result in wasted CPU clock cycles (waiting for results, or pipeline reloading due to discarded false predictions, etc.). Therefore, the instruction delay caused by “pipeline correlation” will also be prominent after a large number of instructions are executed.
As for interpreter models such as “Direct Threading” and “Subroutine Threading”, although they can solve the problem of branch prediction failure well, they also have the following problems: Problems such as using too many JMP instructions and generating useless stack frames (no inline) can also significantly reduce the performance of the interpreter when interpreting the program. JIT Compilation, by contrast, can easily avoid these problems by using basic optimization strategies such as dynamically generating machine code and inlining Compilation. Not only that, but some JIT compilers can even get better run-time performance than the AOT approach. This is mainly due to the JIT’s ability to perform a more dynamic, heuristic code optimization process at runtime based on the current operating system type, CPU ISA architecture, and code Profiling results.
JIT implementation strategy and method
Common JIT policies can be classified into method-based JIT, trace-based JIT, and region-based JIT. In method-based JIts, “functions” are used as separate compilation units. The compiler identifies hot functions during code execution (for example, based on how many times the function has been called) and then replaces them with the compiled machine code version. This approach is simple to implement but has its own problems, such as coarse JIT granularity, low hit ratio of hot code, and inability to accurately capture time-consuming logic (such as “loops”) in the function body.
In contrast, a trace-based JIT uses “Trace” as the compilation unit for hot code. A Trace is a hot code path that the program executes at runtime. From the source code, the execution path of this hot code can span multiple functions. All the JIT compiler has to do is run time compile optimizations for hot code along this path.
The last region-based JIT uses “Tracelet” as its compilation unit, mainly from Facebook’s HHVM virtual machine implementation. A Tracelet is usually the longest execution path that can be “type-specialized.” For more information, please refer to this paper.
In addition to these three common JIT compiler implementation strategies, for implementation details, we often choose to use compilation frameworks that provide better machine code picking and compilation capabilities than the “human machine code compilation” process we use in this article. Common frameworks include DynASM, LLVM, and Cranelift. These frameworks often provide not only basic, platform-specific machine code compilation but also code optimization capabilities. For method-based JIts, for example, some optimization strategies that can be used for static AOT compilation can also be used directly by the JIT compiler. Using LLVM, for example, makes it easier to use these mature optimization strategies directly, eliminating the need for repeated implementation.
The resources
- Solarianprogrammer.com/2018/01/10/….
- Solarianprogrammer.com/2018/01/12/….
- corsix.github.io/dynasm-doc.
- En.wikipedia.org/wiki/Just-i….
- En.wikipedia.org/wiki/Tracin….
- En.wikipedia.org/wiki/Brainf….
- Gcc.gnu.org/onlinedocs/….
- Github.com/opensource-….
- En.wikipedia.org/wiki/Ahead-….
- Prl.ccs.neu.edu/blog/2019/0….
- Javapapers.com/core-java/j….