Note: this article is the text version of the video speech, the editor may appear some errors when listening to the recording, welcome to correct.

Post Editor: Dahai, programming enthusiast, passionate about technology.

Instructor:

Zhou Heyang is the core developer of WASMER. He is an undergraduate student of China Southern University in 2018. He has mastered compilation /OS/VM/ microarchitecture and other technologies.

Video address: www.bilibili.com/video/BV1Yy…

JIT technology meaning and application

JIT compilation, or just-in-time compilation, is a method of compiling at run Time, converting source code or, more commonly, bytecode to machine code and then executing it directly. JIT technology is mainly applied to virtual machines in various languages. In other cases, such as dynamic linkers, the program is dynamically restarted before running to link it; In the Linux kernel, EBPF technology and the static calls mechanism, recently introduced in version 5.10, both use a JIT-like mechanism.

Take virtual machine (VM) as an example to briefly introduce the application of JIT technology. VM technologies can be broadly divided into three categories, simple interpreters, optimized interpreters, and just-in-time compilation. Simple interpreters, like WASMI, implement standards so well that they don’t have the resources to optimize for efficiency. The second is to optimize the interpreter, such as CPython, WASM3,BEAM(Erlang interpreter). The third category includes most high-performance runtime virtual machines, JVMS, CLR, V8, LuaJIT, Wasmer, Wasmtime.

Virtual machines are mainly used for translation processing when the object code format we need to execute is inconsistent with the machine instruction format. However, when features such as dynamic features (javascript’s dynamic constraints), sandbox features that are hard to implement at the hardware level, such as WebAssembly memory isolation, different instruction sets, For example, when compiling dynamically from RISCV to AARCH64 or x86-64 instruction sets, we need to use binary translators for Jit compilation.

The advantages of JIT are obvious. It can make programs run more efficiently. It can ① optimize the code dynamically, ② support the dynamic features and security requirements of the language efficiently, and ③ in some special situations such as static call mechanism and dynamic linker, support the initialization operation of the runtime environment to avoid a large amount of runtime overhead.

Let’s talk about the key aspects of JIT versus traditional static compilation in terms of dynamic optimization.

As shown in Figure 1, taking JavaScriptCore,V8 and Wasmer engines as examples, they all realize the way that users can freely choose the operation of the back end or automatically switch between different back ends at run time, so that the compiler optimization can be switched from low optimization level to high optimization level. The operation of switching back again.

The dynamic optimization process here is that we’re tracking the running state through the Profile, compiling the code with the higher level of optimization, and it’s also deoptimize, and when the optimized code makes some bad assumptions, we have to roll back.

The main technique used to achieve dynamic switching optimization level is OSR, namely on-stack replacement.

Let’s look at a simple process for OSR technology. As shown in figure 2, the call stack appeared on the left side of the hypothetical situation, function Baz code optimization from interpretation to jit level 1, the runtime will trigger the function Baz compilation, once compiled, the reconstruction of the call stack will happen, make the call stack in the records of all function Baz mapped to jit stack structure on level 1, Causes to continue running on Jit level 1 machine code on top of the original state. The trade-off is increased computational complexity.

One of my previous jobs was implementing OSR technology in WASMER. The OSR entry dynamically loads the Image and extracts the Image when OSR exits (from the call stack to the WASM abstraction representation, back to the structure within the WASMER call stack for another optimization level). (came)

Figure 3 shows the performance of my project Benchmark at that time. In Figure 3, SinglePass is the fastest compiled and slowest back-end I’ve written. LLVM is the most optimized backend. The red line is the performance curve using the LLVM back end, and the blue line is the performance curve using the SinglePass back end and the LLVM back end about 2s in the front.

If we compiled directly with LLVM, we would have to wait about 2s in our test program before executing it. If we introduce a dynamic switching mechanism, we can start the program with a fast but full engine to do the execution, and when the compiler with a high optimization level is ready, we can dynamically switch the execution flow to achieve a balance between the two. There’s no overlap between the red line and the blue line, but because we did something extra on the blue line, the performance is theoretically the same.

** The second dynamic optimization technique I want to introduce is inline caching. ** I’ve learned that there are two typical use cases.

  1. Method Lookup in some dynamic languages

    for (let x of list){
        document.write(x); // method lookup
    }
    Copy the code

    The write function can be rewritten dynamically, but the probability of this happening is very small, so we can assume that it does not change at run time and compile to generate machine code. When the hypothesis is not true, roll back. You would have to look it up in a hash table, which is cache unfriendly and slow.

    Therefore, we can directly map this instruction to a cache slot, write some markers corresponding to write function and write function address, check whether the running conditions meet, can directly execute, avoid hash table lookup overhead, otherwise roll back.

  2. Risc-v binary translation

In RISC-V, there are mainly access instructions and jump instructions, which will involve large memory structure search overhead.

(1) For the whole system simulation of access instructions (load/store), the need to perform TLB lookup in the memory management unit clock, with the software is very slow, traversing the 4-layer page table. Or in some high-level structure simulation, in the B-tree structure to find the memory space, the efficiency is also very low.

For this kind of instruction, we can associate a cache slot with the instruction. When the instruction needs to look up the table for the first time, the expected virtual address range and real physical address will be written into the cache slot. In the future, when the instruction is executed, we can directly use the cache information to directly extract memory information.

② Figure 4, jalR instruction, an example of indirect jump instruction. For such instructions, in addition to mmU lookup, you also need to look for Jit translations, which are translated bytecode (translation lookup). There are two levels of lookup. Inline caching eliminates the overhead of both layers of lookups.

Let me show you a simple application I made for inline caching. Rvjt-aa64 project is the JIT engine from RISCV to AARCH64 that I completed (RVJIT-AA64).

Figure 5 shows the quick path of the fetch instruction, and you can see that we have allocated two cache slots with respect to upper and lower bounds. Check if the destination virtual address is within the expected range, and if so, load directly without rollback to explain execution. Otherwise, take a slow path and perform table lookup.

Figure 6 shows the slow path of the fetch instruction. When a load/store miss occurs, we will look up the table against address ADDR to check read/write permissions and related information. If possible, we will write it to the cache slot, which can be executed quickly next time.

Let me introduce you to memory security.

We know that Rust, as a language with a reputation for security, keeps safe code memory safe. So we need a dynamic mechanism to ensure memory safety at run time.

I use null pointer checking and access out-of-bounds checking as examples of how the Jit can ensure memory safety.

① Null pointer check:

For example, in languages with empty Pointers like Java and c#, we encounter a very common situation. When a reference is empty, we should not dereference it and succeed. We should check if it is null, and if it is, we should raise an exception instead of dereferencing it. An obvious method is if (a == null){throw Exception(…) }, but this is expensive. As shown in the code below, the need to insert CMP and JE instructions before the MOV instruction adds additional branch prediction overhead.

1: 1 cmp $0, %rdi
2: je null_pointer_exception 
3: mov %rdi,16(rsp)
...

null_pointer_exception:
	call host_npe_handler
	...
Copy the code

So we can try something else. Using the hardware trap mechanism, when accessing the null pointer, the mov instruction from the third line directly trap to the SIGSEGV exception (take Linux as an example), thus allowing the hardware to check the validity of our pointer.

② Inspection of access transgression

Trap mechanisms can also be used to handle linear memory access in WebAssembly. For example, WASmer and WASmTime directly allocate 6GB of virtual address space and map only the areas allocated by WebAssembly. Once a region outside of the region where the mapping exists is accessed, an exception is thrown and caught by the SIGSEGV processor. This is done at the expense of the time increase on the slow path in exchange for the overhead on the fast path, because the slow path clock has sigSEGV exception handling mechanism, while the fast path no longer requires bounds judgment.

Of course, the details are a little more complicated, such as the code in WASmer that uses Unix signals to handle synchronization exceptions. Call the Low level System API to unbind, associate these exception signals to the processor, which will distribute, and then further find the path.

Finally, let’s look at some of the techniques that use jit in the Linux kernel.

Ebpf, for example, is a mechanism that allows user code to access the kernel securely. It has both interpreter and JIT implementations. Most mainstream architectures are implemented in Jit.

② Static call mechanism introduced in Linux 5.10. Previously, retpoline was used to mitigate spectre vulnerabilities, particularly SPECTRE V2 vulnerabilities.

It relies on the Return Stack Buffer (RSB), which aims to keep all indirect calls out of the Branch Target Buffer, so that attacks cannot take effect.

In order to facilitate your understanding of the Retpoline principle, I refer to the article Retpoline: Principle and Deployment to explain the principle. As shown in Figure 7, the JMP instruction makes an indirect jump through the RAx value, and in the Original mode, the CPU queries the indirect Branch Preditor. If an attacker has previously trained the branch, it can cause the CPU to jump to execute specific code. The Retpoline mechanism prevents CPU opportunistic execution. In Retpoline mode,

① After call L2, the lfence address is pushed and filled to the Return Stack Buffer(RSB), and then jumps to L2.

②mov %rax, (% RSP) instruction put the indirect jump address (*%rax) on the top of the stack, then the top address is different from the RSB address.

(3) If the RET instruction is executed speculatively by the CPU, the address in the RSB in step 1 will be used, while the Lfence, JMP L1 instruction will cause an infinite loop.

④ The CPU finds that the return address on the memory stack is different from the RSB speculative address, so the speculative execution terminates and jumps to *%rax

Thus the Retpoline mechanism avoids opportunistic execution by the CPU.

However, in the Linux kernel, we find that there are many indirect calls of pattern with definite targets, such as virtual tables, so we will install it as two direct calls. The second direct call code is jit rewritten, as shown in Figure 7 _trampoline, thus eliminating the possibility of SPECTRe V2. It also reduces the overhead of indirect calls (because direct calls are used)

Should I use JIT in my project?

As Figure 8 shows, WASM3 is an interpreter, but LLVM(the best Wasmer JIT implementation) performs 10 times worse than Wasmer, which is a very good performance performance for the interpreter. Wasm3’s engineering complexity is also much lower.

Considering the relationship between execution efficiency and project complexity, low project complexity means fewer bugs and safer project code. So for high security requirements, jit needs to be considered carefully.

In this year’s Linux kernel, EBPF JIT found two LPE bugs (CVE-2020-8835, CVE-2020-27194). Even in Linux kernels with a large number of developers, The fact that there are still serious bugs in the smaller language, EBPF, indicates that the Jit compiler is very complex and requires a huge amount of team resources to support maintenance.

Implement the JIT experience with Rust

Rust has advantages over C and C ++ because it is a source language that compiles the target language and cannot guarantee security outside the language itself.

For questions

Question 1: (Guess: Does the Benchenmark used in Figure 3 use HashMap?)

A: Benchmark I use is about 50% slower with HashMap because hashMap is not cache friendly.

Q2: Will jit null pointer checks improve efficiency by replacing ordinary software judgments with traps and hardware interrupts?

A: Trap is analogous to a panic in rust, such as an out-of-bounds array. Fast paths are executed in most cases, but trap paths are executed when bugs occur in the program.

Question 3: About memory out of bounds, if memory A and MEMORY B are adjacent and memory A has been mapped, is this check invalid?

Answer: Because we store upper and lower bound caches in inline cache. We compare the upper and lower bounds on the memory we access, and if we cross the bounds we exclude them. For the comparison overhead here, we go through a layer of comparison that is cache-friendly and expensive compared to table lookup.

Q4: How is WASmer compatible with x86 and ARM instruction sets?

Answer: The compiler backends we use, singlePass and LLVM, support ARM directives.

Added: Improve JIT performance through hardware

Wasmer does some checking at run time, such as querying a table for a jump, finding the destination address in the table, and then jumping. So we need to branch in our code. With the RISC -V Physical Memory Protection (PMP) extension, you can avoid the overhead of table lookup above in some cases. The x86 compatible mechanism in Apple M1 improves the simulation efficiency by adding an x86 total Store ordering (TSO) switch on the hardware to use x86 memory ordering. At the same time, we can see that these versions of the ARM instruction set also introduced some javascript-operations support instructions, which can make some of the commonly used JIT target languages more efficient execution.

Reference article:

  1. Terenceli Retpoline: Principle and deployment
  2. Sergio De Simone How x86 to arm64 Translation Works in Rosetta 2