Introduction: phone taobao client in history took various scripting engine, a language for support include: js/python/wasm lua, including js engine took over there: javascriptcore duktape/v8 / quickjs and so on. A large number of engines will face problems related to package size and performance. Can we provide a solution that can support as many languages as possible with one engine on the premise of supporting business requirements, so as to better balance small package size and excellent performance? To solve this problem, we started exploring HyEngine.
The author | know soldier source | ali technology to the public
1. Background
Phone taobao in history took over a variety of client script engine, a language for support include: js/python/wasm lua, including js engine took over there: javascriptcore duktape/v8 / quickjs and so on. A large number of engines will face problems related to package size and performance. Can we provide a solution that can support as many languages as possible with one engine on the premise of supporting business requirements, so as to better balance small package size and excellent performance? To solve this problem, we started exploring HyEngine.
Ii Introduction to Design
Hyengine is designed for the implementation of various scripting languages (WASM/JS/Python, etc.) for unified mobile technologies. Hyengine is designed and developed with lightweight, high performance and multi-language support. Wasm3 / QuickJS has been jit compiled and runtime optimized to achieve 2 to 3 times faster wASM/JS execution at the cost of a very small package size. Support for Python and other languages will be increased by implementing its own bytecode and Runtime in the future.
Note: Since most current mobile phones already support ARM64, HyEngine only supports the JIT implementation of ARM64.
Note: Since ios does not support JIT, HyEngine is currently available only on Android.
Hyengine is divided into two parts, the compiler part and the engine (VM) part.
Compiler part is divided into front end, middle end and back end. The front end uses the implementation of existing script engine, such as QuickJS for JS and Emscripten for WASM. The middle end plans to implement its own bytecode, optimizer and bytecode converter. The back end implements quickJS and WASM’s JIT and assembler and optimizer.
Vm is divided into interpreter, Runtime, API, debugging and basic library. Due to limited manpower, there is no complete implementation of VM at present. We reuse QuickJS/WASM3 code, implement a set of our own internal distributor and GC, and optimize the existing Runtime implementation to improve performance.
The business code (in the case of WASM) is compiled into executable code through the process shown below:
The c/ C ++ code is compiled by Emscripten into wASM files. Wasm is loaded by HyEngine (WASM3) and compiled into arm64 instructions. Arm64 instructions are optimized by Optimizer to generate optimized ARM64 instructions. The business side executes the corresponding code by calling the entry API.
Note: HyEngine itself is expected to have its own set of low-level (assemble-level) base capabilities, planned for mobile client package size, performance tuning, debugging assistance, and more, in addition to JIT-related uses.
Note: There may be some similarities between the Ark compiler and GraalVM in this solution industry.
Iii Implementation Introduction
1. Compiler
To keep the implementation simple, HyEngine is compiled using direct translation, which is generally slow and needs to be optimized by the optimizer to improve performance. The following is a concrete implementation of the related modules:
The compiler
To generate code that the CPU can execute, we need to implement an assembler that translates the associated script’s Opcode into machine code.
The core code of the assembler is generated according to the script based on the existing instruction data of Golang’s ARCH project, and assisted manual modification and corresponding tool code.
The following is an example of a single assembly code:
// Name: ADC
// Arch: 32-bit variant
// Syntax: ADC <Wd>, <Wn>, <Wm>
// Alias:
// Bits: 0|0|0|1|1|0|1|0|0|0|0|Rm:5|0|0|0|0|0|0|Rn:5|Rd:5
static inline void ADC_W_W_W(uint32_t *buffer, int8_t rd, int8_t rn, int8_t rm) {
uint32_t code = 0b00011010000000000000000000000000;
code |= IMM5(rm) << 16;
code |= IMM5(rn) << 5;
code |= IMM5(rd);
*buffer = code;
}
Copy the code
The code is used to assemble ADC instructions. The first parameter is the buffer where the machine code is stored. The last three parameters are the operands of the assembly instruction, Wd/Wn/Wm. Code in line 7 is the fixed part of the machine code, actions 8 to 10 put the register number corresponding to the operand into the machine code location (see the Bits section of the comment), and actions 9 put the machine code into the buffer. IMM5 means to take the lowest 5bits of the value, because the register is a 5bits long number. The advantage of this naming is that the method name of the assembler can be intuitively associated with the mnemonic form of the machine code it produces.
IMM5 is implemented as follows:
#define IMM5(v) (v & 0b11111)
Copy the code
In order to ensure the correctness of assembly method, gnucases. TXT in the Arch project based on Golang was generated by machine + manual modification to produce single test case in the following format:
// 0a011f1a| adc w10, w8, wzr
ADC_W_W_W(&buffer, R10, R8, RZR);
assert(buffer == bswap32(0x0a011f1a));
Copy the code
In the first line of comments, the first part is the big-endian representation of the machine code, and the second part is the assembly code corresponding to the machine code. The second act is an assembly method call to the assembler. The third act is to check the assembly result to confirm that the result is consistent with the machine code in the annotation. Because the machine code in the annotation is the big-endian representation, byte swap is required to match the assembly result.
disassemblers
The disassembler here does not contain full disassembly functionality, but is intended to be used to identify machine code in the optimizer and to use parameters from machine code. A simple example:
#define IS_MOV_X_X(ins) \
(IMM11(ins >> 21) == IMM11(HY_INS_TEMPLATE_MOV_X_X >> 21) && \
IMM11(ins >> 5) == IMM11(HY_INS_TEMPLATE_MOV_X_X >> 5))
Copy the code
This command can be used to determine whether an instruction is mov xd, xm in the optimizer, and then the specific value of d in XD can be extracted by the following code:
#define RM(ins) IMM5(ins >> 16)
#define RN(ins) IMM5(ins >> 5)
#define RD(ins) IMM5(ins)
Copy the code
Similarly, we also did the corresponding single test for disassembler:
// e7031eaa| mov x7, x30
assert(IS_MOV_X_X(bswap32(0xe7031eaa)));
Copy the code
Wasm compilation
At compile time, we iterate over each method of the WASM module, estimate the amount of memory needed to store the production code, and then translate the bytecode in the method into machine code.
The overall implementation of the core translation is a large loop + switch, which generates a corresponding machine code each time through an Opcode. The code example is as follows:
M3Result h3_JITFunction(IH3JITState state, IH3JITModule hmodule, IH3JITFunction hfunction) { uint32_t *alloc = state->code + state->codeOffset; . // prologue // stp x28, x27, [sp, #-0x60]! // stp x26, x25, [sp, #0x10]! // stp x24, x23, [sp, #0x20] // stp x22, x21, [sp, #0x30] // stp x20, x19, [sp, #0x40] // stp x29, x30, [sp, #0x50] // add x20, sp, #0x50 STP_X_X_X_I_PR(alloc + codeOffset++, R28, R27, RSP, -0x60); STP_X_X_X_I(alloc + codeOffset++, R26, R25, RSP, 0x10); STP_X_X_X_I(alloc + codeOffset++, R24, R23, RSP, 0x20); STP_X_X_X_I(alloc + codeOffset++, R22, R21, RSP, 0x30); STP_X_X_X_I(alloc + codeOffset++, R20, R19, RSP, 0x40); STP_X_X_X_I(alloc + codeOffset++, R29, R30, RSP, 0x50); ADD_X_X_I(alloc + codeOffset++, R29, RSP, 0x50); . for (bytes_t i = wasm; i < wasmEnd; i += opcodeSize) { uint32_t index = (uint32_t)(i - wasm) / sizeof(u8); uint8_t opcode = *i; . switch (opcode) { case OP_UNREACHABLE: { BRK_I(alloc + codeOffset++, 0); break; } case OP_NOP: { NOP(alloc + codeOffset++); break; }... case OP_REF_NULL: case OP_REF_IS_NULL: case OP_REF_FUNC: default: break; } if (spOffset > maxSpOffset) { maxSpOffset = spOffset; }}... // return 0(m3Err_none) MOV_X_I(alloc + codeOffset++, R0, 0); // epilogue // ldp x29, x30, [sp, #0x50] // ldp x20, x19, [sp, #0x40] // ldp x22, x21, [sp, #0x30] // ldp x24, x23, [sp, #0x20] // ldp x26, x25, [sp, #0x10] // ldp x28, x27, [sp], #0x60 // ret LDP_X_X_X_I(alloc + codeOffset++, R29, R30, RSP, 0x50); LDP_X_X_X_I(alloc + codeOffset++, R20, R19, RSP, 0x40); LDP_X_X_X_I(alloc + codeOffset++, R22, R21, RSP, 0x30); LDP_X_X_X_I(alloc + codeOffset++, R24, R23, RSP, 0x20); LDP_X_X_X_I(alloc + codeOffset++, R26, R25, RSP, 0x10); LDP_X_X_X_I_PO(alloc + codeOffset++, R28, R27, RSP, 0x60); RET(alloc + codeOffset++); . return m3Err_none; }Copy the code
This code will prologue the method, then iterate through the WASM bytecode for the for loop, producing the corresponding ARM64 machine code, and finally add the epilogue of the method.
Bytecode generation machine code for wASM opcode i32.add as an example:
case OP_I32_ADD: {
LDR_X_X_I(alloc + codeOffset++, R8, R19, (spOffset - 2) * sizeof(void *));
LDR_X_X_I(alloc + codeOffset++, R9, R19, (spOffset - 1) * sizeof(void *));
ADD_W_W_W(alloc + codeOffset++, R9, R8, R9);
STR_X_X_I(alloc + codeOffset++, R9, R19, (spOffset - 2) * sizeof(void *));
spOffset--;
break;
}
Copy the code
In the code, alloc is the first address of the machine code stored in the method currently being compiled, codeOffset is the offset of the current machine code relative to the first address, R8/R9 represents the two temporary registers we agreed, R19 is the address stored at the bottom of the stack, and spOffset is the stack offset relative to the bottom of the stack when running to the current opcode.
This code generates four machine codes to load two pieces of data at spoffset-2 and SPoffset-1 on the stack, add them together, and store the result at SPoffset-2 on the stack. Since the i32.add instruction consumes two stacks of data and generates one stack of data, the final stack offset is -1.
The machine code generated by the above code and its corresponding mnemonic form are as follows:
f9400a68: ldr x8, [x19, #0x10]
f9400e69: ldr x9, [x19, #0x18]
0b090109: add w9, w8, w9
f9000a69: str x9, [x19, #0x10]
Copy the code
X represents the 64-bit register and w represents the lower 32 bits of the 64-bit register. Since the i32.add instruction does 32-bit addition, only the lower 32 bits need to be added.
With the following Fibonacci C code:
uint32_t fib_native(uint32_t n) {
if (n < 2) return n;
return fib_native(n - 1) + fib_native(n - 2);
}
Copy the code
Compile the resulting WASM code:
parse | load module: 61 bytes
parse | found magic + version
parse | ** Type [1]
parse | type 0: (i32) -> i32
parse | ** Function [1]
parse | ** Export [1]
parse | index: 0; kind: 0; export: 'fib';
parse | ** Code [1]
parse | code size: 29
compile | compiling: 'fib'; wasm-size: 29; numArgs: 1; return: i32
compile | estimated constant slots: 3
compile | start stack index: 1
compile | 0 | 0x20 .. local.get
compile | 1 | 0x41 .. i32.const
compile | | .......... (const i32 = 2)
compile | 2 | 0x49 .. i32.lt_u
compile | 3 | 0x04 .. if
compile | 4 | 0x20 .... local.get
compile | 5 | 0x0f .... return
compile | 6 | 0x0b .. end
compile | 7 | 0x20 .. local.get
compile | 8 | 0x41 .. i32.const
compile | | .......... (const i32 = 2)
compile | 9 | 0x6b .. i32.sub
compile | 10 | 0x10 .. call
compile | | .......... (func= 'fib'; args= 1)
compile | 11 | 0x20 .. local.get
compile | 12 | 0x41 .. i32.const
compile | | .......... (const i32 = 1)
compile | 13 | 0x6b .. i32.sub
compile | 14 | 0x10 .. call
compile | | .......... (func= 'fib'; args= 1)
compile | 15 | 0x6a .. i32.add
compile | 16 | 0x0f .. return
compile | 17 | 0x0b end
compile | unique constant slots: 2; unused slots: 1
compile | max stack slots: 7
Copy the code
The hyEngine JIT-compiled output code looks like this:
0x107384000: stp x28, x27, [sp, #-0x60]!
0x107384004: stp x26, x25, [sp, #0x10]
0x107384008: stp x24, x23, [sp, #0x20]
0x10738400c: stp x22, x21, [sp, #0x30]
0x107384010: stp x20, x19, [sp, #0x40]
0x107384014: stp x29, x30, [sp, #0x50]
0x107384018: add x29, sp, #0x50 ; =0x50
0x10738401c: mov x19, x0
0x107384020: ldr x9, [x19]
0x107384024: str x9, [x19, #0x8]
0x107384028: mov w9, #0x2
0x10738402c: str x9, [x19, #0x10]
0x107384030: mov x9, #0x1
0x107384034: ldr x10, [x19, #0x8]
0x107384038: ldr x11, [x19, #0x10]
0x10738403c: cmp w10, w11
0x107384040: csel x9, x9, xzr, lo
0x107384044: str x9, [x19, #0x8]
0x107384048: ldr x9, [x19, #0x8]
0x10738404c: cmp x9, #0x0 ; =0x0
0x107384050: b.eq 0x107384068
0x107384054: ldr x9, [x19]
0x107384058: str x9, [x19, #0x8]
0x10738405c: ldr x9, [x19, #0x8]
0x107384060: str x9, [x19]
0x107384064: b 0x1073840dc
0x107384068: ldr x9, [x19]
0x10738406c: str x9, [x19, #0x10]
0x107384070: mov w9, #0x2
0x107384074: str x9, [x19, #0x18]
0x107384078: ldr x8, [x19, #0x10]
0x10738407c: ldr x9, [x19, #0x18]
0x107384080: sub w9, w8, w9
0x107384084: str x9, [x19, #0x10]
0x107384088: add x0, x19, #0x10 ; =0x10
0x10738408c: bl 0x10738408c
0x107384090: ldr x9, [x19]
0x107384094: str x9, [x19, #0x18]
0x107384098: mov w9, #0x1
0x10738409c: str x9, [x19, #0x20]
0x1073840a0: ldr x8, [x19, #0x18]
0x1073840a4: ldr x9, [x19, #0x20]
0x1073840a8: sub w9, w8, w9
0x1073840ac: str x9, [x19, #0x18]
0x1073840b0: add x0, x19, #0x18 ; =0x18
0x1073840b4: bl 0x1073840b4
0x1073840b8: ldr x8, [x19, #0x10]
0x1073840bc: ldr x9, [x19, #0x18]
0x1073840c0: add w9, w8, w9
0x1073840c4: str x9, [x19, #0x10]
0x1073840c8: ldr x9, [x19, #0x10]
0x1073840cc: str x9, [x19]
0x1073840d0: b 0x1073840dc
0x1073840d4: ldr x9, [x19, #0x10]
0x1073840d8: str x9, [x19]
0x1073840dc: mov x0, #0x0
0x1073840e0: ldp x29, x30, [sp, #0x50]
0x1073840e4: ldp x20, x19, [sp, #0x40]
0x1073840e8: ldp x22, x21, [sp, #0x30]
0x1073840ec: ldp x24, x23, [sp, #0x20]
0x1073840f0: ldp x26, x25, [sp, #0x10]
0x1073840f4: ldp x28, x27, [sp], #0x60
0x1073840f8: ret
Copy the code
This code took 1716ms to run FIB_native (40), while WASM3 explained that wASM ran the same code in 3637ms, which was only about 47% of the explained execution. But is this fast enough?
The optimizer
The code above doesn’t seem to have a major problem, but when compared to the native code compiled by LLVM, the gap becomes apparent. The FIB_native C code and LLVM compiled disassembly code are as follows:
Hyengine produces 63 instructions, while LLVM produces only 17 instructions, which is about 3.7 times as many as LLVM! However, the actual performance gap is larger. Fib_native (40) generated by Hyengine takes 1716ms to run, while LLVM takes 308ms, 5.57 times that of LLVM.
To close the gap, it’s time to do some optimization.
1) Main flow of the optimizer
The main flow of the optimizer is as follows:
First, the code of the whole method body is split according to the jump instruction (such as B/CBZ) and the jump target address, and the method body is divided into multiple code blocks. Then run an optimized pass for each block to optimize the code inside the block. Finally, the optimized instruction block was merged into a method body, and the optimization was completed.
One of the reasons for splitting the method body into blocks is that the optimizer may remove or add code that changes the jump target address of the jump instruction, requiring a recalculation of the jump target, which is easier to calculate when broken into blocks.
The aforementioned disassembler and assembler are used in the implementation of block splitting and optimization pass, which are the core dependencies of the overall optimizer.
Part of the code in our previous article as an example, do the introduction of optimization process, the first is block split:
; --- code block 0 ---
0x107384048: ldr x9, [x19, #0x8]
0x10738404c: cmp x9, #0x0 ; =0x0
-- 0x107384050: b.eq 0x107384068 ; b.eq 6
|
| ; --- code block 1 ---
| 0x107384054: ldr x9, [x19]
| 0x107384058: str x9, [x19, #0x8]
| 0x10738405c: ldr x9, [x19, #0x8]
| 0x107384060: str x9, [x19]
| 0x107384064: b 0x1073840dc
|
| ; --- code block 2 ---
-> 0x107384068: ldr x9, [x19]
0x10738406c: str x9, [x19, #0x10]
Copy the code
This will be split according to the fourth line of the code b.eq instruction and the target address of the jump line 14, the code is split into 3 blocks. Originally, b instruction in line 11 should also be split, but the previous B.EQ has been split, so it will not be split again.
The next pair runs a bunch of optimized passes on the chunked code, which results in the following:
; --- code block 0 ---
0x104934020: cmp w9, #0x2 ; =0x2
-- 0x104934024: b.hs 0x104934038 ; b.hs 5
|
| ; --- code block 1 ---
| 0x104934028: mov x9, x20
| 0x10493402c: mov x21, x9
| 0x104934030: mov x20, x9
| 0x104934034: b 0x104934068
|
| ; --- code block 2 ---
-> 0x104934038: sub w22, w20, #0x2 ; =0x2
Copy the code
After running a bunch of passes, the code is completely different (see the next section for the implementation of the key optimization), but you can see that the code of code block 1 is changed from 5 instructions to 4 instructions, and the previous B.eq is optimized to reduce the offset of the target address of b.hs jump by 1, from 6 to 5.
Finally, remerge the block into the new method body instruction:
0x104934020: cmp w9, #0x2 ; =0x2
0x104934024: b.hs 0x104934038 ; b.hs 5
0x104934028: mov x9, x20
0x10493402c: mov x21, x9
0x104934030: mov x20, x9
0x104934034: b 0x104934068
0x104934038: sub w22, w20, #0x2 ; =0x2
Copy the code
2) Key optimization of register allocation
One of the main reasons why 3.7x code is 5.57x slower is that the data in the code we produce is stored entirely on the stack, which is in memory. Memory access by various LDR/STR instructions is 4 times slower than register access, even if the data is in the CPU’s L1 cache. To do this, we can further improve performance if we keep data in registers as much as possible and reduce access to memory.
There are some mature schemes for register allocation, commonly used include: linear scan memory allocation based on live range, linear scan memory allocation based on Live Internal, memory allocation based on graph dyeing, etc. In a common JIT implementation, a live Internal based linear scan memory allocation scheme is used to achieve a balance between product performance and the time complexity of register allocation code.
For simplicity, HyEngine uses an alternative minimalist scheme that allocates memory based on a linear scan based on the number of code accesses, which in other words, allocates registers to stack offsets that occur most frequently in the code.
Assume the following code (excerpted from the Hyengine JIT output code) :
0x107384020: ldr x9, [x19]
0x107384024: str x9, [x19, #0x8]
0x107384028: mov w9, #0x2
0x10738402c: str x9, [x19, #0x10]
0x107384030: mov x9, #0x1
0x107384034: ldr x10, [x19, #0x8]
0x107384038: ldr x11, [x19, #0x10]
Copy the code
After assigning registers to the hypothetical code, the code looks like this:
0x107384020: ldr x9, [x19] ; Offset 0 does not change 0x107384024: mov x20, x9; X20 0x107384028: mov w9, #0x2 0x10738402c: mov x21, x9; X21 0x107384030: mov X9, #0x1 0x107384034: mov x10, x20; Offset 8 to x20 0x107384038: mov x11, x21; Offset 16 becomes x21Copy the code
The previous JIT production code was optimized as follows (note: a small amount of instruction fusion was done) :
0x102db4000: stp x28, x27, [sp, #-0x60]!
0x102db4004: stp x26, x25, [sp, #0x10]
0x102db4008: stp x24, x23, [sp, #0x20]
0x102db400c: stp x22, x21, [sp, #0x30]
0x102db4010: stp x20, x19, [sp, #0x40]
0x102db4014: stp x29, x30, [sp, #0x50]
0x102db4018: add x29, sp, #0x50 ; =0x50
0x102db401c: mov x19, x0
0x102db4020: ldr x9, [x19]
0x102db4024: mov x20, x9
0x102db4028: mov x21, #0x2
0x102db402c: mov x9, #0x1
0x102db4030: cmp w20, w21
0x102db4034: csel x9, x9, xzr, lo
0x102db4038: mov x20, x9
0x102db403c: cmp x9, #0x0 ; =0x0
0x102db4040: b.eq 0x102db4054
0x102db4044: ldr x9, [x19]
0x102db4048: mov x20, x9
0x102db404c: str x20, [x19]
0x102db4050: b 0x102db40ac
0x102db4054: ldr x9, [x19]
0x102db4058: mov x21, x9
0x102db405c: mov x22, #0x2
0x102db4060: sub w9, w21, w22
0x102db4064: mov x21, x9
0x102db4068: add x0, x19, #0x10 ; =0x10
0x102db406c: str x21, [x19, #0x10]
0x102db4070: bl 0x102db4070
0x102db4074: ldr x21, [x19, #0x10]
0x102db4078: ldr x9, [x19]
0x102db407c: mov x22, x9
0x102db4080: mov x23, #0x1
0x102db4084: sub w9, w22, w23
0x102db4088: mov x22, x9
0x102db408c: add x0, x19, #0x18 ; =0x18
0x102db4090: str x22, [x19, #0x18]
0x102db4094: bl 0x102db4094
0x102db4098: ldr x22, [x19, #0x18]
0x102db409c: add w9, w21, w22
0x102db40a0: mov x21, x9
0x102db40a4: str x21, [x19]
0x102db40a8: nop
0x102db40ac: mov x0, #0x0
0x102db40b0: ldp x29, x30, [sp, #0x50]
0x102db40b4: ldp x20, x19, [sp, #0x40]
0x102db40b8: ldp x22, x21, [sp, #0x30]
0x102db40bc: ldp x24, x23, [sp, #0x20]
0x102db40c0: ldp x26, x25, [sp, #0x10]
0x102db40c4: ldp x28, x27, [sp], #0x60
0x102db40c8: ret
Copy the code
After optimization, the number of code is reduced from 63 to 51, and the amount of memory access is significantly reduced, and the time consumption is also reduced to 1361ms, which is about 4.42 times of LLVM.
3) Key optimization of register parameter transfer
In the last article of register allocation optimization, the need to copy the value of the register back to the stack during method calls increases the overhead of memory access. The relevant assembly code is:
0x102db4068: add x0, x19, #0x10 ; =0x10
0x102db406c: str x21, [x19, #0x10]
0x102db4070: bl 0x102db4070
0x102db4074: ldr x21, [x19, #0x10]
Copy the code
In ARM64, arguments are passed through registers, saving two memory accesses per method call. The first argument, x22, goes directly to x1, and the return value of the method call, x0, goes directly to x22.
0x1057e405c: add x0, x19, #0x10 ; =0x10
0x1057e4060: mov x1, x22
0x1057e4064: bl 0x1057e4064
0x1057e4068: mov x22, x0
Copy the code
Note: Because the stack offset 0 is also allocated a register, the register number is 1 more than the code before optimization. At the same time, the method header takes arguments from the code:
0x102db4020: ldr x9, [x19]
0x102db4024: mov x20, x9
Copy the code
Optimization for:
0x1057e4020: mov x20, x1
Copy the code
There is one less memory access and one less instruction. The final complete code after optimization is as follows:
0x1057e4000: stp x28, x27, [sp, #-0x60]!
0x1057e4004: stp x26, x25, [sp, #0x10]
0x1057e4008: stp x24, x23, [sp, #0x20]
0x1057e400c: stp x22, x21, [sp, #0x30]
0x1057e4010: stp x20, x19, [sp, #0x40]
0x1057e4014: stp x29, x30, [sp, #0x50]
0x1057e4018: add x29, sp, #0x50 ; =0x50
0x1057e401c: mov x19, x0
0x1057e4020: mov x20, x1
0x1057e4024: mov x21, x20
0x1057e4028: mov x22, #0x2
0x1057e402c: mov x9, #0x1
0x1057e4030: cmp w21, w22
0x1057e4034: csel x9, x9, xzr, lo
0x1057e4038: mov x21, x9
0x1057e403c: cmp x9, #0x0 ; =0x0
0x1057e4040: b.eq 0x1057e404c
0x1057e4044: mov x21, x20
0x1057e4048: b 0x1057e409c
0x1057e404c: mov x22, x20
0x1057e4050: mov x23, #0x2
0x1057e4054: sub w9, w22, w23
0x1057e4058: mov x22, x9
0x1057e405c: add x0, x19, #0x10 ; =0x10
0x1057e4060: mov x1, x22
0x1057e4064: bl 0x1057e4064
0x1057e4068: mov x22, x0
0x1057e406c: mov x23, x20
0x1057e4070: mov x24, #0x1
0x1057e4074: sub w9, w23, w24
0x1057e4078: mov x23, x9
0x1057e407c: add x0, x19, #0x18 ; =0x18
0x1057e4080: mov x1, x23
0x1057e4084: bl 0x1057e4084
0x1057e4088: mov x23, x0
0x1057e408c: add w9, w22, w23
0x1057e4090: mov x22, x9
0x1057e4094: mov x20, x22
0x1057e4098: nop
0x1057e409c: mov x0, x20
0x1057e40a0: ldp x29, x30, [sp, #0x50]
0x1057e40a4: ldp x20, x19, [sp, #0x40]
0x1057e40a8: ldp x22, x21, [sp, #0x30]
0x1057e40ac: ldp x24, x23, [sp, #0x20]
0x1057e40b0: ldp x26, x25, [sp, #0x10]
0x1057e40b4: ldp x28, x27, [sp], #0x60
0x1057e40b8: ret
Copy the code
After optimization, the number of code is reduced from 51 to 47, and the time is also reduced to 687ms, which is about 2.23 times that of LLVM. Although the number of pieces of code is only reduced by four, the time taken is significantly reduced by about 50%.
Note: this optimization skips only methods with short method bodies and frequently called methods, and the effect is not obvious for code with long method bodies.
4) Feature matching for key optimization
Feature matching is to traverse the preset code features in the code and optimize the code that conforms to the features. For example in the code above:
0x1057e404c: mov x22, x20
0x1057e4050: mov x23, #0x2
0x1057e4054: sub w9, w22, w23
0x1057e4058: mov x22, x9
Copy the code
Can be optimized as:
0x104934038: sub w22, w20, #0x2 ; =0x2
Copy the code
Four instructions become one.
5) Optimization results
After the above optimizations, the code becomes:
0x104934000: stp x24, x23, [sp, #-0x40]!
0x104934004: stp x22, x21, [sp, #0x10]
0x104934008: stp x20, x19, [sp, #0x20]
0x10493400c: stp x29, x30, [sp, #0x30]
0x104934010: add x29, sp, #0x30 ; =0x30
0x104934014: mov x19, x0
0x104934018: mov x20, x1
0x10493401c: mov x9, x20
0x104934020: cmp w9, #0x2 ; =0x2
0x104934024: b.hs 0x104934038
0x104934028: mov x9, x20
0x10493402c: mov x21, x9
0x104934030: mov x20, x9
0x104934034: b 0x104934068
0x104934038: sub w22, w20, #0x2 ; =0x2
0x10493403c: add x0, x19, #0x10 ; =0x10
0x104934040: mov x1, x22
0x104934044: bl 0x104934000
0x104934048: mov x22, x0
0x10493404c: sub w23, w20, #0x1 ; =0x1
0x104934050: add x0, x19, #0x18 ; =0x18
0x104934054: mov x1, x23
0x104934058: bl 0x104934000
0x10493405c: add w9, w22, w0
0x104934060: mov x22, x9
0x104934064: mov x20, x9
0x104934068: mov x0, x20
0x10493406c: ldp x29, x30, [sp, #0x30]
0x104934070: ldp x20, x19, [sp, #0x20]
0x104934074: ldp x22, x21, [sp, #0x10]
0x104934078: ldp x24, x23, [sp], #0x40
0x10493407c: ret
Copy the code
After optimization, the number of code is reduced from 63 to 32, and the time is reduced from 1716ms to 493ms, which is about 1.6 times that of LLVM.
Quickjs compilation
Note: Js takes most of the time at Runtime, so the JIT only accounts for about 20% of the overall js performance optimization. More JIT optimization details will be introduced later.
The quickJS compilation process is similar to that of WASM, except that the implementation of OpCode is slightly more complex. For example, OP_object:
// *sp++ = JS_NewObject(ctx);
// if (unlikely(JS_IsException(sp[-1])))
// goto exception;
case OP_object: {
MOV_FUNCTION_ADDRESS_TO_REG(R8, JS_NewObject);
MOV_X_X(NEXT_INSTRUCTION, R0, CTX_REG);
BLR_X(NEXT_INSTRUCTION, R8);
STR_X_X_I(NEXT_INSTRUCTION, R0, R26, SP_OFFSET(0));
CHECK_EXCEPTION(R0, R9);
break;
}
Copy the code
We first place the address of the JS_NewObject method to be called into register R8 using the MOV_FUNCTION_ADDRESS_TO_REG macro:
#define MOV_FUNCTION_ADDRESS_TO_REG(reg, func) \ { \ uintptr_t func##Address = (uintptr_t)func; \ MOVZ_X_I_S_I(NEXT_INSTRUCTION, reg, IMM16(func##Address), LSL, 0); \ if (IMM16(func##Address >> 16) ! = 0) { \ MOVK_X_I_S_I(NEXT_INSTRUCTION, reg, IMM16(func##Address >> 16), \ LSL, 16); \ } else { \ NOP(NEXT_INSTRUCTION); \ } \ if (IMM16(func##Address >> 32) ! = 0) { \ MOVK_X_I_S_I(NEXT_INSTRUCTION, reg, IMM16(func##Address >> 32), \ LSL, 32); \ } else { \ NOP(NEXT_INSTRUCTION); \ } \ if (IMM16(func##Address >> 48) ! = 0) { \ MOVK_X_I_S_I(NEXT_INSTRUCTION, reg, IMM16(func##Address >> 48), \ LSL, 48); \ } else { \ NOP(NEXT_INSTRUCTION); \} \}Copy the code
Then put CTX_REG(the CTX address stored in it) into R0 as the first argument and call JS_NewObject. The result is then stored at SP_OFFSET(0) in the JS stack. Then check whether the result is abnormal by CHECK_EXCEPTION:
#define EXCEPTION(tmp) \
LDR_X_X_I(NEXT_INSTRUCTION, tmp, CTX_REG, HYJS_BUILTIN_OFFSET(0)); \
MOV_X_I(NEXT_INSTRUCTION, R0, SP_OFFSET(0)); \
BLR_X(NEXT_INSTRUCTION, tmp);
#define CHECK_EXCEPTION(reg, tmp) \
MOV_X_I(NEXT_INSTRUCTION, tmp, ((uint64_t)JS_TAG_EXCEPTION<<56)); \
CMP_X_X_S_I(NEXT_INSTRUCTION, reg, tmp, LSL, 0); \
B_C_L(NEXT_INSTRUCTION, NE, 4 * sizeof(uint32_t)); \
EXCEPTION(tmp)
Copy the code
This single opcode generates 13 arm64 machine codes! And that’s not much.
The SAME Fibonacci implementation, wASM jit production code is only 32, quickJS has 467!! And the fear of being at the mercy of assembly.
Note: This instruction comes from the call to Builtin, reference counting, type judgment. Later VM optimizations reduced the number of lines of code to 420 after the reference count was dead.
2 Engine (VM) part
Because WASM itself is strongly typed bytecode, the Runtime itself provides less power, and the performance bottleneck is mostly in code interpretation execution, the VM part is largely unoptimized. Quickjs bytecodes are weakly typed bytecodes that rely on Runtime for their main functions, and because the language itself takes over memory management, the gc overhead is also significant.
After the previous performance analysis of a business JS code, it is found that more than 50% of the performance cost is in memory allocation and GC. This engine part will mainly introduce memory allocation and GC optimization of QuickJS. Some runtime builtin fast path, inline cache optimization is not currently high, only a small introduction.
Hymalloc memory allocator
In order to achieve hyEngine’s quickJS performance optimization while maintaining the memory management required for GC optimization, a faster (lock-free, non-thread-safe) memory allocator was designed. You also need to consider customizations that might be required for other engines to make Hymalloc as versatile as possible.
1) Introduction to implementation
Hymalloc divides the memory into 19 regions and 18 small regions /1 large region. Small region Stores regular memory. The size of each region ranges from 116 to 1916 bytes. The large region is used to store memory larger than 9 x 16 bytes.
Each area can contain multiple pools, and each pool can contain multiple items of target size. Large regions are special, with only one entry per pool. When applying for memory from the system, apply for memory according to the pool, and then divide the pool into corresponding items.
Each small Region initializes a pool with a configurable size (1024 items by default). Large region is empty by default.
The area/block/pool is as follows:
Here’s a quick look at the two most critical data structures:
// hymalloc item
struct HYMItem {
union {
HYMRegion* region; // set to region when allocated
HYMItem* next; // set to next free item when freed
};
size_t flags;
uint8_t ptr[0];
};
// hymalloc pool
struct HYMPool {
HYMRegion *region;
HYMPool *next;
size_t item_size;
};
Copy the code
HYMItem is the data structure of the item mentioned above, where the size of the item is not fixed. The data structure itself is more like the description of the item header, where flags currently exist as special flags for GC. PTR is used to get the address of the actual available portion of memory for an item (obtained by &item-> PTR). Region/Next in a union is a design for saving memory. Before the item is allocated, the value of next points to the next free item of the region; After the item is allocated, region is set to the region address to which the item belongs. The idle item chain of a region represents the following intent:
During memory allocation, the first item of the linked list is taken as the allocation result. If the linked list is empty, a new pool is applied to the system and the item of the pool is added to the linked list. The allocation diagram is as follows:
The assignment code is as follows:
static void* _HYMallocFixedSize(HYMRegion *region, size_t size) { // allocate new pool, if no free item exists if (region->free_item_list == NULL) { // notice: large region's item size is 0, use 'size' instead size_t item_size = region->item_size ? region->item_size : size; int ret = _HYMAllocPool(region, region->pool_initial_item_count, item_size); if (! ret) { return NULL; } } // get free list item head, and set region to item's region HYMItem *item = region->free_item_list; region->free_item_list = item->next; item->region = region; item->flags = 0; return &item->ptr; }Copy the code
To free memory, insert the item into the head of the free list of the region:
void HYMFree(void *ptr) {
HYMItem *item = (HYMItem *)((uint8_t *)ptr - HYM_ITEM_SIZE_OVERHEAD);
// set item as head of region's free item list
HYMRegion *region = item->region;
HYMItem *first_item_in_region = region->free_item_list;
region->free_item_list = item;
item->next = first_item_in_region;
}
Copy the code
In a simple memory allocation /free test case, the above implementation was about 4 times faster than system-supplied Malloc/Free on a macbook M1 device.
2) Memory Compact + Update
To reduce memory usage, Hymalloc implements partial memory Compact to clear unused Pools of small regions and all pools of large regions. However, the update function is not implemented at present. Therefore, items in different pools cannot be copied from each other to save more memory.
However, in the client usage scenario, the memory usage of the running code is not high, and the implementation complexity of the complete compact + Update combination is high, and the cost performance is insufficient. The need for complete Compact + Update will be re-evaluated based on actual service usage.
3) Limitations of Hymalloc
To improve allocation and release performance, each item of Hymalloc has a header, which requires additional memory space, resulting in a certain amount of memory waste.
Hymalloc provides the compact method to release free memory, but memory is allocated in batches according to the pool. As long as one item in the pool is used, the pool will not be released, so the memory cannot be released efficiently.
In addition, considering the probability of memory overcommitment, memory of large regions will be allocated with 256bytes alignment by default, which may also lead to waste.
The above problem can be reduced by setting the default number of items for smaller pools and smaller alignment sizes, sacrificing a small amount of performance.
A more rational data structure and a more sophisticated compact + update mechanism can then be introduced to reduce memory waste.
Garbage collector HYGC
The original GC of QuickJS is based on reference counting + Mark Sweep. Its design and implementation are relatively simple and efficient, but it does not realize generational, multi-threading, compact, idle GC, and copy GC, which makes GC occupy a relatively high proportion of the overall execution time, as well as potential performance degradation caused by memory fragmentation. In addition, due to the existence of reference counting, jit generated code will have a large number of reference counting operation instructions, resulting in large code volume.
In order to optimize QuickJS performance for HyEngine, reduce the proportion of GC in the overall elapsed time and reduce the amount of time that GC can cause long rundowns. With reference to the GC design ideas of v8 and other advanced engines, a set of lightweight, high-performance and simple GC is implemented for mobile terminal services.
Note: This implementation is for QuickJS only; general-purpose GC implementations may follow.
Note: Gc pause time should be controlled within 30ms in order to ensure that business experience does not lag.
1) Common garbage collection implementation
There are three main categories of garbage collection commonly used:
- Reference counting adds one more reference to each object, one more reference +1, one less reference -1, and if the number of references is zero, it is released. Disadvantages: Cannot solve the circular reference problem.
- Mark sweep traverses an object to mark whether it has a reference or not, and then cleans it up if it doesn’t.
- Copy gc iterates through objects, marks them for references, makes a new copy of the referenced objects, and discards all old memory.
Based on these three categories, there will be some derivatives to achieve multithreading support, such as:
- Three-color gc traverses the object, marking whether the object has a reference, state than pure reference (black) and no reference (white) more than an intermediate state flag/uncertain (gray), can support multi-threading.
In order to minimize GC pause time and js execution time, HYGC adopts multi-threaded tri-color GC scheme. In business case testing, it was found that the memory usage was not large, so generational support was not introduced.
2) HyGC’s business strategy
Hygc plans to expose policies to users to meet performance requirements in different scenarios. It provides no GC, idle GC, and multi-threaded GC to meet different memory and performance requirements in different scenarios. Select a GC policy based on actual requirements. You are advised to enable the GC policy to avoid unexpected results.
- No GC runtime does not trigger GC operations. Do a full GC to free the entire memory after the code has completely run and destroyed the Runtime.
- No GC operation is triggered during the idle GC run and gc is performed on the asynchronous thread after the run. When the code has completely run and destroyed the Runtime, do a full GC to free the entire memory.
- The default GC runtime triggers gc. When the code has completely run and destroyed the Runtime, do a full GC to free the entire memory.
One of our business cases can be set to no OR idle GC because no memory can be reclaimed during code execution and GC is a waste of time.
3) Implementation scheme of HYGC
Quickjs’s original reference counting + Mark Sweep GC scheme was removed during GC optimization and replaced with a new multi-threaded tri-color tag GC scheme. The implementation of HYGC reused some of the original QuickJS code to make the required functionality as simple as possible.
Hygc’s tricolor tagging process (single-threaded version) :
First of all, the main operation of collecting the root object is to scan the stack of THE JS thread, and collect the JS object on the stack and the object associated with the JS call stack as the root object of the tricolor mark. The child objects are then recursively marked from the root object as the mark entry. Iterate through the gc_obj_list (where all quickJS objects that need to be gc are on the bidirectional linked list) and place unmarked objects into tmp_obj_list. Finally, the objects in tmp_obj_list are freed.
Single-threaded GC completely suspends JS execution during GC, potentially causing business stutter (just potentially, there is no stutter caused by GC due to the small memory usage of the actual business), and making JS execution time relatively long. For this purpose, hyGC introduces multi-threaded tricolor tags, which flow as follows:
In the multi-threaded version, there are js and GC threads. The JS thread collects the root object and moves the old object to the asynchronous GC linked list, and then JS execution continues. The GC thread sets all the tri-color flags of the old objects to 0, then starts marking the live objects, and then collects the garbage objects. Here we split the garbage object release into two phases, one is garbage object related property modification and nulling that can be performed in the GC thread, and the other is memory release that needs to be done in the JS thread, because Hymalloc is not thread-safe. This leaves only three relatively time-consuming GC operations in the JS thread: root object collection, old object transfer, and memory release.
Note: Sadly, since mark and garbage collection are still done on a single thread, only two colors are used to mark it, and gray is not really used. Subsequent optimizations allow the HYGC implementation to coexist with the original QuickJS GC, making gc migration less risky.
4) Limitations of HYGC
Asynchronous threads of HYGC will only gc old objects during garbage collection, and new objects will not participate in GC after the transfer of old objects, which may result in a spike in memory usage, depending on the execution time of the GC thread.
This problem will be determined whether to optimize the solution according to the actual situation.
Other optimization Examples
1) Inline cache for the global object
The operations on QuickJS’s global object are separately compiled as OP_get_var/OP_put_var and other op’s, which are particularly slow to implement, so we added inline cache for global object access. Inline cache is used to cache offsets from the array of attributes accessed by a piece of code, so that the next fetch will be the offset instead of iterating through the array.
The data structure of the global inline cache is as follows:
typedef struct {
JSAtom prop; // property atom
int offset; // cached property offset
void *obj; // global_obj or global_var_obj
} HYJSGlobalIC;
Copy the code
Void *obj in line 4 is special because quickJS’s global can be stored in the context’s global_obj or global_var_obj, which needs to be stored in the cache. The specific code is as follows:
case OP_get_var: { // 73
JSAtom atom = get_u32(buf + i + 1);
uint32_t cache_index = hyjs_GetGlobalICOffset(ctx, atom);
JSObject obj;
JSShape shape;
LDR_X_X_I(NEXT_INSTRUCTION, R8, CTX_REG, (int32_t)((uintptr_t)&ctx->global_ic - (uintptr_t)ctx));
ADD_X_X_I(NEXT_INSTRUCTION, R8, R8, cache_index * sizeof(HYJSGlobalIC));
LDP_X_X_X_I(NEXT_INSTRUCTION, R0, R9, R8, 0);
CBZ_X_L(NEXT_INSTRUCTION, R9, 12 * sizeof(uint32_t)); // check cache exsits
LSR_X_X_I(NEXT_INSTRUCTION, R1, R0, 32); // get offset
LDR_X_X_I(NEXT_INSTRUCTION, R2, R9, (int32_t)((uintptr_t)&obj.shape - (uintptr_t)&obj)); // get shape
ADD_X_X_I(NEXT_INSTRUCTION, R2, R2, (int32_t)((uintptr_t)&shape.prop - (uintptr_t)&shape)); // get prop
LDR_X_X_W_E_I(NEXT_INSTRUCTION, R3, R2, R1, UXTW, 3); // get prop
LSR_X_X_I(NEXT_INSTRUCTION, R3, R3, 32);
CMP_W_W_S_I(NEXT_INSTRUCTION, R0, R3, LSL, 0);
B_C_L(NEXT_INSTRUCTION, NE, 5 * sizeof(uint32_t));
LDR_X_X_I(NEXT_INSTRUCTION, R2, R9, (int32_t)((uintptr_t)&obj.prop - (uintptr_t)&obj)); // get prop
LSL_W_W_I(NEXT_INSTRUCTION, R1, R1, 4); // R1 * sizeof(JSProperty)
LDR_X_X_W_E_I(NEXT_INSTRUCTION, R0, R2, R1, UXTW, 0); // get value
B_L(NEXT_INSTRUCTION, 17 * sizeof(uint32_t));
MOV_FUNCTION_ADDRESS_TO_REG(R8, HYJS_GetGlobalVar);
MOV_X_X(NEXT_INSTRUCTION, R0, CTX_REG);
MOV_IMM32_TO_REG(R1, atom);
MOV_X_I(NEXT_INSTRUCTION, R2, opcode - OP_get_var_undef);
MOV_X_I(NEXT_INSTRUCTION, R3, cache_index);
BLR_X(NEXT_INSTRUCTION, R8);
CHECK_EXCEPTION(R0, R9);
STR_X_X_I(NEXT_INSTRUCTION, R0, R26, SP_OFFSET(0));
i += 4;
break;
}
Copy the code
The first is hyjs_GetGlobalICOffset on line 5, which assigns the cache_index of an inline cache to the current opcode. This cache_index is set to the fourth parameter of the HYJS_GetGlobalVar method call at line 31. In lines 9 through 19, cache_index is used to fetch the cache, and offset is used to fetch prop(the attribute ID, of type atom) stored in the offset of the global object. Check that the cache is still valid by comparing it to atom for the attribute of the object you currently want to fetch. If the cache is valid, fetch the object property array directly through lines 20-22. If not, walk down a slow path to line 26, traverse the property array, and update the inline cache.
2) Builtin fast path optimization
Fast path optimization is the process of isolating parts of the code that have a higher probability of execution to avoid execution delays in redundant code.
Take the implementation of array.indexof as an example:
static JSValue hyjs_array_indexOf(JSContext *ctx, JSValueConst func_obj, JSValueConst obj, int argc, JSValueConst *argv, int flags) { ...... res = -1; if (len > 0) { ...... // fast path if (JS_VALUE_GET_TAG(element) == JS_TAG_INT) { for (; n < count; n++) { if (JS_VALUE_GET_PTR(arrp[n]) == JS_VALUE_GET_PTR(element)) { res = n; goto done; } } goto property_path; } // slow path for (; n < count; n++) { if (js_strict_eq2(ctx, JS_DupValue(ctx, argv[0]), JS_DupValue(ctx, arrp[n]), JS_EQ_STRICT)) { res = n; goto done; e } } ...... } done: return JS_NewInt64(ctx, res); exception: return JS_EXCEPTION; }Copy the code
The original implementation is a slow path starting at line 23, where the js_strict_eQ2 method is called to determine whether the array index is equal. This comparison method is relatively heavy. If the index is an int, it will be faster to compare int data than to call js_strict_eq2.
Iv. Optimization Results
Performance test device is based on M1 (ARM64) chip macbook, wASM service performance test is based on Huawei Mate 8 mobile phone; The selection method of test results is to run 5 times for each case, and take the third place result; Fibonacci series, Benchmark and business case are selected as test cases to evaluate performance changes brought by optimization in different scenarios.
1 wasm performance
Note: The time obtained in business case is the overall time of single frame rendering, including wASM execution and rendering time.
Note: The JIT time of CoreMark HyEngine is about 3 times that of the LLVM compiled version due to insufficient optimization of computation instructions. More computation instructions can be optimized in the optimizer later.
Note: The above test compilation optimization option is O3.
2 js performance
Note: Some microbench items are negatively optimized in GC optimization, so the improvement of overall optimization is not obvious, but the change of items has little impact on business.
Note: It can be seen from the business case that the improvement brought by VM optimization is much greater than that brought by JIT at present, because JIT currently introduces few optimization methods and there is still a lot of optimization space. In addition, case 1 saw only about 30% jit improvement over Jitless on V8. In JIT implementation, a single optimization single may bring less than 1% of the improvement, need to heap dozens or hundreds of different optimization, to make the performance to achieve such as 30% improvement, the subsequent will be more performance requirements and development costs to make a balanced choice. Note: The above test compilation optimization option is Os.
V. Follow-up Plan
The following plans are mainly divided into two directions: performance optimization and multi-language support, among which performance optimization will continue. Performance optimization points include:
- Compiler optimization to introduce its own bytecode support.
- Optimizer optimization, introducing more optimization passes.
- Own runtime, hot method assembly implementation.
Vi Reference Content
- wasm3: github.com/wasm3/wasm3
- quickjs: bellard.org/quickjs/
- v8: chromium.googlesource.com/v8/v8.git
- Javascriptcore: opensource.apple.com/tarballs/Ja…
- golang/arch: github.com/golang/arch
- Libmalloc: opensource.apple.com/tarballs/li…
- Trash talk: The Orinoco garbage collector: v8.dev/blog/ Trash -…
- JavaScript engine fundamentals: Shapes and Inline Caches: mathiasbynens. Be/notes/shape…
- Cs143: web.stanford.edu/class/cs143…
- C in ASM(ARM64) : www.zhihu.com/column/c\_1…
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.