Author: Su Ok-tong

In early 2021, Python author Guido van Rossum was rehired by Microsoft to continue working on CPython. They came up with a ftP-Python plan to improve CPython performance fivefold in four years. The whole project is open on GitHub’s Farely-Cpython Group. You can see that part of the project ideas has been implemented and verified by the corresponding code through the Activity.

This paper will interpret and analyze the source code of PEP 659, an important proposal in this project, and learn the ideas and means of virtual machine performance optimization at Bytecode level, hoping to inspire you.

Proposal to interpret

PEP 659 was founded in April 2021 and is fully known as Specializing Adaptive Interpreter. Here are two key words: Specializing and Adaptive, which can be simply interpreted as Adaptive to replace the code at a certain position with a special code to improve the execution speed of a certain position operation. For example, we can optimize the code to cache the index of dict entry directly after observing that the dict code does not change during multiple executions. In this way, the process of hashtable lookup can be avoided in the next query and the performance can be improved. The observations here correspond to Adaptive, and the process of replacing code corresponds to amusing.

The above example is not accurate, but to give you a glimpse of a profitable Adaptive Interpreter. Here are some of the key sentences in the proposal.

First of all, it should be clear that PEP 659 is not a JIT solution, as it is intended to make the benefits of Faster CPython available to users who cannot directly use PyPy and other JIT Compiler options. On iOS, for example, user processes that are limited to executable code pages dynamically created by CoDesign are rejected for not containing a valid signature when a page is missing, and therefore cannot directly use Python virtual machines with JIT Compiler.

Looking at this, some people may be concerned, what is the space and benefit of optimizing at the virtual machine level without JIT alone? In PEP 659 the authors also give some explanations:

Specialization is typically done in the context of a JIT compiler, but research shows specialization in an interpreter can boost performance significantly, even outperforming a naive compiler.

That is, it was found that performing Specialization optimizations at the Interpreter level alone can lead to significant performance gains that can even outperform some of the more primitive JIT schemes. The authors cite an earlier paper of their own. Those interested can check out the references section of PEP 659 proposal on their own.

PEP 659 does not include JIT Compiler. Simply put, it does not generate code. It is simply a carrier of code, and we need to exhaust all possible optimizations and prepare the code in advance to replace bytecode when matching optimizations are observed. When it is found that the optimization condition is not met, it must be able to gracefully return to the code before optimization to ensure the correctness of the program.

In order to better enumerate the optimization situation and switch code, it is necessary to choose the appropriate optimization granularity. The original proposal is as follows:

By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster interpreter that also generates profiling information for more sophisticated optimizations in the future.

That is, optimization at the level of virtual machine instructions, rather than JIT optimization at a region or function dimension, can be subdivided for specific instructions. For example, in CPython, globals and builtins are obtained with LOAD_GLOBAL instructions. Builtins: LOAD_GLOBAL_MODULE: LOAD_GLOBAL_BUILTIN: LOAD_GLOBAL_BUILTIN: LOAD_GLOBAL_BUILTIN: LOAD_GLOBAL_BUILTIN: LOAD_GLOBAL_BUILTIN: LOAD_GLOBAL_BUILTIN: LOAD_GLOBAL_BUILTIN: When we find that LOAD_GLOBAL in a bytecode is always looking for globals, we can optimize it to globals and vice versa, It can also cache entry index of Globals and Builtins dict to avoid repeated access to dict hashtable. If optimization conditions are not met (such as lookup failure), Or dict changes) and then rolls back to the LOAD_GLOBAL directive to ensure correctness.

The process above from LOAD_GLOBAL to LOAD_GLOBAL_MODULE/LOAD_GLOBAL_BUILTIN is actually the newspaper in the PEP title, The process of selecting whether to replace the instruction with LOAD_GLOBAL_MODULE or LOAD_GLOBAL_BUILTIN is actually Adaptive. Its responsibility is to observe the execution of the instruction in a particular code and select the correct optimization instruction for it. The process of observation is also the process of virtual machine code execution, so it is necessary to introduce an additional Adaptive instruction LOAD_GLOBAL_ADAPATIVE to execute the observation and replacement logic.

Although Specializing can improve the speed by reducing judgments and increasing the cache, the process of drawing a profitable instruction from the original instruction to the Adaptive instruction and from the Adaptive observation is also losable, so it needs to be optimized based on certain strategies. Instead of mindlessly trying to optimize all the instructions in the code, it says:

Typical optimizations for virtual machines are expensive, so a long “warm up” time is required to gain confidence that the cost of optimization is justified.

The Specializing optimization for the virtual machine is expensive, the expensive embodied in we need to optimize code inserted into a bytecode large cycle of execution, and even in every cycle need extra processing logic, and many are probably will only perform a code, if it is a waste of time to optimize for them, So we need to find out which code needs to be optimized, so that one of the main characteristics of code is that it is called frequently. We can increase the counter for each PyCodeObject (CPython objects that hold bytecode and environment information), and only execute the optimization logic if the number of executions exceeds a certain threshold. This process is called warm up.

At the virtual machine level, the co_WARMup counter of a PyCodeObject is accumulated when a bytecode object is executed (simply interpreted as a bytecode of Python code being executed) or when an absolute jump occurs. When the threshold is reached, all the optimizable instructions in the byte code will be replaced with Adaptive instructions for observation and specialization, and then the instructions will only be converted between Adaptive and mutually profitable. When the optimization condition of mutually profitable instruction is destroyed, We will not roll back to Adaptive instruction immediately, but allow a certain number of misses to prevent bumps between optimization and de-optimization. Similarly, after the de-optimization is rolled back to Adaptive instruction, we will also suspend the observation and optimization logic and let the instruction run according to the original logic for a period of time. This process is called deferred, and the state diagram for the entire process looks like this:

Here we have the working principle and general process of PEP 659 had the understanding, but in order to achieve high performance the optimizer still have many details need to be examined, such as what instruction need to be optimized, how elegant replacement instructions and reduction, how to design the instruction cache, etc., these specific problem from the aspect of theory analysis is so boring and difficult, So let’s look at this in conjunction with the source implementation of PEP 659 in Python 3.11.

Source code analysis

In fact, PEP 659 infrastructure and some optimization instructions are already reflected in the CPython 3.11 branch. Let’s take LOAD_GLOBAL transformation as an example to analyze the above process in detail. LOAD_GLOBAL is chosen because it is relatively easy to operate at the virtual machine level. Only involves the search of two namespaces, optimization and optimization judgment is also more direct, clear enough to explain the problem and not because of the obscure instructions to everyone’s reading difficulties.

The whole optimization process includes Warmup, Adaptive, Specializing and Deoptimize. Here we will analyze the functions and core code of each stage.

Warmup

Warmup, as mentioned above, addresses the problem of finding code that is really frequently executed and avoiding optimizing code that will never be executed again. Instead of counting how often code is executed, we need to set a threshold and count the number of times a particular bytecode object, PyCodeObject, is executed. Warmup is considered complete when the threshold is reached, so here we introduce a new field co_WARMup for PyCodeObject:

/* Bytecode object */
struct PyCodeObject {
    PyObject_HEAD
    // The hottest fields (in the eval loop) are grouped here at the top.
    PyObject *co_consts;        /* list (constants used) */
    PyObject *co_names;         /* list of strings (names used) */
     // ...
    int co_warmup;              /* Warmup counter for quickening */
     // ...

   /* Quickened instructions and cache, or NULL
     This should be treated as opaque by all code except the specializer and
     interpreter. */
    union _cache_or_instruction *co_quickened;
};
Copy the code

When a PyCodeObject is created, the initial co_warmup value is set to QUICKENING_WARMUP_DELAY, which is a negative value and increases co_Warmup by 1 whenever a PyCodeObject is executed or an absolute jump occurs within the code. When the threshold reaches 0, the optimization logic enters:

#define QUICKENING_WARMUP_DELAY 8
/* We want to compare to zero for efficiency, so we offset values accordingly */
#define QUICKENING_INITIAL_WARMUP_VALUE (-QUICKENING_WARMUP_DELAY)

static void
init_code(PyCodeObject *co, struct _PyCodeConstructor *con) {
    // ...
    co->co_warmup = QUICKENING_INITIAL_WARMUP_VALUE;
    co->co_quickened = NULL;
}
Copy the code

Adaptive

When co_warmup is accumulated to 0, the _Py_Quicken function is used to perform optimization logic. Since bytecode adjustment is involved, a copy of the original bytecode co_code is stored in Quickened. All subsequent modifications occur in Quickened:

/* Bytecode object */
struct PyCodeObject {
    PyObject_HEAD
    // The hottest fields (in the eval loop) are grouped here at the top.
    PyObject *co_consts;        /* list (constants used) */
    PyObject *co_names;         /* list of strings (names used) */
   _Py_CODEUNIT *co_firstinstr; /* Pointer to first instruction, used for quickening. */
    // ...
    PyObject *co_code;          /* instruction opcodes */
  
   /* Quickened instructions and cache, or NULL
     This should be treated as opaque by all code except the specializer and
     interpreter. */
    union _cache_or_instruction *co_quickened;
};

typedef uint16_t _Py_CODEUNIT;

int
_Py_Quicken(PyCodeObject *code) {
    if (code->co_quickened) {
        return 0;
    }
    Py_ssize_t size = PyBytes_GET_SIZE(code->co_code);
    int instr_count = (int)(size/sizeof(_Py_CODEUNIT));
    if (instr_count > MAX_SIZE_TO_QUICKEN) {
        code->co_warmup = QUICKENING_WARMUP_COLDEST;
        return 0;
    }
    int entry_count = entries_needed(code->co_firstinstr, instr_count);
    SpecializedCacheOrInstruction *quickened = allocate(entry_count, instr_count);
    if (quickened == NULL) {
        return -1;
    }
    _Py_CODEUNIT *new_instructions = first_instruction(quickened);
    memcpy(new_instructions, code->co_firstinstr, size);
    optimize(quickened, instr_count);
    code->co_quickened = quickened;
    code->co_firstinstr = new_instructions;
    return 0;
}
Copy the code

Looking at the code above you might be wondering why Quickened contains extra space beyond bytecode. In fact, it’s hard to optimize by instruction substitution alone. We also need to cache as much of the object that the instruction operates on as possible. Check globals the first time and builtins the second time. In CPython, the underlying dictionary is hashtable, so its complexity is greater than O(1) when a hash collision occurs. In order to optimize the dictionary that does not change frequently, we can cache the hashtable index corresponding to the key. Obviously, these caches are one-to-one corresponding to the instruction, so the extra space of Quickened is to store the extra information of the optimized instruction.

Quickened is a bidirectional array, storing cache on the left and bytecode on the right. An additional cache entry is allocated to the left of the array to store cache count. By cache count, we can quickly locate the bytecode array. The first_instruction we saw earlier is located from quickened to instr 0 by cache_count:

/* We layout the quickened data as a bi-directional array:
 * Instructions upwards, cache entries downwards.
 * first_instr is aligned to a SpecializedCacheEntry.
 * The nth instruction is located at first_instr[n]
 * The nth cache is located at ((SpecializedCacheEntry *)first_instr)[-1-n]
 * The first (index 0) cache entry is reserved for the count, to enable finding
 * the first instruction from the base pointer.
 * The cache_count argument must include space for the count.
 * We use the SpecializedCacheOrInstruction union to refer to the data
 * to avoid type punning.

 Layout of quickened data, each line 8 bytes for M cache entries and N instructions:

 <cache_count>                              <---- co->co_quickened
 <cache M-1>
 <cache M-2>
 ...
 <cache 0>
 <instr 0> <instr 1> <instr 2> <instr 3>    <--- co->co_first_instr
 <instr 4> <instr 5> <instr 6> <instr 7>
 ...
 <instr N-1>
*/
Copy the code

Each optimized instruction requires its own cache entries. We need a means of indexing O(1) from instruction to cache. In Python 3 each instruction includes an 8-bit opcode and an 8-bit operand. The official choice is to build the secondary index using offset + operand, where offset is used to determine the block range of the index (it’s a bit like a PAGE table search, offset stands for PAGE, operand stands for PAGEOFF), and operand is used to correct offset. The operand that is covered by the index is stored in the cache, as described in PEP 659:

For instructions that need specialization data, the operand in the quickened array will serve as a partial index, along with the offset of the instruction, to find the first specialization data entry for that instruction.

For LOAD_GLOBAL, the main cache is dict version information and key index information, as well as some additional information required by the optimizer, as follows:

  1. The version of dk_version of the dictionary key and value pair, since both globals and builtins are involved here, we need to cache two Dk_versions, each dk_version is a uint32_t, So dk_version adds up to an 8B, which is a cache entry;
  2. Key corresponds to index, which is a uint16_t. Since we have occupied one cache entry, only one additional cache entry can be used here.
  3. Since the original operand is used as a partial index, we need an extra uint8_t to store original_oparg, which can be combined with the uint16_t in 2.
  4. In the proposal reading section we mentioned that converting between Adaptive, Specilization and Deoptimize requires a counter buffer to avoid turbulence, and the official uint8_t is used here.

LOAD_GLOBAL requires two cache entries: orignal_oparg + counter + index, globals and builtins dk_version

typedef struct {
    uint8_t original_oparg;
    uint8_t counter;
    uint16_t index;
} _PyAdaptiveEntry;

typedef struct {
    uint32_t module_keys_version;
    uint32_t builtin_keys_version;
} _PyLoadGlobalCache;
Copy the code

For LOAD_GLOBAL, it is no more than looking for globals and builtins. When a LOAD_GLOBAL matches globals in the Adaptive logic, We optimize it to LOAD_GLOBAL_MODULE, caching index and module_keys_version. When a LOAD_GLOBAL misses globals, we need to cache both module_keys_version and builtin_keys_version, This is because when globals change, LOAD_GLOBAL may not hit builtins next time. Here we optimize it to LOAD_GLOBAL_BUILTIN, which is actually Adaptive.

Specializing & Deoptimize

As mentioned above, the bytecode Object PyCodeObject, after Warmup and Adaptive processes, where the optimizable instructions have become Specialized, For example, LOAD_GLOBAL is already all in the form of LOAD_GLOBAL_MODULE or LOAD_GLOBAL_BUILTIN (Deoptimize is not a consideration here), which is the bytecode instructions that are optimally appropriate to the current environment, Let’s take a look at the code for these adaptations.

In fact, after the above analysis, I believe we have already put the adaptation code guess a lot about ten, let’s take the more complex LOAD_GLOBAL_BUILTIN as an example to analyze the source code:

TARGET(LOAD_GLOBAL_BUILTIN) { // ... PyDictObject *mdict = (PyDictObject *)GLOBALS(); PyDictObject *bdict = (PyDictObject *)BUILTINS(); SpecializedCacheEntry *caches = GET_CACHE(); _PyAdaptiveEntry *cache0 = &caches[0].adaptive; _PyLoadGlobalCache *cache1 = &caches[-1].load_global; DEOPT_IF(mdict->ma_keys->dk_version ! = cache1->module_keys_version, LOAD_GLOBAL); DEOPT_IF(bdict->ma_keys->dk_version ! = cache1->builtin_keys_version, LOAD_GLOBAL); PyDictKeyEntry *ep = DK_ENTRIES(bdict->ma_keys) + cache0->index; PyObject *res = ep->me_value; DEOPT_IF(res == NULL, LOAD_GLOBAL); STAT_INC(LOAD_GLOBAL, hit); Py_INCREF(res); PUSH(res); DISPATCH(); }Copy the code

Let’s ignore DEOPT_IF and look at the main logic. First, fetch the cache entries for the current command. The first entry contains index, the second entry contains dk_version of globals and builtins. In case of a cache hit, we simply retrieve the key-value pair from the Hashtable [index] of builtins dict and return its value, which is much faster than checking globals dict and then checking builtins dict.

We know that we only go to Builtins when globals can’t find it. If globals changes, the cache will inevitably fail. Besides, our index cache is also a builtins dict that cannot be changed. That’s one of the triggers for Deoptimize, which is exactly what DEOPT_IF does:

#define DEOPT_IF(cond, instname) if (cond) { goto instname ## _miss; }
#define JUMP_TO_INSTRUCTION(op) goto PREDICT_ID(op)
#define ADAPTIVE_CACHE_BACKOFF 64

static inline void
cache_backoff(_PyAdaptiveEntry *entry) {
    entry->counter = ADAPTIVE_CACHE_BACKOFF;
}

LOAD_GLOBAL_miss: 
{ 
    STAT_INC(LOAD_GLOBAL, miss); 
    _PyAdaptiveEntry *cache = &GET_CACHE()->adaptive; 
    cache->counter--; 
    if (cache->counter == 0) { 
        next_instr[-1] = _Py_MAKECODEUNIT(LOAD_GLOBAL_ADAPTIVE, _Py_OPARG(next_instr[-1])); 
        STAT_INC(LOAD_GLOBAL, deopt); 
        cache_backoff(cache); 
    } 
    oparg = cache->original_oparg; 
    STAT_DEC(LOAD_GLOBAL, unquickened); 
    JUMP_TO_INSTRUCTION(LOAD_GLOBAL); 
}
Copy the code

What DEOPT_IF does is skip to the cache miss branch of the instruction. For LOAD_GLOBAL, LOAD_GLOBAL_miss, the miss branch does very fixed things. The first thing to make clear is that it will fallback to the regular LOAD_GLOBAL branch (JUMP_TO_INSTRUCTION at the bottom), but it will never roll back instructions to LOAD_GLOBAL, Only transfers between Adaptive and Specialized directives. To avoid cache bumping, the code above degenerates to Adaptive until counter decays to 0, trying to access the cache first until then.

So far, we have finished analyzing the whole optimization and de-optimization process. The whole process is complicated but has obvious routines. As long as the Cache and Adaptive infrastructure are well built, the transformation of other instructions is relatively easy. One difficulty with instruction acceleration in this way is how to design the cache to reduce branching, because DEOPT_IF makes it difficult to directly reduce the preconditional judgment, and can only be converted into a more efficient form.

Pay attention to [Alibaba mobile technology] wechat public number, every week 3 mobile technology practice & dry goods to give you thinking!