PHP 7 new Hash table implementation

The statement

This post was translated from Nikita Popov(@Nikic), a member of the PHP development team, with some adjustments for Chinese custom.

PHP’s new HashTable implementation

Disclaimer

This article is translated from a blog post of Nikita Popov(@nikic), in which I have done a small amount of adjustment based on the reading habit of Chinese.

Original Post: PHP’s new hashtable implementation

The body of the

About three years ago, I wrote an article analyzing the memory consumption of PHP 5 arrays. Much of the Zend engine has been rewritten as part of the work for the upcoming PHP 7, which is aimed at smaller data structures and less memory allocation. In this article, I’ll give an overview of the new Hash table implementation and show why it’s more efficient than the previous implementation.

I measure memory usage using a script that tests the amount of memory needed to create an array of 100,000 different integers:

$startMemory = memory_get_usage();
$array = range(1, 100000);
echo memory_get_usage() - $startMemory, " bytes\n";
Copy the code

The following table shows PHP 5.6 vs. PHP 7 memory usage on 32-bit and 64-bit systems:

32 bit | | 64 - bit -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- PHP 5.6 7.37 MiB | | 13.97 MiB -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | PHP 7.0 3.00 the MiB | 4.00 MiBCopy the code

In other words, arrays in PHP 7 save 2.5 times memory on a 32-bit system and 3.5 times memory on a 64-bit (LP64) system, which is pretty impressive.

The Hash table is introduced

In essence, ARRAYS in PHP are ordered dictionaries. That is, they represent an ordered list of key/value pairs whose key/value mappings are implemented by a Hash table.

Hash tables are a common data structure. Basically, it solves the problem that computers can only express arrays directly using consecutive integer subscripts, whereas programmers often want to use strings or other more complex types as keys.

The concept behind a Hash table is simple: a string key returns an integer, via a Hash function, as the subscript of a “normal” array. The problem is that two different strings can produce the same Hash value, because the possibilities of string combinations are virtually infinite, whereas hashes are limited by integers. So, these Hash tables need to implement some sort of conflict handling mechanism.

There are two main approaches to conflict handling in the industry: the open address method, where if a conflict occurs, the element is stored in a different subscript; Chained address method, so elements with the same Hash value will be stored in a linked list. PHP uses the latter.

In general, Hash tables have no explicit order: the order in which elements are stored at the bottom of the array depends on the Hash function and is fairly random. But this behavior is inconsistent with the semantics of PHP arrays: If you iterate through a PHP array, you’ll get elements in insertion order. This means that PHP Hash table implementations have to incorporate additional mechanisms to keep track of the order of array elements.

Old Hash table implementation

I’ll give you a brief overview of older Hash table implementations here, but for a more comprehensive explanation you can read the Hash table section of the PHP kernel book. Here is a high-level view of a PHP 5 Hash table:

The elements in the Conflict Handling linked list are buckets. Each bucket is assigned individually. The image hides the actual value stored in the bucket (only the key is shown). Values are stored in independently allocated Zval structures of either 16 bytes (32-bit) or 24 bytes (64-bit).

The other thing is that the image doesn’t show that the conflict-handling list is actually a two-way list (which makes it easier to delete). Next to the conflict-handling list, there is another bidirectional list for storing the order of array elements. For an array with three keys in the order “a”, “b”, and “c”, the sequential list is as follows:

So why are old Hash table structures so inefficient, both in terms of memory consumption and performance? There are many main reasons for this:

bucketMemory needs to be allocated independently. Memory allocation is slow and requires an additional 8/16 bytes. Allocating memory independently also meansbucketIt’s more spread out in memory, reducing caching.
zvalMemory also needs to be allocated independently. Similarly, memory allocation is slow and leads to more memory allocation. In addition, this also needs to be in eachbucketStore a pointer tozvalThe pointer. Because the old implementation was overly generic, it actually required not one, but two Pointers.
Bidirectional linked lists are required for eachbucketStore 4 Pointers, which alone takes up 16/32 bytes… In addition, traversing a linked list is a very cache-unfriendly operation.

New Hash table implementations try to solve (or at least improve on) these problems.

New zval implementation

Before I dive into the actual Hash table, I want to take a quick look at the new zval structure and highlight its differences from the old one. The zval structure is defined as follows:

struct _zval_struct {
    zend_value value;
    union {
        struct {
            ZEND_ENDIAN_LOHI_4(
                zend_uchar type,
                zend_uchar type_flags,
                zend_uchar const_flags,
                zend_uchar reserved)
        } v;
        uint32_t type_info;
    } u1;
    union {
        uint32_t var_flags;
        uint32_t next;       /* hash collision chain */
        uint32_t cache_slot; /* literal cache slot */
        uint32_t lineno;     /* line number (for ast nodes) */
    } u2;
};
Copy the code

You can safely ignore the macro ZEND_ENDIAN_LOHI_4 in the definition, which is only used to ensure predictable memory layouts on machines with different byte orders.

The zval structure has three parts: the first member is a value. Zend_value union takes up 8 bytes to store different types of values, including integers, strings, arrays, and so on. What is stored depends on the type of zval.

The second part is a 4-bytes type_INFO that contains the actual type (like IS_STRING or IS_ARRAY), with some additional identifiers that provide this type information. For example, if the zVal stores an object, the type identifier indicates that it is an insubstantial, reference-counted, garbageable, non-replicable type.

The last 4 bytes are typically unused in the zval structure (it’s just explicit padding that the compiler automatically introduces). However, in a particular context, this part of the space is stored with some additional information. For example, AST nodes use it to store row numbers, virtual machine constants use it to store subscripts of cache slots, Hash tables use it to store the next element in the collision handling list — the last part is important to us.

If you compare it to the previous zval implementation, one of the most significant differences is that the new Zval structure no longer has a reference count refcount. The reason for this is that the ZVal is no longer allocated separately, but is integrated directly wherever it is stored (for example, a bucket in a Hash table).

Zval itself no longer uses reference counting, but complex data types such as String, array, Object, and Resource also use reference counting. In fact, the new Zval design moves reference counting from Zval to structures like array and Object. There are many benefits to this approach, some of which are listed below:

zvalStore only simple values (such as booleans, ints, or floating-point numbers) that no longer include any memory allocation. So this saves extra overhead on allocation headers and improves cache access by avoiding unnecessary allocation and freeing memory, which improves performance.
zvalStoring simple values does not require storing reference counts and GC root buffers.
We avoid double reference counting. For example, in the past, objects were both usedzvalAn additional object reference count has been added to support the object pass syntax.
Because all complex values are now integrated with reference counting, they can be shared independentlyzvalEspecially now that strings can be shared. This is important for implementation of the Hash table because it no longer needs to copy a non-retained copy.string interningIs a method that stores only one copy of each different immutable string value.

New Hash table implementation

After all our prep work, we finally moved into the new Hash table implementation in PHP 7. Let’s start with the bucket structure:

typedef struct _Bucket {
    zend_ulong        h;
    zend_string      *key;
    zval              val;
} Bucket;
Copy the code

A bucket is an entry in a Hash table. It contains a lot of what you would expect: a Hash value h, a string key value key, and a zval value val. The key value of the integer will be stored in h (key and Hash are the same here), and the key will be set to NULL.

As you can see, ZVal is built directly into bucket, so it doesn’t need a separate allocation, and we don’t have to pay for allocation.

The structure of the main Hash is even more interesting:

typedef struct _HashTable {
    uint32_t          nTableSize;
    uint32_t          nTableMask;
    uint32_t          nNumUsed;
    uint32_t          nNumOfElements;
    zend_long         nNextFreeElement;
    Bucket           *arData;
    uint32_t         *arHash;
    dtor_func_t       pDestructor;
    uint32_t          nInternalPointer;
    union {
        struct {
            ZEND_ENDIAN_LOHI_3(
                zend_uchar    flags,
                zend_uchar    nApplyCount,
                uint16_t      reserve)
        } v;
        uint32_t flags;
    } u;
} HashTable;
Copy the code

Buckets (equivalent to array elements) are stored in an arData array and allocated as a power of 2. The size of the array is stored in nTableSize (minimum 8). The actual number of elements is nNumOfElements. Notice that this array contains the bucket directly. Previously, we used an array of Pointers to allocate buckets individually, which meant that we needed to allocate/release more and had to pay the price of allocating memory and additional Pointers.

Order of elements

The arData array stores elements in the order they were inserted. So the first element will be stored in arData[0], the second in arData[1], and so on. It doesn’t depend at all on the key used, it just depends on the insertion order.

So if you have five Hash elements, arData[0] through arData[4] will be occupied, the next free slot arData[5]. We store this number in nNumUsed. You may be wondering: Why store them separately? Is there a difference between this and nNumOfElements?

This question arises because we have only looked at the case when an insert is performed. When deleting an element from a Hash table, we obviously don’t want to keep the array contiguous by moving all the elements after the deleted element in arData. Instead, we simply annotate the IS_UNDEF type in zval.

Use the following code as an example:

$array = [
    'foo' => 0,
    'bar' => 1,
    0     => 2,
    'xyz' => 3,
    2     => 4
];
unset($array[0]);
unset($array['xyz']);
Copy the code

This forms the following arData structure:

nTableSize     = 8
nNumOfElements = 3
nNumUsed       = 5

[0]: key="foo", val=int(0)
[1]: key="bar", val=int(1)
[2]: val=UNDEF
[3]: val=UNDEF
[4]: h=2, val=int(4)
[5]: NOT INITIALIZED
[6]: NOT INITIALIZED
[7]: NOT INITIALIZED
Copy the code

As you can see, the first five arData elements are used, but the subscripts 2 (key is 0) and 3 (key is’ xyz ‘) are replaced with IS_UNDEF because they are unset. These elements now also waste memory. However, once nNumUsed reaches nTableSize, PHP automatically compresses the arData array by discarding any UNDEF records. Only when all buckets are truly valuable will they be reassigned to twice the size.

The new way of maintaining array order has many advantages over PHP 5.x’s bidirectional lists. One obvious advantage is that we save two Pointers, equivalent to 8/16 bytes, for each bucket. Also, this means that the array iteration is roughly as follows:

uint32_t i;
for (i = 0; i < ht->nNumUsed; ++i) {
    Bucket *b = &ht->arData[i];
    if (Z_ISUNDEF(b->val)) continue;

    // do stuff with bucket
}
Copy the code

This is equivalent to a linear memory scan, which is more efficient for caching than list traversal, which requires relatively random memory addressing backwards and forwards.

One problem with the current implementation is that arData never contracts unless it is explicitly told to do so. So if you create an array of a few million elements and delete them later, the array still takes up a lot of memory. We should probably halve the size of arData if the usage is reduced to a certain extent.

Hash table lookup

So far, we’ve only discussed how PHP arrays represent order. The lookup of the actual Hash table uses a second arHash array containing the uint32_t values. The arHash number combination arData has the same size (nTableSize). Both arHash numbers are allocated to the same memory fragment.

The Hash value returned is a 32-bit or 64-bit unsigned integer returned by the Hash function (using the DJBX33A algorithm for strings) that is too large to be used directly as a subscript of the Hash array. We need to first convert them to the table size with a mod operation. We use hash & (HT -> ntablesize-1) instead of hash % ht->nTableSize, which gives consistent results in cases where the array size is a power of two but does not require “expensive” integer division operations. Ht -> ntablesize-1 values are stored in HT ->nTableMask.

Next, we look for the subscript idx = ht->arHash[Hash & HT ->nTableMask] in the Hash array. This subscript corresponds to the head of the conflict-handling linked list, so HT ->arData[IDX] is the first record we examine. If the key matches what we’re looking for, we’re done.

Otherwise, we must move on to the next conflict handling list. The subscript of this element is stored in bucket->val.u2.next, a 4 bytes zval that is not normally used but only makes sense in a specific context. We continue to traverse the list until we find the correct bucket, or we encounter INVALID_IDX, which means that there is no element corresponding to the key we are looking for.

The lookup mechanism is shown in the following code:

zend_ulong h = zend_string_hash_val(key); uint32_t idx = ht->arHash[h & ht->nTableMask]; while (idx ! = INVALID_IDX) { Bucket *b = &ht->arData[idx]; if (b->h == h && zend_string_equals(b->key, key)) { return b; } idx = Z_NEXT(b->val); // b->val.u2.next } return NULL;Copy the code

Let’s consider how optimizations have been made to previous implementations: in PHP 5.x, the conflict-handling list is a bidirectional list. Using the Uint32_t subscript is better than using Pointers because it requires half as much memory on 64-bit systems. Also, 4 bytes is just enough to build a link to the next node into the unused Zval slot, so we’re essentially free to use it.

We now also use one-way linked lists with no prev nodes. The prev node is useful for deleting elements because when you delete, you have to adjust the next node of the prev node. However, if you delete by key, you already know the previous element as you traverse the collision handling list.

Deletion in some context (for example, deleting the element where the iterator is currently located) may require traversing the conflicting list to find the previous element. This is not a very common scenario, however, and in this case we prefer to save memory rather than save a traversal.

Pack the Hash table

PHP uses Hash tables for any array. However, for some very common arrays with contiguous integer subscripts (such as arrays of real numbers), the entire Hash system is not useful. This is why PHP 7 introduced the concept of “packaged Hash tables.”

In the packaged Hash table, the arHash array is NULL and is looked up directly through arData. If you look for an element with subscript 5, the element is in arData[5] or does not exist at all, and there is no need to traverse the collision handling list.

Note: Even integer subscript PHP arrays need to maintain order. Arrays [0 => 1, 1 => 2] and [1 => 2, 0 => 1] are not the same. The optimization of packaged Hash tables is only useful for arrays sorted by ascending subscripts. Arrays can have intervals (discontinuous subscripts), but must be in ascending order. So if elements are inserted in the wrong order (e.g., in reverse order), packaged Hash table optimization will not be used.

Also note that packaged Hash tables still store a lot of useless information. For example, we can determine the index of a bucket based on the memory address, so bucket->h is redundant. The value of bucket->key will always be NULL, so this also wastes memory.

We keep these useless values, so buckets have the same structure regardless of whether packaging is used. This means that iterations can use the same code. However, we may switch to a “fully packaged” structure in the future, using pure ZVal arrays if possible.

Empty the Hash table

Empty Hash tables are treated specially in BOTH PHP 5.x and PHP 7. If you create an empty array array [], chances are you won’t actually insert any elements. The arData/arHash array only allocates memory when you insert the first element.

To avoid checking this particular case in many places, a little trick is applied here: when nTableSize is set to the implied size or the default value 8, nTableMask (actually ntablesize-1) is set to 0. This means that hash & HT ->nTableMask will also result in 0.

So in this case, the arHash array has only one element with an INVALID_IDX value subscript 0 (this particular array is called uninitialized_bucket and is statically allocated memory). When doing the lookup, we will always find the INVALID_IDX value, meaning that the key (which you really just want to statically allocate to create an empty table) was not found.

Memory usage

Memory usage should address the most important aspect of PHP 7 Hash table implementation. First, let’s summarize why the new implementation saves within province. I will only use 64-bit system data here and look at the size of individual elements, ignoring the main HashTable structure (which is asymptotically unimportant).

In PHP 5.x each element requires a whopping 144 bytes. In PHP 7, this was reduced to 36 bytes, or 32 bytes in the packaged case. Here are the differences:

zvalNot allocated independently, so we saved 16 bytes.
bucketNot allocated independently, so we saved another 16 bytes.
eachzvalThe value saves 16 bytes.
Ensuring order no longer requires a 16 bytes two-way queue, but absolute order.
The collision chain appears to be one-way, saving 8 bytes. In addition, it is now a linked list with subscripts, and subscripts are built inzval, so we actually saved another 8 bytes.
Due to thezvalBe built into thebucketWe no longer need to save a pointer to it. Because of the details of the previous implementation, we actually saved two Pointers, so we saved another 16 bytes.
The length of the key is no longer stored inbucketThis has 8 bytes. However, if the key is a string rather than an integer, the length of the key will still be storedzend_stringIn the. The exact memory impact in this case is difficult to quantify becausezend_stringIs shared, since previously the Hash table had to copy the string if it wasn’t retained.
The array contains conflicting chains that are subscription-based, so each element saves 4 bytes. This is completely unnecessary for packing arrays, and we can save another 4 bytes.

To be clear, however, this summary makes things look better than they really are in many ways. First, the new Hash table implementation uses many of the built-in constructs that correspond to allocation. What’s the downside of that?

If you did look at the measured data at the beginning of this article, you’ll notice that a 100,000-element array in 64-bit PHP 7 takes up 4 MB of memory. In this case, we did it with a packed array, so we actually expect to consume 32 * 100000 = 3.05 MB of memory. The reason for this is that we allocate memory to a power of two for everything. In this case, the size of nTableSize is 2^17 = 131072, so we need to allocate 32 * 131072 bytes of memory (or 4 MB).

Of course, previous implementations of Hash tables also allocated memory to a power of two. However, it allocates this way only for arrays with bucket Pointers (where each pointer is 8 bytes), and everything else is allocated on demand. So, in PHP 7 we were wasting 32 * 31072 (0.95 MB) of useless memory, while in PHP 5.x we were wasting only 8 * 31072 (0.24 MB).

Another thing to consider is what happens if not all stored values are different. For simplicity, let’s assume that all values are exactly the same. So let’s replace the initial script range function with array_fill.

$startMemory = memory_get_usage();
$array = array_fill(0, 100000, 42);
echo memory_get_usage() - $startMemory, " bytes\n";
Copy the code

The script results in the following:

32 bit | | 64 - bit -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- PHP 5.6 4.70 MiB | | 9.39 MiB -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | PHP 7.0 3.00 the MiB | 4.00 MiBCopy the code

As you can see from memory usage, PHP 7 maintains the same footprint. There is no reason for this to change, as each Zval is separate. Memory consumption is now significantly lower in PHP 5.x, because you only need a zVal to store all the values. So, while PHP 7 still has some advantages, the gap has narrowed.

Things get even more complicated if we consider strings as keys (which may be shared or retained) and complex values. PHP 7 will save significantly more memory than PHP 5.x in this case, but the numbers in the introduction may be overly optimistic in many cases.

performance

We’ve talked a lot about memory footprint, now let’s move on to the next step, which is called performance. Ultimately, the goal of the PHPNG project is not to improve memory footprint, but to improve performance. Memory footprint optimization is only one way to achieve this goal, because reducing memory footprint leads to better CPU cache utilization for better performance.

However, there are of course some other reasons why the new implementation is faster: First, we reduce memory allocation. We reduce the allocation twice for each element, depending on whether the values are shared. Memory allocation is an expensive operation, so this effect makes sense.

Array iteration is now particularly cache-friendly, since it is now linear memory traversal rather than random-access linked list traversal.

More might be needed on the subject of performance, but the main interest of this article is memory usage, so I won’t expand on the details here.

At the end of thinking

PHP 7’s Hash table implementation is definitely a big step forward, and a lot of unused memory is no longer being used.

So the question is: Where do we go from here? One idea, which I also mentioned earlier, is to use a fully packaged Hash in the case of an ascending integer key. This means using pure ZVal arrays, which is the best we can do before we start dealing specifically with arrays of consistent type.

We have some other directions as well. For example, switching from linked addresses to open addresses (e.g., using Robin Hood lookups) can optimize both memory usage (no conflicting handling of linked lists) and performance (better caching efficiency, depending on the details of the lookups). However, the open address method is relatively difficult to combine with sorting requirements, so this may not be feasible in practical situations.

Another idea is to consolidate the H and key fields in the bucket into the bucket structure. Integer keys only use H. String keys also store hash values in keys. However, doing so can have a detrimental effect on performance, as extracting hash values also requires additional memory overhead.

The last thing I need to say is that PHP 7 has not only optimized the internal implementation of Hash tables, but also improved the associated API. I often look at even simple operations like zend_hash_find, especially considering how many layers of indirect calls are required (hint: three). In PHP 7, you just say zend_hash_find(ht, key) and you get *zval. Overall, writing extensions for PHP 7 has become more interesting.

Hopefully I can give you some insight into the PHP 7 Hash table kernel. Maybe I’ll write a follow-up article on Zval, I’ve already touched on some of its differences in this article, but there’s more to be said on this topic.

PHP

PHP7

Internal

Array

Hash