0 x00 preface
Recently, I read “PHP kernel analysis”, about PHP array, hereby write a summary record (∩_∩). Because arrays in PHP are powerful and important data types, it supports both simple arrays and key-value arrays, which are similar to Go’s Map but can be traversed sequentially, and the basic search time complexity is O(1) thanks to the hash table implementation. So let’s take a look at the underlying implementation of PHP arrays
0x01 array structure
What does an array look like in the PHP kernel? We can see its structure in the PHP source code as follows:
// Define the structure alias as HashTable
typedef struct _zend_array HashTable;
struct _zend_array {
// gc stores reference counts, memory management related; This article does not cover
zend_refcounted_h gc;
// u Store auxiliary information; This article does not cover
union {
struct {
ZEND_ENDIAN_LOHI_4(
zend_uchar flags,
zend_uchar nApplyCount,
zend_uchar nIteratorsCount,
zend_uchar consistency)
} v;
uint32_t flags;
} u;
// For hash functions
uint32_t nTableMask;
// arData points to the first Bucket in the array that stores the elements. Bucket is a uniform array element type
Bucket *arData;
// The number of buckets used
uint32_t nNumUsed;
// Number of valid elements in the array
uint32_t nNumOfElements;
// Total capacity of array
uint32_t nTableSize;
// Internal pointer for traversal
uint32_t nInternalPointer;
// The next available numeric index
zend_long nNextFreeElement;
// destructor
dtor_func_t pDestructor;
};
Copy the code
nNumUsed
和nNumOfElements
The difference between:nNumUsed
Refers to thearData
Array already usedBucket
Number, because the array simply removes the element after deleting itBucket
The type of the value is set toIS_UNDEF
(because it would be too time-consuming to move and reindex the array every time an element was deleted), andnNumOfElements
That corresponds to the actual number of elements in the array.nTableSize
The capacity of the array, which is a power of 2. PHP array is variable length but C language array fixed length, in order to achieve PHP variable length array function, the use of “expansion” mechanism, that is, at the time of each insert element judgmentnTableSize
Is it enough to store. If it is not enough, apply again for 2 timesnTableSize
Size of the new array, and copy the original array (this is to clear the original array of typeIS_UNDEF
Element) and re-index.nNextFreeElement
Save the next available numeric index, for example in PHP$a[] = 1;
This usage inserts an index ofnNextFreeElement
And thennNextFreeElement
Since the increased 1.
The _zend_array structure is discussed here, and the functions of some of the structure members are explained below. Don’t be nervous. Let’s look at the Bucket structure as an array member:
typedef struct _Bucket {
// The value of an array element
zval val;
// key A hash value or numeric index calculated using the Time 33 algorithm
zend_ulong h;
// The numeric index is NULL
zend_string *key;
} Bucket;
Copy the code
0x01 Array access
We know that PHP arrays are implemented based on hash tables, and different from ordinary hash tables, PHP arrays also implement order of elements, that is, the insertion of elements from memory is contiguous rather than out of order, to achieve this order PHP uses “map table” technology. Here’s how to access the elements of a PHP array using a legend: -d.
Note: Since the key name is hashed twice to the subscript of the mapping table, the hash is used to distinguish the first hash from the second hash.
As can be seen from the figure, the mapping table and the array elements are in the same continuous piece of memory. The mapping table is an integer array with the same length as the storage element. Its default value is -1, and its valid value is the subscript of the Bucket array. HashTable->arData points to the first element of the Bucket array in memory.
$a[‘key’]; $a[‘key’]; First, the hash value of the key is calculated by the Time 33 algorithm, and then the mapping table subscript corresponding to the hash value is calculated by the hash algorithm. Since the value stored in the mapping table is the subscript value in the Bucket array, the corresponding elements in the Bucket array can be obtained.
Now let’s talk about the hash algorithm, which is the algorithm that maps the index of the “mapping table” by the hash value of the key name. It’s as simple as one line of code:
nIndex = h | ht->nTableMask;
Copy the code
The subscript of the mapping table can be obtained by or operation of the hash value and nTableMask. The value of nTableMask is the negative of nTableSize. And due to the value of 2 nTableSize powers, so h | ht – > the scope of nTableMask between [- nTableSize, 1], is just the subscript range on the map. As for why not use simple “mod” instead of the painstaking “bitwise or” operation? Because bitwise or operations are much faster than mod operations, I feel that the time optimization of a more complex implementation is worth it for this frequently used operation.
Hash conflict
Hash conflicts occur when the hash values of different key names may have the same subscripts in the “mapping table”. PHP uses the “chain address method” to solve this problem. The following is an example of accessing the hashed element:
This looks similar to the first figure, but we also access $a[‘key’] with a few more steps. First, the mapping table with subscript -2 is obtained through hash operation, and then the contents of the mapping table are found pointing to the element with subscript 1 of arData array. At this point, we compare the key of the element and the name of the key to be accessed, and find that the two are not equal, then the element is not the element we want to access, and the value of the element val.u2.next is the index of the arData array corresponding to the next element with the same hash value. So we can keep iterating through next’s values until we find an element with the same key name or the lookup fails.
0x02 Insert element
The _zend_hash_add_or_update_i function inserts elements as follows:
static zend_always_inline zval *_zend_hash_add_or_update_i(HashTable *ht, zend_string *key, zval *pData, uint32_t flag ZEND_FILE_LINE_DC)
{
zend_ulong h;
uint32_t nIndex;
uint32_t idx;
Bucket *p;
IS_CONSISTENT(ht);
HT_ASSERT_RC1(ht);
if(UNEXPECTED(! (ht->u.flags & HASH_FLAG_INITIALIZED))) {// The array is not initialized
// Initialize the array
CHECK_INIT(ht, 0);
// Jump to insert element segment
goto add_to_hash;
} else if (ht->u.flags & HASH_FLAG_PACKED) { // The array is a continuous numerically indexed array
// Convert to associative array
zend_hash_packed_to_hash(ht);
} else if ((flag & HASH_ADD_NEW) == 0) { // Add a new element
// Find the element corresponding to the key name
p = zend_hash_find_bucket(ht, key);
if (p) { // If the same key name element exists
zval *data;
/* Internal _zend_hash_add API logic, can be ignored */
if (flag & HASH_ADD) { // Specify the add operation
if(! (flag & HASH_UPDATE_INDIRECT)) {// If update of indirect type variable is not allowed, return directly
return NULL;
}
// Make sure the current value is different from the new valueZEND_ASSERT(&p->val ! = pData);// data points to the original array member value
data = &p->val;
if (Z_TYPE_P(data) == IS_INDIRECT) { // The original array element variable type is indirect
// take the corresponding variable of the indirect variable
data = Z_INDIRECT_P(data);
if(Z_TYPE_P(data) ! = IS_UNDEF) {// Return if the corresponding variable exists
return NULL; }}else { // Non-indirect types return directly
return NULL;
}
/* Common PHP array update logic */
} else { // No add operation is specified
// Make sure the current value is different from the new valueZEND_ASSERT(&p->val ! = pData);// Data points to the original array element value
data = &p->val;
// Data points to the corresponding variable when indirect type variables are allowed to be updated
if((flag & HASH_UPDATE_INDIRECT) && Z_TYPE_P(data) == IS_INDIRECT) { data = Z_INDIRECT_P(data); }}if (ht->pDestructor) { // The destructor exists
// Execute the destructor
ht->pDestructor(data);
}
// Copy the value of pData to data
ZVAL_COPY_VALUE(data, pData);
returndata; }}// If the hash table is full, expand it
ZEND_HASH_IF_FULL_DO_RESIZE(ht);
add_to_hash:
// The array already uses Bucket +1
idx = ht->nNumUsed++;
// The number of valid elements in the array +1
ht->nNumOfElements++;
// If the internal pointer is invalid, it points to the current subscript
if (ht->nInternalPointer == HT_INVALID_IDX) {
ht->nInternalPointer = idx;
}
zend_hash_iterators_update(ht, HT_INVALID_IDX, idx);
// p is the Bucket corresponding to the new element
p = ht->arData + idx;
// Set the key name
p->key = key;
if(! ZSTR_IS_INTERNED(key)) { zend_string_addref(key); ht->u.flags &= ~HASH_FLAG_STATIC_KEYS; zend_string_hash_val(key); }// Hash the key name and assign it to p
p->h = h = ZSTR_H(key);
// Assign pData to val of the Bucket
ZVAL_COPY_VALUE(&p->val, pData);
// Calculate the subscript of the mapping table
nIndex = h | ht->nTableMask;
// Resolve the conflict by assigning the contents of the original mapping table to the u2.next member of the new element variable value
Z_NEXT(p->val) = HT_HASH(ht, nIndex);
// Set the value in the mapping table to idx
HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(idx);
return &p->val;
}
Copy the code
0 x03 capacity
ZEND_HASH_IF_FULL_DO_RESIZE calls zend_hash_do_resize to expand and reindex the array. Note: Not every Bucket array is full and needs to be expanded. If a large proportion of the IS_UNDEF element is in the Bucket array, the IS_UNDEF element is removed and re-indexed directly to save memory. Let’s look at the zend_hash_do_resize function:
static void ZEND_FASTCALL zend_hash_do_resize(HashTable *ht)
{
IS_CONSISTENT(ht);
HT_ASSERT_RC1(ht);
if (ht->nNumUsed > ht->nNumOfElements + (ht->nNumOfElements >> 5)) { // The IS_UNDEF element is larger than 1/33 of the Bucket array
// Reindex directly
zend_hash_rehash(ht);
} else if (ht->nTableSize < HT_MAX_SIZE) { // Array size < maximum limit
void *new_data, *old_data = HT_GET_DATA_ADDR(ht);
// The new memory is twice the size of the old memory. Addition is used because addition is faster than multiplication
uint32_t nSize = ht->nTableSize + ht->nTableSize;
Bucket *old_buckets = ht->arData;
// Request new array memory
new_data = pemalloc(HT_SIZE_EX(nSize, -nSize), ht->u.flags & HASH_FLAG_PERSISTENT);
// Update the array structure member value
ht->nTableSize = nSize;
ht->nTableMask = -ht->nTableSize;
HT_SET_DATA_ADDR(ht, new_data);
// Copy the original array to the new array
memcpy(ht->arData, old_buckets, sizeof(Bucket) * ht->nNumUsed);
// Free the memory of the original array
pefree(old_data, ht->u.flags & HASH_FLAG_PERSISTENT);
// reindex
zend_hash_rehash(ht);
} else { // The size of the array exceeds the memory limit
zend_error_noreturn(E_ERROR, "Possible integer overflow in memory allocation (%u * %zu + %zu)", ht->nTableSize * 2.sizeof(Bucket) + sizeof(uint32_t), sizeof(Bucket)); }}Copy the code
The reindexing logic is in the zend_hash_rehash function as follows:
ZEND_API int ZEND_FASTCALL zend_hash_rehash(HashTable *ht)
{
Bucket *p;
uint32_t nIndex, i;
IS_CONSISTENT(ht);
if (UNEXPECTED(ht->nNumOfElements == 0)) { // The array is empty
if (ht->u.flags & HASH_FLAG_INITIALIZED) { // Initialized
// Set the Bucket number used to 0
ht->nNumUsed = 0;
// Mapping table reset
HT_HASH_RESET(ht);
}
// Return success
return SUCCESS;
}
// Mapping table reset
HT_HASH_RESET(ht);
i = 0;
p = ht->arData;
if (HT_IS_WITHOUT_HOLES(ht)) { // Bucket array all valid values, no IS_UNDEF
// ----------------------------
// Resets the value of the mapping table
do {
nIndex = p->h | ht->nTableMask;
Z_NEXT(p->val) = HT_HASH(ht, nIndex);
HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(i);
p++;
} while (++i < ht->nNumUsed);
// ----------------------------
} else {
do {
if (UNEXPECTED(Z_TYPE(p->val) == IS_UNDEF)) { // The current Bucket type is IS_UNDEF
uint32_t j = i;
Bucket *q = p;
if (EXPECTED(ht->u.v.nIteratorsCount == 0)) {
// Move the array to override the IS_UNDEF element
while (++i < ht->nNumUsed) {
p++;
if(EXPECTED(Z_TYPE_INFO(p->val) ! = IS_UNDEF)) { ZVAL_COPY_VALUE(&q->val, &p->val); q->h = p->h; nIndex = q->h | ht->nTableMask; q->key = p->key; Z_NEXT(q->val) = HT_HASH(ht, nIndex); HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(j);if(UNEXPECTED(ht->nInternalPointer == i)) { ht->nInternalPointer = j; } q++; j++; }}}else {
uint32_t iter_pos = zend_hash_iterators_lower_pos(ht, 0);
// Move the array to override the IS_UNDEF element
while (++i < ht->nNumUsed) {
p++;
if(EXPECTED(Z_TYPE_INFO(p->val) ! = IS_UNDEF)) { ZVAL_COPY_VALUE(&q->val, &p->val); q->h = p->h; nIndex = q->h | ht->nTableMask; q->key = p->key; Z_NEXT(q->val) = HT_HASH(ht, nIndex); HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(j);if (UNEXPECTED(ht->nInternalPointer == i)) {
ht->nInternalPointer = j;
}
if (UNEXPECTED(i == iter_pos)) {
zend_hash_iterators_update(ht, i, j);
iter_pos = zend_hash_iterators_lower_pos(ht, iter_pos + 1);
}
q++;
j++;
}
}
}
ht->nNumUsed = j;
break;
}
nIndex = p->h | ht->nTableMask;
Z_NEXT(p->val) = HT_HASH(ht, nIndex);
HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(i);
p++;
} while (++i < ht->nNumUsed);
}
return SUCCESS;
}
Copy the code
0 x04 summary
HMMMM, this is the end of the article, because of their own level of reasons can not explain very detailed clear. “This was one of the hardest things I have ever done, and came away feeling like I was the only one remaining to keep track of it because the writing was so spicy. Think of the phrase “If you can’t explain something simply, you don’t really understand it.” There are a lot of details and IMPLEMENTATION of PHP source CODE I am not familiar with, this article is just a start of my PHP bottom learning, I hope to be able to write a really simple good article.
There are other good article gsmtoday. Making. IO / 2018/03/21 /…
Original link – PHP array low-level implementation