PHP array implementation

0 x00 preface

Recently, I read “PHP kernel analysis”, about PHP array, hereby write a summary record (∩_∩). Because arrays in PHP are powerful and important data types, it supports both simple arrays and key-value arrays, which are similar to Go’s Map but can be traversed sequentially, and the basic search time complexity is O(1) thanks to the hash table implementation. So let’s take a look at the underlying implementation of PHP arrays

0x01 array structure

What does an array look like in the PHP kernel? We can see its structure in the PHP source code as follows:

// Define the structure alias as HashTable
typedef struct _zend_array HashTable;

struct _zend_array {
	// gc stores reference counts, memory management related; This article does not cover
	zend_refcounted_h gc;
	// u Store auxiliary information; This article does not cover
	union {
		struct {
			ZEND_ENDIAN_LOHI_4(
				zend_uchar    flags,
				zend_uchar    nApplyCount,
				zend_uchar    nIteratorsCount,
				zend_uchar    consistency)
		} v;
		uint32_t flags;
	} u;
	// For hash functions
	uint32_t          nTableMask;
	// arData points to the first Bucket in the array that stores the elements. Bucket is a uniform array element type
	Bucket           *arData;
	// The number of buckets used
	uint32_t          nNumUsed;
	// Number of valid elements in the array
	uint32_t          nNumOfElements;
	// Total capacity of array
	uint32_t          nTableSize;
	// Internal pointer for traversal
	uint32_t          nInternalPointer;
	// The next available numeric index
	zend_long         nNextFreeElement;
	// destructor
	dtor_func_t       pDestructor;
};
Copy the code

nNumUsed 和 nNumOfElementsThe difference between:nNumUsedRefers to thearDataArray already usedBucketNumber, because the array simply removes the element after deleting itBucketThe type of the value is set toIS_UNDEF(because it would be too time-consuming to move and reindex the array every time an element was deleted), andnNumOfElementsThat corresponds to the actual number of elements in the array.
nTableSizeThe capacity of the array, which is a power of 2. PHP array is variable length but C language array fixed length, in order to achieve PHP variable length array function, the use of “expansion” mechanism, that is, at the time of each insert element judgmentnTableSizeIs it enough to store. If it is not enough, apply again for 2 timesnTableSizeSize of the new array, and copy the original array (this is to clear the original array of typeIS_UNDEFElement) and re-index.
nNextFreeElementSave the next available numeric index, for example in PHP$a[] = 1;This usage inserts an index ofnNextFreeElementAnd thennNextFreeElementSince the increased 1.

The _zend_array structure is discussed here, and the functions of some of the structure members are explained below. Don’t be nervous. Let’s look at the Bucket structure as an array member:

typedef struct _Bucket {
	// The value of an array element
	zval              val;
	// key A hash value or numeric index calculated using the Time 33 algorithm
	zend_ulong        h;
	// The numeric index is NULL
	zend_string      *key;
} Bucket;
Copy the code

0x01 Array access

We know that PHP arrays are implemented based on hash tables, and different from ordinary hash tables, PHP arrays also implement order of elements, that is, the insertion of elements from memory is contiguous rather than out of order, to achieve this order PHP uses “map table” technology. Here’s how to access the elements of a PHP array using a legend: -d.

Note: Since the key name is hashed twice to the subscript of the mapping table, the hash is used to distinguish the first hash from the second hash.

As can be seen from the figure, the mapping table and the array elements are in the same continuous piece of memory. The mapping table is an integer array with the same length as the storage element. Its default value is -1, and its valid value is the subscript of the Bucket array. HashTable->arData points to the first element of the Bucket array in memory.

$a[‘key’]; $a[‘key’]; First, the hash value of the key is calculated by the Time 33 algorithm, and then the mapping table subscript corresponding to the hash value is calculated by the hash algorithm. Since the value stored in the mapping table is the subscript value in the Bucket array, the corresponding elements in the Bucket array can be obtained.

Now let’s talk about the hash algorithm, which is the algorithm that maps the index of the “mapping table” by the hash value of the key name. It’s as simple as one line of code:

nIndex = h | ht->nTableMask; 
Copy the code

The subscript of the mapping table can be obtained by or operation of the hash value and nTableMask. The value of nTableMask is the negative of nTableSize. And due to the value of 2 nTableSize powers, so h | ht – > the scope of nTableMask between [- nTableSize, 1], is just the subscript range on the map. As for why not use simple “mod” instead of the painstaking “bitwise or” operation? Because bitwise or operations are much faster than mod operations, I feel that the time optimization of a more complex implementation is worth it for this frequently used operation.

Hash conflict

Hash conflicts occur when the hash values of different key names may have the same subscripts in the “mapping table”. PHP uses the “chain address method” to solve this problem. The following is an example of accessing the hashed element:

This looks similar to the first figure, but we also access $a[‘key’] with a few more steps. First, the mapping table with subscript -2 is obtained through hash operation, and then the contents of the mapping table are found pointing to the element with subscript 1 of arData array. At this point, we compare the key of the element and the name of the key to be accessed, and find that the two are not equal, then the element is not the element we want to access, and the value of the element val.u2.next is the index of the arData array corresponding to the next element with the same hash value. So we can keep iterating through next’s values until we find an element with the same key name or the lookup fails.

0x02 Insert element

The _zend_hash_add_or_update_i function inserts elements as follows:

static zend_always_inline zval *_zend_hash_add_or_update_i(HashTable *ht, zend_string *key, zval *pData, uint32_t flag ZEND_FILE_LINE_DC)
{
	zend_ulong h;
	uint32_t nIndex;
	uint32_t idx;
	Bucket *p;

	IS_CONSISTENT(ht);
	HT_ASSERT_RC1(ht);
	if(UNEXPECTED(! (ht->u.flags & HASH_FLAG_INITIALIZED))) {// The array is not initialized
		// Initialize the array
		CHECK_INIT(ht, 0);
		// Jump to insert element segment
		goto add_to_hash;
	} else if (ht->u.flags & HASH_FLAG_PACKED) { // The array is a continuous numerically indexed array
		// Convert to associative array
		zend_hash_packed_to_hash(ht);
	} else if ((flag & HASH_ADD_NEW) == 0) { // Add a new element
		// Find the element corresponding to the key name
		p = zend_hash_find_bucket(ht, key);

		if (p) { // If the same key name element exists
			zval *data;
			/* Internal _zend_hash_add API logic, can be ignored */
			if (flag & HASH_ADD) { // Specify the add operation
				if(! (flag & HASH_UPDATE_INDIRECT)) {// If update of indirect type variable is not allowed, return directly
					return NULL;
				}
				// Make sure the current value is different from the new valueZEND_ASSERT(&p->val ! = pData);// data points to the original array member value
				data = &p->val;
				if (Z_TYPE_P(data) == IS_INDIRECT) { // The original array element variable type is indirect
 					// take the corresponding variable of the indirect variable
					data = Z_INDIRECT_P(data);
					if(Z_TYPE_P(data) ! = IS_UNDEF) {// Return if the corresponding variable exists
						return NULL; }}else { // Non-indirect types return directly
					return NULL;
				}
			/* Common PHP array update logic */
			} else { // No add operation is specified
				// Make sure the current value is different from the new valueZEND_ASSERT(&p->val ! = pData);// Data points to the original array element value
				data = &p->val;
				// Data points to the corresponding variable when indirect type variables are allowed to be updated
				if((flag & HASH_UPDATE_INDIRECT) && Z_TYPE_P(data) == IS_INDIRECT) { data = Z_INDIRECT_P(data); }}if (ht->pDestructor) { // The destructor exists
				// Execute the destructor
				ht->pDestructor(data);
			}
			// Copy the value of pData to data
			ZVAL_COPY_VALUE(data, pData);
			returndata; }}// If the hash table is full, expand it
	ZEND_HASH_IF_FULL_DO_RESIZE(ht);

add_to_hash:
	// The array already uses Bucket +1
	idx = ht->nNumUsed++;
	// The number of valid elements in the array +1
	ht->nNumOfElements++;
	// If the internal pointer is invalid, it points to the current subscript
	if (ht->nInternalPointer == HT_INVALID_IDX) {
		ht->nInternalPointer = idx;
	}
    
	zend_hash_iterators_update(ht, HT_INVALID_IDX, idx);
	// p is the Bucket corresponding to the new element
	p = ht->arData + idx;
	// Set the key name
	p->key = key;
	if(! ZSTR_IS_INTERNED(key)) { zend_string_addref(key); ht->u.flags &= ~HASH_FLAG_STATIC_KEYS; zend_string_hash_val(key); }// Hash the key name and assign it to p
	p->h = h = ZSTR_H(key);
	// Assign pData to val of the Bucket
	ZVAL_COPY_VALUE(&p->val, pData);
	// Calculate the subscript of the mapping table
	nIndex = h | ht->nTableMask;
	// Resolve the conflict by assigning the contents of the original mapping table to the u2.next member of the new element variable value
	Z_NEXT(p->val) = HT_HASH(ht, nIndex);
	// Set the value in the mapping table to idx
	HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(idx);

	return &p->val;
}
Copy the code

0 x03 capacity

ZEND_HASH_IF_FULL_DO_RESIZE calls zend_hash_do_resize to expand and reindex the array. Note: Not every Bucket array is full and needs to be expanded. If a large proportion of the IS_UNDEF element is in the Bucket array, the IS_UNDEF element is removed and re-indexed directly to save memory. Let’s look at the zend_hash_do_resize function:

static void ZEND_FASTCALL zend_hash_do_resize(HashTable *ht)
{

	IS_CONSISTENT(ht);
	HT_ASSERT_RC1(ht);

	if (ht->nNumUsed > ht->nNumOfElements + (ht->nNumOfElements >> 5)) { // The IS_UNDEF element is larger than 1/33 of the Bucket array
		// Reindex directly
		zend_hash_rehash(ht);
	} else if (ht->nTableSize < HT_MAX_SIZE) {	// Array size < maximum limit
		void *new_data, *old_data = HT_GET_DATA_ADDR(ht);
		// The new memory is twice the size of the old memory. Addition is used because addition is faster than multiplication
		uint32_t nSize = ht->nTableSize + ht->nTableSize;
		Bucket *old_buckets = ht->arData;
		// Request new array memory
		new_data = pemalloc(HT_SIZE_EX(nSize, -nSize), ht->u.flags & HASH_FLAG_PERSISTENT);
        
		// Update the array structure member value
		ht->nTableSize = nSize;
		ht->nTableMask = -ht->nTableSize;
		HT_SET_DATA_ADDR(ht, new_data);
        
		// Copy the original array to the new array
		memcpy(ht->arData, old_buckets, sizeof(Bucket) * ht->nNumUsed);
		// Free the memory of the original array
		pefree(old_data, ht->u.flags & HASH_FLAG_PERSISTENT);
		// reindex
		zend_hash_rehash(ht);
	} else { // The size of the array exceeds the memory limit
		zend_error_noreturn(E_ERROR, "Possible integer overflow in memory allocation (%u * %zu + %zu)", ht->nTableSize * 2.sizeof(Bucket) + sizeof(uint32_t), sizeof(Bucket)); }}Copy the code

The reindexing logic is in the zend_hash_rehash function as follows:

ZEND_API int ZEND_FASTCALL zend_hash_rehash(HashTable *ht)
{
	Bucket *p;
	uint32_t nIndex, i;

	IS_CONSISTENT(ht);

	if (UNEXPECTED(ht->nNumOfElements == 0)) { // The array is empty
		if (ht->u.flags & HASH_FLAG_INITIALIZED) { // Initialized
			// Set the Bucket number used to 0
            ht->nNumUsed = 0;
			// Mapping table reset
			HT_HASH_RESET(ht);
		}
		// Return success
		return SUCCESS;
	}
	// Mapping table reset
	HT_HASH_RESET(ht);
	i = 0;
	p = ht->arData;
	if (HT_IS_WITHOUT_HOLES(ht)) { // Bucket array all valid values, no IS_UNDEF
		// ----------------------------
		// Resets the value of the mapping table
		do {
			nIndex = p->h | ht->nTableMask;
			Z_NEXT(p->val) = HT_HASH(ht, nIndex);
			HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(i);
			p++;
		} while (++i < ht->nNumUsed);
		// ----------------------------
	} else {
		do {
			if (UNEXPECTED(Z_TYPE(p->val) == IS_UNDEF)) { // The current Bucket type is IS_UNDEF
				uint32_t j = i;
				Bucket *q = p;

				if (EXPECTED(ht->u.v.nIteratorsCount == 0)) {
					// Move the array to override the IS_UNDEF element
					while (++i < ht->nNumUsed) {
						p++;
						if(EXPECTED(Z_TYPE_INFO(p->val) ! = IS_UNDEF)) { ZVAL_COPY_VALUE(&q->val, &p->val); q->h = p->h; nIndex = q->h | ht->nTableMask; q->key = p->key; Z_NEXT(q->val) = HT_HASH(ht, nIndex); HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(j);if(UNEXPECTED(ht->nInternalPointer == i)) { ht->nInternalPointer = j; } q++; j++; }}}else {
					uint32_t iter_pos = zend_hash_iterators_lower_pos(ht, 0);
					// Move the array to override the IS_UNDEF element
					while (++i < ht->nNumUsed) {
						p++;
						if(EXPECTED(Z_TYPE_INFO(p->val) ! = IS_UNDEF)) { ZVAL_COPY_VALUE(&q->val, &p->val); q->h = p->h; nIndex = q->h | ht->nTableMask; q->key = p->key; Z_NEXT(q->val) = HT_HASH(ht, nIndex); HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(j);if (UNEXPECTED(ht->nInternalPointer == i)) {
								ht->nInternalPointer = j;
							}
							if (UNEXPECTED(i == iter_pos)) {
								zend_hash_iterators_update(ht, i, j);
								iter_pos = zend_hash_iterators_lower_pos(ht, iter_pos + 1);
							}
							q++;
							j++;
						}
					}
				}
				ht->nNumUsed = j;
				break;
			}
			nIndex = p->h | ht->nTableMask;
			Z_NEXT(p->val) = HT_HASH(ht, nIndex);
			HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(i);
			p++;
		} while (++i < ht->nNumUsed);
	}
	return SUCCESS;
}
Copy the code

0 x04 summary

HMMMM, this is the end of the article, because of their own level of reasons can not explain very detailed clear. “This was one of the hardest things I have ever done, and came away feeling like I was the only one remaining to keep track of it because the writing was so spicy. Think of the phrase “If you can’t explain something simply, you don’t really understand it.” There are a lot of details and IMPLEMENTATION of PHP source CODE I am not familiar with, this article is just a start of my PHP bottom learning, I hope to be able to write a really simple good article.

There are other good article gsmtoday. Making. IO / 2018/03/21 /…

Original link – PHP array low-level implementation

0 x00 preface

0x01 array structure

0x01 Array access

Hash conflict

0x02 Insert element

0 x03 capacity

0 x04 summary

Related Posts

Share 14 Top Spring MVC Tips!

Interview: Chapter 9: Distributed, High Concurrency, Clustering, load balancing, High Availability

Chain accounting principle | introduced.in the block chain