Optimization of string processing logic in PHP 7

1. Compare the string processing in PHP 5 with that in PHP 7

Look at the following example code:

$a = 'foo';
$b = 'bar';

$c = "I like $a and $b";
Copy the code

Execute the code in PHP 5.6, and the output of OpCode looks like this:

Specific implementation process:

First, apply for memory for I like
Then I like foo is allocated memory, and I like and foo are copied to the newly allocated memory space and returned
Continue to allocate memory for I like foo and, then copy I like foo and and to the newly allocated memory space and return
Finally, I like foo and bar are allocated memory, and then I like foo and bar are copied to the newly allocated memory and returned
Frees the memory space requested during string processing

Execute the code in PHP 7, and the output of opcode is as follows:

String processing in PHP 7 is relatively simple. First, create a stack, then store the concatenated string fragments on the stack, and finally allocate memory space once, and then move the result from the stack to the allocated memory space. Compared to PHP 5, PHP 7 avoids the repeated memory allocation process during processing. PHP 7’s performance improvement in string processing is due to the use of data structure ropes.

Introduction to Rope data structure

___________ the Rope is introduced

A Rope is a binary tree in which each leaf node stores a substring of a string, and each non-leaf node stores the total number of characters contained in the substring stored by each leaf node to the left of that node (making it easy to find the specified character by index).

⒉ Advantages and Disadvantages of Rope

Advantage:

A Rope makes string insertion, deletion, and other operations faster. The time complexity of a Rope is O(log⁡N)O(\log{N})O(logN), and the time complexity of a string array is O(N)O(N)O(N) O(N)O(N).
When manipulating strings as arrays of characters, you first need to copy the array, which takes up O(N)O(N)O(N) memory, whereas ropes do not need to be copied
Ropes do not require large contiguous chunks of memory as character arrays do
With Rope data structures, the undoing of string operations is very convenient
With ropes, performance is stable regardless of the size of the string being manipulated

Inadequate:

Additional memory space is required to store non-leaf nodes, increasing memory consumption compared to character arrays
Because the structure of binary tree is more complex than that of character array, the performance of the operation of small size string is worse
Manipulating strings with ropes increases code complexity and increases the risk of bugs

The code implementation of a judge Rope

Rope is essentially a binary tree, and its basic insert, delete, and search operations are the same as those of a binary tree. Only concat and split strings are described here.

In addition, to minimize the time complexity of the operation, the Rope can be constructed as an AVL tree that is self-balancing after each operation. However, this may increase the complexity of some undo operations. For example, concat two ropes of different heights to form a new Rope that is self-balanced by node rotation may break the structure of the original two trees, making it very complicated to undo the concat operation.

⓵ concat

Concat is relatively simple: you simply join two Rope trees to form a new Rope tree

⓶ split

When you split a Rope tree, you encounter two situations: you split it right at the end of a node or you split it in the middle of a leaf node. For the second case, we can split the corresponding leaf nodes to transform into the first case. As for the first case, it can be divided into two cases according to the position of the node:

A. The node is located in the leftmost or rightmost branch

B. The node is a branch within the Rope tree

class Node:
    def __init__(self, data) :
        self.parent = None
        self.left = None
        self.right = None
        self.data = data
        self.weight = 0

    def __repr__(self) :
        return str(self.__dict__)

    def __str__(self) :
        return str(self.__dict__)


class Rope:
    # Each leaf node stores up to 5 characters, including null characters
    LEAF_DATA_LEN = 5

    def __init__(self) :
        self.root = None

    def create_rope(self, parent, data, left_index, right_index) :
        """ create rope data structure :param parent: parent (root parent is null) Param left_index: start index: param right_index: return: Node """
        if isinstance(data, str):
            data = list(data)
        elif not isinstance(data, list) :return

        if right_index - left_index > self.LEAF_DATA_LEN:
            node = Node("")

            node.parent = parent
            middle_index = (left_index + right_index) // 2
            node.weight = middle_index - left_index
            node.left = self.create_rope(node, data, left_index, middle_index)
            node.right = self.create_rope(node, data, middle_index, right_index)
        else:
            node = Node(data[left_index: right_index])

            node.parent = parent
            node.weight = right_index - left_index

        if node.parent is None:
            self.root = node

        return node

    @staticmethod
    def calc_weight(node) :
        Param node: :return: """
        if node is None:
            return 0

        init_weight = node.weight
        while node.right is not None:
            node = node.right
            init_weight += node.weight

        return init_weight

    def concat_rope(self, data1, data2) :
        String concatenation :param data1: :param data2: :return: """
        r1 = Rope()
        r1.create_rope(None, data1, 0.len(data1))
        r2 = Rope()
        r2.create_rope(None, data2, 0.len(data2))

        node = Node("")
        node.left = r1.root
        node.right = r2.root
        r1.root.parent = node
        r2.root.parent = node

        node.weight = self.calc_weight(r1)

        self.root = node

    def split_rope(self, data, index) :
        """ String splitting :param data: string to be split :param index: position to split (string index is calculated from 0) :return: Rope """
        if index < 0 or index > len(data) - 1:
            return

        node = self.create_rope(None, data, 0.len(data))
        original_index = index

        if index == self.root.weight - 1:
            # Split at root
            rope_left = node.left
            rope_left.parent = None
            rope_right = node.right
            rope_right.parent = None
            return rope_left, rope_right
        elif index < self.root.weight - 1:
            while index < node.weight - 1 and node.data == "":
                node = node.left
        else:
            while index > node.weight - 1 and node.data == "":
                index -= node.weight
                node = node.right

        ifnode.data ! ="":
            # index falls on the leftmost and rightmost leaves
            if original_index < self.root.weight - 1:
                # index falls on the leftmost leaf node
                rope_left = self.create_rope(None, node.data[0:index + 1].0, index + 1)
                rope_right = self.root
                Update rope_right's weight
                node.data = node.data[index + 1:]
                while node is not None:
                    node.weight -= (index + 1)
                    node = node.parent
            else:
                # index is placed on the right-most leaf node
                rope_left = self.root
                rope_right = self.create_rope(None, node.data[index + 1:].0.len(node.data[index + 1:]))
                node.data = node.data[0:index + 1]
        elif index == node.weight - 1:
            # index is right at the end of the node
            if original_index < self.root.weight:
                # index falls at the end of the non-leaf node in the leftmost branch
                weight_sub = node.weight
                rope_left = node.left
                rope_left.parent = None
                node.left = None
                rope_right = self.root
                Update node weight
                while node is not None:
                    node.weight -= weight_sub
                    node = node.parent
            else:
                # index falls at the end of the non-leaf node in the rightmost branch
                rope_left = self.root
                rope_right = node.right
                rope_right.parent = None
                node.right = None
        else:
            stack = []
            if original_index < self.root.weight:
                # index falls on a node in the left subtree
                index -= node.weight
                rope_left = node
                rope_right = self.root
                node.parent.left = None
                node.parent = None
                node = node.right
            else:
                # index falls into the node in the right subtree
                rope_left = self.root
                stack.append(node.right)
                rope_right = None
                node.right = None
                node = node.left
            while node.data == "" and index >= 0:
                if index < node.weight - 1:
                    stack.append(node.right)
                    node.right = None
                    node = node.left
                elif index > node.weight - 1:
                    node = node.right
                    index -= node.weight
                else:
                    stack.append(node.right)
                    node.right = None
                    break
            ifnode.data ! ="":
                # Need to split leaf nodes
                new_node = Node(node.data[index + 1:])
                new_node.weight = node.weight - index - 1
                stack.append(new_node)
                node.data = node.data[0:index + 1]
            Update node weight information
            while node is not None:
                ifnode.data ! ="":
                    node.weight = len(node.data)
                else:
                    node.weight = self.calc_weight(node.left)
                node = node.parent
            Assemble rope_right and update the weight value of the node
            left_node = None
            while len(stack) > 0:
                root_node = Node("")
                if left_node is None:
                    left_node = stack.pop()
                    root_node = left_node
                else:
                    root_node.left = left_node
                    left_node.parent = root_node
                    right_node = stack.pop()
                    root_node.right = right_node
                    right_node.parent = root_node
                    root_node.weight = self.calc_weight(root_node.left)
                    left_node = root_node

            if rope_right is None:
                # index > self.root.weight - 1
                rope_right = root_node
            else:
                # index < self.root.weight - 1
                tmp = rope_right
                while tmp.left is not None:
                    tmp = tmp.left
                tmp.left = root_node
                root_node.parent = tmp
                while tmp.parent is not None:
                    tmp.weight = self.calc_weight(tmp.left)
                    tmp = tmp.parent
                rope_right = tmp
                rope_right.weight = self.calc_weight(rope_right.left)

        return rope_left, rope_right


rope = Rope()
data = "php is a script language"
index = 18
left, right = rope.split_rope(data, index)
print(left)
print(right)

Copy the code

3. PHP 5 and PHP 7 low-level string processing logic comparison

(1) String processing logic and existing problems in PHP 5

4. Utility logic

Strings in PHP 5 do not have fixed data structures. Instead, they are treated like NULl-terminated character arrays in C
To support binary strings that contain NULL characters, you need to record the length of the string as an extra

Problems with ⓶

A. The length of the string

In PHP 5, strings defined in zval are of int (signed integer) type, which means that even on a 64 machine the length of a string cannot exceed 2312^{31}231 (in the LP64 data model, ints are always 32 bits).

B. The format is inconsistent

// Zval defines a string of length int
typedef union _zvalue_value {
		long lval;
		double dval;
		struct {
			char *val;     /* C string buffer, NULL terminated */
			int len;      /* String length : num of ASCII chars */
		} str;            /* string structure */
		HashTable *ht;
		zend_object_value obj;
		zend_ast *ast;
} zvalue_value;
	
Zend_uint = zend_uint = zend_uint = zend_uint = zend_uint
struct _zend_class_entry {
	char type;
	const char *name;		/* C string buffer, NULL terminated */
	zend_uint name_length;		/* String length : num of ASCII chars */
	struct _zend_class_entry *parent;
	int refcount;
	zend_uint ce_flags;

	/ *... * /
}	

// The key of the hashtable defines a string of uint length
typedef struct _zend_hash_key {
		const char *arKey;  /* C string buffer, NULL terminated */
		uint nKeyLength;    /* String length : num of ASCII chars */
		ulong h;
} zend_hash_key;
Copy the code

Binary strings are supported in many places in PHP 5, and since strings don’t have a fixed structure, this leads to many definitions of strings. At the same time, because the string length is defined differently from place to place, the supported string length varies from place to place.

C. High memory consumption

In older versions of PHP, interned strings were not introduced, so if the same string was used in multiple places, it would have to be copied out in order to not affect each other, which caused a lot of memory consumption.

static PHP_FUNCTION(session_id)
	{
		char *name = NULL;
		int name_len, argc = ZEND_NUM_ARGS();

		if (zend_parse_parameters(argc TSRMLS_CC, "|s", &name, &name_len) == FAILURE) {
		    return;
		}
	
		/ *... * /

		if (name) {
		    if(PS(id)) { efree(PS(id)); } PS(id) = estrndup(name, name_len); }}Copy the code

Take the PHP function session_id, which eventually copies the name and stores it in PS(id). This way, whatever subsequent code does to the name will not affect the value stored in PS(id).

⓷ interned string

Starting in PHP 5.4, the interned string was introduced to address the memory consumption caused by string copying. In addition, because interned strings require shared memory space, they are not supported in thread-safe versions of PHP.

An interned string is a string that is stored only once in a process. When a PHP-fpm process starts, it requests a buffer of 1MB of persistent strings (string constants, function names, variable names, class names, method names…). Is stored in the buffer and added to a Hashtable. Since PHP processes requests independently of each other, any data generated in memory is destroyed once a request is processed. In order to use the interned string when processing a request, PHP will record the top position of the current Interned string buffer when the request arrives. After the request is processed, The space consumed after the top position is restored.

The initialization of an Interned string only occurs during PHP startup and PHP script compilation

Interned String is read-only and cannot be modified or destroyed

Since the Interned string buffer has only 1MB of space, when the 1MB of space is used up, subsequent string processing will return to the state before the interned string was introduced

Advantages of interned string

A string is stored only once, saving memory
The hash of an interned string is evaluated only once and then used multiple times
The comparison of interned strings is eventually translated into a comparison of pointer addresses

The initialization and creation of an Interned string is complicated because of the hashtable manipulation, and because of this, not all strings are suitable for storing in an Interned string.

⒉ STRING processing optimization in PHP 7

struct _zend_string {
	zend_refcounted_h gc;
	zend_ulong        h;                /* hash value */
	size_t            len;
	char              val[1];
};

typedef struct _zend_refcounted_h {
	uint32_t         refcount;			/* reference counter 32-bit */
	union {
		uint32_t type_info;
	} u;
} zend_refcounted_h;
Copy the code

In PHP 7, strings have a fixed structure, zend_string

The string length type in PHP 7 is size_t. The string length is no longer limited to 2312^{31}231 as in PHP 5. Especially on 64-bit machines, the string length depends on the maximum length supported by the platform.

Instead of using a char *, PHP 7 uses a struct hack to store string content in a contiguic memory space along with zend_string. Zend_string also has a nested hash value for the string, so that only one hash calculation is needed for a specified string.

PHP 7 introduced reference counting for strings, so as long as the string is in use, you don’t have to worry about being destroyed.

static PHP_FUNCTION(session_id)
{
	zend_string *name = NULL;
	int argc = ZEND_NUM_ARGS();

	/ *... * /

	if (name) {
	    if(PS(id)) { zend_string_release(PS(id)); } PS(id) = zend_string_copy(name); }}static zend_always_inline zend_string *zend_string_copy(zend_string *s)
{
	if(! ZSTR_IS_INTERNED(s)) { GC_REFCOUNT(s)++; }return s;
}

static zend_always_inline void zend_string_release(zend_string *s)
{
	if(! ZSTR_IS_INTERNED(s)) {if (--GC_REFCOUNT(s) == 0) { pefree(s, GC_FLAGS(s) & IS_STR_PERSISTENT); }}}Copy the code

The session_id function is used as an example. When you set session_id, you do not need to fully copy name. Instead, you just increment the reference count of name by one. The same is true when deleting a string. The reference count of the string is subtracted by one, and only when the reference count is zero is the string actually destroyed.

Interned strings in PHP 7 no longer require separate interned string buffers. Instead, type_info is marked as IS_STR_INTERNED in the GC of zend_string. Thus, when destroying the string, Zend_string_release checks to see if the string is interned string, and does nothing if it is.