preface
The concept of an index is easy to understand, and we must have used dictionaries in our school days. An index is just like a dictionary’s table of contents, with which we can quickly retrieve the explanation of the word we need. Similarly, in the database, indexes can also help us quickly retrieve the data we need, and the query efficiency is very high.
In general, an index is a data structure. Let’s explore what an index is today. Why do we use B+ tree as an indexed data structure?
Why does indexing make queries faster?
We all know that database storage has two storage media, one is memory, the other is hard disk. Memory is a temporary storage medium and has very limited capacity, which can result in data loss if the server fails. Hard disks are a permanent (if undamaged) storage medium, so we need to store our data on them to be safe.
But there’s a problem? If we put data on the hard disk, when we query the data on the hard disk, it will generate I/O operations. Hard disks take much longer to access I/O than memory. The purpose of an index is to minimize I/O operations on the hard disk, thus reducing the time spent. You can compare the operations of looking up a dictionary, and the table of contents (index) can help you reduce the action of turning pages, for the same reason.
So let’s talk about binary trees
A binary tree, as its name implies, has at most two “forks”, or children, of each node, the left and right child. However, a binary tree does not require every node to have two children. Some nodes have only the left child and some have only the right child.
Let’s start with the basic Binary Search Tree. Searching for a node is the same as inserting a node. Let’s assume that the value of the Search insert is key:
- If the key is larger than the root node, the search is performed in the right subtree.
- If the key is smaller than the root node, the search is performed in the left subtree.
- If the key is equal to the root node, which means the node is found, return the root node.
For example, to create a sequence {30,25,36,32,40,20,28}, the same data, different insertion order, the tree result is different, as shown in the following figure:
But there are extreme cases, when the depth of the binary tree is very large it may degenerate into a linked list. In the figure above, you can see that the depth of the first tree is 3, which means that it takes at most three comparisons to find the node, while the depth of the second tree is 7, which takes at most seven comparisons to find the node.
The right-hand side of the graph is also a binary search tree, but the performance aspect has been reduced to a linked list, and the search time has become O(n). How to solve this problem? , people proposed balanced binary search tree (AVL tree), which added constraints on the basis of binary search tree to ensure that the height difference between the left and right subtrees of each node cannot exceed 1, that is to say, the left and right subtrees of the node are still balanced binary trees.
Here, there are many kinds of common balanced binary trees, including balanced binary search tree, red black tree, number heap, stretch tree. Balanced binary search tree is the first self-balanced binary search tree. When we refer to balanced binary search tree, we generally refer to balanced binary search tree. In fact, the first tree is a balanced binary search tree, and the search time is O(log2n).
As mentioned above, the query time depends on the I/O operation of the disk. If we use the binary tree form, even if we improve it by balancing the binary search tree, the tree depth is O(log2n). When N is large, the depth is high, as shown in the following figure:
Each node access requires one disk I/O operation, and for the tree above, we need five I/ OS. Although balanced binary tree comparison is efficient, the tree depth is also high, which means that the number of disk I/O operations affects the overall data query efficiency.
Now, what is a B tree
From the above, we know that binary trees as indexes will cause the tree to become very tall, increase the number of I/O times of hard disks, and affect the time of data query. The appearance of B Tree is to solve this problem, the English name of B Tree is Balance Tree, also known as balanced multi-path search Tree, its height is far less than the height of balanced binary Tree. The index structure in file system and database system is often implemented by B tree.
Let’s take a look at the schematic diagram of B-tree structure, as shown in the figure below:
As a balanced multi-path search tree, each node of b-tree can contain at most M child nodes, and M is called the order of b-tree. As you can see in the figure, each disk block contains Pointers to keywords and child nodes. If a disk block contains X keywords, the number of Pointers is x+1. For a 100-rank B-tree, if there are three levels, it can store up to about 1 million index data. For large amounts of index data, the b-tree structure is very suitable because the tree is much smaller than the binary tree.
Combined with the b-tree structure diagram, let’s take a look at how b-tree searches. Assuming that the keyword we want to find is 9, the steps can be divided into the following steps:
- We compare it with the keyword (17, 35) of the root node. If 9 is less than 17, we get pointer P1.
- Find disk block 2 according to pointer P1, keyword is (8, 12), because 9 is between 8 and 12, so we get pointer P2;
- Follow pointer P2 to find disk block 6 with keyword (9, 10), and then we find keyword 9.
As you can see, we’re not comparing a lot of times in the search of the B tree, but if we’re reading the data out and comparing it in memory, that’s negligible. However, reading the disk block itself requires I/O operation, which consumes more time than the time required for comparison in memory, and is an important factor in data search. Compared with balanced binary tree, B tree has less disk I/O operation, and is more efficient in data query than balanced binary tree.
B tree Plus
B+ tree is modified based on B tree. At present, mainstream databases all support B+ tree as an index method. Let’s take MySQL as an example to compare the differences between B+ tree and B tree:
- In a B+ tree, nodes with k children have K keywords. So the number of children is equal to the number of key words, and in the CASE of B tree, the number of children is equal to the number of key words plus 1.
- Keywords of non-leaf nodes also exist in child nodes and are the maximum (or minimum) of all keywords in child nodes.
- In a B+ tree, non-leaf nodes are only used for indexes and do not hold data records. Record-related information is stored in leaf nodes. In a B-tree, non-leaf nodes store both indexes and data records.
- All keywords appear in the leaf node, which forms an ordered linked list, and the leaf node itself is linked in order of keyword size from smallest to largest.
The following figure is a B+ tree with order 3. Keywords 1, 18 and 35 in the root node are the minimum values in the child nodes (1, 8, 14), (18, 24, 31) and (35, 41, 53) respectively. The keywords of the parent node of each layer will appear in the keywords of the child node of the next layer, so all the keyword information is included in the leaf node, and each leaf node has a pointer to the next node, thus forming a linked list.
For example, if we want to find the keyword 16, the B+ tree will search from the top down layer by layer:
- Compared with the root keyword (1, 18, 35), 16 is between 1 and 18, resulting in pointer P1 (pointing to disk block 2)
- Find disk block 2, keyword (1,8,14), because 16 is greater than 14, so get pointer P3 (to disk block 7)
- Find disk block 7, keyword (14,16,17), and then we find keyword 16, so we can find the data corresponding to keyword 16.
There are three I/ OS in the whole process. It looks like the query process of A B+ tree is similar to that of a B+ tree, but the fundamental difference is that the middle node of a B+ tree does not store data directly. What are the benefits?
First, B+ tree query efficiency is more stable. Because B + tree each time the only access to the leaf node to find the corresponding data, in the B tree, not a leaf node will store data, this will cause the query efficiency is not stable, sometimes access to the leaf node can find the key words, and sometimes need to access to the leaf node to find the key.
Second, B+ trees are more efficient queries because they are generally dumber (larger order, lower depth) and require less disk I/O to query. The B+ tree can store more node keywords for the same disk page size.
The efficiency of B+ tree is higher than that of B tree not only in the query of a single keyword, but also in the query range. This is because all the keywords appear in the leaf nodes of the B+ tree and are linked through an ordered linked list. In B tree, it is necessary to complete the search of the query range through middle-order traversal, which is much less efficient.
conclusion
The number of DISK I/O operations is critical to index efficiency. Although the traditional binary tree data structure is efficient in searching data, it is easy to increase the number of disk I/O operations, affecting the efficiency of index use. Therefore, we tend to use “chunky” data structures when constructing indexes.
Both B tree and B+ tree can be used as index data structures. In MySQL, B+ tree is used. The B+ tree is more stable in query performance.