The statement
The various data structures described in this article (binary trees, etc.) do not take into account the case of duplicate values. The differences in the various data structures described in this article are merely a preparation for understanding the needs of MySQL indexes.
What is an index
Speaking of indexes, we all know that creating indexes makes database queries faster, so what exactly is an index? I don’t think everyone can say that. An index is a sorted data structure used in a database management system to help query and update data in a database table quickly. Yes, an index is a data structure, but of all the data structures, why does MySQL choose a B+ tree? Let’s take a look at what makes B+ trees unique compared to other data structures.
Binary Search
First of all, let’s think for ourselves, if we were to design, how would we store? I think most people think of it as a linked list or an array, and then they sort it in the default order, and then they look it up, and an ordered linked list is something that we can query efficiently with binary search.
Binary search, also known as half search, is a high efficiency search method. Let’s say we have 10 numbers from 1 to 10, and we want to find 8, and we start with 5 in the middle, and we find that 8 is bigger than 5, so we can get rid of the number to the left of 5, and then we have 6 to 10, and we start in the middle, and so on, until we find 8. But this lookup method has a premise is that the data must be ordered, and it belongs to the list of storage, but we want to insert or modify a data, might be accompanied by a large number of subscript mobile, for example, we put 1-10 array, subscript respectively 0-9, and now want to insert a zero, in order to ensure orderly, 0 must be in the first place, So all the indices from 1 to 10 have to be moved one bit back, and that’s a bit of a hassle, so to solve that problem, we have binary trees.
Binary Search tree (BST)
Binary Search Tree (BST) is a Binary Search Tree. Please see below
So in this tree up here, we’re going to find 8, and we’re going to start at root 6, and we’re going to find 8, and we’re going to go to the right, and we’re going to find 8
Characteristics of binary trees
Binary trees have two characteristics: 1. All nodes in the left subtree are smaller than the parent node. 2
Problems with binary trees
A serious problem with binary trees is that their search time is dependent on the depth of the tree, which in the worst case degrades to O(n). The diagram below:
So this is an extreme case of a binary tree, which degenerated into a linear linked list, where if you want to find the last number, 6, you have to start at 1 and go through the whole tree, which is very inefficient. So is there a data structure that’s a little bit more balanced, that doesn’t have this extreme situation, and so you have balanced binary trees.
AVL Tree balanced binary Tree
Balanced binary search trees are Balanced binary search trees, or AVL trees, which are not short for AVL trees, but for the inventors g. M. Adelson-Velsky and E. M. Landis. See below for an example of a balanced binary tree:
In the figure above, 6 is inserted from 1. If it is a binary tree, it will become a linear structure, but a balanced binary tree will be rotated left and right to produce the structure shown in the figure above.
Characteristics of balanced binary trees
Compared with binary trees, balanced binary trees have one characteristic: the absolute value of the depth difference between left and right subtrees cannot exceed 1. Of course, balanced binary trees are first a binary tree, but the depth difference between left and right subtrees is not more than 1 through left-rotation and right-rotation, so as to avoid the occurrence of extreme cases of binary trees.
Why doesn’t MySQL choose balanced binary trees
If balanced binary trees solve the problem of normal binary trees, why mysql doesn’t choose balanced binary trees as indexes?
What does an index need to store
Let’s think about what information we should store if we want to store the index. It should store three pieces of information:
- Index value: is the value of the index column in the table.
- The disk address of the data (find the current data by the disk address) or store the whole data directly.
- References to child nodes: We need to go down from the root node, so we need to know the addresses of the left and right child nodes.
According to these three points, we can have a simple structure diagram as follows:
The numbers in the figure above represent the values of the index. The numbers starting with 0x indicate the disk address. The root node stores references to the left and right nodes.
What is the problem with AVL trees being used to store indexes
As we know, pages are the smallest unit of disk used by Innodb storage engine to manage data. The default size of pages is 16KB. Page is also the node in the figure above. Every query of the node requires an IO operation. IO operation is a very time-consuming operation, and the bottleneck of many business systems is stuck in IO operation. AVL tree only stores a keyword (index value)+ a disk address + references of the left and right nodes on a node, which is far less than 16KB and will waste a lot of space.
Above, if we want to find this article data needs 3 IO (takes a node is an IO operation), if the tree is very high, will do a lot of IO operations, so the AVL tree is the biggest problem is the lack of space utilization, wasted a lot of space, when a large quantity of data will become a tall tree, So how can we improve? The answer is obvious, that is, each disk block to store a little more stuff, that is, each disk to store a few more keywords, because the more keywords, the more ways; The more paths there are, the shorter and fatter the tree will be, and the fewer IO operations there will be.
Balanced Tree
Multi-path balanced tree is short for B-tree, also known as B-tree. Like AVL tree, B-tree stores key values, disk addresses, and left and right node references in branch and leaf nodes. See the following figure for an example of a multipath balanced tree:
B tree characteristics
Compared with AVL tree, B tree can store multiple keywords (values) on a disk, and has one feature:
- The number of forks (paths) is always 1 more than the key word count.
We can draw the following schematic diagram (there are only 3 ways, i.e. 2 keywords, depending on how many keywords can be stored on a page) :
From the figure above, it is obvious that the data stored by B tree of the same height is much larger than that of balanced binary tree.
How does a B tree find data
For example, if we want to find the number key=32, we first get the root node, find that 18 is less than key, so we go to the right, get the data on the right, 54 and 76, then follow the following principles:
- Key <54, hit the leftmost fork;
- Key =54, direct hit, return data;
- 54
- Key =76, direct hit, return data;
- Key >76, hit the right branch; In this case, I hit the left branch because key=32, so I go to the first one, hit the left branch, and then I go to the left branch, get 32 and 50, and I find key=32, hit, and return the data.
From the above we can see that compared with AVL tree, the efficiency of B tree has been improved a lot in the case of large data volume, so why MySQL still does not choose B tree as the index? So let’s take a look at the modified B+ tree first, and then jump to conclusions.
B + tree
B+ tree is improved from B tree, which is an improved version of multi-way balanced search tree.
Let’s first look at what a B+ tree looks like:
Compared to B+ trees, one obvious difference is that the leaf nodes are guided by an arrow and are ordered from left to right.
Compared with the traditional B+ tree, the improved B+ tree in InnoDB has the following characteristics
InnoDB B+ tree features
- The number of keywords is equal to the number of routes.
- No data is stored in the root or branches of a B+ tree, only in the leaves. Search keyword will not directly return, will be the last layer of the leaf node.
- Each leaf node of the B+ tree adds a pointer to the adjacent leaf node, and its last data points to the first data of the next leaf node, forming an ordered linked list structure.
- It retrieves data based on an interval between left closed and right open
According to the characteristics of B+ tree, we can draw a simple diagram for storing data, as follows:
How does a B+ tree find data
Suppose we now want to find a key=66, follow the following steps: 1. Obtain the root node, according to the left closed and right open have the following interval: [1,28),[28,66),[66,+∞), hit the last interval, although 66 is in the root node, because the root node does not store data, it will continue to search the node on the right. 2. The node on the right can be obtained according to the following interval: [66,78),[78,89),[89,+∞), hit the left range.
B+ tree relative to B tree improvement point
B+ trees are improved from B trees, so B+ trees can solve all the problems that B trees can solve, so what problems can B+ trees solve that B trees can’t solve? 2. The disk read and write capability of B+Tree is stronger than that of B+Tree: If we want to scan the whole table, we only need to scan the leaf nodes, not the whole B+Tree. Root nodes and branches do not hold data areas, so a node can hold more keywords, and a disk load (IO operation) can fetch relatively more keywords. 3. Natural sorting ability: there is a pointer to the next data area on the leaf node, and the data forms a linked list. 4. Stable efficiency: B+Tree always gets the data from the leaf node, so the IO times are stable. However, if B Tree is lucky, the root node will get the data, and if B Tree is unlucky, it will get the data from the leaf node, which will take different time.
conclusion
This article Outlines the evolution from binary trees to B+ trees, explains the differences between the various data structures, and explains why MySQL ultimately chose B+ trees as indexes.