1. Introduction
An index is a data structure that helps us quickly locate the data we are looking for in a large amount of data.
In MySQL index category:
- B + tree index
- A Hash index
- The full text indexing
Today we are going to introduce the B+ tree index in InnoDB storage engine, which is the most commonly used in working development. To introduce B+ tree index, we have to mention binary lookup tree, balanced binary tree and B tree. The B+ tree evolved from these three.
2. Binary search tree
First, let’s take a look ata graph. The circle in the graph is the node of the binary search tree, where keys and data are stored. The key corresponds to the ID in the user table, and the data corresponds to the row data in the User table.As you can see from the figure, we created a binary lookup tree index for the User table (user information table).
Features of binary search trees
The key of the left child of any node is smaller than the key of the current node, and the key of the right child is greater than the key of the current node. The nodes at the top are called root nodes, and the nodes without children are called leaf nodes.
If we need to find the user information whose ID is 12, we can use the binary search tree index created by us. The search process is as follows:
- We take the root node as the current node, compare 12 with the key value of the current node 10, 12 is greater than 10, and then we take the right child of the current node > as the current node.
- Continue to compare 12 with the key of the current node 13, and find that 12 is less than 13. Consider the left child of the current node as the current node.
- 12 =12, id=12, name=xm, id=12, name=xm
With the binary search tree we only need 3 times to find the data that matches. If we look in the table, we need 6 times to find it.
3. Balanced binary trees
Above we explained the use of binary search tree can quickly find the data. However, if the binary lookup tree above is constructed like this:At this point you can see that our binary lookup tree has become a linked list. If we need to find the user whose id is 17, we need to find 7 times, which is equivalent to a full table scan. The reason for this phenomenon is that the binary search tree becomes unbalanced, that is, the height is too high, resulting in unstable search efficiency.
To solve this problem, we need to make sure that the binary search tree is always balanced, so we need to use a balanced binary tree.
- Balanced binary tree, also known as AVL tree, requires that the height difference between the left and right subtrees of each node should not exceed 1 on the basis of satisfying the characteristics of binary search tree.
Here is a comparison of balanced and non-balanced binary trees:
From the construction of balanced binary tree, we can find that the binary tree in the first picture is actually a balanced binary tree.
Balanced binary tree ensures that the structure of the tree is balanced. When we insert or delete data that does not satisfy the unbalanced binary tree, the balanced binary tree will adjust the nodes in the tree to keep the balance. The specific adjustment method is not introduced here.
Compared with binary search tree, the search efficiency of balanced binary tree is more stable and the overall search speed is faster.
4. B tree
Because of the volatile nature of memory. In general, we would choose to store the data and indexes in the User table on a peripheral device such as disk.
However, reading data from disk can be hundreds, thousands, or even thousands of times slower than memory, so we should try to minimize the number of times we read data from disk.
In addition, when data is read from the disk, it is read block by block, not piece by piece.
- If we can put as much data as possible into the disk block, then a single disk read will read more data, and our search time will be greatly reduced.
- If we use a data structure like a tree as the index data structure, then every time we look for data, we need to read a node from the disk, which is what we call a disk block.
We all know that balanced binary trees store only one key and data per node. What does that mean? Note Each disk block only stores a key value and data! What if we were to store huge amounts of data?
As you can imagine, the nodes of the binary tree will be very large, and the height will be very high, and we will have to do a lot of disk I/O to find the data, and the efficiency of finding the data will be very low!
In order to solve this problem of balanced binary tree, we should find a balanced tree that can store multiple keys and data in a single node. Which is the B tree that we’re going to talk about.
B Tree is a Balance Tree. The following figure is a B Tree:
The P node in the graph is a pointer to the child node. There are also binary search trees and balanced binary trees, which are omitted because of the beauty of the graph.
Each node in the graph is called a page. A page is the disk block we talked about above. In MySQL, the basic unit of data read is a page, so we call it a page here more in line with the underlying data structure of MySQL indexes.
As can be seen from the figure above, compared with the balanced binary tree, each node of B-tree stores more keys and data, and each node has more child nodes, the number of child nodes is generally called order. The B-tree in the above figure is a third-order B-tree with a low height. If we want to find the user information whose ID is 28, then the search process in the tree B above is as follows:
- Find the root node, which is page 1, and determine that 28 is between the key values 17 and 35. Then we will find page 3 based on the pointer p2 in page 1.
- Comparing 28 with the key value on page 3, 28 is between 26 and 30. We find page 8 by referring to p2 on page 3.
- Comparing 28 with the key value on page 8, we find that there is a matching key value 28, which corresponds to the user information (28, bv).
5. B + tree
B+ tree is a further optimization of B tree. Let’s take a look at the structure of the B+ tree:
Let’s look at the difference between a B+ tree and a B tree based on the figure above:
- B+ tree non-leaf nodes do not store data, only key values, while B tree non-leaf nodes store not only key values, but also data.
The reason for this is that in the database the page size is fixed, InnoDB’s default page size is 16KB.
If we don’t store data, we store more keys, the order of the tree (the child tree of the node) will be larger, the tree will be shorter and fatter, and we will need less DISK I/O to find the data, and the data query will be faster.
In addition, the order of B+ tree is equal to the number of keys. If our B+ tree can store 1000 keys per node, then 3 layers of B+ tree can store 1000×1000×1000=1 billion data.
Typically, the root node is resident in memory, so it takes only two disk IO times to find a billion pieces of data.
- Because all the data of the B+ tree index is stored in the leaf node, and the data is arranged in order.
So B+ trees make range lookup, sorted lookup, group lookup, and de-lookup incredibly easy. With a B-tree, this is not easy because the data is scattered across nodes.
Interested readers may also notice that the pages in the B+ tree above are connected by a two-way list, and the data in the leaf nodes are connected by a one-way list.
In fact, in the B tree above, we could have added a linked list to each node. These are not their previous differences, because this is how indexes are stored in MySQL’s InnoDB storage engine.
In other words, the B+ tree index shown in the figure above is exactly how InnoDB’s B+ tree index is implemented, specifically as a clustered index
The B+ tree index implementation in MyISAM is slightly different from InnoDB. In MyISAM, the leaf node of the B+ tree index does not store the data, but the file address of the data.
6. Clustered versus non-clustered indexes
What is a clustered index? In MySQL, B+ tree indexes are divided into clustered indexes and non-clustered indexes according to different storage modes.
- Clustered indexes: All tables stored in InnoDB will have a primary key. Even if you don’t create a primary key, the system will create an implicit primary key for you. This is because InnoDB stores data in a B+ tree, and the B+ tree keys are the primary keys, and all the data in the table is stored in the leaves of the B+ tree. Such a B+ tree index constructed with the primary key as the key of the B+ tree index is called a clustered index
- Non-clustered index (non-clustered index) : A B+ tree index constructed with column values other than the primary key as the key values is called a non-clustered index. The difference between a non-clustered index and a clustered index is that the leaf node of a non-clustered index does not store the data in the table, but stores the corresponding primary key of the column. To search for data, we need to search in the clustered index according to the primary key. This process of looking for data according to the clustered index is called back table.
Use clustered and non-clustered indexes to find data
When we explained the B+ tree index, we did not talk about how to find data in the B+ tree, mainly because we did not introduce the concept of clustered index and non-clustered index.
In the following section, we will introduce the B+ tree index lookup method by explaining how to find data in the table by clustered indexes and non-clustered indexes.
Use clustered indexes to find data
This is the same B+ tree index graph, and now we should know that this is the clustered index, where the data in the table is stored. Now suppose we want to look for user data with id>=18 and id<40. The corresponding SQL statement is:
select * from user where id> =18 and id <40
Copy the code
Where ID is the primary key, the specific search process is as follows:
(1) Generally, the root node is resident in memory, that is, page 1 is already in memory, in this case, do not need to read disk data, directly read from memory.
- To look for the id>=18 and id <40 or a range value, we first need to find the key with id=18.
- From page 1 we can find the key value 18, and we need to navigate to page 3 according to the pointer p2.
(2) To find the data from page 3, we need to take p2 pointer to the disk and read page 3.
- We read page 3 from disk, put page 3 into memory, and then look for it. We can find the key value 18, and then take the pointer p1 from page 3, and navigate to page 8.
(3) The same page 8 is not in memory, we need to go to disk to read page 8 into memory.
- After reading page 8 into memory. Because the data in the page is linked, and the key values are stored in order, you can find the legal bit to the key value 18 by binary search.
- And at this point, because we’re on the data page, we’ve found a piece of data that meets our criteria, which is the data that corresponds to key 18.
- Since it is a range lookup, and all the data has leaf nodes and is arranged in order, we can iterate through the key values in page 8 to find and match the data that meets the condition.
- We can go all the way to 22, and then there is no data on page 8, so we need to take the p pointer on page 8 and read the data on page 9.
(4) Since page 9 is not in memory, page 9 will be loaded into memory again, and data search will be carried out in the same way as in page 8 until page 12 is loaded into memory, and 41 is found to be greater than 40. At this time, the condition is not met. So the search ends here. Finally, we found all the data that met the condition, a total of 12 records:
Kl, (18), (19, kl), (22, hj), (24, IO), (25, vg), (29, jk), (31, jk), (33, rt), (34, ty), (35, yu), (37, rt), (39, rt).
Let’s take a look at the specific lookup flow chart
Use non-clustered indexes to find data
Readers may wonder when they see this picture, what is it? It’s all numbers. If you feel this way, look carefully at the explanation in the red letter below. What? Don’t get it? So let me explain it again. First, the non-clustered index represents the index of the user’s lucky number (why lucky number? On a whim :-)), the table structure looks like this.In a leaf node, instead of storing all data, it stores key values and primary keys. For x minus y in the leaves, let’s say 1 minus 1. The 1 on the left represents the key of the index, and the 1 on the right represents the primary key.
If we want to find the user whose lucky number is 33, the corresponding SQL statement is:
select * from user where luckNum=33
Copy the code
The lookup process is the same as for clustered indexes, so I won’t go into details here. When we find the primary key 47, we need to go back to the clustered index to find the corresponding data. At this point, we go back to the clustered index search process.
Let’s take a look at the specific lookup flow chartIn MyISAM, both clustered and non-clustered leaf nodes store the file address of the data.