preface
In daily work, I often need to contact with mysql database. In order to better use mysql database, I began to study its in-depth knowledge, so that I can apply mysql database in the working environment, optimize its SQL, understand the principle of lock, transaction, and THE principle of MVVC version control. All of them will be summarized in the following articles.
In this article, I’ll start with indexes and explore why mysql is so much more efficient with index hits. And how we should use mysql index, and what we need to be aware of when using it.
Search algorithm
In order to better understand the PRINCIPLE of B + tree, we need to explain its algorithm first. Step by step to understand how B + trees do searches.
- Binary search
This algorithm allows us to quickly understand the data structure of a tree and how does a tree look up, suppose we have, okay
1,2,3,4,5,6,7,8,9,10
Copy the code
Ten numbers. What’s the number of times I want to find five? That must be five searches. Because it’s sequential. So what is binary search?
Binary search means that each search starts at the middle of the data, dividing the whole data 1 into 2 for example, with 10 numerons. So let’s start with the index 5
1,2,3,4,5,6,7,8,9,10
^
Copy the code
In that case, we’re lucky we only need one query to find this number. What about looking for nine numbers?
1,2,3,4,5,6,7,8,9,10
^
1,2,3,4,5,6,7,8,9,10
^
1,2,3,4,5,6,7,8,9,10
^
Copy the code
So we only need 3 times to find the number 9.
The data structure
The tree
For a normal tree, you only need to reach a node with two leaves below it, but there’s no limit to where you can put leaves. That is, you can put it like this:
1 | 1 | 3
\ | / \ | / \
2 | 5 3 | 1 2\ | the \ |3 | 2 |
Copy the code
This is the most common tree structure, and I won’t explore the advantages and disadvantages of this data structure in detail. It should be obvious that this kind of arbitrary data structure can cause a lot of trouble and specification problems. The higher the level of the tree, the higher the IO and the longer the search takes.
Balanced binary trees
And then I derived other trees, like this balanced binary tree. There are several rules for balancing binary trees that need to be met:
- The value of the left subtree is always less than the root node
- The value of the right subtree is always greater than the root node
Balanced binary tree search result tree is definitely faster than sequential search. How to have the same 10 nodes. Then the sequential search times are:
(1 + 2 + 3 +… 10) / 10 = 5.5 times
So what’s the search result for a binary tree?
(4 + 4 + 4 + 3 + 3 + 3 + 3 + 2 + 2 + 1) / 10 = 2.9 times
Obviously, the efficiency is nearly doubled under balanced binary tree search. So balanced binary trees are all that good, right? Why not use it as a data structure for mysql? Why do we have b+ trees?
Let’s think a little bit further, what are the disadvantages of balanced binary trees?
- The cost of insertion is high
Think about a principle that we talked about earlier. Balanced binary trees must take care that their left and right nodes are smaller or larger than the root node. The problem is that every time I insert, I have to do a rotation of the tree to keep this property. So every insert is actually looking for a + rotation and that makes it very expensive to maintain a one-moment balanced binary search tree.
In the figure above we can see that the insertion of each node is actually a search + rotation subtree. The higher the height of the tree, the more expensive it is to maintain it. And that’s where our hero, the B + tree, comes in
B + tree
Today we are going to talk about the protagonist, B + tree, B + tree and balanced binary tree are very classical data structures, B + tree also evolved from the index sequential access of B tree.
B + tree is defined as:
- Non-leaf nodes store only their keys
- Leaf nodes store data
The design of b+ tree is a data structure specially designed for disks or other directly accessible devices, and all nodes of b+ tree are in accordance withThe size of the keyStore in orderNodes at the same level After all, the main purpose of this time is to talk about indexes, so bury a dot here. Next time we will focus on sharing the data structure of b+ trees, inserting and deleting nodes, and how InnoDB splits and requests more pages when leaf and intermediate nodes are full
The figure above is when the number of leaves is 3. B + tree insert and delete operations
B + tree index type
Clustered index
In InnoDB, each table has an index, and the clustered index is a B + tree constructed by the primary key of the table, and the leaf nodes are the data of the table.
Because data pages can be sorted by only one B + tree, there is only one clustered index per table page. In most cases, query optimization will prefer clustered indexes to lookup because clustered indexes can directly find data on leaf nodes. There is no need to repeat the table
Clustered indexes can be created in one of the following ways:
- The table has a defined primary key, which InnoDB uses as the clustered index
- Without a primary key, InnoDB selects a non-empty unique index as the clustered index
- Innodb creates an implicit clustered index (_rowid).
There is one point to note here. When we say data storage is sequential, we mean it’s logically ordered, not physically ordered, so it’s all connected in a two-way linked list.
Secondary index
Secondary indexes refer to non-clustered indexes, that is, all indexes except clustered indexes can be called secondary indexes. The leaf node of the secondary index does not contain row data, but the key value of the primary key. Each table can hold multiple secondary indexes, so using secondary indexes to find data requires a two-step operation.
- The first step is to find the primary key index through the secondary index
- The second step is to find the data of the leaf node through the primary key index
That’s what we’ve been saying about table-back queries
At the top is the data structure for our secondary index. Once we get to secondary indexes here, we need to take them out a little furtherJoint indexAs well asIndexes coverThe meaning of
Joint index
We can see from the secondary index diagram that when storing data, we are storing it in order. Suppose we have the index above (a, b) by (1, 1), (1, 2), (2, 1), (2, 4), (3, 1) sequence of storage, to accelerate the speed of the query.
When you use a query statement
select * from table where a=? and b=?
Copy the code
The (a,b) index can be used to speed up the query. Because of the left-most rule, the order of where conditions must be the same as the order of the index to hit the index. That is to say when you use
select * from table where b=?
Copy the code
(1,2,1,4,1) on the b+ tree is obviously not sorted in order.
Cover index
The principle of index coverage is very simple, mainly refers to the secondary index can be directly obtained from the data, without the need to return to the table query. The advantage of overwriting an index is that it does not contain the entire record, so it is much smaller than a clustered index and can greatly reduce IO operations during lookups.
The next two examples will give you an idea of what overwriting an index means. Let’s create a table
create table buy_log(
userid INT UNSIGNED NOT NULL,
buy_date DATE
)
Copy the code
Then we randomly insert some data and add two secondary indexes
alert table buy_log add key (userid);
alert table buy_log add key (userid, buy_date);
Copy the code
Next, let’s do a query assuming we look at all the data with userID < 2
select * from buy_log where userid < 2;
Copy the code
Which index should this statement use?
Select (userID, buy_date, userid, buy_date, userid, buy_date); We just mentioned that an overwrite index contains only a portion of the data, not the entire record. So over hereselect *Just enough data for this overwrite index.
What about another one?
select userid from buy_log where userid < 2;
Copy the code
This time, we just need the value userID
Obviously. This time the query optimizer selects the index (userID) because the index overrides the userID.
thinking
- Why not create indexes in fields with high duplicates (such as gender)?
- Why not just add an index?
The answer
- When you create an index in the gender column, the data structure of the B + tree points directly to the primary key of the clustered index. As a result, each item of the index needs to be queried back into the table, and additional maintenance of the index is required
- As I said. Too many indexes cause us to maintain all the B + trees every time we insert or delete them
supplement
Full text on hash index index is not introduced in detail in this article