MySQL index principles, a from start to finish explain

Index, may make a lot of people daunting, after all, every interview when MySQL index must be asked content, even if first put aside the interview, just in the usual development, for SQL optimization is also a top priority.

It is no exaggeration to say that the quality of SQL in the system can directly determine the speed of your system. But is there a question that you think about before you optimize? Which is: What are our principles for optimization? What is the theoretical basis for tuning SQL?

Although it is said that real knowledge comes from practice, I believe that theory is the basis to support practice, because we can not blindly practice without purpose, because it often achieves half the result with twice the effort.

So I just want to tell you that before we can really start indexing, we need to understand how indexing works. This way you will feel more silky when talking about optimization

1. The nature of indexes

The essence of an index is a sorted data structure. I believe that we are not unfamiliar, because when it comes to indexes, many people will naturally associate the contents of dictionaries.

Yes, the analogy is very graphic, but if you go further, I’m afraid many partners will be a little tongue-tied, that since you already know the essence of the index, so you have the basis to see this article, I believe that you read this article, will certainly have a new understanding of the principle of the index.

2. Classification of indexes

In the database, there are many kinds of indexes (don’t think of indexes as B+ tree, because we use MySQL). While different categories are obviously for different occasions, what are the categories of indexes? Let’s take a look at it.

2.1. Hash Index

Hash index is a common index. The query efficiency of a single record is high and the time complexity is 1. However, Hash indexes are not the most commonly used database index type, especially the Mysql Innodb engine that we commonly use does not support Hash indexes. The main reasons are as follows:

Hash indexes are good for precise lookups, but range lookups are not
- Because the storage engine computes a hash code for each row, the hash code is small, and the hash code for different key rows is usually different. The hash code stored in the hash index is the hash code, and the hash code is irregular between each other, and the sequence of hash operations cannot be guaranteed. Therefore, two data with similar values, Hash values differ greatly and are divided into different buckets. This is why hash indexes can only be queried full-time, because only then can the hash code match the data.

This is all you need to know about hash indexes.

2.2. Binary tree

In addition, the common index data structure is the tree structure, first let’s introduce the most classical binary tree.

First, let’s introduce the characteristics of binary trees:

1. The time complexity of binary trees is O(n).
1. A node can have only two child nodes. The degree is not more than 2
1. The left child node is smaller than the current node, and the right child node is larger than the current node

So let’s first look at what a binary tree looks like

But in extreme cases, there is chain, where nodes keep increasing on one side. The following figure

In binary trees, there is a special structure — balanced binary tree. The characteristics of balanced binary tree are as follows:

1. The root node changes as the data changes
1. The more data there is, the more times there are, and the more I/OS there are, the slower the I/OS are. (The disk I/OS are determined by the tree height.)

2.4. B Tree (Second and third trees)

Now that we know about binary trees, we can talk a little bit more about what a B tree is. B trees look something like this:

It can be seen from the structure diagram of B-tree that each node contains not only the key value but also the data value.

The storage space of each page is limited. If the data volume is large, the key of each node is small. If the data volume is large, the B tree is very deep, which increases the disk I/O times and affects the query efficiency.

MySQL B+ tree MySQL B+ tree MySQL B+ tree

2.5 and B + tree

The most common index data structure in MySQL is B+ tree, which has the following characteristics:

In B+ tree, all data record nodes are stored in leaf nodes of the same layer according to the size of key values, instead of non-leaf nodes only storing key information, which can greatly reduce the number of keys stored in each node and reduce the height of the B+ tree
The keywords of the leaf nodes of B+ tree are arranged in order from small to large, and the data at the end of the left will save the pointer to the data at the beginning of the node on the right.
B+ trees have fewer levels: Compared to B+ trees, which store more key words per non-leaf node, there are fewer levels of the tree so data is queried faster
B+ tree is more stable in query speed: B+ all keyword data addresses are stored on leaf nodes, so the search times are the same, so the query speed is more stable than B tree.
B+ tree has the natural sorting function: all the leaf node data of B+ tree forms an ordered linked list, which is more convenient for querying the data in the size range, with high data tightness and higher cache hit ratio than B tree.
B+ tree traversal of all nodes is faster: A B+ tree traversal of the whole tree only requires traversal of all leaf nodes, rather than traversal of each layer like a B tree, which is conducive to the database to perform full table scan.

Now that we’ve talked about the characteristics of B+ trees, let’s draw a picture of what a B+ tree looks like.

The data page above is where the data pages are actually stored, and the data pages are connected through a two-way linked list. Well, here we will quickly understand the types of each index, and now we will start the formal ANALYSIS of B+ tree.

3. Primary key directory

We take the data page from the above image and refine it to form the image below

As we all know, MySQL stores data in the smallest unit of data pages, and data is stored continuously in data pages. Data in data pages is sorted by primary key (no primary key is sorted by ROW_ID maintained by MySQL itself), and data pages and data pages are associated by bidirectional linked lists. Data and data times are correlated by a one-way linked list.

This means there is a in each data page, he must have a minimum of a primary key, and then each data page page number and the smallest primary key will be composed of a primary key directory (as above) on the left portion, suppose now to find the primary key data for the 2, through the binary search method finally determine the primary key for the record in data page 1 of 2, At this point, we will locate the data page 1 and then locate the record with primary key 2. We first know the general process, and do not go into details. We will first look at the structure principle from the macro, and then look at the implementation principle from the micro.

A primary key index is the simplest and most basic index. At this point you can see why it’s faster if you set up a primary key query, right?

4. Index page

But now assuming that there are lots and lots of data pages, would the corresponding primary key directory be really, really big?

What if there were 10 million records, 50 million records? Even if the binary search, its efficiency is still very low, so in order to solve this problem MySQL designed a new storage structure – index page. For example, if I have the following situation,

If the primary key directory has a very, very large number of records, MySQL will split the records into different index pages like this

The index page records the page number of each data page and the record of the smallest primary key in the data page, that is to say, the minimum primary key and data page number is not simply maintained in the primary key directory, but evolved into the index page, index page and data page is similar, one is not enough to split to the next.

Now if I want to find the record whose ID is equal to 20, huh? Which index page should I look for this record in? So this is definitely the time to maintain the index page.

Yes, MySQL has also designed data structures to maintain index pages, which are also called index pages, but at different levels, like the following:

That is, the index page that maintains the index page is one level up from the actual index page that stores the records and data. Now if you want to find the record with ID =20, you start at the top of the index page. If you do a binary search, you can quickly locate the record with ID =20 s on index page 2. Index page 2 is the same as before (note that the index page is also linked by one-way list), based on the smallest primary key can be located on the data page 5, assuming that the data page 5 looks like this

Are you able to figure out where the data is located at this point?

5. Index page layering

Ok, now that you know that too many entries to index pages will spread up one level, what if there are too many entries to index pages up one level? It’s very simple, let’s keep splitting, let’s keep going up another level, and let me draw it to help you understand

I get it. Do you get it? Let’s simulate a search. Suppose you want to find the record 37, which FRANKLY I have no idea where it is. MySQL > query (id=37); MySQL > query (id=37); MySQL > query (id=37); Finally, I can locate the data reality on data page 8, assuming that data page 8 looks like this

Isn’t it perfect? If I have to complete the above picture,…. (The map is too large to draw the linked list structure of the data in the index page)

Have you discovered a little secret by now? Doesn’t it look like a binary tree? This is actually the structure of a B+ tree, and this is the physical structure of how data is actually stored on disk. What are the properties of B+ trees? A B+ tree is a type of binary search tree, but its data is stored only in leaf nodes (in this case, data pages). A B+ tree consisting of index pages and data pages is a clustered index.

Clustered indexes are created by MySQL based on the primary key index structure

6. Non-primary key indexes

But now the problem comes again, since the emphasis here is the primary key index, that we usually develop in addition to the primary key index other index also use a lot, this time how to do? Suppose you now index name and age. Now reviewing the primary key index, is it possible to maintain a B+ tree based on the order of the primary keys when inserting data?

MySQL maintains a B+ tree as many indexes as you create. MySQL maintains B+ trees as many indexes as you create. We knew not to create too many indexes because indexes also take up space, which is actually the root cause.)

If name+age is now indexed, what is the value of the index? MySQL maintains a separate B+ tree structure based on name+age, and the data is still stored in the data page, except that each entry in the data page is written as id=xx, now it is written as name=xx, age=xx, id=xx. Let’s start with a picture

When inserting data, MySQL will first sort by name. If name is the same, it will sort by age in the federated index. If name is the same, it will sort by primary key fields. That’s how insertion works.

At this point, records in each data page are actually stored in index fields and primary key fields, and other fields are not stored (why not? SQL > alter table (); SQL > alter table (); SQL > alter table (); SQL > alter table ()

SELECT name FROM student WHERE name='wx'
Copy the code

Then the query is perfect, using the index and no need to return to the table

7. Back to the table

Is it, now according to the name lookup to actual orderdate, and query field (i.e., select the back of the query field) also there is only the name (as long as it is in the name, age, id) in these three fields at this time is to be able to direct access to the records in the end

In other words, because there are only name, age, and ID in the federated index, if you query only these three fields, you will get the desired result in the B+ tree.

So let’s assume that the SQL for the query looks like this (we’re assuming that student has other fields besides name, age, id)

SELECT * FROM student WHERE name='wx'
Copy the code

That now was ended, because although you now according to the name quickly locate into the records, but because the name + the age is not a clustering index, at this time of the data page of the B + tree is just placed with their associated indexes, and the primary key index fields, will not save other fields, so this time other attribute values is to get less than, What should I do?

In this case, MySQL will need to perform back table query. MySQL will perform a clustered index lookup based on the ID of the located record, that is, it will maintain a lookup based on the ID of the ID in the B+ tree. Because the data page in the clustered index records the complete record of a record, this process is called back to the table.

The result of a query based on a non-primary key index does not have the value of the field searched. In this case, we need to start the search again from the root node of the cluster index based on the primary key.

Finally, let me take a look at MySQL’s maintenance process for non-primary key indexes:

For non-primary key indexes (generally joint indexes), the maintenance of the B+ tree will be based on the fields of the joint index. Assume that the joint index is: MySQL > select * from B+ tree; select * from B+ tree; select * from B+ tree; select * from B+ tree; select * from B+ tree; If age is the same, then it will be sorted by the primary key field value, and for non-primary key indexes, MySQL maintains only the index field and primary key field when maintaining the B+ tree.