MySQL introduction (2) : indexes

Getting started with MySQL (1) : Internal implementation of queries and updates
MySQL introduction (2) : indexes
MySQL Introduction (3) : Transaction isolation
MySQL > lock MySQL > lock
MySQL Introduction (5) : Replication

Abstract

In this article, I will first introduce what an index is and what it does.

Then we’ll look at what the data structure of an index looks like, what its benefits are, and what the problems are.

After analyzing the data structure, we can study the use of indexes and how to design more efficient caches based on this data structure.

Finally, I will supplement the content of the last article, introduce the role of Change Buffer and analyze the impact of change buffer on performance.

1 purpose

Before we look at indexes, we need to understand what they are and what they do.

The official definition of an index is as follows:

Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows.

That is, an index is used to quickly find a row of data with a particular value. If there is no index, MySQL must scan the data line by line starting from the first row.

Especially when our data volume is increasing, proper indexes can help us to have better performance.

If an index is poorly designed, it can make our database performance worse.

So what exactly is an index? Let’s move on.

2 model

Before we talk about indexing specific data structures, let’s imagine that we are looking for a word in an English dictionary.

If we need to find a word: “awesome”!

We’ll find A list of words beginning with the letter A, then W, then E…

Just keep looking down, keep narrowing our search. If we don’t use the table of contents and look for the word directly in the text, it may take more time.

Besides, the words in this dictionary are arranged in such a way that if we were to look for the letter Z, we would have to search hundreds of pages before we finally found it.

This example is not particularly accurate, but reflects the core of indexing: reducing the number of lookups.

As we all know, MySQL data is stored on disk. Disk IO is the slowest. Therefore, reducing disk reads and writes is essential to improve performance. Although most computers now use SSDS and do not need seek, the principle of indexing still holds.

Here’s how InnoDB’s B+ tree is implemented (from High Performance MySQL) :

You can see that this is an N-fork tree, and each node in the tree is a data page in MySQL.

In fact, the N tree, in this case, is the same logic as the binary search tree. The difference is that each node here contains more data and Pointers than a binary search tree. The idea is that with the same amount of data, B+ trees can make the height of the tree lower.

And because all data pages are persisted on disk, the lower height means fewer disk I/OS to find a piece of data, making it more efficient.

Notice, because the larger N in an n-tree, the lower the height of the corresponding tree. The size of each node (data page) is fixed (the default is 16K and can be changed using innodb_page_size), so the smaller the key is set to the index, the larger the N is.

3 classification

After the introduction above, I think you can understand the lookup method of index. Let’s talk about the classification of indexes:

Primary and non-primary key indexes.

A primary key index, which is a non-leaf node where the values stored are the values of the primary key, is looked up by the primary key. Until you find the last leaf node. The entire row corresponding to the primary key is stored in the last leaf node.

A non-primary key index is a non-leaf node where the value stored is the value of the index, and the lookup is performed by this value. Find the last leaf node, save the corresponding primary key ID. MySQL then searches the B+ tree corresponding to the primary key index until it finds all data in that row. And this process of finding the primary key and then using the primary key to find it again, or this row of data, is called back to the table.

Note that when we create a new table, there must be a B+ tree indexed by the primary key. Even if you do not set a primary key, MySQL selects the first unique index column that does not contain NULL as the primary key column and uses it as a primary key index. If there is no such index, a clustered index is generated using the row number as the primary key.

In addition, for each additional index, MySQL maintains an additional B+ tree. The process of maintaining B+ trees is also complicated, involving page splitting and so on, which I want to cover in a future article.

As mentioned earlier, one of the most important factors affecting MySQL performance is disk IO. And back to the table this operation, is tantamount to increasing a lot of IO times.

So is there any way to reduce the cost of that? Let’s move on.

4 Joint Index

The indexes we mentioned above are all individual data to look up.

Thus, each time we create an index for one of the columns, we have to maintain an additional B+ tree, which also wastes performance and space.

So is it possible to sort multiple numbers at once and then look them up? The answer is yes, we can use federated indexes.

4.1 Left-most prefix

Take the picture above:

We set up a joint index (name, age).

If we need to find a 15-year-old outlaw maniac:

select * from user where name = "Zhang" and age = 15;
Copy the code

Because we simply match the search conditions we define indexes, so the MySQL will be starting to find the first condition, named “zhang SAN” data, then this time will continue to determine the condition of the second age of 15 years, since the item is greater than the first data of 10, and less than a second item data of 20 years old, So it looks down from the second pointer, looking for “Joe” older than 10 and younger than 20.

The search process in which conditions and indexes match exactly is called full-value matching query.

However, suppose we don’t set multiple search criteria and only search for people with the name “John”.

select * from user where name = "Zhang";
Copy the code

The search process will not match the age column, only the name column. So, we’re going to start at the leftmost node of this B+ tree, eight-year-old Joe, and we’re going to keep going back until the name of this guy isn’t Joe.

Such a search process is called left-most prefix search. Simply to explain, is the search conditions as long as it is consistent with the joint index, or the most left of the joint index, the index will take effect, it will achieve the purpose of “pruning”, accelerate the speed of the search. Only the remaining ones that don’t match the leftmost prefix will be matched through the history.

That is, indexes can be used to speed up retrieval as long as the leftmost prefix is satisfied. The leftmost prefix can be the leftmost N field of a union index or the leftmost M characters of a string index.

So, when does the left-most prefix not work?

Suppose there is such a joint index (a, b, c, D, e, f). So the search conditions are (a), (a, b), (a, b, c) and so on are called accords with the leftmost prefix. That is, it must start at the left of the index with any N fields or M characters.

But if we use a search condition like (a, c, d), then it only works on (a), not (c, d). Because the leftmost prefix is interrupted.

If the search condition is (e, f), it will also not take effect, because it does not comply with the rule of the leftmost N fields and does not belong to the leftmost prefix.

Therefore, the index reuse ability is one of the evaluation criteria when we establish the joint index. Since the left-most prefix can be supported, there is generally no need to create a separate index on a when there is already a joint index (a,b). Therefore, the first rule of thumb is that if you can maintain one less index by adjusting the order, that order is often the preferred one.

However, the longer the index, the better. As mentioned earlier, it is important to make the N of the n-tree as large as possible so that the height of the tree is low to reduce the number of DISK I/OS. If the federated index contains many fields, the N value will be reduced when the page size is fixed, which will slow down the efficiency.

4.2 Index push-down

Continue with the outlaw fanatics example above.

Suppose our statement looks like this:

select * from user where name like "Zhang %" and age = 15;
Copy the code

It makes sense that MySQL would start with data whose name starts with “Zhang” and determine if the age is 15.

But there is a very important rule about the left-most prefix: MySQL will keep matching to the right until it encounters a range query (>, <, between, like) and stops matching.

In other words, the age index is not needed for our query.

So, before MySQL5.6, whenever a name that starts with “zhang” is found, the primary key ID of the data is returned to the table, and the age of the data is 15.

The index condition pushdown feature introduced in MySQL 5.6 can be used to determine the fields contained in the index during index traversal and filter out the records that do not meet the conditions to reduce the number of table returns. That is, until a name starting with “Zhang” is found and the age is 15, the table will not be returned.

In addition, if the multi-range Read (MRR) strategy is used, all acquired primary keys will be sorted before the table is retrieved.

4.3 Overwriting indexes

Remember that if we were using a non-primary key index, we would need to go back to the table again based on the primary key in the leaf.

Overwriting indexes solves this problem. For example, when we search for “Zhang SAN” in front of us, we can also find his age. Syndication indexes like (a,b) are used by us

select b form table_name where a = xxx;
Copy the code

If the value of a is found, the value of b is returned directly from the b + tree. That is, in this query, indexes (A, B) have “covered” our query requirements, which we call overridden indexes.

5 Unique index and common index

Plain indexes: Speed up data access
Unique index: a normal index that is not allowed to duplicate

5.1 the query

Let’s start by looking at query performance.

For a query, if this were a normal index, after finding the data that meets the criteria, it would continue to iterate until it encounters the data that does not.

If it’s a unique index, because of its uniqueness, as soon as it’s found, it just returns, it doesn’t need to go through.

In fact, the performance difference between the two is very small.

Why is that? You might think that a normal index would still need to be traversed, perhaps even slower. However, as we mentioned earlier, query operations need to read data into memory, in the form of data pages. In memory traversal operations, the speed difference is very small.

Even if the last entry of a normal index is the same, disk IO is required to read the next page, which can be time consuming. However, because a data page contains so much data, this probability is extremely low.

5.2 insert

Before we get to inserts, I want to tell you about the change Buffer.

I mentioned in my last article that when we need to update data, we first read data from disk into memory, modify the data, then modify redolog, add binlog, and then flush the dirty pages back to disk when memory is full or redolog is full.

What about inserting data?

After we add a new piece of data, MySQL does not insert it directly into disk, but writes the change buffer.

The data page is read when there is a subsequent query request about the data page, and then the data page is updated to the data page in memory according to the records of the data page in the Change buffer. This process is called merge. After the update is complete, the query results are returned.

But what good would that do?

Assuming that the normal index we inserted is not in memory, it does two things:

First, when we insert a piece of data, we do not need to call the data page that needs to be written into the memory for modification through disk IO. Instead, we record the insertion behavior, and then flush the dirty pages back to the disk. In other words, change Buffer avoids the problem of having to import a data page into memory each time to make changes, resulting in too many dirty pages.

Second, and most importantly, change Buffer is designed to avoid random IO to find pages of data during each insert. In addition, MySQL will make sure that dirty pages are flushed back to disk in a sequential IO manner when they are refreshed later.

This process is very large for normal indexes.

Simply put, the main purpose of the change buffer is to cache recorded changes. So before a data page is merged, the more changes recorded in the change buffer (that is, the more times the page has to be updated), the greater the benefit.

But for unique indexes, because the constraint for unique indexes is “data uniqueness.” Therefore, you still need to find the data page and determine whether there is a conflict before inserting. In this case, the change buffer does not work.

Then let’s associate the change buffer with the redo log mentioned earlier.

For example, we need to insert two data pages, one data page is in memory, the other data page is in disk (not yet read into memory), and the two data using a normal index (do not need to verify whether duplicate).

At this point, the memory is directly modified for the data page insertion operation in memory, and the data page insertion operation out of memory is recorded in the Change Buffer. Log these two operations in the redo log and add a binlog. When both log files are written, return and the operation is complete.

When to flush dirty pages from memory back to disk is another operation.

In addition, the change buffer here can also be persisted, following the checkpoint mechanism, that is, the change buffer marks which records have been merged into the data page and which have not.

After MySQL5.5, in addition to insert, update and delete operations, change Buffer is also supported. In other words, update and delete operations are also recorded by the Change Buffer before the merge is performed.

Wrote last

First of all, thank you for being here!

In addition, I also want to thank the male brother, in the index of this part of the content also gave me a special great help!

That’s all you need to know about MySQL indexes. Similarly, there are many holes left unfilled in this article. Due to the length and coherence of the article, no detailed introduction. But it will be covered in a later article.

If there is anything I did not explain clearly in this article, or there is a mistake in my understanding, please leave a comment, thank you!

PS: If you have other questions, you can also find the author on the official account. And, all articles will be updated in the public account at the first time, welcome to find the author to play ~