Query plan Explain

Interviewer: I see on your resume that you are familiar with SQL optimization, how do you optimize in your work?

We will enable the slow query log, catch the slow SQL, first use Explain to look at the query plan, mainly look at the columns type, Possible_keys, key, rows and extra, and then optimize accordingly.

Interviewer: Tell me more about these columns

Type is an important reference indicator of the SQL performance. Generally speaking, it is required to be at least index level, followed by possible_keys and key. Take a look at the possible indexes, see if they are actually used, and if not, analyze why.

If there is no index available, it is simple to build the index normally.

If the desired index is used and the execution plan is fine, look at the number of rows scanned. If the number of rows scanned is very large, this will need to slowly change the SQL to try to optimize the number of rows scanned (this requires a strong background), or consider the data table and library. Finally, look for extra time-consuming operations, such as in-file sorting.

In fact, the bottom line is to see if the query uses the index, if the index is still slow to optimize the number of scanned rows

Interviewer: What are the values of type and what does it mean

The following table is the most common, with performance deteriorating from top to bottom:

value On behalf of the meaning
const Indicates that the primary key index is used to look up records at once, which is very good because only one record is matched
eq_ref Generally, join appears, indicating that each row in the left table has a unique record corresponding to it in the right table. For example, the user account table and user information table are associated with userId
ref Although an index is used, the value of the index column is not unique and there are duplicates. In this way, even if the first data is quickly found using the index, it still cannot be stopped and a small scan near the target value is performed
range Range scan based on index columns, such as where with “<“, “>”, “between and”
index Performs a sequential full table scan based on indexes. Compared with all, it performs a full table scan based on the index order
all A full table scan requires that all row records be scanned

So what’s the difference between index and all? Many tutorials on the web say that index only scans the index tree and does not access specific row records, which is actually wrong. A full table scan is performed on index and all, but MySQL performs a full table scan in index order rather than row scan. In other words, it’s sequential disk IO. The extra column only scans the index tree if it shows using index, because using index implies that MySQL is using an overwrite index, and that the desired results can be obtained directly from the index tree without the need to scan rows.

The advantages and disadvantages of indexes

Interviewer: You just said that optimization boils down to looking at the index. Why is the query faster when there is an index?

Because index is a data structure used to improve query speed, we use InnoDB storage engine, which stores all index fields of records in a B+TREE data structure, which has some features of binary TREE, query is very fast. Without an index, it needs to scan every page on disk, and with an index it can selectively scan pages. Changed random disk I/O to selective disk I/O. Much fewer disk I/OS are performed.

Interviewer: Is it a good idea if I index every field?

No, the index is a logical B+TREE that needs to be maintained. The index TREE needs to be maintained dynamically when data in the table is added or deleted. This slows down the speed at which data is added or deleted, so indexes cannot be built mindlessly.

Interviewer: Tell me which fields are suitable for indexing and which are not

If the column has few future duplicates, it is suitable for indexing; if the column has a high repetition rate, it is not suitable for indexing. For example, the user’s mobile phone number, order table order number and so on are suitable for building indexes. User gender and order status are not suitable.

Interviewer: Why?

The data structure of the index is B+TREE. If the index column does not have duplicate values, the search will not be performed after the target is queried during the index tree search. If there is a large number of the same value, then many leaf node disk blocks have the same value, then after the search to find the target still need to search (continue to query other disk blocks, IO operations) (more on B+TREE below).

Index type & back table

Interviewer: Tell me what kinds of indexes MySQL has

MySQL > create index ();

All indexes except primary key indexes are non-clustered indexes. Clustered index this index tree not only stores index field values, but also stores complete row records in leaf nodes. Instead of a clustered index, its index tree leaf nodes store primary keys rather than specific row records. Therefore, when using non-clustered index query, the principle is to first find the primary key value, and then query the whole row record according to the primary key value of the clustered index tree, this process is also called back to the table.

Interviewer: If the table has no primary key, is there a clustered index?

(this…… If there is no primary key, MySQL will select a unique and non-empty key as the clustered index. If there is no unique key, MySQL will implicitly define a primary key as the clustered index. So a table must have clustered indexes

Index underlying principle B+TREE

Interviewer: I see you are quite familiar with indexes. Could you tell me more about B+TREE?

We can go to this website to get a feel for the B+TREE data structure. Here I choose 7 orders (actually MySQL implements hundreds of orders) and insert 42 primary keys. You can see the changes in this tree, the changes in the index tree as we continue to add data to the table, also known as page splitting.

Each rectangle here is a disk block, also known as a page. The page with the arrow is the page where the leaf node resides. Note that only leaf nodes of primary key B+TREE store full data. Non-leaf nodes store primary keys and Pointers to other nodes. Note that in MySQL’s implementation, Pointers between leaf nodes are bidirectional, which can be easily seen in reverse order.

As we said, the leaf node only stores the full row record. The right part of the picture above can be understood by the following image:

Interviewer: You talk about the process of querying data with a primary key index

When a query is executed, the system does not directly find the corresponding record. Instead, the system uses the primary key to find the page where the corresponding record resides, reads the page from disk to the memory, and searches the page in the memory. For example, if we want to query the record with ID = 42, we can search all the way from the root node according to the characteristics of the tree, and finally find 42 on the rightmost page. It then reads the page into memory and iterates through memory to find 42.

Data is a complete row of data in a table. Here you can see that the data structure between pages is a two-way linked list. In fact, inside the page, the specific row records are also a linked list data structure, but the row records between data is a single linked list.

Interviewer: In MySQL, there are hundreds of B+ trees, so a page is a single linked list consisting of hundreds of elements. Because of the single linked list query, we have to start from scratch to traverse, so the retrieval efficiency is not very low?

In MySQL, there is a page directory that maintains all primary keys and row records in the page. Each slot in the page directory corresponds to a group of data, and each group contains several pieces of data. Then the corresponding grouping is determined and the single linked list is traversed in the grouping. There are only a few pieces of data in a group, and since it’s in memory, the time it takes is negligible.

Specific illustration:

Interviewer: How is B+TREE better than BTREE for MySQL storage?

The difference between B+TREE and BTREE is that B+TREE stores row records only on leaf nodes, while BTREE stores data on each node (page). As we know, the disk size of a page of MySQL is 16KB, storing specific row data will occupy disk space. Therefore, a page of a non-leaf node of B+TREE can store more primary keys, so that the TREE height is relatively low, and the disk I/O operation during query is less, and the performance is higher. Secondly, as the leaf nodes of B+ tree are all connected together, scope search and sorting are more convenient

Now look at the fields that are not suitable for indexing

We have already stated that the gender field (0 for female, 1 for male) is not suitable for indexing. If we create an index, it would look like this:

Look how many disk blocks I have to scan for sex = 0 with this index. The index is meaningless

Look back to the table

We’ve already talked about the concept of a back table, so let’s use a concrete diagram to understand it

A non-primary key index B+TREE is almost the same as a primary key index B+TREE. The only difference is that a non-primary key index B+TREE leaves store primary key values. I’m not going to draw any other graphs here (you can think of the tree below as having a non-primary key index like a cell phone number). If the phone number is used to query the complete record, it first finds the target phone number corresponding to the primary key is 42 in the B+TREE of the non-primary key index, and then according to 42 go to the B+TREE of the primary key index according to the process described above to query the complete target record. As you may have already noticed, you need to first perform the B+TREE disk IO that is not the primary key index and then return the table to the B+TREE disk IO that is the primary key index.

Joint index & leftmost prefix match

Interviewer: How did you determine the order when you created the joint index? How to use a joint index field when querying?

Create a joint index according to the recognition degree from high to low, that is to say, almost not repeated in the left, slightly repeated data put in the middle, more repeated data put in the right, principle is the above said sex sex field is not suitable for the establishment of index. The leftmost index field must be used in the query, which is also called the leftmost prefix matching rule.

Interviewer: Why can’t you use a federated index without following the left-most prefix match?

Previously we said that a federated index is a non-clustered index that also has its own B+TREE. Take this chart, for example,

id a b c d
1 1 1 1 data
2 2 2 2 data
3 3 2 2 data
4 3 2 2 data
5 2 3 5 data
6 6 4 4 data
7 4 5 5 data
8 8 8 8 data

We create the associative index (a,b,c). In this b +TREE, it looks something like this:

First of all, it sort by the first field, a, if a is the same, sort by B, if a and B are the same, sort by C. That is, if we want to use a field of this index, we have to specify the field to the left of it. So it’s obvious that the leftmost field is a must-have, otherwise it can’t be queried in index order. This is the leftmost prefix match.

Interviewer: Your a, B, and C are all integers. How do you sort them if they are strings?

Open Navicat in the design table interface, select the string type field, there is a Collation, it will be based on the Collation we specify the size.

Indexes cover

Interviewer: You said before that a non-primary key index must be queried back into the table multiple times. Does a non-primary key index have to be queried back into the table?

Not necessarily, there is a case of index overwrite, in which no query is returned to the table. For example, if we were to query SQL like this:

SELECT a,b,c FROM test
Copy the code

Since the three fields in our query are just a few fields in the joint index, it can be directly retrieved from the B+TREE of the joint index, and then returned to the client. No need to go back to the table query. So if possible in the development of the query field overwrite index, this can significantly improve SQL performance.

Index Failure Scenario

Interviewer: Do you know of any scenarios that would cause the index to fail?

It can be classified into the following three categories:

  • The leftmost prefix matching rule is not followed
  • Use functions on indexed columns
  • Indexes are not as efficient as full table scans

You can make any list you want

  • Where condition appears for index field NULL value judgment
  • Where condition appears index field “! =”
  • Select * from ‘where’ where ‘or’
  • Select * from index where id like “%xx%”
  • The WHERE condition appears to perform functional operations and expression concatenations on index fields, for exampleto_days( t1.update_time ) > 'xxx'
  • If a table has only 10 data in total, then all 10 data will eventually be read from disk blocks to memory together with or without indexes, so it is possible that the optimizer will not use indexes in this case

All of the above may cause index failure and become full table scan. If you are familiar with the principles of indexing, the reasons for the above failures are very simple, and I will not go into details here

ICP&MRR

Interviewer: You said earlier that you are using MySQL 5.7 online. Do you know what changes have been made to MySQL 5.6?

There are two important ones, we can use the following SQL query related optimization status on, by default, these two are on

SELECT @@optimizer_switch
Copy the code
  • MRR (Multi-range Read)

The official website says a lot about this stuff and I don’t really understand it… The net effect is to reduce random disk access and turn random I/OS into sequential I/OS. If MRR is used, the extra column in explain will show using MRR.

  • ICP Index Condition Pushdown

It’s called indexed conditional push, and the example on the website makes that very clear, suppose you have a table people that is indexed jointly by Zipcode, lastName, and address. Execute the following SQL statement

SELECT * FROM people WHERE zipcode='95054'AND lastname LIKE '%etrunia%' AND address LIKE '%Main Street%';
Copy the code

Without index push-down optimization, it scans all indexes that match Zipcode = ‘95054’, queries the data back to the table, and filters the last two criteria. This will scan the disk for a lot of data that does not meet all the criteria, and then filter it. With index push-down optimization, it will scan the federated index to determine whether the next two fields meet the criteria, and if not, it will be removed, so that the final need to query back to the table is all the criteria of the data. We put all the where conditions at the storage engine level.

Note that for index push-downs to work, the WHERE condition field must overwrite the index.

conclusion

Please like and follow this article if it’s been helpful to you. Your support is my motivation to continue to create!