I heard there's a leftmost principle? This is a good example of how MySQL can use multiple columns in a joint index

Tag: article of public account

The preparatory work

For the story to run smoothly, we need to create a table:

CREATE TABLE single_table (
    id INT NOT NULL AUTO_INCREMENT,
    key1 VARCHAR(100),
    key2 INT,
    key3 VARCHAR(100),
    key_part1 VARCHAR(100),
    key_part2 VARCHAR(100),
    key_part3 VARCHAR(100),
    common_field VARCHAR(100),
    PRIMARY KEY (id),
    KEY idx_key1 (key1),
    UNIQUE KEY uk_key2 (key2),
    KEY idx_key3 (key3),
    KEY idx_key_part(key_part1, key_part2, key_part3)
) Engine=InnoDB CHARSET=utf8;
Copy the code

We create 1 cluster index and 4 secondary indexes for single_table, respectively:

The clustered index for the ID column.
Idx_key1 secondary index for key1 column.
The uk_KEY2 secondary index for the KEY2 column, and it is the only secondary index.
Idx_key3 secondary index for key3 column.
Idx_key_part secondary index for key_part1, key_part2, and key_part3 columns, which is also a federated index.

Insert 10000 rows into the table. Insert random values into all columns except the id column.

Let’s draw the clustering index of single_table:

Through the B+ tree corresponding to the cluster index, we can easily locate the cluster index record whose primary key value is equal to a certain value. For example, we want to locate the cluster index record whose ID value is 1438 through the B+ tree, then the schematic diagram is as follows:

If we want to find the secondary index record whose key1 value is equal to a certain value, we can easily locate the secondary index record whose first key1 column value is equal to a certain value through the B+ tree corresponding to IDx_KEY1, and then scan back along the one-way list where the record is located. For example, if we want to locate the first record with key1 value ‘ABC’ through the B+ tree, the schematic diagram would look like this:

Scan interval and boundary conditions

The roughest execution for a query is to scan all the records in the table, determine if each record meets the search criteria, and send it to the client if it does, or skip it if it doesn’t. This execution scheme is also known as a full table scan. For tables using the InnoDB storage engine, a full table scan means starting from the first record of the first leaf node in the clustered index and scanning back along the one-way linked list where the record resides until the last record of the last leaf node. While a full table scan is a clumsy implementation, it is a universal implementation that can be used for all queries.

We described using B+ trees to find records whose index column value is equal to a certain value, which significantly reduces the number of records that need to be scanned. In fact, since the records in the leaf node of B+ tree are sorted in ascending order according to the index column values, the number of records that need to be scanned can be significantly reduced by scanning only the records in a certain interval or interval. For example, for the following query:

SELECT * FROM single_table 
    WHERE id >= 2 AND id <= 100;
Copy the code

[2, 100]; [2, 100]; [2, 100]; [2, 100] Until the value of a cluster index is not in the range [2, 100] (i.e., until the value of the cluster index does not meet the condition id<=100). Compared with scanning all cluster index records, scanning records in the range of [2, 100] have greatly reduced the number of records that need to be scanned, thus improving the query efficiency. For simplicity, the interval where the ID value of the record to be scanned is called the scanning interval, AND the query conditions forming the scanning interval, namely ID >= 2 AND ID <= 100, are called the boundary conditions forming the scanning interval.

Tip: In fact, for full table scan, we need to scan the records of id values in the interval of (-∞, +∞), that is to say, the corresponding scanning interval of full table scan is (-∞, +∞).

For the following query:

SELECT * FROM single_table 
    WHERE key2 IN (1438, 6328) OR 
          (key2 >= 38 AND key2 <= 79);
Copy the code

Of course we can use a full table scan way to execute the query, but observed that the query search criteria involves key2 columns, and we just for key2 uk_key2 index is established, if we use uk_key2 index executing the query, then from three at the bottom of the scan interval for secondary index record:

[1438, 1438], the corresponding boundary condition is key2 IN (1438).
[6328, 6328], the corresponding boundary condition is key2 IN (6328).
[38, 79], the corresponding boundary conditions are key2 >= 38 AND key2 <= 79

These scan intervals correspond to the number line as shown below:

Tip: You can use idx_KEY1, IDx_KEY3, and idx_KEYPart, not just uk_key2. Taking IDx_KEY1 as an example, it is obvious that we cannot reduce the number of idX_key secondary index records that need to be scanned by forming an appropriate scanning interval through search conditions, and can only scan all secondary index records of IDX_KEY1. For each secondary index record retrieved, a back table operation is performed to retrieve the complete user record. We can also say that the scan interval corresponding to idx_KEY1 is (-∞, +∞). It works, but what are we trying to do? The most crude full table scan method is to scan all the cluster index records, and you need to scan all the idX_KEY1 secondary index records. In the process, not reduce the number of records that need to scan, but efficiency is less than a full table scan, so if we want to use one of the indexes to execute queries, but can’t through search terms to form proper scanning interval when the number of records to reduce the need to scan, so we are not consider using this index to execute queries.

Not all search criteria can be boundary conditions, such as the following query:

SELECT * FROM single_table 
    WHERE key1 < 'a' AND 
          key3 > 'z' AND 
          common_field = 'abc';
Copy the code

So:

Key1 < ‘a’ AND key3 > ‘z’ AND common_field = ‘ABC’ are common search conditions. These common search conditions need to obtain the secondary index records of IDX_KEY1, then perform the table operation, and obtain the complete user records to determine whether they are valid.
Key3 < ‘a’ AND common_field = ‘ABC’ if idx_key3 = ‘z’, +∞ if idx_key3 = ‘z’, +∞ These common search conditions need to obtain idx_KEY3 secondary index records, and then perform the table operation, obtain complete user records to determine whether they are valid.

The scan interval corresponding to a query executed using a federated index

The index column of the joint index contains multiple columns, and the sorting rules for each page of the B+ tree and the records in each page are relatively complex. Taking the IDX_KEY_PART joint index of singLE_TABLE as an example, its sorting rules are as follows:

First sort by the value of the key_part1 column.
If the values of the key_part1 column are the same, sort by the values of the key_part2 column.
If the values of key_part1 and key_part2 are the same, sort by the values of key_part3.

Let’s draw the idx_KEY_part index:

Q1: SELECT * FROM single_table WHERE key_part1 ='a';
Copy the code

Since the secondary index records are sorted first by the value of the key_part1 column, all records that meet the condition key_part1 = ‘A’ must be adjacent. We can locate the first record that meets the condition key_part1 = ‘a’ and scan back along the one-way list where the record is located. Until one of the records does not meet the condition key_part1 = ‘a’ (of course, we won’t show back table operations for every secondary index record retrieved), as shown below.

For the following query Q2:

Q2: SELECT * FROM single_table WHERE key_part1 ='a' AND 
              key_part2 = 'b';
Copy the code

Since the secondary index records are first sorted by the value of the KEY_PART1 column; The key_part2 column is sorted if the values of the key_part1 column are equal. Key_part1 =’a’ AND key_part2=’b’ are adjacent to each other, so we can locate the first index that matches the conditions key_part1=’a’ AND key_part2=’b’. It then scans back along the one-way linked list of records until one of the records does not meet the conditions key_part1=’ A ‘or key_part2=’b’ (of course, backtable operations are performed for each secondary index record retrieved, which we will not show here), as shown below.

For the following query Q3:

Q3: SELECT * FROM single_table WHERE key_part1 ='a' AND 
              key_part2 = 'b' AND 
              key_part3 = 'c';
Copy the code

Since the secondary index records are first sorted by the value of the KEY_PART1 column; If the values of keypart1 are equal, then the key_part2 column is sorted. With the key_part1 and key_part2 columns equal, sort by the key_part3 column. Key_part1 = ‘a’ AND key_part2 = ‘b’ AND key_part3 = ‘c’ We can locate the first record that meets the conditions key_part1=’ A ‘AND key_part2=’b’ AND key_part3=’ C’ AND scan back along the one-way list of records, Until a record does not meet the key_part1=’ A ‘condition, or key_part2=’b’ condition, or key_part3=’ C’ condition (of course, for every secondary index record retrieved, a back-table operation is performed), we will not draw diagrams. If we use idx_key_part index, when performing a query Q3 can form scanning interval [(‘ a ‘, ‘b’, ‘c’), (‘ a ‘, ‘b’, ‘c’)]. Key_part1 = ‘a’ AND key_part2 = ‘b’ AND key_part3 = ‘c’.

For the following query Q4:

Q4: SELECT * FROM single_table WHERE key_part1 <'a';
Copy the code

Since the secondary index records are sorted first by the value of the KEY_part1 column, all records that meet the condition key_part1 < ‘a’ must be contiguous, We can locate the first record that matches key_part1 < ‘A’ (that is, the first record of the first leaf node indexed by IDx_KEY_PART) and then scan back along the one-way list of records until one of the records does not match key_part1 < ‘a’ (of course, A back table operation is performed for each secondary index record retrieved, which we won’t show here), as shown below.

For the following query Q5:

Q5: SELECT * FROM single_table WHERE key_part1 ='a' AND 
          key_part2 > 'a' AND 
          key_part2 < 'd';
Copy the code

Since the secondary index records are first sorted by the value of the KEY_PART1 column; The key_part2 column is sorted if the values of the key_part1 column are equal. That is, secondary index records that meet the condition key_part1 = ‘a’ are sorted by the value of the key_part2 column, Key_part1 = ‘a’ AND key_part2 > ‘a’ AND key_part2 < ‘d’ are adjacent to each other. We can locate the first record that meets the conditions key_part1=’ A ‘AND key_part2 > ‘a’ AND key_part2 < ‘c’ AND scan back along the one-way list of records, Until a record does not meet the conditions key_part1=’ A ‘or key_part2 >’ A ‘or key_part2 < ‘d’ (of course, backtable operations are performed for every secondary index record retrieved, so we won’t show backtable operations here), as shown below.

For the following query Q6:

Q6: SELECT * FROM single_table WHERE key_part2 ='a';
Copy the code

Since the secondary index records are not sorted directly by the value of the key_part2 column, the secondary index records that match key_part2 = ‘a’ may not be adjacent, which means that we cannot reduce the number of records that need to be scanned by the key_part2 = ‘a’ search condition. In this case, we would not use the IDx_KEY_part index to perform the query.

For the following query Q7:

Q7: SELECT * FROM single_table WHERE key_part1 ='a' AND 
              key_part3 = 'c';
Copy the code

Since the secondary index records are sorted by the value of the key_part1 column first, the secondary index records that meet the condition key_part1 = ‘a’ must be adjacent, but for the secondary index records that meet the condition key_part1 = ‘a’, It is not sorted directly by the key_part3 column, which means that we cannot further reduce the number of records to scan based on the search condition key_part3 = ‘c’. So if we use the IDx_KEY_PART index to perform the query, we can locate the first record that meets the condition key_part1=’ A ‘, and then scan back along the one-way list until one of the records does not meet the condition. Therefore, in the process of query Q7 using idx_KEY_PART, the corresponding scan interval is actually [‘ A ‘, ‘a’], which is formed by the search condition key_part1 = ‘a’, regardless of key_part3 = ‘c’.

Tip: For each secondary index record obtained, if the index condition push is not enabled, you must first perform the table back operation, obtain the complete user record, then determine whether the condition key_part3 = ‘c’ is valid. If you enable the index condition push down feature, you can immediately determine whether the secondary index record meets the condition key_part3 = ‘c’. If yes, the table operation is performed again. If not, the table operation is not performed and the next secondary index record is directly skipped. The index conditional push feature was introduced in MySQL 5.6 and is enabled by default.

For the following query Q8:

Q8: SELECT * FROM single_table WHERE key_part1 <'b' AND 
              key_part2 = 'a';
Copy the code

Since the secondary index records are sorted first by the value of the KEY_part1 column, the secondary index records that meet the key_part1 < ‘b’ condition must be adjacent, but for the secondary index records that meet the key_part1 < ‘b’ condition, It is not sorted directly by the key_part2 column, which means that we cannot further reduce the number of records to scan based on the search condition key_part2 = ‘a’. So if we use the idx_KEY_part index, we can locate the first record that meets the condition key_part1<‘b’ and scan back along the one-way list of records. Until a record does not meet the key_part1 < ‘b’ condition, as shown below.

For the following query Q9:

Q9: SELECT * FROM single_table WHERE key_part1 <='b' AND 
              key_part2 = 'a';
Copy the code

Q8 and Q9 obviously look very similar, except that among the conditions involving key_part1, the condition in Q8 is key_part1 < ‘b’ and the condition in Q9 is key_part1 <= ‘b’. It is obvious that secondary index records that meet the key_part1 <= ‘b’ condition are adjacent, but secondary index records that meet the key_part1 <= ‘b’ condition are not sorted directly by the key_part2 column. However, for secondary index records that conform to key_part1 = ‘b’, they are sorted by the value of the key_part2 column. So when we determine the range of secondary index records to scan, when the value of key_part1 column of the secondary index record is’ B ‘, we can also use the condition key_part2 = ‘A’ to reduce the range of secondary index records to scan. Key_part1 = ‘b’, key_part2 = ‘a’, key_part1 = ‘b’

It may be easier to understand the scan interval and the conditions under which query Q9 is executed using the idx_KEY_part index:

SELECT * FROM single_table 
    WHERE (key_part1 < 'b' AND key_part2 = 'a') 
       OR (key_part1 = 'b' AND key_part2 = 'a');
Copy the code

For more information, please refer to the Nuggets booklet how MySQL Works: Understanding MySQL from the Root

digression

This article was first published on the public account “We are all little Frogs”.

Writing articles is tiring, and sometimes you feel that the reading is smooth, but it is actually the result of countless revisions behind. If you think it is good, please help to forward it, thanks a million ~ here is my public account “we are all small frog”, there are more technical dry goods, occasionally pull a calf, welcome to pay attention to:

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

I heard there’s a leftmost principle? This is a good example of how MySQL can use multiple columns in a joint index

The preparatory work

Scan interval and boundary conditions

The scan interval corresponding to a query executed using a federated index

digression

I heard there’s a leftmost principle? This is a good example of how MySQL can use multiple columns in a joint index

The preparatory work

Scan interval and boundary conditions

The scan interval corresponding to a query executed using a federated index

digression

Related Posts

40 maps to understand TCP and UDP

Static static static static

A small optimization of spring’s redis cluster integration