How can MySQL design indexes to be more efficient?

Have feelings, have dry goods, wechat search [three prince Ao Bing] pay attention to this different programmer.

This article has been included in GitHub github.com/JavaFamily, there are a line of large factory interview complete test sites, information and my series of articles.

preface

By now I think you have a general idea of all the concepts. While reading the comments this week, I found a question that I found interesting: How does Shuai Design an index? How do you design the index? How to design more efficiently?

I thought I’ve written a lot about indexes, and there’s no reason why readers won’t, but as soon as I looked back, yes, I wrote about the concept of indexes, the pros and cons, not how to design them, and that’s where this article came from.

This article will repeat many of the concepts I have written before, but it is also intended to give you a better understanding of the principles of MySQL index design.

The body of the

As we know, index is a Tree structure based on linked list, which can quickly retrieve data. At present, almost all RDBMS databases have implemented index features, such as B+Tree index of MySQL and BTree index of MongoDB.

In the process of business development, whether the index design is efficient or not determines the execution efficiency of the SQL corresponding to the interface. Efficient indexes can reduce the Response Time of the interface and also reduce the cost. Our realistic goals are as follows: Index design -> lower interface response time -> lower server configuration -> lower cost, and ultimately cost, because that’s what the boss cares about most.

Today, we will talk about the MySQL index and how to design the index, using the index to improve the interface RT, improve the user health check.

MySQL > create index

InnoDB engine in MySQL uses B+Tree structure to store indexes, which can minimize disk I/O times during data query. At the same time, the height of the Tree directly affects the query performance. Generally, the height of the Tree is maintained at 3 to 4 layers.

B+Tree consists of three parts: root, branch, and Leaf. Root and branch do not store data, but only pointer addresses. All data are stored in Leaf nodes, and Leaf nodes are linked by a bidirectional list.

As can be seen from the above, each Leaf Node is composed of three parts, namely, the precursor pointer P_prev, data data and its successor pointer p_next. Meanwhile, data data is in order, which is ascending ASC by default. The key values distributed on the right side of B+tree are always larger than those on the left. At the same time, the distance from root to each Leaf is the same, that is, the IO required to access any Leaf Node is the same, that is, the height of the index tree is Level + 1 IO operation.

Sort_buffer_size = sort_buffer_size; sort_buffer_size = sort_buffer_size; sort_buffer_size = sort_buffer_size; Most importantly, indexes can avoid sorting operations (distinct, Group by, order by).

Clustered index

The Index Organization Table in MySQL is an IOT (Index Organization Table), where data is stored by primary key IDS (logically contiguous, physically discontiguous). Table users(ID, user_id, user_name, phone, primary key(ID)) where id is the clustered index. The entire row of id, user_id, user_name, phone is stored.

Secondary index

The secondary index is also called the secondary index, which stores the primary key ID in addition to the index column. For the index idx_user_name(user_name), the secondary index is equivalent to idx_user_name(user_name, id). MySQL will automatically add the primary key id at the end of the secondary index, familiar with the Oracle database knows, except index column also stored in the index row_id (on behalf of the physical location of data, is made up of four parts: object number + data file number + number + data blocks of data line number), we are creating secondary index can also be shown to add a primary key id.

Create index on user_name column
mysql> create index idx_user_name on users(user_name);
Add primary key ID to create index
mysql> create index idx_user_name_id on users(user_name,id);
Compare the statistics of two indexes
mysql> select a.space as tbl_spaceid, a.table_id, a.name as table_name, row_format, space_type,  b.index_id , b.name as index_name, n_fields, page_no, b.type as index_type  from information_schema.INNODB_TABLES a left join information_schema.INNODB_INDEXES b  on a.table_id =b.table_id where a.name = 'test/users';
+-------------+----------+------------+------------+------------+----------+------------------+----------+------
| tbl_spaceid | table_id | table_name | row_format | space_type | index_id | index_name       | n_fields | page_no | index_type |
+-------------+----------+------------+------------+------------+----------+------------------+----------+------
|         518 |     1586 | test/users | Dynamic    | Single     |     1254 | PRIMARY          |        9 |       4 |          3 |
|         518 |     1586 | test/users | Dynamic    | Single     |     4003 | idx_user_name    |        2 |       5 |          0 |
|         518 |     1586 | test/users | Dynamic    | Single     |     4004 | idx_user_name_id |        2 |      45 |          0 |
mysql> select index_name, last_update, stat_name, stat_value, stat_description from mysql.innodb_index_stats where index_name in ('idx_user_name'.'idx_user_name_id');
+------------------+---------------------+--------------+------------+-----------------------------------+
| index_name       | last_update         | stat_name    | stat_value | stat_description                  |
+------------------+---------------------+--------------+------------+-----------------------------------+   
| idx_user_name    | 2021- 01. 17:14:48 | n_leaf_pages |       1358 | Number of leaf pages in the index |
| idx_user_name    | 2021- 01. 17:14:48 | size         |       1572 | Number of pages in the index      |
| idx_user_name_id | 2021- 01. 17:14:48 | n_leaf_pages |       1358 | Number of leaf pages in the index |
| idx_user_name_id | 2021- 01. 17:14:48 | size         |       1572 | Number of pages in the index      |
Copy the code

Comparing the results of the two indexes, n_fields represents the number of columns in the index, n_leaf_pages represents the number of leaf pages in the index, and SIZE represents the total number of pages in the index. Through data comparison, it can be seen that the secondary index does contain the primary key ID, which also indicates that the two indexes are completely consistent.

Index_name	n_fields	n_leaf_pages	size
idx_user_name	2	1358	1572
idx_user_name_id	2	1358	1572

The index back to the table

The secondary index contains the primary key ID. If the secondary index column is used to filter data, it may need to be returned to the table. For example:

select  user_id, user_name, phone from users where user_name = 'Laaa';
Copy the code

Select * from idx_user_name; select * from idx_user_name; select * from idx_user_name;

SQL 1: select id, user_name from users where user_name = 'Laaa';

SQL 2: select id from users where user_name = 'Laaa';

mysql> explain select id, name from users where name = 'Laaa';
+----+-------------+-------+------------+------+---------------+---------------+---------+-------+------+-------
| id | select_type | table | partitions | type | possible_keys | key           | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+---------------+---------+-------+------+-------
|  1 | SIMPLE      | users | NULL       | ref  | idx_user_name | idx_user_name | 82      | const |    1 |   100.00 | Using index |
mysql> explain select id from users where name = 'Laaa';
+----+-------------+-------+------------+------+---------------+---------------+---------+-------+------+-------
| id | select_type | table | partitions | type | possible_keys | key           | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+---------------+---------+-------+------+-------
|  1 | SIMPLE      | users | NULL       | ref  | idx_user_name | idx_user_name | 82      | const |    1 |   100.00 | Using index |
Copy the code

SQL 1 and SQL 2: select Extra=Using index, select Extra=Using index, select Extra=Using index

select user_id, user_name, phone from users where user_name = 'Laaa';

MySQL > select user_id, phone, idx_user_name, phone, idx_user_name, phone, idx_user_name

Section 1: Select ** ID ** from users where user_name = ‘Laaa’ //id = 100101

Section 2: select user_id, user_name, phone from users where id = 100101;

The Section 2 operation is called a back table, where the primary key ID in the secondary index is used to find data in the original table.

The index level

MySQL > alter table idx_name = 1; MySQL > alter table idx_name = 1; MySQL > alter table idx_name = 1; Index_id = 4003, page_no = 5, offset = page_no x innodo_page_size + 64 = 81984, hexdump

$hexdump -s 81984 -n 10 /usr/local/var/mysql/test/users.ibd
0014040 00 02 00 00 00 00 00 00 0f a3                  
001404a
Copy the code

Where the index PAGE_LEVEL is 00, that is, idx_user_name index height is 1,0, f a3 represents the index number, which in decimal form is 4003, which is the same as index_id.

Data scanning mode

A full table scan

Scan the entire B+Tree from left to right to obtain data, and scan the entire table data. The I/O overhead is high, the speed is slow, and the lock is serious, affecting the concurrency of MySQL.

In OLAP scenarios, a large amount of data needs to be returned from a scan. In this case, sequential I/O efficiency of full table scan is higher.

An index scan

Generally speaking, indexes are smaller than tables, scan less data, consume less IO, execute faster blocks, and almost no locks, which can improve the concurrency of MySQL.

For OLTP systems, it’s always nice to hope that all SQL hits the right indexes.

Full table scanning is sequential I/O, while index scanning is random I/O. MySQL has optimized this by adding the change Buffer feature to improve I/O performance.

Index optimization cases

Paging query optimization

Services need to query transaction records based on the time range. The original SQL interface is as follows:

select  * from trade_info where status = 0 and create_time > = '2020-10-01 00:00:00' and create_time < = 'the 2020-10-07 23:59:59' order by id desc limit 102120.20;
Copy the code

Trade_info = idx_status_create_time(status,create_time); trade_info = idx_status_create_time(status,create_time); trade_info = idx_status_create_time(status,create_time); The slower the page back, that is, m more assembly more slowly, because more and more data to position the m need to scan, causing the IO overhead is large, here auxiliary indexes cover scans can be used to optimize, first get id, this step is the index to cover scan, do not need to back to the table, and then through the id associated with the original trade_info table, The rewritten SQL is as follows:

select * from trade_info a ,

(select  id from trade_info where status = 0 and create_time > = '2020-10-01 00:00:00' and create_time < = 'the 2020-10-07 23:59:59' order by id desc limit 102120.20) as b   //This step is an index overwrite scan without the need to return to the tablewhere a.id = b.id;
Copy the code

Many of you may know that writing this way is efficient, but you may not know why. Understanding the indexing feature is important for writing high-quality SQL.

Divide and conquer is always good

The marketing system has a batch of expired coupons to be invalid, the core SQL is as follows:

-- 500W data needs to be updated
update coupons set status = 1 where status =0 and create_time > = '2020-10-01 00:00:00' and create_time < = 'the 2020-10-07 23:59:59';
Copy the code

If the SQL is complex or execution is slow, it will block subsequent SQL requests and cause the number of active connections to increase dramatically. MySQL CPU 100%, corresponding interface Timeout, and for the Master/slave replication architecture, and the service read/write separation, 500w data update takes 5 minutes, 5 minutes for the Master, 5 minutes for the slave binlog transfer, 5 minutes for the slave, that is 5 minutes for the slave. During this period, there will be dirty business data, such as duplicate orders, etc.

Obtain the minimum and maximum IDS in the WHERE condition first, and then update them in batches with 1000 ids per batch. This can not only quickly complete the update, but also ensure that there is no delay in the master/slave replication.

The optimization is as follows:

Get the minimum and maximum ids for the range of data to be updated (the table has no physical DELETE, so the ids are contiguous)

mysql> explain select min(id) min_id, max(id) max_id from coupons where status =0 and create_time > = '2020-10-01 00:00:00' and create_time < = 'the 2020-10-07 23:59:59'; 
+----+-------------+-------+------------+-------+------------------------+------------------------+---------+---
| id | select_type | table | partitions | type  | possible_keys          | key                    | key_len | ref  | rows   | filtered | Extra                    |
+----+-------------+-------+------------+-------+------------------------+------------------------+---------+---
|  1 | SIMPLE      | users | NULL       | range | idx_status_create_time | idx_status_create_time | 6       | NULL | 180300 |   100.00 | Using where; Using index |
Copy the code

Extra=Using where; Using index uses the idx_status_create_time index, and the required data can be found in the index, so there is no need to query the data back to the table.

Update loops with 1000 commits at a time. The main code is as follows:

current_id = min_id;
for  current_id < max_id do
  update coupons set status = 1 where id >=current_id and id <= current_id + 1000;  // Update 1000 entries with primary key ID quickly
commit;
current_id += 1000;
done
Copy the code

These two cases tell us that to make full use of the feature of secondary index containing primary key ID, it is efficient to obtain the primary key ID through the index to cover the index scan, without returning to the table, and then to associate the operation with the ID. At the same time, according to the characteristics of MySQL, the idea of divide and conquer can be used to efficiently complete the operation. In addition, service data confusion caused by master/slave replication delay can be avoided.

MySQL index design

Once you are familiar with the features of indexes, you can design high-quality indexes during business development to reduce interface response time.

The prefix index

For InnoDB tables that use REDUNDANT or COMPACT formats, the index key prefix length is limited to 767 bytes. This limit may be reached if the column prefix index of a TEXT or VARCHAR column exceeds 191 characters, assuming a UTF8MB4 character set with a maximum of four bytes per character.

Innodb_large_prefix = OFF innodb_large_prefix = OFF innodb_large_prefix = OFF Innodb_large_prefix = OFF innodb_large_prefix = OFF innodb_large_prefix = OFF innodb_large_prefix = OFF It’s not very efficient.

-- Set innodb_large_prefix=OFF to disable index prefix limits.
mysql> create index idx_nickname on users(nickname);    // `nickname` varchar(255)
Records: 0  Duplicates: 0  Warnings: 1
mysql> show warnings;
+---------+------+---------------------------------------------------------+
| Level   | Code | Message                                                 |
+---------+------+---------------------------------------------------------+
| Warning | 1071 | Specified key was too long; max key length is 767 bytes |
Copy the code

In the early stage of service development, in order to quickly realize the function, the length of some data table fields is loosely defined. For example, nickname of user table Users is defined as vARCHar (128), and some service interfaces need to be queried through nickname. After the system runs for a period of time, When the maximum nickname length of the users table is 30, you can create a prefix index to reduce the length of the index and improve performance.

-- 'nickname' varchar(128) DEFAULT Execution plan defined by NULL
mysql> explain select * from users where nickname = 'Laaa';
+----+-------------+-------+------------+------+---------------+--------------+---------+-------+------+--------
| id | select_type | table | partitions | type | possible_keys | key          | key_len | ref   | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+--------------+---------+-------+------+--------
|  1 | SIMPLE      | users | NULL       | ref  | idx_nickname  | idx_nickname | 515     | const |    1 |   100.00 | NULL  |
Copy the code

Key_len =515, since the table and column are utF8MB4 character set, each character is 4 bytes, variable length data type +2Bytes, allow NULL additional +1Bytes, i.e. 128 x 4 +2 +1 = 515Bytes. Create a prefix index. The prefix length can not be the maximum value of the data column in the current table, but must be the most distinguishable part, generally exceeding 90%. For example, if the email field is stored with a value like xxxx@yyy.com, the maximum length of the prefix index can be XXXX.

Create prefix index with prefix length of 30
mysql> create index idx_nickname_part on users(nickname(30));
-- View the execution plan
mysql> explain select * from users where nickname = 'Laaa';
+----+-------------+-------+------------+------+--------------------------------+-------------------+---------+-
| id | select_type | table | partitions | type | possible_keys                  | key               | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------+------------+------+--------------------------------+-------------------+---------+-
|  1 | SIMPLE      | users | NULL       | ref  | idx_nickname_part,idx_nickname | idx_nickname_part | 123     | const |    1 |   100.00 | Using where |
Copy the code

You can see that the optimizer selected the prefix index, which is 123 in length, i.e. 30 x 4 + 2 + 1 = 123 Bytes, less than a quarter of the original size.

Prefixed indexes reduce index size, but do not eliminate sorting.

mysql> explain select gender,count(*) from users where nickname like 'User100%' group by nickname limit 10;
+----+-------------+-------+------------+-------+--------------------------------+--------------+---------+-----
| id | select_type | table | partitions | type  | possible_keys                  | key          | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+-------+------------+-------+--------------------------------+--------------+---------+-----
|  1 | SIMPLE      | users | NULL       | range | idx_nickname_part,idx_nickname | idx_nickname | 515     | NULL |  899 |   100.00 | Using index condition |
Extra= Using index condition = Extra= Using index condition
mysql> explain select gender,count(*) from users where nickname like  'User100%' group by nickname limit 10;
+----+-------------+-------+------------+-------+-------------------+-------------------+---------+------+------
| id | select_type | table | partitions | type  | possible_keys     | key               | key_len | ref  | rows | filtered | Extra                        |
+----+-------------+-------+------------+-------+-------------------+-------------------+---------+------+------
|  1 | SIMPLE      | users | NULL       | range | idx_nickname_part | idx_nickname_part | 123     | NULL |  899 |   100.00 | Using where; Using temporary |
Extra= Using where; Using temporaryn means that when an index is used, the required data needs to be queried back into the table, while a sorting operation takes place.
Copy the code

The composite index

If a single-column index cannot filter data well, you can create a composite index by combining other fields in the WHERE condition to better filter data and reduce I/O scanning times. For example, services need to query transaction records by time segment.

select  * from trade_info where status = 1 and create_time > = '2020-10-01 00:00:00' and create_time < = 'the 2020-10-07 23:59:59';
Copy the code

Develop students’ experience in designing composite indexes: Unique value much good selectivity as a composite index of leading row, so to create a composite idx_create_time_status cable that is efficient, because a second create_time is a value, the only value a lot, good selectivity, and the status only six discrete values, so that is no problem to create, But this rule of thumb applies only to equivalence conditions, not to range conditions such as idx_user_id_STATUS (user_id, status), which is fine, but not to compound indexes with create_time ranges. Let’s look at the difference between the two different index orders, idx_status_create_time and idx_create_time_status.

Create two different composite indexes
mysql> create index idx_status_create_time on trade_info(status, create_time);
mysql> create index idx_create_time_status on trade_info(create_time,status);
View the SQL execution plan
mysql> explain select * from users where status = 1 and create_time > ='2021-10-01 00:00:00' and create_time < = 'the 2021-10-07 23:59:59';
+----+-------------+-------+------------+-------+-----------------------------------------------+---------------
| id | select_type | table | partitions | type  | possible_keys                                 | key                    | key_len | ref  | rows  | filtered | Extra                 |
+----+-------------+-------+------------+-------+-----------------------------------------------+---------------
|  1 | SIMPLE      | trade_info | NULL       | range | idx_status_create_time,idx_create_time_status | idx_status_create_time | 6       | NULL | 98518 |   100.00 | Using index condition |
Copy the code

Idx_status_create_time = idx_status_time_status = idx_status_create_time; We track the optimizer selection through optimizer_Trace.

-- Enable Optimizer_Trace tracing
mysql> set session optimizer_trace="enabled=on",end_markers_in_json=on;
-- Execute SQL statements
mysql> select * from trade_info where status = 1 and create_time > ='2021-10-01 00:00:00' and create_time < = 'the 2021-10-07 23:59:59';
-- View trace results
mysql>SELECT trace FROM information_schema.OPTIMIZER_TRACE\G;
Copy the code

Compare the statistics for the two indexes, as shown below:

The composite index	Type	Rows	Participate in filtering index columns	Chosen	Cause
idx_status_create_time	Index Range Scan	98518	status AND create_time	True	Low Cost
idx_create_time_status	Index Range Scan	98518	create_time	False	Cost is high

The MySQL Optimizer is Based on Cost, which mainly includes IO_COST and CPU_COST. The MySQL CBO (cost-based Optimizer) always selects the one with the lowest Cost as the final execution plan. CBO selects the compound index IDx_status_create_time because both status and create_time in this index can participate in data filtering and the cost is low. Idx_create_time_status only create_time is filtered. Status is ignored. In fact, the CBO simplifies the index to idx_create_time, which is not as selective as the compound index idx_status_create_time.

Composite index design principles

Place the column of the range query at the end of the composite index, for example idx_status_create_time.
The more frequently columns are filtered, the more selective they are, and should be used as a leading column in a composite index for equivalent lookups, such as IDx_user_ID_STATUS.

The two principles are not contradictory, but mutually reinforcing.

Jump index

Create index idx_status_create_time (idx_status_create_time); create index idx_status_create_time (idx_status_create_time); Skip_scan =on if skip_scan=on if skip_scan=on if skip_scan=on if skip_scan=on

| optimizer_switch             |use_invisible_indexes=off,skip_scan=on,hash_join=on |
Copy the code

The MySQL CBO does not select index jump scan if the leading column unique values of the composite index are small and the leading column unique values are large, depending on the data partition of the index column.

mysql> explain selectId, user_id, status, phonefrom users where create_time > ='the 2021-01-02 23:01:00' and create_time < = 'the 2021-01-03 23:01:00';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----
|  1 | SIMPLE      | users | NULL       | range  | idx_status_create_time          | idx_status_create_time | NULL    | NULL | 15636 |    11.11 | Using where; Using index for skip scan|
Copy the code

You can also turn off the index skip scan feature with optimizer_switch=’skip_scan=off’.

conclusion

This section introduces indexes in MySQL, including clustered indexes and secondary indexes. Secondary indexes contain primary key ids used for back table operations, and overwrite index scans can be used to optimize SQL.

It also introduces how to better design MySQL index, including prefix index, composite index order problem and MySQL 8.0 launched index jump scan, as we all know, index can speed up data retrieval, reduce IO overhead, will occupy disk space, is a space for time optimization means. At the same time, update operation will lead to frequent index merge and split, affecting index performance. In actual business development, how to design an appropriate index according to business scenarios is very important. Today, we will talk about so much, hoping to help you.

I’m Aobing, the more you know, the more you don’t know, thank you for the third company, and we’ll see you next time.

I’m Aobing, the more you know, the more you don’t know, thank you for your talent: likes, favorites and comments, we’ll see you next time!

This article is constantly updated. You can search “Santaizi Aobing” on wechat and read it for the first time. Reply [Information] There are the interview materials and resume templates for first-line big factories prepared by me.