preface

MySQL supports many storage engines, and each storage engine supports different indexes. Therefore, MySQL database supports many index types, such as BTree index, hash index, full-text index, and so on. To avoid confusion, this article will focus only on the BTree index, as this is the index that you deal with most often when using MySQL.

MySQL index

An Index is a data structure that helps MySQL obtain data efficiently. An index is a data structure. An index is a data structure. An index is a data structure.

MySQL > create index

The index purpose

The purpose of an index is to improve query efficiency. It is analogous to a dictionary. If you want to look up the word “mysql”, you need to locate the letter M, and then go down to the letter Y, and then find the rest of the SQL. If there’s no index, then you might have to look through all the words to find what you want, but what if I want to find the m word? Or what about words that start with ze? Don’t you think you could have done it without an index?

It is the same for us to borrow books from the library. If you want to borrow a book, you must first find the corresponding classification subject, and then find the corresponding number. This is a living example in life.

The index principle

The principle of all indexing is the same, we filter out the desired results by narrowing down the range of data we want to obtain, and turn random events into sequential events, that is, we always lock the data by the same search method.

The same is true for databases, but it is obviously much more complex, because there are not only equivalent queries, but also range queries (>, <, between), fuzzy queries (like), union queries (OR), multi-value matching (in is essentially multiple OR), and so on. How should the database choose to deal with all these problems?

If we think back to the dictionary example, can we break up the data into segments and query it in segments? The simplest is if 1000 pieces of data, 1 to 100 into the first paragraph, 101 to 200 into the second paragraph, 201 to 300 into the third paragraph… If you look up the 250 data, just look for the third paragraph, and you eliminate 90% of the invalid data in one fell swoop. But what if it’s 10 million records? What’s a good number?

Those of you with a little algorithmic foundation will think of search trees, which have an average complexity of lgN and good query performance. But here we ignore one of the key problems, every time the complexity of the model is based on the same operating costs to consider, database implementation is more complex, the data stored in the disk, and in order to improve performance, every time can put some data into memory to calculate again, because we know that the cost of access disk access memory is probably about one hundred thousand times, Therefore, simple search trees cannot meet complex application scenarios.

Index structure

Any kind of data structure is not without foundation generation, will have its background and usage scenarios, we summarize, now we need to what could be done this kind of data structure, actually very simple, that is: each time to find the data to disk I/o control in a small number of orders of magnitude, it is best to constant orders of magnitude. So we wondered if a highly controllable multi-path search tree could meet the requirements? And so the B + tree was born.

B + tree index structure explanation

Each disk block contains several data items (shown in dark blue) and Pointers (shown in yellow). For example, disk block 1 contains items 17 and 35, and Pointers P1, P2, and P3. P1 indicates the disk block less than 17, and P2 indicates the disk block between 17 and 35. P3 indicates disk blocks larger than 35. Real data exists in leaf nodes 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, 99. Non-leaf nodes do not store real data and only store data items that guide the search direction, such as 17 and 35, which do not really exist in the data table.

B + tree search process

As shown in the figure, if data item 29 is to be searched, disk block 1 will be loaded from disk to memory first. At this time, I/O will occur. In memory, binary search is used to determine that 29 is between 17 and 35. By disk 1 piece of P2 pointer disk address the disk block 3 by disk loaded into memory, in the second IO, 29 between 26 and 30, locking disk blocks of the 3 P2 pointer, pointer loading disk blocks through eight into memory, in the third IO, binary search to find memory do 29 at the same time, the end of the query, a total of three IO.

The reality is that a 3-tier B + tree can represent millions of data. If millions of data lookups take only three IO, the performance improvement is huge. If there is no index and each data item takes one IO, the total number of IO will be millions, which is obviously very, very expensive.

B + tree

1. According to the above analysis, we know that the smaller the interval, the more data items, and the lower the height of the tree. That’s why each data item, or index field, should be as small as possible; for example, int is 4 bytes, less than half of bigInt8 bytes. This is why b+ trees require real data to be placed in leaf nodes rather than inner nodes, where the number of data items in a disk block drops dramatically, causing the tree to grow taller. When the data item is equal to 1 it will degenerate into a linear table.

2. When the data items of the b+ tree are compound data structures, such as (name,age,sex), the search tree is built on the order of b+ numbers from left to right. For example, when the data of (zhang 3,20,F) is searched, the b+ tree will compare the name preferentially to determine the direction of the next search. If the name is the same, then age and sex are compared in turn to obtain the retrieved data. However, when data such as (20,F) without name comes, b+ tree does not know which node to search next, because name is the first comparison factor when the search tree is established, and search must be conducted according to name first to know where to search next.

For example, when (three,F) data to search, b+ tree can use name to specify the search direction, but the next field age is missing, so we can only find the data whose name is equal to three, and then match the data whose gender is F, this is a very important property, that is, the left-most matching feature of the index.

MySQL index implementation

In MySQL, index belongs to the concept of storage engine level. Different storage engines implement index in different ways. This paper mainly discusses the index implementation of MyISAM and InnoDB storage engines.

MyISAM index implementation

MyISAM engine uses B+Tree as the index structure, and the data field of the leaf node stores the address of the data record.

Here is the MyISAM index schematic:

Select Col1 as Primary key, MyISAM as Primary key, Col1 as Primary key You can see that MyISAM’s index file only holds the address of the data record. In MyISAM, there is no difference in structure between primary and Secondary keys, except that the primary index requires a unique key, while Secondary keys can be repeated. If we create a secondary index on Col2, the structure of this index is as follows:

Also a B+Tree, data field holds the address of the data record. Therefore, the index retrieval algorithm in MyISAM is to search the index according to THE B+Tree search algorithm. If the specified Key exists, the value of its data field is extracted, and then the corresponding data record is read with the value of the data field as the address.

MyISAM’s index is also called “non-clustered” to distinguish it from InnoDB’s clustered index.

InnoDB index implementation

InnoDB also uses B+Tree as an index structure, but the implementation is quite different from MyISAM.

The first big difference is that InnoDB’s data files are themselves index files. MyISAM index files and data files are separate, and the index file only holds the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+Tree, and the data field of the leaf node of this Tree holds complete data records. The key of this index is the primary key of the table, so the InnoDB table data file itself is the primary index.

Above is a diagram of the main index (which is also a data file) of InnoDB. You can see that the leaf node contains the complete data record. This kind of index is called a clustered index. InnoDB requires tables to have primary keys (MyISAM does not have primary keys) because InnoDB data files themselves are aggregated by primary keys. If this is not explicitly specified, the MySQL system automatically selects a column that uniquely identifies the data record as the primary key. MySQL automatically generates an implied field for the InnoDB table as the primary key. This field is 6 bytes long and of type long integer.

The second difference from MyISAM index is that InnoDB’s secondary index data field stores the value of the primary key of the corresponding record rather than the address. In other words, all InnoDB secondary indexes refer to the primary key as the data field. For example, here is a secondary index defined on Col3:

The ASCII code of English characters is used as the comparison criterion. Clustered index implementations make searching by primary key very efficient, but secondary index searches require retrieving the index twice: first, retrieving the secondary index for the primary key, and then using the primary key to retrieve records from the primary index.

Understand the different storage engines index is implemented for the proper use of and optimize the index are very helpful, know the InnoDB index after implementation, for example, it is easy to understand why not suggest using long field as the primary key, because all the auxiliary index reference primary index, long primary index will make auxiliary index becomes too large. For example, using non-monotonic fields as primary keys is not a good idea in InnoDB, because InnoDB data files themselves are a B+Tree. Non-monotonic primary keys cause data files to be constantly split and adjusted to maintain B+Tree features when new records are inserted, which is inefficient. Using an increment field as a primary key is a good choice.

How to build a suitable index

Principles of indexing

An index in MySQL can refer to multiple columns in a certain order. This type of index is called a federated index. Generally, a federated index is an ordered tuple in which each element is a column of a table. In addition, a single-column index can be regarded as a special case where the number of elements in a joint index is 1.

If the index columns are A, B, and C, the order is A, B, and C:

If you want to query [A] [A, B] [A, B, C], then you can use the index to query
If [A, C] is used in the query, C is an index, but B is missing in the middle, so C is not used, only A can be used
Select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from))))))
If a range is used and the left-most prefix is the first column index, then the index can be used, but the column after the range cannot use the index

Because indexes speed up queries, they also come at a cost: index files themselves consume storage space, and indexes increase the burden of inserting, deleting, and modifying records. In addition, MySQL also consumes resources to maintain indexes at runtime, so more indexes are not always better

When using the InnoDB storage engine, always use a business-independent increment field as the primary key unless you have a specific need. Using the InnoDB engine without auto-increment primary keys is definitely a bad idea from a database index optimization perspective.

InnoDB uses clustered indexes, where the data records themselves are stored on a leaf node of the main index (a B+Tree). This request within the same leaf node (the size of a memory or disk pages) of the individual data records in the primary key order, so every time when we have a new record into the MySQL will according to its primary key node and insert it into the appropriate position, if the page to load factor (InnoDB default for 15/16), then create a new page (nodes). If the table uses auto-increment primary keys, each time a new record is inserted, the records are sequentially added to the place following the current index node, and when a page is full, a new page is automatically opened. As follows:

This results in a compact index structure that is approximately sequentially filled. Since there is no need to move the existing data with each insert, this is efficient and does not add much overhead to maintaining the index.

If you use a non-increment primary key (if id number or student number, etc.), since the value of the primary key is approximately random, each new record is inserted somewhere in the middle of the existing index page, as follows:

MySQL at this time have to in order to insert a new record to the appropriate location and mobile data, or even the target page might have been back to disk and clear it from the cache, at this time to read from the disk back again, it cost a lot of, at the same time, frequent mobile, paging are responsible for a large number of pieces, the index structure of compact enough, The OPTIMIZE TABLE then had to be used to rebuild the TABLE and OPTIMIZE the fill page.

Therefore, whenever possible, use auto-increment field primary keys on InnoDB.

Common techniques for building indexes

Mysql > select * from (a,b,c,d); mysql > select * from (b,c,d); mysql > select * from (a,b,c,d); D (a,b,d,c); d (a, B,d,c); d (a, B,d);

2, = and in can be in random order, such as a = 1 and b = 2 and c = 3 to create (a,b,c) index can be in any order, mysql query optimizer will help you optimize the form to be recognized by the index

3. Try to select columns with high distinction as indexes. The formula of distinction is count(DISTINCT Col)/count(*), which indicates the proportion of fields that are not duplicated. So one might ask, is there any empirical value to this ratio? This value is also difficult to determine in different application scenarios. Generally, we require join fields to be above 0.1, that is, 10 records are scanned per item on average

Select froM_unixtime (create_time) = ‘2014-05-29’ from ‘b’; select from_unixtime(create_time) = ‘2014-05-29’ from ‘b’; Obviously the cost is too great. Create_time = unix_timestamp(‘ 2014-05-29 ‘);

5, as far as possible to expand the index, do not create a new index. For example, if you want to add an index (a,b) to a table that already has an index (A, B), you only need to modify the original index

MySQL optimization

Configuration optimization

Configuration optimization refers to the configuration of MySQL server, generally for the business side, can not pay attention to, after all, there will be a special DBA to deal with, but for the understanding of the principle, I think, we need to understand the development.

MySQL optimization, also can refer to: super comprehensive MySQL optimization interview parsing

The basic configuration

innodb_buffer_pool_size

This is the first option that should be set after InnoDB is installed. The buffer pool is where the data and indexes are cached: the larger the value, the better, to ensure that you use memory rather than hard disk for most read operations. Typical values are 5-6GB(8GB of memory), 20-25GB(32GB of memory), and 100-120GB(128GB of memory).

innodb_log_file_size

This is the size of the redo log. Redo logs are used to ensure that write operations are fast and reliable and recover in the event of a crash. Up to MySQL 5.1, it was difficult to adjust because on the one hand you wanted it to be larger for better performance, and on the other hand you wanted it to be smaller for faster recovery from crashes.

Fortunately, crash recovery performance has improved significantly since MySQL 5.5, so you can have high write performance and crash recovery performance at the same time. Until MySQL 5.5, the total size of redo logs was limited to 4GB(2 log files by default). This has been improved in MySQL 5.6. If you know that your application needs to write data frequently and you are using MySQL 5.6, you can call it 4G at first.

max_connections

If you often see ‘Too many connections’ errors, it is because the max_connections value is Too low. This is very common because the application does not close the database connection properly, and you need a higher value than the default 151 connection count.

One of the major drawbacks when the max_connection value is set high (for example, 1000 or more) is that the server becomes unresponsive when running 1000 or more active transactions. Using connection pools in applications or process pools in MySQL can help solve this problem.

InnoDB configuration

innodb_file_per_table

This setting tells InnoDB whether to store all table data and indexes in a shared tablespace (Innodb_file_per_TABLE = OFF) or in a separate. Ibd file (Innodb_file_per_table = ON) for each table. One file per table allows you to reclaim disk space when you drop, TRUNCate, or rebuild a table.

This is also necessary for advanced features such as data compression. But it doesn’t bring any performance gains. The main scenario where you don’t want to have one file per table is when there are too many tables (such as 10K +). In MySQL 5.6, this property defaults to ON, so in most cases you don’t need to do anything. With previous versions you had to set this property to ON before loading data because it only affects newly created tables.

innodb_flush_log_at_trx_commit

The default value is 1, indicating that InnoDB fully supports ACID. This value is most appropriate when your primary concern is data security, such as on a primary node. However, on systems with slow disk speeds, it can be very expensive, as additional FSYNcs are required for each flush to redo log change.

Setting this value to 2 makes it less reliable because committed transactions are only flushed to the redo log once per second, but it is acceptable for some scenarios, such as the backup node for the primary node. A value of 0 is faster, but some data may be lost in the event of a system crash: only for backup nodes.

innodb_flush_method

This configuration determines how data and logs are written to the hard disk. In general, if you have a hardware RAID controller with a write-back independent cache and battery power-off protection, you should set it to O_DIRECT. Otherwise, it should be set to fdatasYNc (the default) in most cases. Sysbench is a good tool to help you decide on this option.

innodb_log_buffer_size

This configuration determines the cache allocated for transactions that have not yet been executed. The default value (1MB) is generally sufficient, but if your transaction contains binary large objects or large text fields, this cache can fill up quickly and trigger additional I/O operations. Look at the Innodb_log_waits state variable, if it is not 0, increase innodb_log_BUFFer_size.

Other Settings

query_cache_size

Query Cache is a well-known bottleneck, even when there is not a lot of concurrency. The best option is to disable it from the start, set query_cache_size to 0 (now MySQL 5.6’s default) and use other methods to speed up queries: optimize indexes, increase copy-spread load, or enable additional caches (such as memcache or Redis).

Query Cache may be useful if you have query Cache enabled for your application and have not yet found any problems. If you want to stop using it, be careful.

log_bin

Binary logging is a must if you want the database server to act as a backup node to the primary node. If you do, don’t forget to set server_id to a unique value. Even with only one server, this is useful if you want to do point-in-time data recovery: recover from your most recent backup (full backup) and apply the changes in binary logs (incremental backup).

Binary logs, once created, are saved forever. So if you don’t want to run out of disk space, you can either PURGE BINARY LOGS of old files with PURGE PURGE or set expire_logs_days to specify how many days the log will be automatically cleared. Binary logging is not cost-free, so it is recommended to turn this option off if you do not need it on a replica node that is not the primary node.

skip_name_resolve

When a client connects to a database server, the server does host name resolution, and when DNS is slow, it is slow to establish a connection. Therefore, it is recommended to turn off the skip_name_resolve option without DNS lookup when starting the server. The only limitation is that only IP addresses can be used in GRANT statements, so care must be taken when adding this setting to an existing system.

SQL tuning

The system or server can enable slow query logs, especially for online systems. If slow query logs exist, you can filter them by using logs. But now that you know the SQL that needs to be tuned, how do you tune it

Slow query optimization basic steps

1. Run it first to see if it’s really slow, and set SQL_NO_CACHE
1. Lock the minimum return record table. Select * from table where the smallest number of entries is returned. Select * from table where the smallest number of entries is returned. Select * from table where the smallest number of entries is returned
1. Explain to see if the execution plan is consistent with 1’s expectations (start with tables with fewer locked records)
1. SQL statements in the form of order by limit give precedence to sorted tables
1. Understand the application scenarios of the service side
1. When adding an index, follow the principles of building an index
1. Observed results that do not conform to expectations continue from 0 analysis

Commonly used tuning methods

Execution plan Explain

In daily work, we sometimes slow down the query to record some of the SQL statement execution time longer, find out the SQL statement does not mean to, we often use the explain this command to view a these SQL statement execution plan, view the SQL statements have used on the index, do a full table scan, This can all be viewed through the explain command.

So we looked into MySQL’s overhead optimizer and got a lot of details about the access policies that might be taken into account by the optimizer and which policies are expected to be adopted by the optimizer when running SQL statements.

Select ‘explain’ from ‘explain’; select ‘explain’ from ‘explain’;

mysql> explain select * from servers;
+----+-------------+---------+------+---------------+------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+---------------+------+---------+------+------+-------+
| 1 | SIMPLE | servers | ALL | NULL | NULL | NULL | NULL | 1 | NULL |
+----+-------------+---------+------+---------------+------+---------+------+------+-------+
1 row in set (0.03 sec)
Copy the code

Briefly explain the meaning of each field

Id: Identifies the order in which the SQL is executed, from large to small
Select_type: specifies the type of each SELECT clause in the query
Table: Shows which table this row is about, sometimes not the actual table name
Type: indicates the method used by MySQL to find the required rows in the table, also called access type. Common types are: ALL, index, range, ref, eq_ref, const, system, NULL (left to right, poor to good performance)
Possible_keys: Possible_keys specifies which index MySQL can use to find records in a table. If an index exists on a column involved in the query, it will be listed, but not necessarily used by the query
Key: The Key column shows the Key (index) that MySQL actually decides to use, or NULL if no index is selected.
Key_len: Indicates the number of bytes used in the index. This column can be used to calculate the length of the index used in the query. (Key_len displays the maximum possible length of the index field, not the actual length, i.e. key_len is calculated from the table definition, not retrieved from the table.)
Ref: indicates the join matching criteria for the above table, that is, which columns or constants are used to find values on indexed columns
Rows: indicates the estimated number of rows that the MySQL database needs to read to find the required record based on the table statistics and index selection. The fewer rows, the better the query performance
Extra: This column contains details about how MySQL solves queries

The characteristics of the EXPLAIN

EXPLAIN does not tell you about triggers, stored procedures, or the impact of user-defined functions on queries
EXPLAIN does not consider various caches
EXPLAIN could not show the optimizations MySQL made when executing the query
Some statistics are estimates, not exact values
EXPALIN can only interpret SELECT operations. Other operations must be rewritten as SELECT to view the execution plan.

We practice

Table structure and query statements

Suppose there is the following table structure

circlemessage_idx_0 | CREATE TABLE `circlemessage_idx_0` (
`circle_id` bigint(20) unsigned NOT NULL COMMENT 'group id'.`from_id` bigint(20) unsigned NOT NULL COMMENT 'Send user ID'.`to_id` bigint(20) unsigned NOT NULL COMMENT 'Specify receiving user ID'.`msg_id` bigint(20) unsigned NOT NULL COMMENT 'the message ID'.`type` tinyint(3) unsigned NOT NULL DEFAULT '0' COMMENT 'Message type',
PRIMARY KEY (`msg_id`.`to_id`),
KEY `idx_from_circle` (`from_id`.`circle_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

Copy the code

Analyze the following query statements by execution plan Explain

mysql> explain select msg_id from circlemessage_idx_0 where to_id = 113487 and circle_id=10019063 and msg_id>=6273803462253938690and from_id ! =113487 order by msg_id asc limit 30; +----+-------------+---------------------+-------+-------------------------+---------+---------+------+--------+-------- -----+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------------------+-------+-------------------------+---------+---------+------+--------+-------- -----+ |1 | SIMPLE | circlemessage_idx_0 | range | PRIMARY,idx_from_circle | PRIMARY | 16 | NULL | 349780| Using where | +----+-------------+---------------------+-------+-------------------------+---------+---------+------+--------+-------- -----+1 row in set (0.00 sec)

Copy the code

mysql> explain select msg_id from circlemessage_idx_0 where to_id = 113487 and circle_id=10019063and from_id ! =113487 order by msg_id asc limit 30;
+----+-------------+---------------------+-------+-----------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------+-----------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | circlemessage_idx_0 | index | idx_from_circle | PRIMARY | 16 | NULL | 30 | Using where |
+----+-------------+---------------------+-------+-----------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

Copy the code

Problem analysis

If msg_id >= XXX does not exist, there are fewer rows to be retrieved, and the primary key index is used in both cases. That means the index should be unreasonable, not to the maximum effect.

SQL > select * from msg_id where msg_id >= XXX; SQL > select * from msg_id where msg_id >= XXX; SQL > select * from msg_id where msg_id = XXX; MySQL > select * from msg_id where msg_id = ‘PRIMARY’; MySQL > select * from msg_id where msg_id = ‘to_id’; 34W rows have been retrieved, and since msg_id is greater than or equal to the query condition, the to_id index cannot be used after this query condition.

Key (len, 16); key (PRIMARY, 16); key (len, 16);

Finally, the type value is range, which indicates that the query is either a range query or a multi-value match.

Please note that From_id! Statements like = XXX do not use indexes. Select * from from_id = XXX where from_id = XXX where from_id = XXX

How to optimize

Now that you know the index is out of order, analyze and adjust the index. In general, since we want to query from a single table, then we need to be able to know in general, a single table will have roughly what data, the current magnitude is about what.

If the msGID is set as a primary key, it must be globally unique, so there will be at least as many MsGids as there are data volumes. Retrieving msg_id is basically retrieving the entire table. The optimization we need to do is to minimize the index, reduce the number of rows in the query; Then you need to think, which fields can be queried to reduce the number of rows? For example, can a table contain fewer rows for a user than the number of rows used to query an MSGID? Is it even less to query a user and belong to a circle? And so on.

Select * from msg_id; select * from msg_id; select * from msg_id; select * from msg_id; However, since the table is already online, has a large amount of data, and the service is running, changing the primary key can cause many problems (of course, changing the index is OK). Therefore, it is not recommended to change the primary key directly.

To ensure that the to_id index is valid, create a new federated index; Therefore, the first index field of the newly created joint index must be to_id. For this business scenario, it is better to add circle_ID index, so that the index can be quickly indexed. This gives you the index for the new union index (to_id,circle_id), and then, because msg_id is being looked for, msg_id is added to that. The final union index is (to_id,circle_id,msg_id); In this case, you can quickly retrieve query statements like where to_id = XXX and circle_id = XXX and msgId >= XXX

Index creation does not mean that an SQL statement needs an index. If there are too many indexes, write performance will be affected (insert, delete, modify), and storage space will increase accordingly. In addition, mysql also consumes resources to maintain indexes at runtime. Therefore, it is not necessary to have more indexes. A single index or set of primary keys is a btree, and multiple indexes are multiple btrees.

conclusion

First of all, we need to deeply understand the principle and implementation of the index, when understanding the principle, can help us to establish a suitable index. Then when we build the index, don’t take it for granted, first think about the business logic, and then build the corresponding table structure and index. The following points need to be stressed again:

More indexes are not always better
Distinguish between primary keys and indexes
Understand index structure principles
Understand query index rules

Index is not the more the better, to understand the principle of index structure, to help us establish a suitable index!