[MySQL] SQL optimization from index data structure

The index

How to find data in MySQL

Suppose there is a table org_user with no primary key set and no index set:

SQL SELECT * FROM org_user where id = 7; , the full table scan is performed. Mysql will take the data item by item and compare it according to the condition after where. Finally, the data that meets the conditions are retrieved. Each time data is fetched, an I/O operation is performed. Each time data is fetched, a number of I/ OS are performed. When the amount of data reaches a certain size, this full table scan can be very performance consuming.

This is why mysql introduced the concept of indexes. Indexes can be added to frequently queried columns to optimize query speed.

What is an index

Indexes are sorted data structures that help MySQL retrieve data efficiently.

Different storage engines of MySQL have different underlying data structures when using indexes. There are three types of MySQL indexes: HASH index, BTREE index, and full-text index.

The index described in this article is a relatively used BTREE index.

BTREE index

The BTREE index is implemented using a B+ tree.

Schematic diagram of B+ tree structure:

B+ tree dynamic demonstration

Difference: The BTREE index is slightly different from a simple B+ number, which maintains a one-way pointer from small to large between leaf nodes (as shown in the figure above). The BTREE index maintains bidirectional Pointers between leaf nodes.

BTREE Index structure features

Non-leaf nodes do not store specific data, only indexes (redundant), in order to put more indexes

The leaf node contains all the index fields

Leaf nodes are connected with Pointers to improve the performance of interval access

Here is:

Why is the BTREE index implemented in B+ tree

Compared with other data structures, tree structure has greater advantages in query speed, and compared with Hash, tree structure is better in sorting scenarios.
Compared with other trees, B tree performs better in the single-ended growth problem. Compared with balanced binary trees and red-black trees, b-trees have fewer hierarchies when the data volume is large.
B+ tree compared to B tree, non-leaf nodes do not hold data, and each disk page can hold more nodes. B+ tree has fewer layers for the same data. And B+ tree maintains bidirectional pointer in leaf node, which is more suitable for range query.

Index structure of different mysql storage engines

MyISAM
- Non-clustered index: The index file is separate from the data file. The leaf node of the index file holds a pointer to the address of the data on disk.
- Here is:
InnoDB
- Clustered index: Index files and data files are not separated.
  - The leaf node of the primary key index file holds the detailed data.
  - The leaf node of a non-primary key index only stores the corresponding primary key ID. If you want to query other fields (except primary key and index fields), you need to query the primary key index tree based on the ID.
- Here is:

Aggregate index (federated index) and leftmost prefix principle

Aggregate index (joint index) : An index consists of multiple fields.

Here is:

In the data structure, the data will be sorted in the B + tree according to the order of the index.

Leftmost prefix principle: leftmost takes precedence and matches from the leftmost part of the union index when retrieving data.

The following table, for example,name.age.positionFor joint index

CREATE TABLE `org_user`  (
  `id` int(11) NOT NULL,
  `name` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `age` int(2) NOT NULL,
  `position` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `sex` varchar(2) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `org` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL.PRIMARY KEY (`id`) USING BTREE,
  INDEX `index_name_age_position`(`name`, `age`, `position`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8
Copy the code

Now, before we do that, we need to understand a little bit`EXPLAIN`

Explain how to see whether the index is gone, which joint index is gone?

Type attribute system > const > eq_ref > ref > range > index > ALL.

Type is eq_ref\ref\range\index, which indicates the index to be queried.

The length of key_len can be used to determine the first few digits of the joint index.

The rules for kan_len are as follows: 1.

Char (n) : 3n bytes if Chinese characters are stored

Varchar (n) : 3n + 2 bytes for Chinese characters, with the additional 2 bytes used to store the length of the string

2. Value type

Tinyint: 1 byte

Smallint: indicates 2 bytes

Int: 4 bytes

Bigint: 8 bytes

3. Time type

Date: 3 bytes

Timestamp: 4 bytes

Datetime: 8 bytes

4. Null value: 1 byte

Analyze SQL using the mysql EXPLAIN EXPLAIN statement

EXPLAIN SELECT * FROM `org_user` where name = 'zhouxh';
Copy the code

Key_len = 62; key_len = 62; key_len = 62Similarly, analyze the following SQL

EXPLAIN SELECT * FROM `org_user` where name = 'zhouxh' AND position = 'Front-end development';
EXPLAIN SELECT * FROM `org_user` where name = 'zhouxh' AND age = 24;
EXPLAIN SELECT * FROM `org_user` where age = 24 AND position = 'Front-end development';
EXPLAIN SELECT * FROM `org_user` where name = 'zhouxh' AND age = 24 and position = 'Front-end development';
EXPLAIN SELECT * FROM `org_user` where name like 'zhouxh%' AND age = 24 and position = 'Front-end development';
EXPLAIN SELECT * FROM `org_user` where name like '%zhouxh' AND age = 24 and position = 'Front-end development';
Copy the code

You can obtain the following index hits with the leftmost prefix (name, age, position are joint indexes)

	Where conditions	Index usage
1	where name = ‘xxx’	To use the name
2	where name = ‘xxx’ and age = xx	Use name and age
3	where name = ‘xxx’ and age = xx and position = ‘xxx’	Index full hit
4	where name = ‘xxx’ and position = ‘xx’	Only have the name
5	Where age = xx and position = ‘XXX’ where age = xx and position = ‘XXX’	No index used
6	where name = ‘xxx’ and age > xx and position = ‘xxx’	Use name and age
7	where name like ‘xxx%’ and age = xx and position = ‘xxx’	After MySQL5.6: only name is hit in/before the index full name
8	where name like ‘%xxx’ and age = xx and position = ‘xxx’	No index used
9	where name = ‘x%xx%’ and age = xx and position = ‘xxx’	Index in full name
10	where name = ‘xx’ and age like ‘%xx’ and position = ‘xxx’	Use name and age

Summary of reasons:

When the name field is determined (name = XXX), the age field is arranged in order. At this time, whether the age field is limited in scope (>, <) or determined in value (=), it can be quickly found from the B+ tree.
When the name field is uncertain (there is no qualified name after where), the age field is arranged unordered in the B+ tree (Bill, HanMeimei, LEff, lilEI may all have the same age). In this case, the age cannot be quickly located and the index is forcibly used, which is not as fast as the full table scan.
When the name field is a post-fuzzy query (name like ‘XXX %’), MySQL5.6 introduces the index push-down optimization, which can judge all the fields in the index first, filter out the records that do not meet the conditions and then return to the table in the process of index traversal, effectively reducing the number of back to the table. After matching an index with a name beginning with ‘LiLei’ in the union index, the query will filter the age and position fields in the index. After filtering the primary key IDS of the remaining indexes, the query will go back to the table to look up the whole row data. Index pushdown reduces the number of table reversions. For innoDB engines, index pushdown can only be used for secondary indexes.
When the name field is pre-fuzzy (name like ‘% XXX ‘), the name field is unordered, so the index cannot be matched.

SQL optimization

1, the first field of the joint index with a range does not necessarily walk the index

EXPLAIN SELECT * FROM `org_user` where name > 'zhouxh' AND age = 24 and position = 'Front-end development';
Copy the code

Index match is not confirmed: in a table with a certain amount of data, the first field is a range query. If mysql internal analysis shows that the range query data set is large, it will perform a full table scan. Otherwise, mysql will run the index. How to optimize overwrite indexing: Require that all query fields exist in the union index tree. Cause: In this case, all data can be obtained from the union index tree, without the need to query back to the table, which is faster.

EXPLAIN SELECT id, name, age, position FROM `org_user` where name > 'zhouxh' AND age = 24 and position = 'Front-end development'; 
Copy the code

2. Optimize Order by and Group BY

MySQL supports two types of sorting filesort and index. Using index means that MySQL scans the index itself to complete sorting. Index is efficient, while filesort is inefficient.
If order BY satisfies both cases, Using index is used.
- The Order BY statement uses the left-most front row of the index.
- Use the WHERE clause with the Order by clause to satisfy the left-most front row of the index.
Try to sort the index columns by following the leftmost prefix rule for index creation (index creation order).
Using filesort is generated if the condition of order BY is not on the index column.
Use overwrite indexes whenever possible
Group BY is similar to Order BY in that it sorts before grouping, following the leftmost prefix rule in the index creation order. Select order by NULL to disallow sorting. Note that where is higher than having. If you have a limit in where, don’t have a limit in having.

EXPLAIN SELECT * FROM `org_user` where name = 'zhouxh' AND age = '11' ORDER BY position;
Copy the code

If the leftmost prefix rule is not followed:

EXPLAIN SELECT * FROM `org_user` where name = 'zhouxh' ORDER BY position;
Copy the code

Use filesort to sort files.

Single-way sort:
- Load all query data into an in-memory buffer for sorting.
Dual sorting
- Just load the sort field and primary key into an in-memory buffer for sorting, and then go back to the table to fetch the queried field based on the primary key.

How to select mysql

MySQL compares the max_length_for_sort_data(1024 bytes by default) to the total size of the fields to be queried. If the total size of the fields is smaller than the system variable, single-way sort is used. Otherwise, double sort.

3. Paging query optimization

We often have a scenario where a table field (not a primary key index) is sorted and then paged. The crudest way to write it is, of course:

SELECT * FROM `org_user` order by name limit 100 , 10;
Copy the code

In this way, when querying a small number of sets, it can hit the index.

But once the jump page number becomes large, the page becomes very slow.

SELECT * FROM `org_user` order by name limit 10000 , 10;
Copy the code

We’re throughEXPLAINYou can see that although the name field is indexed, it is not used for actual sorting.

why:

To scan the secondary index and find the row data that is not on the secondary index, you need to return the table. Mysql decided that the cost of such a lookup was higher than the cost of a full table scan, so mysql automatically optimized the full table scan internally.

How to optimize You can query only the ID when sorting by secondary index, and then manually query the desired row data by ID.

SELECT * from `org_user` a inner join (SELECT id FROM `org_user` order by name limit 10000 , 10) b on a.id = b.id;
Copy the code

After optimization: Sorting using filesort sort, and query using index, higher efficiency.

In the end, the hard and fast rules of Alibaba for index are attached:

[Mandatory] A unique index must be set up for a field that has unique characteristics, even for a combination field.
- Note: Do not think that unique indexes affect insert speed. The speed loss is negligible, but the increase in search speed is significant. In addition, even if the verification control is very perfect at the application layer, as long as there is no unique index, according to Murphy’s Law, dirty data must be generated.
[Mandatory] Join cannot be joined for more than three tables. The data types of fields to be joined must be absolutely the same. When multiple tables are queried by association, ensure that the associated fields have indexes.
- Note: Pay attention to table index and SQL performance even if the join of two tables.
[Mandatory] When creating an index on a VARCHar field, the index length must be specified. It is not necessary to create an index for all fields. The index length is determined according to the actual text differentiation.
- Note: The length and distinctness of an index are a pair of contradictory bodies. Generally, the distinctness of an index with a length of 20 is more than 90% for string data. You can use count(distinct left(column name, index length))/count(*) to determine the distinctness.
[Mandatory] page search is strictly prohibited left or all fuzzy, if you need to go to the search engine to solve.
- Note: The index file has the leftmost prefix matching feature of b-tree. If the left value is not determined, then this index cannot be used.