Mysql index principle and optimization

Know the index

What is the index

Indexes are data structures that help Mysql efficiently retrieve data, similar to directories at the front of a book, and speed up database lookups.

  • Advantages:
    • It can improve the efficiency of data retrieval and reduce database IO cost
    • Reduce the cost of sorting data
  • Disadvantages:
    • Occupying Disk space
    • Reduce the efficiency of updating table data. To add, delete, and query a table, not only the data information but also the corresponding index information needs to be updated

Common index types of Mysql

  • Normal INDEX INDEX
  • The only index
    • PRIMARY KEY Index PRIMARY KEY
    • UNIQUE index UNIQUE
  • Joint index
    • Select * from PRIMARY KEY;
    • The only federated index UNIQUE
    • Common INDEX INDEX

Indexed data structure

The Hash table

An arbitrary value (key) is transformed by a hash function into a fixed-length key address, through which the data structure of the specific data is carried out.

The time complexity of hash search is O(1), and the search speed is very fast.

select * from test where id = 10; 
Copy the code

Although the index of hash algorithm can achieve fast data retrieval, but the method of data efficient range search, so the hash index is not suitable for the underlying index of Mysql data structure.

select * from test where id > 10; 
Copy the code

Binary search tree

define
  1. If the left subtree of any node is not empty, the value of all nodes in the left subtree is not greater than the value of its root.

  2. If the right subtree of any node is not empty, the value of all nodes in the right subtree is not less than the value of its root node.

  3. The left and right subtrees of any node are binary search trees respectively.

The time complexity of binary search tree is O(log N). For example, for the above binary tree structure, we need to calculate and compare three times to retrieve the data with ID =7. Compared with the direct traversal query, it saves half of the time

advantages
  • Efficient retrieval
  • Scope of the search
disadvantages
  • In extreme cases, the tree degenerates into a linked list and the lookup rate drops dramatically

Balanced binary search tree

The nature of the
  1. The absolute value of the difference between the height of the left and right subtrees is no more than 1

  2. Every left and right subtree of a tree is an AVL tree

The time complexity of balanced binary tree search is O(log n). In addition to having the characteristics of binary tree, the most important feature of balanced binary tree search is that the hierarchy difference between the left and right subtrees is at most 1. When inserting and deleting data, the binary tree is balanced by left-rotation/right-rotation operations, so that the left subtree is not high and the right subtree is short.

And given that, you might think that this is a good thing, that this is the ideal case for a binary tree. However, some problems remain:

Each node read corresponds to a disk I/O operation. The height of the tree is equal to the number of DISK I/O operations performed during each data query. When a large amount of data is generated, the number of I/O operations increases, seriously affecting the query rate.

b-tree

Multipath balanced search tree

The main features
  1. Balanced search tree
  2. A node stores multiple elements
  3. Elements in a node contain keys and data, and all nodes store data
  4. The search may end at a non-leaf node;
  5. All leaf nodes are in the same layer and there are no pointer connections between leaf nodes

The biggest difference is that a node can store multiple data, the height of the tree is much smaller than the balanced binary tree, the reduction of nodes, thus reducing the NUMBER of IO search, greatly improve the search rate.

An example of a query is the difference between looking up data in a balanced binary tree and looking up data in a B tree

Finding 15 in the balanced binary tree takes 4 IO

Finding 15 in the B tree only takes two I/OS,

First IO: Reads the data of layer 1 nodes, compares the data in memory, and finds the address of the next node

Second IO: Reads the data of layer 2 nodes and finds the target data

Compared to balanced binary trees, the number of comparisons is not significantly reduced, but the number of I/OS is significantly reduced

I think the B tree is pretty good up here, but there’s still some optimization

  1. The fast search of range query is not supported. If you want to search data between 10 and 35, you need to go back to the upper node and continue to search right after 10. The query efficiency needs to be improved

  2. Non-leaf nodes also store data. With the increase of row data, the number of keys that can be obtained by one IO will decrease and the IO times will be relatively more

B + tree

As an upgraded version of the B tree, there are two main features more than the B tree

  1. Only leaf nodes store data, non-leaf nodes do not store data
  2. A linked list is formed between leaf nodes using bidirectional Pointers

Equivalent query: search can only be completed in the leaf node

Range query: for example, data greater than 5 can be efficiently searched through linked lists after 5 data is found in leaf nodes

Mysql index implementation

Clustered Index

Each InnoDB table has a clustered index, which is built using B+ trees, and leaves store data as whole rows of records. In general, a clustered index is equivalent to a primary key index. When a table does not have a primary key index, InnoDB automatically creates a ROWID field to build the clustered index. InnoDB index creation rules are as follows:

  1. The PRIMARY KEY is defined on the table. InnoDB uses the PRIMARY KEY index as the clustered index.
  2. If the table does not have a primary key defined, InnoDB selects the first non-null unique index column as the clustered index.
  3. If neither of these is present, InnoDB builds the clustered index using an implicit ROWID field with a 6-byte long integer. The ROWID field is automatically incremented when a new row is inserted.

select * from user where id = 18;
Copy the code

Find row data of leaf node id=18 after three IO attempts

B+ trees store the amount of data

In the computer, the smallest unit of disk storage data is sector, and the size of a sector is 512 bytes, while the smallest unit of file system (such as XFS/EXT4) is block, and the size of a block is 4K. For our InnoDB storage engine, it also has its own minimum storage unit — Page, and the size of a Page is 16K. The size of all innoDB data files (files with suffix IBD) is always an integer multiple of 16384 (16K).

The size of a node in Mysql is one page, assuming that the primary key ID is bigint and the length is 8 bytes, and the pointer size is set to 6 bytes in the source code of InnoDB, which makes a total of 14 bytes. How many such units can we store in a page, actually represents how many Pointers there are. 16KB (16 x 1024=16384 Bytes) 16384/14=1170 (number of indexes).

Assuming that a piece of data is 1K and a leaf node can store 16 rows,

  1. A B+ tree of height 2 can hold 1170 (index number) x 16 (leaf number of rows) = 18720 such records.
  2. A B+ tree of height 3 can hold: **1170 (number of first-level indexes) *1170 (number of second-level indexes) *16 (number of leaf rows) =21902400 (20 million) ** such records

So in InnoDB, the B+ tree height is usually 1-3 layers, which can accommodate tens of millions of data storage. A page lookup represents one IO when looking for data, so a primary key index query typically takes only 1-3 IO operations to find the data

Non-clustered Secondary Index

An index added in addition to a clustered index is called a non-clustered index. The main difference is that a leaf node stores key values instead of row data.

select * from user where age=19;
Copy the code
  1. Three IO runs in a non-clustered index

  2. After finding the primary key in the non-clustered index and then finding the row data in the clustered index, it also takes three IO, so it takes a total of six IO, calling back the table

Joint index

select * from t_test where a=13 and b=16 and c =4;
Copy the code

The process of finding acDS

  1. In the first node comparison, the first two characters, ab < ac < AD, go to the middle node

  2. Ab < AC, go to the node on the right, compare in the leaf node and finally find ACD

Mysql > select * from database;

Explain keywords

Explain explains how mysql processes SQL statements, how tables are loaded, how tables are joined, and how indexes are used. It is an important tool for SQL optimization.

explain select * from user where name ='name';
Copy the code

1. id

The id column is numbered as the sequence number of the select, and the number of ids increases in the order in which the select appears. MySQL classifies SELECT queries into simple and complex queries. Complex queries fall into three categories: simple subqueries, derived tables (subqueries in from statements), and Union queries.

Order of execution:

  1. The ids are the same and the execution sequence is from top to bottom
  2. Different ids are executed in descending order
2. select_type

Select_type indicates whether the corresponding row is a simple or complex query, and if it is a complex query, which of the above three complex queries

1) simple: simple query. The query does not contain subqueries and unions

2) primary: the outermost select in a complex query

3) subquery: subquery included in select (not in the FROM clause)

4) Derived: Subqueries contained in the from clause. MySQL stores the results in a temporary table, also known as a derived table

5) Union: the second and subsequent select in union

6) Union result: select the result from union temporary table

3. table

The query table

4. type

This field is our optimization to focus on the field, this field directly reflects the performance of our SQL is efficient.

This field has a lot of values, so I’ll just focus on a few words that are often used in our development

Performance is in the order of system>const>eq_ref>ref>range>index>all

  1. System: The table has only one row. This is a const exception and can be ignored

  2. Const: indicates that the index was found once, and const is used to compare primary key or unique indexes. Because it only matches one row, it’s fast.

  3. Eq_ref: Unique index lookup in which only one record in the table matches it. Generally, two tables are associated, and the fields in the association condition are primary keys or unique indexes.

  4. Ref: A non-unique row index lookup that returns all rows matching a single value

  5. Range: retrieves rows in a given range. In general condition queries, such as >, <, IN, and between appear

  6. Index: scans the index tree

  7. All: indicates a full table scan, which is usually optimized by adding indexes

5. possible_keys

This column shows which indexes the query might use to look up.

6. key

This column represents which index Mysql actually uses to query

7. key_length

This column shows the number of bytes used by mysql in the index. This value can be used to figure out which columns are used in the index.

This column shows the columns or constants used by the table to find values in the index of the key column. Common values are const (constant), func, NULL, and field names

The rows column is the number of rows that mysql estimates to be read and examined. Note that this is not the number of rows in the result set.

10. The Extra column displays additional information.

  1. Using Index :

    This occurs when the columns requested for the table are all part of a field in the same index, and the returned column data only uses the information in the index without accessing the row records in the table, which is a high performance performance.

  1. Using temporary

    Mysql needs to create a temporary table to process the query.

    The remark field does not have an index, and a temporary table is to be de-duplicated.

  2. Using where

    Mysql will filter the results retrieved by the storage engine.

    The result extracted according to the index name should also be filtered according to the remark field. The filtering condition field has no index

  3. Using filesort

    Indicate that Mysql will use an external sort for the results instead of sorting by index. Explain does not specify whether to sort in memory or on disk, either way is possible

Index column types should be as small as possible

Indexes also occupy a portion of storage space. For example, ids are generally used as BIGINT instead of THE UUID primary key of VARCHAR. Integers consume less CPU space than strings

Columns with small cardinality need not be indexed

Cardinality: The number of unique keys after a single column has been deduplicated is called the cardinality.

SELECT COUNT(DISTINCT name),COUNT(DISTINCT gender) FROM user;
Copy the code

For example, there are 100 data items for user, 100 different values for name, and 2 different values for gender. The gender column has a small cardinality and can not be indexed

Add index search efficiency does not improve, mysql may not run index

Leftmost matching principle

  1. Rule of left-most prefix matching for composite indexes: When using composite indexes, mysql will keep matching to the right until it encounters a range query (>, <, between, like)

    In a joint index, if we add a joint index to (a, b, c), we add (a), (a, b), (a, b, c)

select * from t_test where a=13 and b=16 and c=4;  - walk index

select * from t_test where a=13 and b=16;  - walk index

select * from t_test where a=13;   - walk index


select * from t_test where a=13 and b=16 and c>4;  - walk index

select * from t_test where a=13 and b>16 and c=4;  -- A b go index

select * from t_test where a>13 and b=16 and c=4;  - a index


select * from t_test where b=16 and c=4;  -- Don't go to the index

select * from t_test where c=4;  -- Don't go to the index

Copy the code
  1. The same is true for strings like. To use the index, add % to the right. You can search the index from left to end
SELECT * FROM t_test WHERE d LIKE 'aa%'  - a index

SELECT * FROM t_test WHERE d LIKE '%aa%'  -- Don't go to the index
Copy the code

Cover index

When an index contains (or overwrites) the values of all the fields that need to be queried, it is called an overwrite index.

Imagine a user table where two columns are indexed and name and age have the following query statement:

select name,age from user where name = 'xxx'  -- No need to return to the table
select * from user where name = 'xxx' - to be back to the table
Copy the code

When a non-clustered index already contains all columns that need to be returned, the search speed can be increased by eliminating the need to find rows returned on the primary key index

Index Failure Scenario

  1. Indexed columns use functions or calculations

    select * from user where age+1=10;
    Copy the code
  2. Left blur is used

    select * from user where name like '%xx';
    Copy the code
  3. Some fields queried with OR do not have indexes

    select * from user where name = 'xx' or remark = 'aa';
    Copy the code
  4. A query that does not comply with the leftmost prefix rule

    select * from user where age = 1;
    Copy the code
  5. Index range queries may result in lost indexes

    select * from user where age > 0;
    select * from user where name in(...)Copy the code