An Index is a data structure that helps MySQL obtain data efficiently. Database query is one of the most important functions of the database, we all hope that the speed of the query data can be as fast as possible, so the designer of the database system will optimize from the point of view of the query algorithm, this article to do a systematic combing of the index, I hope to help you.

MySQL > create index ()

Indexes can be classified from multiple perspectives. The following three dimensions are data structure, physical storage, and business logic.

1. From a data structure perspective

(1) B+ tree index (O(log(n))

More on B+ tree indexes later

(2) Hash index

  • This only works for “=”,”IN” and “<=>” queries, not range queries
  • The search efficiency of Hash indexes is higher than that of B-tree indexes. The search efficiency of Hash indexes is much higher than that of B-tree indexes. The search efficiency of Hash indexes is much higher than that of B-Tree indexes
  • Only Memory storage engines display support for hash indexing

(3) FULLTEXT index

Now both MyISAM and InnoDB engines support it

(4) R-tree index

Used to create SPATIAL indexes for GIS data types

2. From the perspective of physical storage

(1) Clustered Index

  • Body content is stored sorted according to a specific dimension, which is the clustered index;
  • The Innodb storage engine stores rows in the order of clustered index dimensions. Innodb tables are also called index tables. Because row records can be sorted by only one dimension, a table can have only one clustered index.

(2) Non-clustered Index

  • An index is described by the data structure of a binary tree. We can think of a clustered index as: the leaf nodes of the index are data nodes. A leaf node that is not a clustered index is still an index node but has a pointer to the corresponding data block.

  • Non-clustered index Index items are stored sequentially, but the corresponding contents of index items are stored randomly.

Here’s an example:

create table student (
`id` INT UNSIGNED AUTO_INCREMENT,
`name` VARCHAR(255),
PRIMARY KEY(`id`),
KEY(`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Copy the code

In this table, the primary key ID is the clustered index of the table, and the primary key name is the non-clustered index. Each row in the table is stored sorted by clustered index ID; Name =’Arla’; name=’Arle’; name=’Arle’; The name index table node is sorted by name and retrieves the primary key for each row of data. The clustered index table, sorted by primary key ID, retrieves the true contents of each row of data.

3. From a logical perspective

(1) Primary key index

A primary key index is a special unique index that does not allow empty values

(2) Plain index or single column index

(3) Multi-column index (compound index)

A compound index is an index created on multiple fields. The index is used only when the first field is used in the query condition. Follow the leftmost prefix set when using composite indexes

(4) Unique index or non-unique index

(5) Spatial index

Spatial indexes are built on fields of spatial data types. There are four spatial data types in MYSQL: GEOMETRY, POINT, LINESTRING, and POLYGON.

MYSQL is extended with the SPATIAL keyword to enable the creation of SPATIAL indexes using the same syntax used to create regular index types. Columns that create a spatial index must be declared NOT NULL. A spatial index can only be created in a table whose storage engine is MYISAM.

Index creation method

CREATE TABLE table_name[col_name data type]
[unique|fulltext|spatial][index|key][index_name](col_name[length])[asc|desc]

Copy the code
  • Unique | fulltext | spatial for optional parameters, respectively, says that the index and full-text index and spatial index;
  • Index and key are synonyms. They are used to specify index creation
  • Col_name is the column for which the index is to be created. This column must be selected from multiple columns defined in the data table.
  • Index_name Specifies the name of the index. This parameter is optional. If this parameter is not specified, col_name is the index value by default. Length is an optional parameter, indicating the length of the index. The index length can be specified only for a string field.
  • Asc or DESC specifies the ascending or descending index value store

1. Create an index when creating a table

(1) Create a common index

create table table_name(
    id int(11),
    name varchar(20),
    sex boolean,
    INDEX(id)
);
Copy the code

View table structure

show create table table_name; You can make the EXPLAIN statement to see if the index is being used

explain select * from table_name where id = 1\G
Copy the code

(2) create unique index

create table index2(
    id int unique,
    name varchar(20),
    unique INDEX index_2(id asc)
);
Copy the code

(3) Create a full-text index

Full-text indexes can only be used on fields of type CHAR, VARCHar, or text. Also, only the MyISAM storage engine supports full-text indexing.

create table idnex3(
    id int,
    info varchar(20),
    FULLTEXT INDEX index3_info(info)
)ENGINE=MyISAM;

Copy the code

(4) Create a single column index

create table index4(
    id int,
    subject varchar(255),
    index index4_st(subject(10))
);
Copy the code

Note here that the length of the subject is 255, but the index index4_st is only 10. The purpose of this is to improve the speed of the query. For character data, you can query only the preceding characters instead of all the information.

(5) create a multi-column index

create table index5(
    id int,
    name varchar(20),
    sex char(4),
    index index5_ns(name.sex)
);
Copy the code

Here we can see that the index_NS index has been created on the name and sex fields.

2. Create index on existing table

(1) Create a common index

Create an index named index7_id on the IDS in the example0() table.

create index index7_id on example0(id);
Copy the code

(2) create unique index

create UNIQUE index index_name on table_name(name);
Copy the code

(3) Create a full-text index

create FULLTEXT index index_name on table_name(info);
Copy the code

(4) Create a single column index

create INDEX index_name ON table_name(name(10));
Copy the code

(5) create a multi-column index

create INDEX index_name ON table_name(name,sex);
Copy the code

Alter TABLE create index

(1) Create a common index

Create an index named INDx_name on the name field

alter table table_name ADD INDEX index_name(name(20));
Copy the code

(2) Create unique index

alter table table_name ADD UNIQUE INDEX index_name(id);

Copy the code

(3) Create a full-text index

alter table table_name ADD FULLTEXT INDEX index_name(info);
Copy the code

(4) Create a single column index

alter table table_name ADD INDEX index_name(name(4));
Copy the code

(5) create a multi-column index

alter tabel table_name ADD INDEX index_name(name.sex);
Copy the code

4, Drop index

DROP INDEX index_name ON table_name;
Copy the code

How is the index tree maintained

At present, most database systems and file systems use B-tree or its variant B+Tree as the index structure, so how to maintain the index Tree?

1. Find the evolution history of structures

Lookup is a very important concept in data structures and algorithms.

  • 1. To search one by one; Simple implementation; Too slow
  • Binary search: ordered; Simple; The requirements are ordered and insertion is very slow
  • HASH search: fast query; Take up space; Not very suitable for storing large-scale data
  • Binary lookup trees: inserts and queries are fast (log(n)); Unable to store large-scale data, complexity degradation
  • Balanced tree: to solve the BST degradation problem, the tree is balanced; When there are many nodes, the tree is still very tall
  • Multi-path search tree: one father multiple child node (degree); If there are too many nodes, the tree height will not be too deep
  • Multi-path balanced search Tree: B-tree

2, the b-tree

A b-tree is a multipath search Tree (not binary) :

  1. Define any non-leaf node to have at most M sons; And M > 2;
  2. The number of sons of the root node is [2, M];
  3. The number of sons of non-leaf nodes except root nodes is [M/2, M].
  4. Each node stores at least M/2-1 (rounded) and at most M-1 keywords; (At least 2 keywords)
  5. Number of keywords of non-leaf nodes = number of Pointers to sons -1;
  6. Keywords of non-leaf nodes: K[1], K[2]… [M, K – 1); And K[I] < K[I +1];
  7. Pointers to non-leaf nodes: P[1], P[2]… P [M]; Where P[1] points to the subtree whose keyword is less than K[1], P[M] points to the subtree whose keyword is greater than K[m-1], and other P[I] points to the subtree whose keyword belongs to (K[i-1], K[I]).
  8. All leaf nodes are located in the same layer;
  9. Each k corresponds to a data. For example :(M=3) is equivalent to a 2-3 tree. A 2-3 tree is a tree where each node has either 2 children and 1 data element, or 3 children and 2 data elements, while a leaf node has no children and 1 or 2 data elements.

B-tree search, starting from the root node, the keyword (ordered) sequence within the node for binary search, if the match is finished, otherwise enter the search keyword range of the son node; Repeat until the corresponding son pointer is null or already a leaf node; The pseudo-code of the search algorithm on b-tree is as follows:

BTree_Search(node, key) { if(node == null) return null; foreach(node.key) { if(node.key[i] == key) return node.data[i]; if(node.key[i] > key) return BTree_Search(point[i]->node); } return BTree_Search(point[i+1]->node); } data = BTree_Search(root, my_key);
Copy the code

(1) Characteristics of B-tree

  1. The keyword set is distributed in the whole tree.
  2. Any keyword appears and only appears in one node;
  3. The search may end at a non-leaf node;
  4. Its search performance is equivalent to a binary search in the whole set of keywords.
  5. Automatic hierarchical control;

(2) Self-control of B-tree

Each internal node in the B tree will contain a certain number of keys. Typically, the number of key values is selected between D and 2D. In practice, key values take up most of the space in a node. Factor 2 guarantees that nodes can be split or combined. If an internal node has 2d keys, the process of adding a key to the node will split the 2d key with 2 D keys and add the key to the parent node. Each split node requires a minimum number of keys. Similarly, if an internal node and its neighbor both have D keys, a key will be deleted by merging it with the neighbor. Deleting this key results in the node having D-1 keys; The merge with the neighbor adds d keys, plus one from the parent of the neighbor node. The result is a fully populated 2D key value.

(3) The construction process of B-tree

So let’s go ahead and insert into the B tree

1 7 8 8 6 3 6 21 5 15 15 6 32 23 45 65 7 8 65 4

3, B + Tree

There are many varieties of B-tree, among which the most common is B+Tree, which is widely used by MySQL to implement its index structure. Compared with b-tree, B+Tree has the following differences:

  1. The number of subtree Pointers and keywords of non-leaf nodes is the same.
  2. The subtree pointer P[I] of non-leaf node points to the subtree whose keyword value belongs to [K[I], K[I +1]) (b-tree is the open interval);
  3. Add a chain pointer to all leaf nodes.
  4. All the keywords appear in the leaf node;
  5. Internal nodes do not store data, but only keys. For example :(M=3)

The search of B+ is basically the same as that of B-tree. The difference is that B+ tree hits only leaf nodes (B-tree can hit non-leaf nodes), and its performance is equivalent to binary search in the whole set of keywords.

(1) Characteristics of B+

  1. All the keywords appear in the linked list of leaf nodes (dense index), and the keywords in the linked list happen to be ordered;
  2. Impossible to hit on non-leaf nodes;
  3. The non-leaf node is equivalent to the index of the leaf node (sparse index), and the leaf node is equivalent to the data layer storing (keyword) data.
  4. More suitable for file indexing systems;

(2) B+ tree construction process

Now I’m going to insert into the B+ tree

1 7 8 8 6 3 6 21 5 15 15 6 32 23 45 65 7 8 65 4

3. Physical storage of indexes

Generally, indexes themselves are too large to be stored in memory, so indexes are often stored on disk as index files.

Therefore, the disk I/O consumption of index lookups is several orders of magnitude higher than memory access, so the most important indicator of a data structure as an index is the progressive complexity of disk I/O operations during lookups. In other words, indexes should be structured to minimize the amount of disk I/O needed to access a lookup.

Suppose each disk block can store exactly one node of the B-tree (exactly two file names). A BTNODE node represents one disk block, and a subtree pointer is the address of another disk block.

(1) Simulate the B+ tree search process

Let’s simulate the process of finding file 29:

  1. Locate the root disk block 1 of the file directory based on the root pointer and import the information into memory. [Disk I/O operation once]
  2. At this point, memory has two file names 17, 35, and three to store the data for the page addresses of other disks. According to the algorithm, we find: 17<29<35, so we find the pointer P2.
  3. Based on the P2 pointer, we locate disk block 3 and import the information into memory. [Disk I/O operation twice]
  4. At this point in memory there are two file names 26,30, and three store the data for other disk page addresses. According to the algorithm, we find: 26<29<30, so we find the pointer P2.
  5. Based on the P2 pointer, we locate disk block 8 and import the information into memory. [Disk I/O operations for 3 times]
  6. There are two file names in memory: 28,29. According to the algorithm, we find the file name 29 and locate the disk address of the file memory. Three disk I/O operations and three memory lookup operations are required. As for the file name search in memory, because it is an ordered table structure, we can use split search to improve efficiency. IO operations are the decisive factor affecting the efficiency of the whole B-tree search. Of course, if we use the balanced binary tree disk storage structure for lookup, disk 4 times, maximum 5 times, and more files, B tree than balanced binary tree disk I/O operations will be fewer and more efficient.

4. Advantages of B+tree

(1) THE disk read and write cost of B+-tree is lower

The internal node of B+-tree does not have a pointer to the specific information of the keyword. So the internal nodes are smaller than the b-tree. If all the keywords of the same internal node are stored in the same disk block, then the disk block can contain more keywords. Read into memory at a time to find more keywords. The number of IO reads and writes is relatively low. For example, suppose that a block on a disk contains 16bytes, a keyword is 2bytes, and a pointer to a keyword is 2bytes. An internal node of order 9 b-tree (a node with up to 8 keywords) requires 2 disks. The internal nodes of the B+ tree only need one disk. When internal nodes need to be read into memory, the B-tree has one more block lookup time (in disk, the rotation time) than the B+ tree.

(2) The query efficiency of B+-tree is more stable

Because non-endpoints are not the nodes that ultimately point to the contents of the file, they are only the indexes of the keywords in the leaf nodes. So any keyword lookup must take a path from root to leaf. The length of all keyword query paths is the same, resulting in the same query efficiency of each data.

What are the principles of index creation

Index query is an important record query method in the database, whether to enter the index and set up the index on those fields should be combined with the query requirements of the actual database system to consider, the following gives some general principles in practice:

  1. Index fields that are often used as filters;
  2. SQL > create index on GROUP BY, ORDER BY;
  3. It is unnecessary to build indexes on fields with fewer different values, such as gender fields;
  4. Avoid indexing for frequently accessed columns;
  5. Build indexes on the columns (primary/external) used for the join;
  6. Building compound indexes on frequently accessed columns, but it should be noted that the order of building compound indexes should be determined according to the frequency of use.
  7. Non-clustered indexes are created by default, but clustered indexes are best considered if they contain a finite number (not very many) of unique columns; Conduct a wide range of queries; Making full use of indexes can reduce the number of I/0 table scans and effectively avoid searching the entire table.
  8. Data columns commonly used in WHERE clauses;
  9. Order by, group by, distinct; If a composite index is created, the index fields must be in the same order as those following the keywords; otherwise, the index will not be used.
  10. For columns that are rarely involved in a query, do not index columns with a high number of duplicate values;
  11. Do not index columns of data types defined as text, image, and bit;
  12. Avoid indexing for frequently accessed columns;
  13. Limit the number of indexes on a table. For a table with a large number of update operations, the number of indexes should not exceed 3 and at most 5. Indexes improve access speed, but too many indexes can affect data update operations.
  14. For composite indexes, the index is built based on the frequency of field occurrences in the query condition. In a composite index, records are first sorted by the first field. For records with the same values in the first field, the system sorts the records by the values in the second field, and so on. Therefore, the composite index can be used only when the first field of the composite index appears in the query condition. Therefore, the system can maximize the use of the composite index and play the role of the index by placing the field with high application frequency in front of the composite index.

1. Combine multiple indexes

A single index scan can only be used for conditional clauses that use operators in the indexed field AND index operator class, AND those conditions are joined by AND.

If there is an index on (a, b), then conditions like WHERE a = 5 AND b = 6 can use the index, but conditions like WHERE a = 5 OR b = 6 cannot use the index directly.

A query like WHERE x =42 OR x = 47 OR x = 53 OR x = 99 can be decomposed into four independent scans on X, each scan using a condition, and finally the results of these scans OR are grouped together to produce the final result.

As another example, if we have separate indexes on x AND y, a query like WHERE x = 5 AND y = 6 can be decomposed into several clauses that use separate indexes, AND then the results can be combined to produce the final result.

Five, there are several cases of index failure

  1. If there is an OR in the condition, it will not be used even if the condition is indexed (this is why use or as little as possible).
  2. For multi-column indexes that are not used in the first part (the first one), the index is not used
  3. A like query starts with %
  4. If the column type is a string, be sure to quote the data in the condition, otherwise no index is used
  5. If mysql estimates that a full table scan is faster than an index, then the index is not used

1. Conditions for invalidation of joint indexes

A combined index is also called a composite index. An index on two or more columns is called a composite index.

For composite indexes: Mysql uses fields in the index from left to right. A query can use only a portion of the index, but only the leftmost portion. For example, the index is the key index (a,b,c). A, can support a | b | a, b, c three combinations are looking for, but does not support b, c to look up. An index is useful when the leftmost field is a constant reference.

So when creating a composite index, you should carefully consider the order of the columns. Composite indexes are useful when searching for all or only the first few columns in an index; A composite index is not useful when a search is performed only on any of the following columns.

How to check index usage

There are two ways to record it

1. Use Handler_read to check index usage

Show status like 'Handler_read %';
Copy the code

You may note:

  • Handler_read_key: The higher the value, the better, the higher the number of times the index was queried
  • Handler_read_rnd_next: The higher the value, the less efficient the query
+-----------------------+--------------+
| Variable_name         | Value        |
+-----------------------+--------------+
| Handler_read_first    | 153      |
| Handler_read_key      | 364   |
| Handler_read_next     | 425    |
| Handler_read_prev     | 598     |
| Handler_read_rnd      | 605     |
| Handler_read_rnd_next | 860571 |
+-----------------------+--------------+
6 rows in set(0.00 SEC) -- -- -- -- -- -- -- -- --Copy the code

By analyzing these values, we can view the current index usage:

  • Handler_read_first: the number of times the first item in the index is read. If it is high, it indicates that the server is performing a lot of full index scans; For example, SELECT col1 FROM foo, assuming col1 has an index (the lower the value, the better).
  • Handler_read_key: If the index is working, this value represents the number of times a row is read by the index value, while a lower value indicates a modest performance improvement because the index is not used very often (a higher value is better).
  • Handler_read_next: Number of requests to read the next line in key order. This value is increased if you query indexed columns with range constraints or if an index scan is performed.
  • Handler_read_prev: Number of requests to read the previous row in key order. ORDER BY… ORDER BY… DESC.
  • Handler_read_rnd: Number of requests to read a line based on a fixed position. This value is higher if you are executing a large number of queries and need to sort the results. You may be using a lot of queries that require MySQL to scan the entire table or your join is not using keys correctly. This high value implies inefficiency and should be remedied by indexing.
  • Handler_read_rnd_next: Number of requests to read the next line in the data file. This value is higher if you are doing a lot of table scans. This usually means that your table index is incorrect or that a written query does not utilize the index.

2. Check for useless indexes in sys library

Query the schemA_unused_INDEXES library.

root@localhost [sys]>select * from schema_unused_indexes; +-------------------+-------------+------------+ | object_schema | object_name | index_name | +-------------------+-------------+------------+ | sysbench_testdata | sbtest1 | k_1 | | sysbench_testdata | sbtest10 | k_10 | | sysbench_testdata | sbtest3 | k_3 | | sysbench_testdata | sbtest4 | k_4 | | sysbench_testdata | sbtest5 | k_5 |  | sysbench_testdata | sbtest6 | k_6 | | sysbench_testdata | sbtest7 | k_7 | | sysbench_testdata | sbtest8 | k_8 | | sysbench_testdata | sbtest9 | k_9 | +-------------------+-------------+------------+ 9 rowsin set (0.00 sec)
Copy the code

7, EXPLAIN command to check whether the index takes effect

Explain shows how mysql uses indexes to process SELECT statements and join tables. It helps to choose better indexes and write better queries.

1. A practical example

Create a new table,

CREATE TABLE IF NOT EXISTS `article` (`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`author_id` int(10) unsigned NOT NULL,
`category_id` int(10) unsigned NOT NULL,
`views` int(10) unsigned NOT NULL,
`comments` int(10) unsigned NOT NULL,
`title` varbinary(255) NOT NULL,
`content` text NOT NULL,
PRIMARY KEY (`id`)
);
Copy the code

Execute the query,

EXPLAIN
SELECT author_id
FROM `article`
WHERE category_id = 1 AND comments > 1
ORDER BY views DESC
LIMIT 1
Copy the code

The response data is as follows,

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: article
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
        Extra: Using where; Using filesort
1 row in set (0.00 sec)

Copy the code

Type is ALL, worst-case scenario. Using filesort appears in Extra, which is the worst case scenario.

2, EXPLAIN the explanation:

  • Table: Shows which table this row is about
  • Type: This is the important column that shows what type the connection is using. The best to worst connection types are const, eq_reg, ref, range, Indexhe, and ALL
  • Possible_keys: Displays possible indexes that can be applied to this table. If empty, there is no possible index. You can select an appropriate statement from the WHERE statement for the relevant domain
  • Key: indicates the actual index. If NULL, no index is used. In rare cases, MYSQL will select an index that is underoptimized. In this case, USE INDEX (indexName) in the SELECT statement to force the USE of an INDEX or IGNORE INDEX (indexName) to force MYSQL to IGNORE the INDEX
  • Key_len: the length of the index used. With no loss of accuracy, the shorter the length, the better
  • Ref: Shows which column of the index is used, if possible, as a constant
  • Rows: The number of rows that MYSQL deems necessary to check to return the request data
  • Extra: Additional information about how MYSQL parses queries. We’ll discuss this in Table 4.3, but the bad examples you can see here are Using temporary and Using filesort, which means MYSQL can’t use indexes at all, resulting in slow retrieval

3. Type An explanation of the returned result

MySQL finds rows in tables. Including (from left to right, from worst to best) : | | All index | range | ref | eq_ref | const, system | null |

  • The system table has only one row: the SYSTEM table. This is a special case of const connection types
  • Const: The maximum value of a record in the table that can match the query (index can be primary key or unique index). Because there’s only one line, this value is actually a constant, because MYSQL first reads this value and then treats it as a constant
  • Eq_ref: In a join, MYSQL reads a record from the previous table for each record union during a query, which is used when the entire query uses the primary key or unique key of the index
  • Ref: This join type occurs only when the query uses keys that are not unique or primary keys or parts of those types (for example, using the leftmost prefix). For each row union of the previous table, the entire record is read from the table. This type depends heavily on how many records are matched against the index – the fewer the better
  • Range: This connection type uses an index to return rows in a range, such as what happens when you use > or < to look up something
  • Index: This join type does a full scan of each record union in the previous table (better than ALL because indexes are generally smaller than table data)
  • ALL: This connection type does a full scan for each previous record union, which is generally bad and should be avoided

4. Meaning of the description returned by the extra column

  • Distinct: Once MYSQL finds a joint match with a row, it no longer searches
  • Not exists: MYSQL optimizes LEFT JOIN and does Not search once it finds a row that matches the LEFT JOIN standard
  • Range checked for each Record (index map:#) : No desired index was found, so for each combination of rows from the previous table, MYSQL checks which index to use and uses it to return rows from the table. This is one of the slowest connections using an index
  • Using filesort: When you see this, the query needs to be optimized. MYSQL takes additional steps to discover how to sort the rows returned. It sorts all rows based on the connection type and the row pointer that stores all rows with sort key values and matching conditions
  • Using Index: Column data is returned from a table that only uses the information in the index without reading the actual action. This occurs when all columns requested for the table are part of the same index
  • When you see this, the query needs to be optimized. In this case, MYSQL needs to create a temporary table to store the results, which usually happens to ORDER BY on different sets of columns rather than GROUP BY
  • Using WHERE uses the WHERE clause to restrict which rows will match the next table or be returned to the user. This can happen if you do not want to return ALL rows in the table and the join type is ALL or index, or if the query has a problem explaining the different join types (in order of efficiency)

Reference documentation

The most official mysql Explain Type field reads mysql clustered index and non-clustered index

Follow the public account: Architecture evolution, get first hand technical information and original articles