Indexes are ordered data structures that help Mysql efficiently retrieve data

What is the index?

An index is a data structure that helps Mysql obtain data efficiently. In more general terms, a database index is like a directory at the front of a book, which speeds up the query speed of the database. Indexes themselves are generally too large to be stored in memory, so indexes are often stored in files on disk (either in separate index files or in data files with data) 3. We usually refer to the index, including clustered index, overwrite index, composite index, prefix index, unique index and so on without special description, the default is to use B+ tree structure (multi-path search tree, not necessarily binary) index advantages and disadvantages: 1. Can improve the efficiency of data retrieval, reduce the IO cost of database similar to the book catalog 2. Through the index columns to sort data, reduce the cost of data sorting, reducing the consumption of the CPU indexed column will automatically sort, including single index and composite index 】 【 】 just combination index ordering more complicated If in accordance with the order of the index columns to sort, corresponding order by statement, efficiency will improve a lot of disadvantage:  1. Indexes occupy disk space. 2. Indexes improve query efficiency, but reduce table update efficiency. For example, Mysql not only saves the data, but also saves or updates the corresponding index file every time it adds or deletes a tableCopy the code

1. Data structure and index implementation

- binary tree (to the right of the element is greater than go to the parent element, the element on the left is less than their parent) - red and black tree (is a self-balancing binary search trees), Hash table (Hash table and then the equivalent query, the efficiency is very high, the time complexity is O (1) but does not support quickly find the scope, range finding, when he was only done by scanning a full table) -b +Trees (multi-fork balanced Tree non-coconut nodes do not store Data, only store index (redundant), can store more indexes, leaf nodes contain all index fields Leaf nodes are connected with Pointers to improve interval access performance)Copy the code

Approximately 21,902,400 indexes can be stored by B+Trees of 3 height

  • 1. Clustered index – [clustered index]: index and data together (leaf node contains complete data)
  • 2. Non-clustered index – [sparse index]: index and data are not stored together (e.g. MyISAM MYZ/MYD index and data separation)
    • Clustered index queries are faster than non-clustered index queries

    • A B+Tree stores more data than a B-tree
    • Mind using primary key integer increment

1.1 MyISAM index

1.2 the InnoDB index

Create InnoDB tables

CREATE TABLE `abc_innodb`
(
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a`  int(11)     DEFAULT NULL,
  `b`  int(11)     DEFAULT NULL,
  `c`  varchar(10) DEFAULT NULL,
  `d`  varchar(10) DEFAULT NULL.PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_abc` (`a`, `b`, `c`)
) ENGINE = InnoDB;

INSERT INTO abc_innodb(a,b,c) VALUES (1.4.2);
INSERT INTO abc_innodb(a,b,c) VALUES (8.5.6);
INSERT INTO abc_innodb(a,b,c) VALUES (6.7.5);
INSERT INTO abc_innodb(a,b,c) VALUES (4.7.3);
INSERT INTO abc_innodb(a,b,c) VALUES (4.7.2);
INSERT INTO abc_innodb(a,b,c) VALUES (9.2.1);
INSERT INTO abc_innodb(a,b,c) VALUES (19.6.2);
INSERT INTO abc_innodb(a,b,c) VALUES (10.1.9);

Copy the code

1.3 Left-most matching principle

The principle of left-most prefix matching is related to index storage structure and retrieval method of joint index. In the composite index tree, the lowest leaf nodes are arranged in ascending order from left to right according to the first column A, but columns B and C are disordered. Column B can only increase order in a small range when columns A are equal, while column C can only increase order in a small range when columns A and B are equal. Like the query above, the B+ tree first compares column A to determine which direction to search next, left or right. If column A is the same, compare column B. But if the query condition does not have a column, the B+ tree does not know which node to look up first. Idx_abc (a,b,c); idx_ABC (a,b,c) When using a combined index query, mysql will keep matching to the right until it encounters a range query (>, <, between, or like)Copy the code

2. Index type

2.1 Primary Key Index

Each InnodDB table has a clustered index, which is built using B+Tree. The data stored by leaf nodes is a whole row of records. In general, a clustered index is equivalent to a primary key index. InnoDB automatically creates a RowID field for a clustered index when a table does not have a primary key index. The rules for automatic index creation are as follows:

1. Define the PRIMARY KEY on the table. InnoDB will reference the PRIMARY KEY index to cluster index 2. If no primary key is defined, InooDB selects the first unique index column that is not NULL as the cluster index 3. If neither of the above is present, InnoDB builds the clustered index using a random ROWID field of 6 byte long integer type. The ROWID field is automatically incremented when a new row is insertedCopy the code

All indexes other than clustered indexes are called secondary indexes. In middle InnoDB, the leaf node in the secondary index stores data that is the primary key of the row. InnoDB uses this primary key value to search for row records in the clustered index at retrieval time.

Leaf nodes of primary key indexes store rows, while secondary indexes store only primary key values.

2.2 Common Indexes

The basic index type in MySql, which has no restrictions, can insert duplicate and null values in the defined index column

2.3 Unique Index

Values in index columns must be unique but allow null values

2.4 Full-text Index

Full-text indexes can only be created on fields of the TEXT types CHAR,VARCHAR, or TEXT. When the minimum length is large, if you create a normal index, the efficiency is low in like fuzzy query. In this case, you can create a full-text index. MyISAM and InnoDB can use full-text index

2.5 Spatial Index

Mysql supports spatial indexing in versions after 5.7 and also supports the OpenGIS geometric data model. MySql follows the OpenGIS geometric data model rules for spatial indexing

2.6 Prefix Index

When creating an index on a TEXT type such as CHAR,VARCHAR, or TEXT, you can specify the length of the index column, but not the numeric type

2.7 Single-column index

Indexes created by a single field (a table can have up to 16 indexes with a maximum byte length of 256)

2.8 Composite Index (Combined Index)

An index consisting of multiple fields is called a composite index (the use of a composite index must follow the left-most prefix matching principle. In general, if conditions permit, a composite index is used instead of multiple single-column indexes).

Joint index, in the establishment of the index, as far as possible in multiple single-column index to judge whether the use of joint index. The use of federated indexes not only saves space, but also makes it easier to use index coverage.

Think about it. The more fields you index, the easier it is to meet the requirements of the query. For example, the joint index (a_b_c) is equivalent to having three indexes: A, A_b, and a_b_c, which saves space. Of course, the space saved is not three times as much as that saved by (A, a_b, and a_b_c), because the data in the index tree does not change, but the data in the index data field is really saved.

Joint index creation principle, you should create a combined index with frequent use of the columns, the columns of the high degree of differentiation in front, the frequent use of representative indexes utilization rate is high, high degree of differentiation on behalf of the filter size is large, these are all in the optimization of index creation needs to consider the scenario, can also often need as a query returns on the fields of increase to joint index, If an override index is used by adding a field to the federated index, then mind using the federated index

Use of federated indexes

  1. Consider whether there are already multiple single-column indexes that can be merged, and if so, create the current multiple single-column indexes as a single federated index
  2. If there are any columns that are frequently used as return fields, you can consider whether the current column can also be added to the existing index so that the query statement can use the overwrite index

select * from abc_innodb where a = 13 and b = 16 and c = 4;
Copy the code

CREATE TABLE `abc_innodb`
(
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a`  int(11)     DEFAULT NULL,
  `b`  int(11)     DEFAULT NULL,
  `c`  varchar(10) DEFAULT NULL,
  `d`  varchar(10) DEFAULT NULL.PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_abc` (`a`, `b`, `c`)
) ENGINE = InnoDB;
Copy the code

2.9 Overwriting indexes (not Index structures)

Overwriting an index is not an index structure. Overwriting an index is a common optimization method. Because when we use secondary indexes, we can only get the primary key value, which means we need to query the primary key index and get the data. If I only need the ABC field, does that mean that I can return the leaf node of the composite index directly without returning the table? This case is called an overwrite index.

3. Index data structure

3.1 the hash table

TreeMap is a Hash table structure that stores data as key-value pairs. We use Hash tables to store table data. Keys can store index columns, and values can store row records or row disk addresses. The reequivalence query of Hash tables is efficient and the time complexity is O(1). However, fast range lookups are not supported, and range lookups are still done by scanning the entire table (obviously not suitable for database indexes that require frequent lookups and range lookups).

3.2 Binary tree Search

Below is an example binary tree diagramBinary tree features: Each node has a maximum of two forks. The data sequence of the left and right subtrees is smaller on the left and larger on the right. This is a feature that reduces the NUMBER of I/OS in order to make sure that every search is half done, but binary trees are very difficult to get the value of the first root because it’s very easy to get something that we don’t want to happen with this feature and say, “The tree doesn’t bifurcate,” which is very uncomfortable

The following figure

3.3 Balancing binary trees

The balanced binary tree adopts the dichotomy thinking. In addition to the characteristics of the balanced binary tree, the most important feature of the search tree is that the hierarchy difference between the left and right subtrees is at most 1. When inserting and deleting data, the binary tree is balanced by left-rotation/right-rotation operations, so that the left subtree is not tall and the right subtree is short

The performance of balanced binary tree query is close to binary search, the time complexity is O(log2n) query ID =6, and only two IO operations are requiredProblems of balanced binary trees:

  1. Time complexity is related to tree height. The height of the tree is determined by how many times it needs to be retrieved. For each node read, there is a disk IO operation. The tertiary height is equal to the number of DISK I/O operations performed during each data query. Each time the disk has a seek sword of 10ms, the query performance will be poor when the table data is large. (For 1 million data, log2n is approximately equal to 20 DISK I/O times 20 x 10=0.2ms)
  2. The balanced binary tree does not support quick search for range query. Therefore, the range query requires multiple traversal from the root node, resulting in low query efficiency

3.4 B-tree (Modified binary tree)

Mysql data is stored in disk files. When querying and processing data, data on disk needs to be loaded into memory first. Disk I/O operations are time-consuming, so we focus on reducing disk I/O operations. I/O occurs for each node that accesses the binary tree and if you want to reduce disk I/O you have to lower the tree height. If the key is bigint=8byte and each node has two Pointers of 4 bytes each, the space occupied by each node is 16 bytes (8+4*2=16), because each IO in Mysqk's InnoDB storage engine will read the default (16KB) amount of data in one page, However, binary trees only have 16 bytes of effective data per IO, which is extremely low space utilization. In order to maximize the utilization of IO space at a time, a simple idea is to store multiple elements in each node and store as much data as possible in each node. Each node can store 1000 indexes (16K /16=1000), thus transforming a binary tree into a multi-fork tree, changing the tree from tall to chunky by increasing the number of branches of the tree. To construct one million pieces of data, the height of the tree only needs two layers (1000*1000= one million), which means that only two IO operations are needed to get the data disk. The NUMBER of IO operations is reduced, and the efficiency of data query is improved. Each inner node has multiple forks 2. Elements in a node contain key values and data. The key values in a node are arranged from the largest to the smallest, which means that data can be stored in all nodes. 4. All leaf nodes are in the same layer, leaf nodes have the same depth, and there is no pointer connection between leaf nodesCopy the code

For example, when querying data in a B tree:

Suppose we query for data with a value equal to 10. Query the path Disk block 1 > Disk Block 2 > Disk Block 5.

First disk I/O: Load disk block 1 into memory. Compare disk block 1 in memory. 10<15, go left, and address disk block 2.

Second disk IO: Load disk block 2 into memory, compare the disk from the beginning in memory, 7<10, address disk to locate disk block 5.

Third disk I/O: Load disk block 5 into the memory, compare 10 in the memory, find 10, fetch data, if data stores row records, fetch data, the query ends. If the disk address is stored, you need to retrieve data from the disk based on the disk address, and the query ends.

Compared with binary balanced search tree, although the number of data comparison is not significantly reduced in the whole search process, disk I/O times will be greatly reduced. Also, since our comparison is in memory, the time of comparison is negligible. The height of b-tree is usually two or three layers, which can meet most application scenarios. Therefore, b-tree index construction can improve the efficiency of query.

The process is shown as follows:

Now, the problem is that you can modify the B tree

1. B tree also does not support fast search of range query, which means that when you perform range search, you will still traverse from the root node one by one, which can be expected to be very low query efficiency. If data stores row records, the space occupied by rows will increase as the number of columns increases. In this case, the amount of data that can be stored in a page will decrease, the tree will increase accordingly, and the disk I/O times will increaseCopy the code

3.5 B+Tree (B+Tree, modified B Tree)

B+Tree is an upgraded version of B Tree. On the basis of B Tree,Mysql continues to transform it and uses B+Tree to build indexes. The main difference between B+TreeB trees is whether non-leaf nodes store data

2. B+Tree: Only leaf nodes store data, and non-leaf nodes only store key values. Leaf nodes are connected using bidirectional Pointers, and the lowest leaf nodes form a bidirectional ordered listCopy the code

The lowest leaf node of a B+Tree contains all the indexes. As can be seen from the figure, B + Tree at the time of search data because the data is stored in the bottom of the leaf nodes, so every time find need to be retrieved leaf nodes to query data So we need to look at the data under the condition of each disk IO and is directly related to the height of the tree But on the other hand, since the data is in the child nodes, Put indexing disk blocks for storage of indexes is follow increase relative to B Tree, B + Tree, a Tree is high in theory than B Tree height of the Tree Index coverage is true, in the index data to meet the current query all the data required for now you just need to find the index return immediately, do not need to search to the bottom of the leaf node

3.5.1 Eg: Equivalence query

Suppose we query for data equal to 9. Query the path Disk block 1 > Disk Block 2 > Disk Block 6

First disk I/O: Load disk block 1 into memory and compare it from the beginning in memory. 9<15 go left to address disk block 2

Second disk I/O: Load disk block 2 into memory, compare the disk from the beginning in memory, 7<9<12, address disk to locate disk block 6

Third disk I/O: Load disk block 6 into memory, compare from the beginning in memory, find 9 in the third index, fetch data, if data stores the row record, fetch data, end of query. If the disk address is stored, Data needs to be retrieved from the disk according to the disk address, and the query terminates (what needs to be distinguished here is that Data in InnoDB stores row Data, while MyIsam stores disk address) as shown below

3.5.2 Range Query

Let’s say we want to find something between 9 and 26. The search path is Disk Block 1 > Disk Block 2 > Disk Block 6 > Disk Block 7.

We first look for data with a value equal to 9 and cache the data with a value equal to 9 into the result set. In this step, three disk I/O operations are performed as in the previous equivalent query process

After finding 15, the underlying leaf node is an ordered list, and we start from disk block 6 and key value 9 to filter all the data that meets the criteria

Fourth disk I/O: Locate disk block 7 based on the subsequent pointer to disk block 6, load disk 7 into the memory, and compare the data in the memory. 9<25<26, 9<26<=26 Cache the data in the result set

Since the primary key is unique (no <=26 data will follow), there is no need to continue the search to terminate the query and return the result