The index model

There are three main models of indexing: key-value pairs, ordered arrays, and search trees

Key-value pairs

The key-value pair model uses hash tables to store data. When a hash collision occurs, a linked list can be used to resolve the conflict. However, if the list is too long, the query efficiency will be affected.

Key-value pairs are characterized by very fast data query, but can only perform equivalent query, not range query. Key-value pairs are often used in NoSQL, such as Redis.

Orderly array

The ordered array model is to store the data in an array model, and then keep the array in order, which can be arranged by the value of a field in the data.

Ordered arrays can be queried directly using dichotomy, so the time complexity is O(logN). And ordered arrays can be very convenient for range query. The problem with ordered arrays is that, in order to maintain the order of the array, it is very expensive to move all the data after the insertion position when inserting. So ordered arrays are good for indexing static data, which can be created once and never need to be inserted again.

Search tree

Binary search trees are also ordered, with the left node smaller than the parent and the parent smaller than the right node. Binary search trees also have o(logN) query complexity and O (logN) update complexity. Of course, in order to maintain O (logN) complexity, binary trees need to be balanced.

However, when the amount of data is very large, the height of the binary search tree will be very high, which increases the search time. For example, for a balanced binary tree with a million nodes and a tree height of 20, assuming that a random read from disk takes 10ms, then a simple query takes 200ms. This speed is obviously unacceptable.

In order to reduce the height of binary trees, n-fork search trees were invented, where a parent node can have N children. For an N-cross search tree with a million nodes, when N is 100, the height of the tree is already reduced to 3.

InnoDB index model

Clustered index and non-clustered index

Clustered index is not a separate index type, but a way of data storage. When a table has a clustered index, its rows are actually placed on the leaf nodes of the index. Accordingly, if the data row is stored separately and the index leaf node only holds the pointer to the data row, the storage mode is non-clustered index. Because it is not possible to put rows in two different places at the same time, a table can have at most one clustered index.

Then we compare the storage characteristics of clustered indexes with those of non-clustered indexes. Direct reference to the book “High Performance MySQL” on clustered indexes compared to non-clustered indexes. On the left are clustered indexes and primary key indexes and secondary indexes, and on the right are primary key indexes and secondary indexes of non-clustered indexes.

Features of clustered index and non-clustered index:

The primary key index of a clustered index, where rows are grouped with the index;
The primary key index of a non-clustered index, in which rows are stored separately from the index, and addresses of rows are stored in the index;
The secondary index of the cluster index holds the value of the primary key index. The value of the primary key index is saved, rather than the address of the data row, because the clustered index is subject to page splitting, which changes the address of the data store. When the page address changes, only the data of the primary key index needs to be maintained, and the secondary index does not need to be maintained, which reduces the index maintenance work.
There is no substantial difference between a primary key index and a secondary index for a non-clustered index.

Clustered indexes have some advantages:

You can keep related data together. For example, for the flow record of a user, the cluster index is built according to the user ID. In this way, all the data of the user will be gathered together. Therefore, the query only needs to read a few data blocks of disks to obtain all the data of a user.
Faster data access. The data rows of the clustered index are grouped with the index, so when searching on the clustered index, the data is found by completing the index search without the need for another disk IO.
When using an overwrite index, you can directly use the primary key in the page’s child node because the primary key is saved on the leaf node of the secondary index.

Clustering indexes have some disadvantages:

The speed of inserts depends heavily on the order in which they are inserted. Inserts can be fast if they are done in primary key order. If the inserts are not in the order of the primary keys, page splitting may occur. Page splitting not only affects the insertion speed, but also the original page is now divided into two pages for storage, and the two pages are not fully stored, taking up more disk space. So that’s why DBAs always recommend using the auto-increment ID primary key when using the InnoDB engine;
Updating the cluster index is expensive because each updated row is moved to a new location;
The secondary index needs two lookups, because the secondary index stores the primary key value, and needs another table back operation.
The secondary index stores the value of the primary key, which may occupy more space.

InnoDB index model

InnoDB has become the default storage engine of MySQL since MySQL5.5. Let’s analyze an InnoDB index model. InnoDB uses B+ tree index model, and InnoDB’s primary key index is clustered index.

What is a B+ tree? What are the advantages of B+ trees over B trees? How do B+ trees insert and delete data?

Above questions, please refer to this blog, write very simple and easy to understand: B tree and B+ tree insert, delete graphic details

InnDB engine uses B+ tree as index model, for M order B+ tree, equivalent query time complexity is logm(N), B+ tree data are on the leaf node. The leaf node of a B+ tree has a pointer to the next leaf node, so range queries are also fast.

InnoDB maintains the ordering of clustered indexes by primary key order without causing page splitting. Random insertion may lead to page splitting. Therefore, the auto-increment ID is generally selected as the primary key. Id BIGINT UNSIGNED NOT NULL PRIMERY KEY AUTO_INCREMENT

The InnoDB storage engine must specify a primary key for the table. If there is no primary key, InnoDB selects a unique non-empty index instead. If there is no such index, InnoDB implicitly defines a primary key as the cluster index.

MySQL index features

Back to the table

We already know that the secondary index of the cluster index stores the value of the primary key index, so when searching for data through the secondary index, we first need to get the value of the primary key index through the secondary index, and then query data through the primary key value. The process of querying data by primary key value is called table back.

Left-most prefix rule

When you maintain an index for each query, you need to create many indexes, which not only take up disk space, but also cost a lot of maintenance. When many indexes are inserted, a primary key value is inserted into each index for each data entry. To reduce the number of indexes, you can create a joint index, where multiple columns can be used together to create an index. When creating an index, extend an existing index to a federated index first, or add a field to an existing federated index. The more indexes there are, the more maintenance costs there are, and the slower inserts there are.

Suppose we create a table as follows:

CREATE TABLE `t` (
  `id` int(11) NOT NULL.`a` varchar(32) DEFAULT NULL.`b` varchar(32) DEFAULT NULL.`c` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `a_b` (`a, b`),ENGINE=InnoDB
Copy the code

In addition to the primary key index on ID, the table also creates a joint secondary index a_b on columns A and B. Where a = ‘XXX’ and b = ‘yyy’ where a = ‘XXX’ and b = ‘yyy’ where a = ‘XXX’ Where b = ‘yyy’ where b = ‘yyy’ That’s the leftmost prefix rule.

One other thing to note about the joint index and the left-most prefix principle is that both fuzzy and range queries invalidate the column after that query column on the joint index.

For example: select * from t where a like ‘test%’ and b = ‘luck’, select * from t where a like ‘test%’ and b = ‘luck’, select * from T where a like ‘test%’ and b = ‘luck’, select * from T where a like ‘test%’ and b = ‘luck’.

An index pushdown

select * from t where a like 'test%' and b = 'luck';
Copy the code

SQL > alter table a; alter table B; alter table B;

If there are four rows on the index, the result is as follows:

However, after version 5.6 of mysql, an optimization was made to introduce index push down. In the process of traversing the index, it will judge the fields contained in the index first, filter out the data that does not meet the conditions, and reduce The Times of back to the table.

The index optimization

How should a joint index be built

You can reduce the number of indexes by using federated indexes and the left-most prefix principle described earlier. So what principles should be followed when creating a joint index?

Give priority to those that are highly differentiated. For example, a user table of the global population has fields such as gender, nationality, age, etc. In general, nationality is more discriminative than gender. For example, fewer people meet the requirement of being Chinese than that of being male. Nationality is therefore given priority over gender in the establishment of the joint index.
Enumerable values take precedence. Again, let’s say we created a joint index (nationality, age, gender) key_a. If I wanted to find Chinese male users, the joint index would not work. But if we created a joint index key_B (gender, nationality, age), in which case we were looking for 18-year-olds in China, would it still work? Sure, we can specify IN(male, female) with the IN condition when querying. This satisfies the leftmost prefix principle.

Careful readers will notice that the first and second points may conflict, so how to make a decision? There is no single criteria for index selection, and many principles have previously been in conflict and need to be weighed according to specific circumstances. For example, in the above case, if gender is placed after the index, the index cannot be hit directly in many cases although the distinction is high. However, if gender is placed in the first place, although the distinction is not so large, the index hit performance will not be degraded. Therefore, you can consider putting gender in the first place.

There are also some query conditions, need to be a range query or sort, then the range query and sort of the field should be put as far back as possible, because the range query after the field index is not hit.

Should I use a unique index

For a query, a normal index hits the first record and continues down, while a unique index returns immediately. But disk storage is stored in pages, the smallest of which is 4K, and by the time the first record is read, there is a high probability that the rest of the data will be in that 4K page, which has been loaded into memory. Therefore, there is little difference in performance impact between the two queries.

For writes, normal index inserts are written to change buf first to speed up the write operation. However, in order to ensure the uniqueness of unique index, change buf cannot be used. Unique index must first check whether the same index data already exists, and insert if not.

If the service can guarantee the uniqueness, use common indexes. If the service cannot guarantee the uniqueness, use unique indexes.

Do you want to use the UUID primary key

When using the InnoDB engine, dbAs advise you to use auto-increment ID primary keys instead of random UUID primary keys because non-increment primary keys cause frequent page splits and reduce insert efficiency. So normally, we would add an increment ID field to the table and use that field as the primary key of the table. When the primary key with the self-increasing ID is used, if you need to query user information based on the UUID, you need to search back to the table, which reduces the search efficiency.

Here’s how I understand this:

If only a unique index of a UUID is needed in the table, then the UUID can be used as the primary key;
If condition 1 is not met, use the increment ID as the index.
If you do not know how to choose, then use the increment ID index is good, so that at least can not be wrong.

Cover index

Overwriting an index means that the data on the index can meet the requirements of the query, so there is no need to perform back operations to reduce I/O operations and optimize the query speed.

On a user list

CREATE TABLE `user` (
  `id` int(11) NOT NULL.`id_card` varchar(32) DEFAULT NULL.`name` varchar(32) DEFAULT NULL.`age` int(11) DEFAULT NULL.`ismale` tinyint(1) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `id_card` (`id_card`),
  KEY `name_age` (`name`.`age`))ENGINE=InnoDB
Copy the code

If the id number id_card is indexed, and there is a high frequency query for the name by id number, this query will be returned to the table every time. In this case, if the index is changed to id_card and name, then the data on the index already meets the requirements of the query, so there is no need to query the table.

Control index length

An index that is too long occupies a large amount of disk space, and an index that is too long becomes bloated, slowing down index queries. Querying a book’s specific chapter through the table of contents is fast because the index is lightweight, and this advantage is lost if the index is too long. Moreover, the data in the index and the data in the table are themselves redundant, and if the index is too long, the more disk space is wasted. MySQL also has specific limits on the length of indexes.

There are several ways to control index length:

String prefix index can greatly shorten the length of the index.
Do not build a joint index on too many fields;

The prefix index

If an index is too long, it will become bloated. Prefix indexes are used to reduce the load of an index.

CREATE TABLE User(
  ID bigint unsigned primary key,
  email varchar(64),...).engine=innodb;
Copy the code

There is a list of users using a mailbox as their registration name. Suppose you have a group of users, abcdii@163.com, abcdrr@163.com, abcdoo@163.com, abcdss@163.com. In service requirements, users are most likely to be searched by user names. Therefore, you need to add indexes to user names. An easy way to create a plain key(email) directly, or you can use a prefixed key(email(4)).

The length of the index is 10 if the index is normal, and 4 if the index is prefixed. Suppose we want to query abcdoo@163.com. For a normal index, we first locate the row at abcdoo@163.com, retrieve the data from the table, and then continue to search down the index. If abcdss@163.com does not satisfy the condition, we end the search. Only one callback is made, so the system determines that only 1 row was found.

For prefixed indexes, we need to search four times, all of which need to go back to the table to verify that it is the object to search, because indexes of length 4 can meet the criteria. Consider that if the prefix index takes five keys (email(5)), it will take only one lookup. This indicates that the prefix index can maintain a good partition condition, can reduce the length of the index.

So how do you choose the length of the prefix index?

Start by counting the number of different columns with the following statement

SELECT count(distinct email) as C FROM User;
Copy the code

Then calculate the number of different columns under different index lengths. When the length of different columns is close to that of the non-prefix index, or when the number of different columns does not increase significantly as the index length increases, it is almost a reasonable prefix index length.

SELECT count(distinct left(email, 4) as C4 FROM User;
SELECT count(distinct left(email, 5) as C5 FROM User;
Copy the code

Effect of prefix indexes on overwrite indexes Because prefix index strings are not complete, overwrite indexes will be invalidated, so take this into account when creating prefix indexes.

What if there is a high degree of duplication at the beginning of the string

If the prefix part of the string is repeated, for example, the ID number. There are two solutions,

One way is to store the string upside down and then query it by reversing the string

SELECT * FROM t WHERE id_card = reverse('input_id_card');
Copy the code

Second, the string is used to compute a HASH value on which the index is built. However, different ids may have the same hash value. Therefore, you need to determine the hash value based on the ID number. Id_card_crc computes the hash value field of CRc32 for ID. After the hash, the index takes only four bytes.

SELECT * FROM t WHERE id_card_crc = crc32('input_id_card') and id_card='input_id_card'
Copy the code

It is important to note that both methods will cause the range query to fail.

MySQL index

The index model

InnoDB index model

Clustered index and non-clustered index

InnoDB index model

MySQL index features

Back to the table

Left-most prefix rule

An index pushdown

The index optimization

How should a joint index be built

Should I use a unique index

Do you want to use the UUID primary key

Cover index

Control index length

The prefix index

Related Posts

Process scheduling implementation

Go library introduction 1: Strings

What do you know about Lambda expressions in Java?