Mysql, as a relational database, should be the most widely used in China. Maybe you use Oracle, Pg, etc., but most Internet companies, such as our company, use Mysql the most, which is self-evident.

After the last article about MySQL infrastructure went out, someone said, can we talk about indexes? In our daily work, when we encounter slow SQL execution, we often receive advice like “add an index”. What exactly is an index? Why does it improve performance? Let’s talk about it

01 What is the index?

An index is a data structure designed to improve the efficiency of data query, like the table of contents of a book. If you think of a book with hundreds of pages, you’d have to go through a lot without a table of contents. Take a popular example, I brush in Zhihu, the metaphor is wonderful.

In the Xinhua dictionary we have used since childhood, the initial query is clustered index. A partial is a secondary index and a partial plus a stroke is a joint index.

Indexes themselves take up disk space (think of the pages of a book’s table of contents) and exist mainly on disk as files.

1.1 Advantages and disadvantages of indexes

advantages

  • Improves query execution efficiency and reduces I/O operations
  • Creating unique indexes ensures the uniqueness of each row in a database table
  • Indexed columns are sorted, which can significantly reduce the time spent grouping and sorting queries when using the grouping and sorting clauses

disadvantages

  • Indexes take up physical space
  • Creating and maintaining indexes takes time, which increases with the volume of data
  • When the data in the table is added, deleted, changed, or checked, the index must be maintained dynamically, which reduces the data update efficiency

1.2 Classification of indexes

The primary key index

A special unique index that does not allow null values. (Primary key constraint = unique index + non-null value)

The only index

Values in index columns must be unique, but null values are allowed.

Normal index

Add index type in MySQL, there are no restrictions. Null and duplicate values are allowed, purely for query efficiency.

Single index

There is only one column in the index, and each table can have multiple single-column indexes.

Composite index

An index composed of multiple column values is designed for combined search, which is more efficient than index merging. Note that you need to follow the leftmost matching rule when using it. Composite indexes are common in work when multiple columns are used as query criteria.

The full text indexing

Full-text indexes can only be built on columns of TEXT content, that is, TEXT, CHAR, VARCHAR data types. Didn’t someone say that creating a single-column index was the end of it? Consider a case where a like query is slow when the column is long, which is a good place to build a full-text index.

The prefix index

CHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR

//Create a length of x_NAME on x_test4Prefix index ofalter table x_test add index(x_name(4));
Copy the code

This length is determined according to the actual situation. Long too take up space, short great effect. For example: I have a table where the first character of the x_name is almost the same (assuming they are all 1), and if the length of the index is 1, the following query may be worse than it was before. Because there are too many columns in the database with the first character = 1, try to select columns that start with different lengths.

SELECT * FROM x_test WHERE x_name = '1892008.205824857823401.800099203178258.8904820949682635656.62526521254';
Copy the code

Spatial index

MySQL supports spatial indexing in versions after 5.7 and also supports the OpenGIS geometric data model. MySQL follows the OpenGIS geometric data model rules for spatial indexing.

02 Index memory model

There are many ways to achieve the index, here is the most common three: hash table, ordered array, binary tree, which binary tree is divided into binary search tree, balanced binary tree, B tree and B+ tree, thus explains why InnDB chose B+ tree? User has two columns, one is the ID number, and the other is the name.

CREATE TABLE IF NOT EXISTS `user`(
   `id_card` INT(6) NOT NULL,
   `name` VARCHAR(100) NOT NULL.PRIMARY KEY ( `id_card` )
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
Copy the code

2.1 a hash table

A HashMap is a structure that stores data in key-value pairs. In MySQL, key is used to store index columns, and value is the data of a row or its disk address.

If you’ve used hashmaps, you probably know that when multiple keys are hashed to the same value, the structure of the value is a linked list. Suppose you are now asked to find the name by id number, and its hash index structure looks like this:

The hash key of user2 and user4 is M, and the value of value is a linked list. If you want to look up people whose ID_card = 66688, the procedure is: hash 66688 to M, then iterate through the list in order to find user2.

You may have noticed that the values of the four ID_cards in the figure above are not increasing, so adding new users is fast, so append later. But because it is not ordered, the speed of interval query will be slow.

Therefore, the hash table structure is suitable for only equivalent query scenarios, not suitable for range query.

2.2 Ordered Array

In order to solve the problem of slow interval query, ordered array came into being. Its equivalence and range queries are fast. Mysql > select user by id; mysql > select user by id;

If you want to find users whose ID card is 66666, use dichotomy, order log(N).

The array also supports range query, again using binary search, if you want to find users in the range [12345,66666], you only need binary search for users whose ID_card is greater than or equal to 12345 and less than 66666.

In terms of query efficiency, ordered arrays are perfect, but if we need to add new data, it’s very difficult. If you want to add users with ID_card = 12346, you have to move all the data back one place, which is too expensive.

So ordered arrays are only good for storing data that doesn’t change much, like a few years in the past.

2.3 Binary search tree

Binary search tree, also known as binary search tree, or binary sorting tree. Its definition is also relatively simple, either an empty tree, or a binary tree with the following properties: each node has only two forks, all the nodes of the left subtree are smaller than the right subtree, the left and right subtrees of each node are also a small binary tree, and there are no nodes with equal keys.

I’m a little confused about the overview, so here’s a picture. A typical binary search tree looks like this:

The binary order structure is designed because you can use binary search, which is order log(N) for both insertion and search, but O(N) for the worst case, because the tree is not balanced between insertion and deletion. For example, a binomial tree that bends:

So in this case, the query time complexity of the tree is high, and also unstable.

2.4 Balanced binary trees

A balanced binary tree, also known as an AVL tree, differs from a binary search tree in that the height difference between any left and right subtrees is no more than 1**. I made a comparison, as shown below:

I’m happy with that, because of the properties of balanced binary trees. Its query time is O(log(N)), and of course its update time is also O(log(N)) for maintenance balance. Seemingly perfect? But there’s a problem.

If you’ve studied data structures, you know that time complexity is related to tree height. If you think about it, let’s say I have a balanced binary tree of 1 million nodes, 20 tall. A query requires access to 20 data blocks. According to the principle of computer composition, it takes an average of 10ms to read a data fast from disk. PS: Indexes are not only stored in memory, but also written to disk, so the core of optimization is to reduce disk I/O times.

That is, for a table with a million rows, if stored using a balanced binary tree, it might take 20 10ms, or 0.2s, to access a single row, which is uncomfortable.

In addition, balanced binary trees do not support fast range queries, which require multiple traversals from the root node, which is really inefficient.

Therefore, most database stores do not use balanced binary trees.

2.5 B tree

As we know from the above analysis, the query is slow because of the tree height, which requires multiple accesses to the disk. To keep a query from touching the disk as much as possible. We can lower the height of the tree, since we have a binary. So if we add more crosses, won’t the height of the tree go down? So, that’s where the B tree comes in. Hahaha).

The InnoDB storage engine of MySQL will read one page of data (default page 16K) in one IO, while the effective data amount of binary tree in one IO is only 16 bytes, and the space utilization is very low. To maximize IO space at a time, a simple idea is to store multiple elements per node, storing as much data as possible on each node. Each node can store 1000 indexes (16K /16=1000), thus transforming a binary tree into a multi-fork tree, changing the tree from tall and thin to short and fat by increasing the number of branches of the tree. To build 1 million pieces of data, the height of the tree only needs 2 layers (1000*1000=1 million), which means it only takes 2 disk I/OS to query the data. Disk I/O times are reduced, and data query efficiency is improved.

A B-tree, also known as a B-tree, is a B-tree of order M (m indicates how many branches the tree has at most). Features are:

  • Each non-leaf node and non-root node has at least m/2 (rounded up), that is, the number of children of the internal node is at least M /2.
  • The root node has at least two child nodes, and each inner node (non-leaf nodes are inner nodes) has at most m forks.
  • All nodes of a B-tree store data. A node contains multiple elements, such as keys and data. The keys in the node are sorted from smallest to largest.
  • The leaf nodes are all at the same level, highly consistent and there are no Pointers connected between them.

The b-tree structure of order 3 is shown in the figure below:

  • Equivalent query

So in this structure we’re looking for 48, and again we’re using binary search. Its query path looks like this: Database 1-> Block 3-> Block 9. There are three disk I/OS in total, and for the same amount of data, the tree height of balanced binary tree storage is definitely higher. It obviously has a higher IO count. So B tree actually speeds up the query efficiency.

  • Range queries

I don’t know if you noticed? The leaves of the B tree are not connected by Pointers. That means if I’m doing a range query, let’s say I’m doing 41 to 58.

First, binary search method access: data block 1-> data block 3-> data block 9, find 41; Then go back to the root node: data block 1-> data block 3-> data block 10, find 58, a total of six I/O query is completed, so the query efficiency is much slower.

It also has the following problems:

1. No pointer is connected to leaf nodes, so range query increases disk I/O times and reduces query efficiency.

2. If data stores row records, the space occupied by rows will increase as the number of columns increases. In this case, the amount of data that can be stored in a page is reduced, the tree is correspondingly higher, and disk I/OS are increased.

So there’s room for optimization in B trees.

2.6 B + tree

B+ trees are actually derived from B trees. It differs from B trees in two ways:

  • The non-leaf nodes of a B+ tree do not store data, only keys.
  • The leaf nodes of B + tree are connected by bidirectional Pointers and are bidirectional ordered linked lists

Its data structure is shown below:

According to the figure above, data of B+ tree are stored in leaf nodes. Therefore, we need to retrieve the leaf node for each query to find out the data. Somebody said, well, is this going to be slower? B trees don’t have to retrieve leaves.

This is not true because the non-leaf nodes of B+ no longer store data. So it can store more indexes, which means that in theory the height of a B+ tree should be lower than that of a B tree. From this point of view, instead of choosing a B tree to store values on non-leaf nodes, you should choose a B+ tree and lower the tree height.

Let’s do some analysis to see if the B+ tree works.

  • Equivalent query

So in this structure we’re looking for 48, and again we’re using binary search. Its query path looks like this: Block 1-> Block 3-> Block 9. Three disk I/OS in total, that’s fine.

  • Range queries

Let’s say I look up data from 41 to 49. First binary lookup access: database 1-> data block 3-> data block 8. Again after three disk IO, find 41 cached into the result set.

But since the leaf is a bidirectional ordered list, you just need to go backwards. Load data block 9 where 49 is located into the memory, traverse, find 49, the query ends, only four disk I/OS.

Here you can see that the B+ tree is much more efficient for a range query than the B tree, which has to go through the same route.

Therefore, quick lookups are supported for both equivalence and range queries in B+ trees. So MySQL has chosen B+ tree as the memory model for indexing.

03 How do MySQL indexes work?

Ok, the data structures that can be used as the indexed memory model have been analyzed. Finally, MySQL chose B+ tree as the index memory model. How does B+ trees work in a specific engine? Take a look

3.1 InnDB index

First InnDB index, for space reasons, I will talk about the primary key index and ordinary index.

3.1.1 Primary key Index

A primary key index, also known as a clustered index, is built using a B+ tree. The leaf node stores a row of data in the table. When a table does not have a primary key index, InnDB automatically creates an ROWID field to build the cluster index. Here are the rules:

  1. The PRIMARY KEY is defined on the table. InnoDB uses the PRIMARY KEY index as the cluster index.
  2. If the table does not have a primary key defined, InnoDB selects the first non-null unique index column as the cluster index.
  3. If neither of these is present, InnoDB uses an implicit 6-byte integer ROWID field to build the clustered index. The ROWID field is automatically incremented when a new row is inserted.

Take the Student table below, where id is the primary key and age is the normal index.

CREATE TABLE `student`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `age` int(11) NULL DEFAULT NULL.PRIMARY KEY (`id`) USING BTREE,
  INDEX `index_age`(`age`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 66 CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;
Copy the code

Table data is as follows:

  • SQL > select * from primary key;
select * from student where id = 38;
Copy the code

The process is as follows:

  • First disk IO: Retrieve from the root node, load data block 1 into memory, compare 38 < 44, go left.
  • Second disk IO: Load the left data block 2 into memory, compare 8<37<38, go to the right.
  • Third disk I/O: Load the right data block 6 into the memory. Compare 37<38, 38=38. After the query, return the data to the client.

Flowchart: Three DISK I/OS

  • Primary key index range query SQL
select * from student where id between 38 and 44;
Copy the code

As previously introduced, B+ trees have bidirectional Pointers to leaf nodes, so range queries can use bidirectional ordered lists directly.

The process is as follows:

  • First disk IO: Retrieve from the root node, load data block 1 into memory, compare 38 < 44, go left.
  • Second disk IO: Load the left data block 2 into memory, compare 8<37<38, go to the right.
  • Third disk I/O: Load the right data block 6 into the memory. Compare 37<38, 38=38. Walk on the right.
  • Fourth disk I/O: Load the right data block 7 into the memory. Compare 38<44=44. After the query, return the data to the client.

Flowchart: Four DISK I/OS

3.1.2 Common Indexes

  • Common index equivalent query SQL

In InnDB, B+ tree normal index does not store data, only stores the primary key value of the data. For example, in this table age, its index structure looks like this:

What is the flow of executing the following query statement?

select * from student where age = 48;
Copy the code

Using a normal index requires retrieving the index twice. The primary key is age = 48, and the primary key is age = 48. This process is called table back.

That is, queries based on non-primary key indexes require one more scan of the index tree. Therefore, we should use primary key queries as much as possible.

The process is as follows:

  • First disk IO: Retrieve from the root node, load data block 1 into memory, compare 48 < 54, go left.
  • Second disk IO: Load the left data block 2 into memory, compare 28<47<48, go to the right.
  • Third disk I/O: Load the right data block 6 into the memory. Compare 47<48, 48=48. I get primary key 38.
  • Fourth disk IO: Retrieve from the root node, load the root node into memory, compare 38 < 44, go left.
  • Fifth disk IO: Load the left data block 2 into memory, compare 8<37<38, go to the right.
  • Sixth disk I/O: Load the right data block 6 into the memory. Compare 37<38, 38=38. After the query, return the data to the client.

Flowchart: A total of six DISK I/OS.

3.1.3 Composite Indexes

Would it be too much to design an index for every query? Now if I were to look up the age of the student based on his name. Let’s say the profile of this requirement is low, but we can’t let it go full table scan, can we?

But isn’t it a bit wasteful to create an index for an infrequent requirement? So what to do? We can create a joint index (name, age) to solve this problem. The structure of a composite index is shown below:

What is the flow of executing the following query statement?

select * from student where name = 'two dogs 5' and age = 48;
Copy the code

The process is as follows:

  • First disk IO: Retrieve from root node, load data block 1 into memory, compare 2 dog 5 < 2 dog 6, go left.
  • Second disk IO: load the left data block 2 into memory, compare 2 dog 2< 2 dog 4< 2 dog 5, go to the right.
  • Third disk IO: load data block 6 on the right into memory, compare 2 dog 4< 2 dog 5, 2 dog 5= 2 dog 5. I get primary key 38.
  • Fourth disk IO: Retrieve from the root node, load the root node into memory, compare 38 < 44, go left.
  • Fifth disk IO: Load the left data block 2 into memory, compare 8<37<38, go to the right.
  • Sixth disk I/O: Load the right data block 6 into the memory. Compare 37<38, 38=38. After the query, return the data to the client.

Flowchart: A total of six DISK I/OS

3.1.4 Left-most matching principle

The principle of left-most prefix matching is related to index storage structure and retrieval method of joint index.

In the composite index tree, the lowest leaf nodes are arranged in ascending order from left to right according to the first column name, but the AGE column is unordered, and the age column is only in ascending order in a small range if the values of the name column are equal.

Just like the query above, the B+ tree first compares the name column to determine which direction to search next, left or right. Compare the age column if the name column is the same. However, if there is no name column in the query condition, the B + tree does not know which node to start the first step, which is known as the left-most matching principle.

Create idx_name_age (name,age) index (name,age); ,

Rule of left-most prefix matching for composite indexes: When using composite indexes, mysql will keep matching to the right until it encounters a range query (>, <, between, like).

3.1.5 Overwriting indexes

Overwriting indexes is a common optimization technique. Because in the plain index example above, the data required for the query result is only available on the primary key index, you have to go back to the table. Is it possible to optimize indexes to avoid returning to the table? Like this:

select age from student where age = 48;
Copy the code

In the normal index example above, if I only need the age field, does that mean that we can query the leaf node of the normal index and return it directly without returning the table? This case is called an overwrite index.

Take a look at the execution plan:

Overwriting an index:

Index not overwritten:

3.2 myisam index

The student table is the same as above.

CREATE TABLE `student`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `age` int(11) NULL DEFAULT NULL.PRIMARY KEY (`id`) USING BTREE,
  INDEX `index_age`(`age`) USING BTREE
) ENGINE = MyISAM AUTO_INCREMENT = 66 CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;
Copy the code

3.2.1 Primary key Index

Unlike InnDB, myISam’s data files and index files are stored separately. Its leaf nodes store keys, and the data is the disk address of the row where the index resides. Its structure is as follows: the index files for table Student are stored in Student.myi and the data files are stored in student.myd.

  • Primary key index equivalent query
select * from student where id = 38;
Copy the code

Its specific execution process is as follows:

  • First disk IO: Retrieve from the root node, load data block 1 into memory, compare 38 < 44, go left.
  • Second disk IO: Load the left data block 2 into memory, compare 8<37<38, go to the right.
  • Third disk I/O: Load the right data block 6 into the memory. Compare 37<38, 38=38. Gets the memory address of the index row.
  • Fourth disk IO: get the corresponding row from the data file student.myd based on the address.

Flowchart: Four DISK I/OS.

  • Primary key index range query
select * from student where id between 38 and 44;
Copy the code

The process is as follows:

  • First disk IO: Retrieve from the root node, load data block 1 into memory, compare 38 < 44, go left.
  • Second disk IO: Load the left data block 2 into memory, compare 8<37<38, go to the right.
  • Third disk I/O: Load the right data block 6 into the memory. Compare 37<38, 38=38. Gets the memory address of the index row.
  • Fourth disk IO: obtain the row record for primary key 38 from the data file student.MYD by address.
  • Fifth disk I/O: Load the right data block 7 into the memory. Compare 38<44=44. Gets the memory address of the index row.
  • The sixth disk IO: obtain the row record for primary key 44 from the data file student.MYD by address.

3.2.2 Common Indexes

In MyISAM, secondary indexes are structured the same as primary key indexes without any difference, and the data stored in leaf nodes is the disk address of row records. However, the key value of the primary key index is unique, while the key value of the secondary index can be repeated.

When querying data, because the key value of the secondary index is not unique, there may be multiple records with the same value. Therefore, even for equivalent query, data needs to be retrieved in the secondary index tree in the way of range query.

3.3 Index use skills

3.3.1 Avoid returning to the table

SQL > select * from primary key; select * from primary key; select * from primary key; The back table is bound to affect performance. So how do you avoid that?

Use an overwrite index, for example: Student, whose SQL is commonly used in business:

select id, name, age from student where name = 'two dogs 2';
Copy the code

Select * from student; select * from student; select * from student; select * from student; Age) so that the full data of the current statement can be retrieved from the secondary index.

This avoids the need to retrieve age data from the table. Well, this is a classic case of using an overwrite index optimization strategy to reduce back to the table.

3.3.2 Use of federated indexes

Joint index, in the establishment of the index, as far as possible in multiple single-column index to judge whether the use of joint index. The use of federated indexes not only saves space, but also makes it easier to use index coverage. For example, in the student table above, I create (name,age) and age indexes.

When creating a federated index, you should put the frequently used columns and highly differentiated columns first. Frequent use means high index utilization, and high discrimination means large filtering granularity.

It can also be added to a federated index on fields that are often returned as queries, and should be used if an override index is used by adding a field to the federated index.

Use of federated indexes

  • Consider whether there are already multiple single-column indexes that can be merged, and if so, create the current multiple single-column indexes as a single federated index.
  • If the current index has columns that are frequently used as return fields, you can consider whether the current column can be added to the existing index so that the query statement can use the overwriting index.

3.3.3 Index push down

Now my table data looks like this: I have added a sex column.

When it comes to satisfying the left-most prefix rule, the left-most prefix can be used to locate records in an index. Now, you might ask, what about the parts that don’t fit the leftmost prefix?

Again, let’s take the union index (name, age) of the student table. If you now have a demand: retrieve all boys in the table whose names start with two and whose age is 38. So, the SQL statement looks like this:

select * from student where name like 'a %' and age=38 and sex='male';
Copy the code

According to the prefix index rule, this statement can only search the index tree with “zhang” and find three records that meet the criteria (red box data). Of course, this is not bad; it’s better than a full table scan.

And then what? Of course, whether the other conditions are met.

Before MySQL5.6, you could only return a table one by one from a record id=18 that met the condition. Find the rows on the primary key index and compare the fields

The index condition pushdown feature introduced in MySQL 5.6 can be used to determine the fields contained in the index during index traversal and filter out the records that do not meet the conditions to reduce the number of table returns.

The flow chart of its entire implementation looks like this:

InnoDB determines whether age equals 38 in the index (name, age), and skips records that do not. In this example, we only need to fetch data from the table with id=18 and id=65. We only need to fetch data from the table twice. This is called index push-down.

Give some books

If you read here and like this article, please click on it. 1000+ programming e-books, including C, C++, Java, Python, GO, Linux, Git, database, design pattern, front end, artificial intelligence, interview related, data structure and algorithm and computer foundation, see the following figure for details. Reply 1024 send you a complete set of Java video tutorials.

Shoulders of giants

  • High Performance MySQL
  • zhuanlan.zhihu.com/p/142361798
  • cnblogs.com/lianzhilei/p/11250589.html
  • blog.csdn.net/qq_35190492/article/details/109257302
  • time.geekbang.org/column/article/69636
  • cnblogs.com/happyflyingpig/p/7662881.html
  • tech.meituan.com/2014/06/30/mysql-index.html

conclusion

This article explains what an index is. Its advantages and disadvantages, classification; Six memory models of index; Why use B+ trees? InnDB and MyIsam engine primary key index, ordinary index, combined index, overwrite index are how to function? What is the leftmost matching rule? Finally, there are some tips for using indexes. It’s all about indexes, so to speak. After reading this one still don’t understand the words, you beat me!

Well, that’s dogdog’s summary of indexes. Thanks to the efforts of the tech community leaders, if I have seen further, it is because I am standing on your shoulders. Hope you found this article helpful, and we’ll see you in the next article