An overview of the

A database index is like a table of contents at the front of a book, which speeds up database queries. An index is a structure that sorts the values of one or more columns (for example, the ‘name’ column of the User table) in a database table. If you want to look up a particular user by his or her name, an index helps you get information faster than if you searched all the rows in a table.

Creating an index has the following advantages:

  • Greatly accelerate the speed of data retrieval;
  • Create unique indexes to ensure the uniqueness of each row of data in the database table.
  • Accelerometer and the connection between the table;
  • When using grouping and sorting clauses for data retrieval, the grouping and sorting time in queries can be significantly reduced.

Of course, there are advantages and disadvantages. The disadvantages of indexes are as follows:

  • Indexes need to occupy physical storage space outside the data table
  • It takes time to create and maintain indexes
  • When a table is updated, the index needs to be rebuilt, which slows down data maintenance.

Common index types

Indexes appear to improve query efficiency, but there are many ways to achieve indexes, so the concept of index model is introduced here. There are many data structures that can be used to improve read and write efficiency. Here are three common and relatively simple data structures: hash tables, ordered arrays, and search trees.

Hash table

A hash table stores key-value pairs. To use hash tables to store indexes, first use hash functions to convert keys into corresponding Index positions in the array, and then put the value in this position. If multiple keys are converted to the same Index, then add a linked list here to store key-value objects with the same hash value. This is how hashMap is implemented, isn’t it?

If it’s hard to understand let’s look at this diagram, a table that maintains the user’s ID number and name.

In the figure above, User2 and User4 compute the hash of N based on the ID number, which becomes a linked list. If you are given a given id number and asked to query the name, you need to convert the index value of the array according to the hash function. If there are more than one value in the array, then the list can be traversed.

Note that the four ID_card_n values in the figure are not incremented. The advantage of this is that new users can be added quickly and only need to be appended later. The disadvantage of hash table is obvious, because hash table is unordered, so the interval query speed of hash index is very slow.

So you could imagine, if you wanted to find all the users in the ID_card_X, ID_card_Y range, you would have to scan all of them.

Therefore, the hash table structure is suitable for scenarios where there are only equivalent queries, including =, IN(), <>, such as Memcached and other NoSQL engines.

Orderly array

Ordered arrays perform very well in both equivalent query and range query scenarios. If we use an ordered array, it looks like this:

The array is stored in ascending order of id numbers. So if you want to look up the name of ID_card_n2, you can quickly get it by dichotomy, and it’s order log of N.

Ordered arrays also support range queries. If you want to find the User whose ID number is in [ID_card_X, ID_card_Y], you can use dichotomy to find ID_card_X (if there is no ID_card_X, then find the first User greater than ID_card_X), and then iterate to the right, Exit the loop until the first id number greater than ID_card_Y is found.

If you look at query efficiency alone, ordered arrays are the best data structures. However, when you need to update the data, the trouble is that you insert one record in the middle and have to move all the records behind it, which is too expensive.

Therefore, ordered array indexes are only good for static storage engines, such as when you want to store a certain type of data that will not be modified.

Search tree

Binary search trees are also classic data structures in textbooks. If we use a binary search tree to achieve the above example, the schematic diagram is as follows:

The characteristic of binary search tree is that the left son of each node is smaller than the parent, and the parent is smaller than the right son. So if you want to look up ID_card_n2, the search sequence is UserA -> UserC -> UserF ->User2. This is order log N.

Of course, in order to maintain order log(N) query complexity, you need to keep this tree balanced binary. And to do that, the update time is order log(N).

From the perspective of query alone, binary tree is the most efficient search, but in fact, most database storage does not use binary tree, but choose to use multi-fork tree, that is, B tree, B+ tree, each node can have more than one child node tree. The reasons are as follows:

When a large amount of data is stored, indexes need a large amount of storage space. Therefore, it is not practical to store indexes in the memory. Indexes are stored in the disk space and then read from the disk to the memory for use.

When querying data using an index, the computer reads the index into memory from disk pages, each of which corresponds to a node in the tree structure. Each tree node query is equivalent to one disk I/O, reading a disk page. Each disk I/O takes about 8-10ms, which is extremely expensive. Therefore, the shorter the tree is, the fewer tree nodes need to be passed during the query process, and the less time is needed. That’s why we use multi-fork trees instead of binary trees!

Let’s use a diagram to see the difference between a binary tree and a multi-tree:

There are about 1200 leaf nodes in the InnoDB engine. A B+ tree with a height of 4 can store 1,200 to the third values, which is 1.7 billion. Considering that the root of the data block is always in memory, an index of an integer field on a billion row table requires only three disk visits at most to find a value. In fact, the second level of the tree also has a high probability of being in memory, so the average number of disk accesses is even smaller.

InnoDB index model

In MySQL, indexes are implemented at the storage engine layer, so there is no uniform index standard, that is, indexes of different storage engines work differently. Even if multiple storage engines support the same type of index, the underlying implementation may differ. Since InnoDB storage engine is the most widely used MySQL database, I will take InnoDB as an example and analyze the index model with you.

In InnoDB, tables are stored as indexes according to the order of primary keys, which are called indexed organized tables. As mentioned earlier, InnoDB uses a B+ tree index model, so data is stored in a B+ tree.

Each index in InnoDB corresponds to a B+ tree.

Suppose we have a table with a primary key column ID, a field K in the table, and an index on k.

The construction sentence of this table is:

mysql> create table T(
id int primary key,
k int not null,
name varchar(16),
index (k))engine=InnoDB;
Copy the code

In the table, the (ID,k) values of R1~R5 are (100,1), (200,2), (300,3), (500,5) and (600,6) respectively. The schematic diagram of the two trees is as follows.

It is not difficult to see from the figure that, according to the content of the leaf node, the index type is divided into primary key index and non-primary key index.

Leaf nodes indexed by primary keys hold entire rows of data. In InnoDB, primary key indexes are also called clustered indexes.

Leaf node contents that are not primary key indexes are primary key values. In InnoDB, non-primary key indexes are also called secondary indexes.

Based on the above index structure description, let’s discuss a question: what is the difference between a query based on a primary key index and a normal index?

Select * from T where ID=500; select * from T where ID=500; If the statement is select * from T where k=5, that is, the common index query mode, search the K index tree first, obtain the ID value 500, and then search the ID index tree. This process is called table back.

That is, a query based on a non-primary key index requires one more index tree to be scanned. Therefore, we should try to use primary key queries in our applications.

Index maintenance

B+ trees need to perform necessary maintenance when inserting new values in order to maintain index order. In the example above, if you insert a new row with an ID of 700, you only need to insert a new record after the R5 record. If the newly inserted ID value is 400, it is relatively troublesome, and you need to logically move the following data to make space.

Even worse, if the data page R5 is on is already full, according to the B+ tree algorithm, it needs to apply for a new data page and move some data there. This process is called page splitting. In this case, performance naturally suffers.

In addition to performance, page splitting also affects data page utilization. Data originally placed on one page is now divided into two pages, reducing overall space utilization by about 50%.

Of course there is fragmentation and there is consolidation. When two adjacent pages have low utilization due to data deletion, the data pages are merged. The process of merging can be regarded as the reverse of the process of splitting.

Based on the index maintenance process described above, let’s discuss a case:

You may have seen this described in some of the table building specifications, which require table building statements to have auto-increment primary keys. Of course, nothing is absolute, so let’s analyze where you should use auto-increment primary keys and where you shouldn’t.

An AUTO_INCREMENT PRIMARY KEY is a PRIMARY KEY defined on an AUTO_INCREMENT column. NOT NULL PRIMARY KEY AUTO_INCREMENT is a PRIMARY KEY defined on an AUTO_INCREMENT column.

When inserting a new record, you do not need to specify the ID value. The system obtains the maximum value of the current ID plus 1 as the ID value of the next record.

In other words, the auto-increment primary key insert mode fits the incremental insert scenario we mentioned earlier. Every time a new record is inserted, it is an append operation, which does not involve moving other records and does not trigger the splitting of leaf nodes.

However, the primary key of fields with business logic is not easy to ensure orderly insertion, so the cost of writing data is relatively high.

In addition to considering performance, we can also look at it from a storage perspective. If you do have a unique field in your table, such as a string id number, should you use id number as the primary key or increment field as the primary key?

Because every leaf node that is not a primary key index has a primary key value. The leaf node of each secondary index takes about 20 bytes if the id number is used as the primary key, but only 4 bytes if the integer key is used, and 8 bytes if the bigint is used.

Obviously, the smaller the primary key length, the smaller the leaf nodes of the normal index, and the smaller the space taken up by the normal index.

Therefore, auto-increment of primary keys is often a more reasonable choice for performance and storage.

Are there any scenarios where you can use business fields directly as home keys? There are. For example, some business scenarios require something like this:

  1. There is only one index;
  2. The index must be unique.

As you can see, this is a typical KV scenario. Since there are no other indexes, there is no need to worry about the size of leaf nodes for other indexes. In this case, we should give priority to the “use primary key as much as possible” principle mentioned in the previous paragraph. Setting this index to the primary key directly avoids the need to search two trees per query.

The index principles

Before introducing the principles of indexing, let’s look at an example:

In the following table T, if I perform select * from T where k between 3 and 5, how many tree searches are required and how many rows are scanned?

mysql> create table T ( ID int primary key, k int NOT NULL DEFAULT 0, s varchar(16) NOT NULL DEFAULT '', index k(k)) engine=InnoDB; Insert into T values (100, 1, 'aa'), (200, 2, 'bb'), (300, 3, 'cc'), (500, 5, 'ee'), (600, 6, 'ff'), (700, 7, 'gg')Copy the code

Now let’s take a look at the execution flow of this SQL query:

  1. Find the record k=3 in the index tree, obtain ID = 300;
  2. Select R3 where ID=300;
  3. Select the next value k=5 in the k index tree and get ID=500;
  4. Select * from R4 where ID=500;
  5. The next value k=6 in the k index tree is not satisfied, and the loop ends. In this process, back to the primary key index tree search process, we call back to the table. As you can see, the query reads three records of the K-index tree (steps 1, 3, and 5) and returns to the table twice (steps 2 and 4). In this case, because the data required for the query result is only available on the primary key index, you have to go back to the table. So, is it possible to optimize indexes to avoid the table back process?

Cover index

Select ID from T where k between 3 and 5; select ID from T where k between 3 and 5; select ID from T where k between 3 and 5; In other words, in this query, index K has “overridden” our query requirements, which we call the overridden index.

Because overwriting indexes can reduce the number of tree searches and significantly improve query performance, it is a common performance optimization method to use overwriting indexes.

Note that using an overwrite index inside the engine actually reads three records on index K, R3~R5 (the corresponding entry on index K), but for MySQL Server layer, it only reads two records from the engine, so MySQL believes that the number of rows scanned is 2.

Based on the above description of overwriting indexes, let’s discuss a question: is it necessary to have a joint index of id number and name on a citizen information table?

Assume that the construction sentence of the citizen information table is as follows:

CREATE TABLE `tuser` (
	`id` int(11) NOT NULL.`id_card` varchar(32) DEFAULT NULL.`name` varchar(32) DEFAULT NULL.`age` int(11) DEFAULT NULL.`ismale` tinyint(1) DEFAULT NULL,
	PRIMARY KEY (`id`),
	KEY `id_card` (`id_card`),
	KEY `name_age` (`name`.`age`))ENGINE=InnoDB
Copy the code

As we know, the id number is the only identification of citizens. That is to say, if there is a need to query the information of citizens based on the ID number, we only need to build an index on the ID number field. And to establish a joint index (id number, name), is it a waste of space?

If there is now a high frequency request to look up a citizen’s name based on his id number, this joint index makes sense. It can be used on this high frequency request, eliminating the need to go back to the table to look up the entire row, reducing statement execution time.

Of course, there is always a cost to index field maintenance. Therefore, there are trade-offs when creating redundant indexes to support overwriting indexes. This is the job of a business DBA, or business data architect.

Left-most prefix rule

Looking at this, you must wonder if there are too many indexes if you design an index for every query. What if I had to look up a citizen’s home address based on his ID number? Although the probability of this query requirement in the business is not high, but can not let it go full table scan, right? Conversely, creating a single index (id number, address) for an infrequent request feels wasteful. How do you do that?

Here, LET me tell you the conclusion first. B+ tree is an index structure that uses the left-most prefix of the index to locate records.

To illustrate this concept visually, we use the associative index (name, age).

As you can see, the index entries are sorted by the fields that appear in the index definition.

When your logical requirement is to find all the people with the name “Zhang SAN”, you can quickly locate ID4 and iterate backwards to get all the results you need.

If you want to query all people whose names start with “zhang”, your SQL statement will say “where name like ‘zhang %’ “. In this case, you can also use the index to find that the first record that meets the criteria is ID3, and then iterate back until the criteria are not met.

As you can see, indexes can be used to speed up retrieval as long as the left-most prefix is satisfied, not just the entire definition of the index. The leftmost prefix can be the leftmost N field of a union index or the leftmost M characters of a string index.

Based on the above description of the left-most prefix index, let’s discuss how to arrange the field order in the index when creating a federated index.

The evaluation criteria here is the reuse capability of the index. Since the left-most prefix can be supported, there is generally no need to create a separate index on a when there is already a joint index (a,b). Therefore, the first rule of thumb is that if you can maintain one less index by adjusting the order, that order is often the preferred one.

So now you know, in the question at the beginning of this paragraph, we are going to create a joint index (id number, name) for high frequency requests, and use this index to support the need to “query addresses by ID number”.

What if you have both a federated query and a query based on each of a and B? Select * from (a,b); select * from (b); select * from (a,b);

At this point, the principle to consider is space. For example, if the name field is larger than the age field, then I recommend that you create a joint index (name,age) and a single index (age).

An index pushdown

Last time we talked about satisfying the left-most prefix rule, the left-most prefix can be used to locate records in an index. Now, you might ask, what about the parts that don’t fit the leftmost prefix?

Again, let’s take the union index (name, age) of the citizen table. Now if there is a demand: retrieve all the boys in the table whose name begins with Zhang and whose age is 10. So, the SQL statement looks like this:

Mysql > select * from tuser where name like 'zhang %' and age=10 and ismale=1;

You already know the prefix index rule, so this statement can only search the index tree using “zhang” to find the first record that meets the condition ID3. Of course, this is not bad; it’s better than a full table scan.

And then what?

Of course, whether the other conditions are met.

Before MySQL 5.6, you could only return tables one by one from ID3. Find the rows on the primary key index and compare the field values.

The index condition pushdown feature introduced in MySQL 5.6 can be used to determine the fields contained in the index during index traversal and filter out the records that do not meet the conditions to reduce the number of table returns.

The following is a flowchart for the execution of the two processes.

The first image is in the index (name,age) and I deliberately removed the age value. InnoDB does not look at the age value, but just takes the “name first word is’ zhang ‘” records out of the table one by one. Therefore, you need to return to the table four times.

The difference between InnoDB’s (name,age) index and InnoDB’s (name,age) index is that InnoDB determines whether age is equal to 10 and skips records that are not. In our example, we only need to judge the two records of ID4 and ID5 back to the table to fetch data, and only need to back to the table 2 times.