The database index of the years of interactions with interviewers

I sat across from the interviewer and introduced myself passionately, while the interviewer looked through my resume with a blank face. I don’t know whether my little brother is too tall or attracted by my resume. After 2 minutes, my little brother still didn’t say a word to me. They seem to have two down sons. But it doesn’t matter. It doesn’t matter.

What is an index?

Interviewer: I see you have done SQL optimization in your project, so let’s talk about indexes today.

Index can ask what, nothing more than the concept of index, index rules, index classification, index principle. Hee hee ~ I was prepared) I:An index in a database, simply put, is like a table of contents in a book. It helps us quickly locate and find specific values, thus speeding up the efficiency of data query.

If we do not use an index, we have to search from the first record until all tables have been searched to find the data we want.

Interviewer: Do you mean the more indexes the better?

Isn’t the more indexes the better?

Me: Of course, index is not the more the better, index is not a panacea, in some cases the use of index will make efficiency become low.

The value of an index is to help us find the data we want from a large amount of data, if the data is small, then whether or not to use the index does not matter much.

When the number of rows in a table is small, such as less than 1000 rows, there is no need to create indexes. In addition, there is no need to index this field when the data is highly repetitive, such as more than 10%.

For example, if the field is gender, you don’t need to index it. Why is that? If you want to find 500,000 rows out of a million rows (say, male data), once you create an index, you need to access the index 500,000 times and then the table 500,000 times, which can add up to more overhead than not using the index at all.

Types of indexes

The interviewer nodded and seemed satisfied with my answer. Then he asked:

Interviewer: What kinds of indexes do you have?

(Hee hee, I am too familiar with the types of indexes. But I paused a little and began my answer.)

Categorize by functional logic

I: From the functional logic, there are four main indexes, which are common index, unique index, primary key index and full-text index.

A common index is a basic index that has no constraints and is mainly used to improve query efficiency.

Unique index is to add the constraint of data uniqueness on the basis of common index, there can be multiple unique indexes in a data table.

A primary key index is NOT NULL or UNIQUE, so there is only one primary key index in a table.

Full text index is not used much, MySQL’s own full text index only supports English. Specialized full-text search engines such as ES(ElasticSearch) and Solr can often be used.

In fact, the first three types of indexes (normal, unique, and primary key indexes) are all of the same kind of indexes, but the constraints on the data are gradually increasing.

There can only be one primary key index in a data table because of the physical implementation of primary key indexes, because data stores in files can only be stored in one order. But there can be multiple normal indexes or multiple unique indexes.

Classified by physical implementation

Me: According to the physical implementation, indexes can be divided into two types: clustered index and non-clustered index. We also refer to non-clustered indexes as secondary or secondary indexes.

Clustered indexes can store data sorted by primary key, which is very effective when looking up rows.

For example, if we want to look up the word “shu” in a Chinese dictionary, we can simply look for the location of The Chinese pinyin, which is the pinyin “Shu”. This gives us the location of the index, and behind it is the data row we are looking for.

A non-clustered index does not place the indexed content directly after the index like a clustered index, but maintains a separate index table (only the index is maintained, not the data pointed to by the index) to facilitate data retrieval.

We also take the Chinese dictionary as an example, if you want to find the word “number”, then according to the way of radical search, first find the radical of the word “number”, and then the directory will tell us how many pages the word “number” is stored in, and then we go to the specified page number to find the word.

That is, the system will do two lookups, the first to find the index, and the second to find the location of the index to retrieve the row.

Differences between clustered and non-clustered indexes

In fact, the above answer is enough, but in order to show my thorough understanding, I also made the following elaboration:

Me: Clustered indexes and non-clustered indexes have different principles and some differences in use:

Leaf nodes of clustered indexes store our data records, while leaf nodes of non-clustered indexes store data locations. Non-clustered indexes do not affect the physical storage order of data tables.
A table can have only one clustered index, because there can be only one sort of storage, but it can have multiple non-clustered indexes, that is, multiple index directories providing data retrieval.
When a clustered index is used, the data query efficiency is high, but when data is inserted, deleted, or updated, the efficiency is lower than that of a non-clustered index.

The data structure of the index

Interviewer: You have just explained the classification of indexes in terms of functional logic and physical implementation. It seems that you have a good understanding of the data structure of indexes. Tell me what you know about the index data structure.

(That’s easy, I blurted out)

Me: Hash, B tree and B+ tree can all be used as index data structure, but in MySQL, B+ tree is used, and B+ tree is also commonly used as index data structure.

Why do we often use B+ trees as data structures for indexes?

Interviewer: Why do we often use B+ trees as data structures for indexes? Don’t the other tree structures smell good?

I knew it wasn’t that simple. Oh, why did I say “often”? I couldn’t laugh or cry inside, but I tried to keep my composure. I:Before I answer this question, LET me say a few words about where to put an index and how to judge the data structure of an index.

The location of the index

Me: As we know, the database server has two storage media, namely hard disk and memory. Memory is temporary storage. When an accident occurs, such as a power failure or a faulty restart, data will be lost. Hard disks act as permanent storage media and data can be persisted, which is why we need to save data to hard disks.

How to evaluate the data structure design of the index?

Me: Even though memory reads fast, we still need to store indexes on hard disk. Therefore, when we perform queries on the hard disk, we also perform I/O operations on the hard disk.

As we all know, hard disk I/O access consumes much more time than memory access. The number of disk I/ OS required to find a row through an index increases with the number of disk I/ OS.

If we can structure the index to minimize disk I/O, the less time it takes, the better the index’s data structure design will be.

Binary tree

The interviewer nodded and motioned for me to continue. In order to give a satisfying answer to the question “Why do we use B+ trees as data structures for indexing?” I picked up my pen and started with binary trees. I:Next, we talk about binary tree. We know that binary search method is an efficient data retrieval method with O(log2n) time complexity, which can be said that the retrieval speed is very fast.

Using the Binary Search Tree as an example, searching for a node is the same as inserting a node. We assume that the value of the Search insert is key:

If the key is larger than the root node, the search is performed in the right subtree.
If the key is smaller than the root node, the search is performed in the left subtree.
If the key is equal to the root node, which means the node is found, return the root node.

For example, the binary search tree we created for the sequence (25,18,36,9,20,32,41) looks like this:

But there are special cases where the depth of the binary tree is very large. For example, the order of data given by us is (9, 18, 20, 25, 32, 36, 41), creating a binary search tree as shown in the figure below:

Now this tree is also a binary search tree, but it has been degraded to a linked list, and the search time has become O(n).

We can see that the depth of the first tree is 3, which means that it takes at most three comparisons to find the node, while the depth of the second tree is 7, which takes at most seven comparisons to find the node.

Balanced binary search tree

Interviewer: Since normal binary trees don’t work, how about balanced binary search trees? Because we know that it can be rotated to prevent data structures from degrading to linked lists in special cases.

Me: AS I mentioned earlier, the data query time mainly depends on the number of disk I/O. Even with the improved balanced binary search tree, the tree depth is O(log2n). When N is large, the tree depth is high, as shown in the following figure:

Each node access requires one disk I/O operation, and for the tree above, we need five I/ OS. Although balanced binary tree comparison is efficient, the tree depth is also high, which means that the number of disk I/O operations affects the overall data query efficiency.

What is a B tree?

Me: For the same data above, if we change the binary tree to m-tree (M>2), when M=3, the same 31 nodes can be stored by the following trigeminal tree:

You can see that the height of the tree is going down, and when N is large, and M is large, the height of the m-tree is going to be much lower than the height of the binary tree.

If the binary tree is used as the index structure, the tree will become very high, increasing the disk I/O times, and affecting the data query time. Therefore, a node cannot have only 2 children, but should be allowed to have M children (M>2).

The emergence of B Tree is to solve this problem, THE English name of B Tree is Balance Tree, also known as balanced multi-path search Tree, its height is far less than the height of balanced binary Tree. The index structure in file system and database system is often implemented by B tree.

The structure of b-tree is shown in the figure below:

As a balanced multi-path search tree, each node of b-tree can contain at most M child nodes, and M is called the order of b-tree. You can also see that each disk block contains Pointers to keywords and child nodes. If a disk block contains X keywords, the number of Pointers is x+1. For a 100-rank B-tree, if there are three levels, it can store up to about 1 million index data. For large amounts of index data, the b-tree structure is very suitable because the tree is much smaller than the binary tree.

An m-order B-tree (M>2) has the following properties:

The range of sons of the root node is [2,M].
Each intermediate node contains K-1 keywords and K children. The number of children is equal to the number of keywords +1. The value of K is [ceil(M/2), M].
The leaf node contains k-1 keywords (the leaf node has no children), and the value range of k is [ceil(M/2), M].
Assume that the keywords of the intermediate nodes are Key[1], Key[2]… , Key[k-1], and the keywords are sorted in ascending order, that is, Key[I]
All leaf nodes are in the same layer.

The b-tree represented in the diagram above is a third-order B-tree. If we look at disk block 2, the keyword is (8, 12), it has 3 children (3, 5), (9,10) and (13,15), you can see that (3, 5) is less than 8, (9,10) is between 8 and 12, and (13,15) is greater than 12. It fits exactly what we just gave you.

And then let’s see how we can use B trees to find things. Assuming that the keyword we want to find is 9, the steps can be divided into the following steps:

We compare it with the keyword (17, 35) of the root node. If 9 is less than 17, we get pointer P1.
Find disk block 2 according to pointer P1, keyword is (8, 12), because 9 is between 8 and 12, so we get pointer P2;
Follow pointer P2 to find disk block 6 with keyword (9, 10), and then we find keyword 9.

We can see that we’re not comparing a lot of times in the search of the B tree, but if we’re reading the data out and comparing it in memory, that’s negligible.

However, reading the disk block itself requires I/O operation, which consumes more time than the time required for comparison in memory, and is an important factor in data search. Compared with balanced binary tree, B tree has less disk I/O operation, and is more efficient in data query than balanced binary tree.

What is a B+ tree?

Me: In the end, B+ tree, B+ tree is based on B tree to make an improvement, the mainstream DBMS all support B+ tree index, such as MySQL. B+ trees differ from B trees in the following points:

A node with k children has k keywords. So the number of children is equal to the number of key words, and in the CASE of B tree, the number of children is equal to the number of key words plus 1.
Keywords of non-leaf nodes also exist in child nodes and are the maximum (or minimum) of all keywords in child nodes.
Non-leaf nodes are only used for indexing and do not hold data records. Records-related information is stored in leaf nodes. In a B-tree, non-leaf nodes store both indexes and data records.
All keywords appear in the leaf node, which forms an ordered linked list, and the leaf node itself is linked in order of keyword size from smallest to largest.

The following figure is a B+ tree with order 3. Keywords 1, 18 and 35 in the root node are the minimum values in the child nodes (1, 8, 14), (18, 24, 31) and (35, 41, 53) respectively. The keywords of the parent node of each layer will appear in the keywords of the child node of the next layer, so all the keyword information is included in the leaf node, and each leaf node has a pointer to the next node, thus forming a linked list.

For example, if we want to find the keyword 16, the B+ tree will search from the top down layer by layer:

Compared with the root keyword (1, 18, 35), 16 is between 1 and 18, resulting in pointer P1 (pointing to disk block 2)
Find disk block 2, keyword (1,8,14), because 16 is greater than 14, so get pointer P3 (to disk block 7)
Find disk block 7, keyword (14,16,17), and then we find keyword 16, so we can find the data corresponding to keyword 16.

Interviewer: There are 3 I/O operations in the whole process. It seems that the query process of B+ trees is similar to that of B+ trees. So why do we use B+ trees more?

Me: One fundamental difference between A B+ tree and a B tree is that the middle node of a B+ tree does not store data directly. The benefits are:

First, B+ tree query efficiency is more stable. Because B + tree each time the only access to the leaf node to find the corresponding data, in the B tree, not a leaf node will store data, this will cause the query efficiency is not stable, sometimes access to the leaf node can find the key words, and sometimes need to access to the leaf node to find the key.
Second, B+ trees are more efficient queries because they are generally dumber (larger order, lower depth) and require less disk I/O to query. The B+ tree can store more node keywords for the same disk page size.

The efficiency of B+ tree is higher than that of B tree not only in the query of a single keyword, but also in the query range. This is because all the keywords appear in the leaf nodes of the B+ tree and are linked through an ordered linked list. In B tree, it is necessary to complete the search of the query range through middle-order traversal, which is much less efficient.

(Said here I want to cry with satisfaction, finished I also don’t forget to summarize)

Overall, the number of DISK I/O operations is critical to index efficiency. Although the traditional binary tree data structure is efficient in searching data, it is easy to increase the number of disk I/O operations, affecting the efficiency of index use. Therefore, we tend to use “chunky” data structures when constructing indexes.

Both B tree and B+ tree can be used as index data structures. In MySQL, B+ tree is used. The B+ tree is more stable in query performance.

The interviewer took a sip of the cold coffee sitting nearby.

(The little girl has something.)

“Continuously updated…”

Love three link 💖

Finally, thank you for reading. The purpose of the article is to record and share, if there are obvious flaws in the article also welcome to point out, we study together in the discussion. Much appreciated!

If you find this article useful, give it a thumbs up. Your encouragement and support keeps me going

Welcome to follow my wechat public account [back-end program Yuan], we discuss code and life together.

In addition, I set up a technical exchange group, welcome everyone to study and exchange in the group.

This article refer to

Data Structures and Algorithms;
Geek time: SQL must know must know.