Q.1. Why do databases design indexes?

There are 1000W books in the library, from which we should find the Path of Architects. When should we check them? So, the librarian designed a set of rules: (1) the first floor put history class, the second floor put literature class, the third floor put IT class… (2) IT class, and divided into software class, hardware class… (3) Software class, and according to the title of the sort… To find a book quickly.

Shenjian = ‘shenjian’; shenjian = ‘shenjian’; shenjian = ‘shenjian’; Therefore, to have an index, used to improve the speed of the database search.

Question 2. Hashes are faster than trees. Why is the index structure designed as a tree?

There are two common types of data structures that speed up lookups:

(1) Hash, such as HashMap, the average time complexity of query/insert/modify/delete is O(1); (2) trees, such as balanced binary search trees, the average time complexity of query/insert/modify/delete is O(lg(n));

As you can see, whether it’s a read request or a write request, a hash index is faster than a tree index. So why is the index structure tree?

Voice-over: 80% of the students can't answer the interview.Copy the code

The index is designed as a tree, which is relevant to the requirements of SQL.

SQL requirement for such a single row query: select * from t where name= “shenjian”; It is true that hash indexes are faster because only one record is queried at a time.

Vo: So, if the business requirements are single-line access, such as passport, you can indeed use hash indexes.Copy the code

But for the SQL requirements for sorting queries:

(1) Group by group

(2) In order by

(3) <, >

(4)…

A hash index degrades to O(n) in time, but the order nature of the tree still maintains O(log(n)) efficiency.

Any design that deviates from the requirements is rogue.

In addition, InnoDB does not support manual hash indexing.

Voiceover: Adaptive hash indexing is the InnoDB kernel mechanism.Copy the code

Question 3. Why do database indexes use B+ trees?

In order to maintain the integrity of the knowledge system, briefly introduce some trees.

First: binary search tree

Binary search trees, shown above, are one of the most well-known data structures that I won’t go into. Why are they not suitable for database indexing?

(1) When there is a large amount of data, the height of the tree will be relatively high; when there is a large amount of data, the query will be slow;

(2) Each node stores only one record, which may lead to many disk I/OS for a query;

Voiceover: This tree is best known because it often appears in college textbooks.Copy the code

Second: B trees

B tree, as shown above, has the following characteristics:

(1) It is no longer binary search, but M-fork search;

(2) Both leaf and non-leaf nodes store data;

(3) In order traversal, can obtain all nodes;

Voiceover, I do not want to introduce this feature: the number of non-root nodes containing keywords J meet, (chrysene m/2 gp)-1 <= j <= m-1, node split to meet this condition.Copy the code

B-trees were created as data structures to implement indexes because they exploit the “locality principle” perfectly.

What is the locality principle?

The logic of the locality principle goes like this:

(1) memory read and write block, disk read and write slow, and much slower;

(2) Disk prefetch: Disk read and write does not read on demand, but prefetch on page. Data is read one page at a time and more data is loaded each time. If the data to be read in the future is in this page, disk I/O can be avoided and efficiency can be improved.

Voiceover: Typically, a page of data is 4K for an operating system and 16K for MySQL.Copy the code

(3) The principle of locality: software design should try to follow the “data read concentration” and “use a data, the probability of using its nearby data”, so that disk prefetch can fully improve disk IO;

Why are B-trees good for indexing?

(1) Due to the bifurcation of M, the height can be greatly reduced;

(2) Each node can store J records. If the node size is set to page size, such as 4K, the prefetch feature can be fully utilized to greatly reduce disk IO;

Third: B+ trees

The B+ tree, as shown in the figure above, is still an M-fork search tree, with some improvements made on the basis of the B tree:

(1) Non-leaf nodes no longer store data, and data is only stored on leaf nodes of the same layer;

Voice-over: In a B+ tree, the path from the root to each node is the same length, which is not the case in a B tree.Copy the code

(2) Between leaves, a linked list is added to obtain all nodes, and middle-order traversal is no longer needed;

These improvements give B+ trees superior properties over B trees:

(1) Range search: after min and Max are located, the middle leaf node is the result set without middle order backtracking; Voice-over: Range queries are used a lot in SQL, and this is the biggest advantage of B+ trees over B trees.

(2) The leaf node stores the actual record line, which is relatively close and suitable for disk storage with large data volume; Non-leaf node storage record PK, used for query acceleration, suitable for memory storage;

(3) If the non-leaf node does not store the actual record, but only stores the KEY of the record, then the B+ tree can store more indexes with the same memory;

And finally, to quantify, why is the height of the B+ tree of the m cross much, much lower than the height of the binary search tree?

Do the math:

(1) Locality principle, the size of a node is set as one page, one page 4K, assuming that a KEY has 8 bytes, a node can store 500 keys, that is, J =500;

(2) m fork tree, approximately m/2<= j <=m, that is, it can be almost 1000 fork tree;

(3) Then:

Layer 1 tree: 1 node, 1 x 500 keys, size 4K

Layer 2 tree: 1000 nodes, 1000500=50W keys, size 10004K=4M

Three-layer tree: 10001000 nodes, 10001000500= 500 million keys, 10001000 x 4K=4G

Voiceover: Well, please check if there is any mistake.

As you can see, storing a large amount of data (500 million) does not require much tree depth (height 3), nor does the index take up much memory (4G).

conclusion

(1) Database index is used to speed up the query;

(2) Although the hash index is O(1), the tree index is O(log(n)), but SQL has many “order” requirements, so the database uses tree index;

(3) InnoDB does not support manual creation of hash indexes;

(4) The idea of data prefetch is as follows: Disk read and write does not read data on demand, but prefetch data on page. The system reads one page of data at a time and loads more data each time to reduce disk I/OS in the future

(5) Locality principle: software design should try to follow the “data read concentration” and “use a data, the probability of using its nearby data”, so that disk prefetch can fully improve disk IO

(5) Database index is most commonly used B+ tree:

  • Very suitable for disk storage, can make full use of the principle of locality, disk preread;
  • Very low tree height, able to store a lot of data;
  • The index itself takes up very little memory;
  • Can well support single point query, range query, ordered query;

Original link: mp.weixin.qq.com/s?__biz=MjM…