We all know that in order to avoid full table scan when querying data, we will add indexes, which will greatly reduce the number of traversals, thus increasing the efficiency of the query. An index is essentially a data structure that optimizes queries. MySQL’s default index data structure is B+ tree.

Why does MySQL use B+ trees as the default index data structure? Why not hash, and why B+ tree out of all the trees? I also have this question of course, after consulting some data, and listen to me.

1. Hash structure

There are two types of data structures that speed up queries:

  • Hash: Hash is a very common data type. For example, hashMap is used to locate the key in the array after the hash operation. The average time complexity of the query, insert, modify, and delete of the hash is O (1).
  • Tree: For balanced binary search tree, the average time complexity of query/insert/modify/delete is O (log2 (n)).

Hash is faster than tree indexing for both read and write requests, so why use tree indexing? Here’s an example:

SELECT	* FROM	student WHERE age BETWEEN 18 AND 24
Copy the code

When hash is used for range/sort conditions, the time complexity of the hash lookup is reduced to O (n), and the tree is ordered, which still ensures the efficiency of O (log2 (n)).

2. Tree structure

1. Binary search tree

1. Binary tree features
  • A node can have only two child nodes
  • The left child node is smaller than the current node, and the right child node is greater than or equal to the current node
2. Retrieval process

The search times for a node with a depth of 1 are 1. The search times for a node with a depth of 2 are 2. The search times for a node with a depth of N are N

Summary: So if the tree height is 3, the average number of lookups is (1+2+2+3+3+3) /6=2.3

Problem 3.
  • If you keep inserting children larger than the parent, the tree will tilt to the right so much that the tree will grow taller

  • To make matters worse, if it is worthwhile to insert nodes lower down, the tree will split, which will undoubtedly make the retrieval slower

So we need a different kind of tree to avoid this situation where the tree always leans to one side, the AVL balanced tree.

2. The AVL tree

1. The AVL tree
  • It doesn’t tilt all the way to one side because the AVL tree rotates itself to maintain balance
  • Equilibrium condition: the absolute value (equilibrium factor) of the difference between the height of left and right subtrees of each node is at most 1
2. Retrieval process

For example, the data to be retrieved is 7

  • Compare the size of root nodes 4 and 7, 7>4, and continue to the right
  • Compare the size of child nodes 6 and 7, 7>6, continue to the right
  • Compare the size of child nodes 8 and 7, 7<8, continue to the left
  • Compare the size of child nodes 7 and 7, 7=7, find the destination node
Problem 2.
  • AVL tree ensures that it will not tilt to one side all the time. When the amount of data is small, the query performance is very good. For example, to search 7 in the figure above, only 4 times of query are needed to get the results

At this point, the question we need to consider is how to make the tree less “lean and tall” and more “chunky” as the data volume increases

3. B tree

1. B tree characteristics
  • A little bit different from a balanced binary tree is that a B tree is a multi-fork tree aka a balanced multi-path lookup tree.
  • Each node contains not only a data key value, but also a data value
2.B Tree search process

For example, the data to be retrieved is 7

  • Compare the size of root nodes 4 and 7, 7>4, and continue to the right
  • Compare the size of child nodes 6.8 and 7, 8>7>6, and continue
  • Compare the size of child nodes 7 and 7, 7=7, find the destination node

It can be seen that the same search for 7, B tree structure is 1 less than the AVL tree lock query

3. The question

This is pretty good lookup performance, so why would you want to bring out B+ trees? Look down.

4. B + tree

1. The characteristics of B + tree
  • Different from B tree, the data of B tree is stored in each node, while B+ tree moves all data to leaf nodes, and there are Pointers between sub-nodes, which can greatly improve the efficiency of range query and facilitate traversing the whole tree
  • A linked list is added between the leaves, so that middle order traversal is no longer required to obtain all nodes
  • Non-leaf nodes of a B+ tree store only keys. In this way, the number of keys stored in each node can be increased and the height of the B+ tree can be reduced
  • Query leaf nodes by non-leaf nodes to obtain corresponding data. All adjacent leaf nodes including non-leaf nodes are combined by linked list. Leaf nodes are ordered and adjacent nodes have sequential reference relationship

Here is just a brief introduction to the query process, the specific query/add/delete process can be through the process of data structure simulation hands-on practice, can be more in-depth understanding.

Data structure simulation address: www.cs.usfca.edu/~galles/vis…