The introduction

A year later, I remembered that when I looked at the database, I saw the B+ tree, which is the data structure used by the index of the database. Reorganize and see how much you haven’t forgotten.

An overview of the

So before we do the B+ tree, let’s look at the binary search tree (1,2,3,4,5,6,7)

Yeah, it looks something like this. Sure, it’s fast to find something in a binary search tree, binary search. But think of a database looking up data:

Select * from user where id > 10. So how do B+ trees solve this problem?

What is the most efficient data structure for interval lookup? Array, as long as you find the index of the element whose id is 10, then everything after that will match.

So I’m going to change the tree so that the leaves of the tree point directly to the subscripts of the array. The modified structure is as follows:

Select * from user where id > 2 and id < 5 select * from user where id > 2 and id < 5 That’s right. That’s a B plus tree.

I don’t know where this structure came from, but I suddenly realized today that its storage mode is very similar to that of a skip watch. Was it inspired by the skip watch? Or were they inspired by B+ trees? I don’t know.

extended

Good. B+ tree integration is clear. New problem. If the database uses this data structure for storage, it is certainly not practical to put all the data in memory, it must be stored in the hard disk, to be searched and then read in a file. But if you put it in a file, it is bound to cause SLOW IO, every time you read the node to access the file, if the tree to the height is very exaggerated, light IO will run out of patience.

In that case, lower the IO and increase the number of nodes at each level of the tree, so that the binary tree becomes an n-fork tree (which it does).

If it is a 3-tree with height 3 (the height is the height of the index tree), the indexable array length is :(3^4=81); If it is a 5-fork tree, height is 3, indexable array length is :(5^4=625); If it is a 100-fork tree, height is 3 and indexable length is :(100^4= 100 million). Index 100 million data volume, height is only 3, meaning that 3 IO can be located. Perfect.

The tree forks too much, does it make it less efficient to search for children at each node? Here we can reduce the time complexity by using some lookup algorithms.


The above is the content of my memories, the feeling is not obscure, most of them are recalled again. But look at the old and you’ll know the new. I don’t know how to write something new. That’s what I just learned recently.

The more branches the B+ tree has, the better

That is certainly not the more the better ah, if a layer of all the data stored, to him what use, did not play a role in fast positioning.

But that’s not what I’m talking about. As we know, when the operating system reads data from disk, it reads and manages data by page, a page size of 4KB. When reading data, if the size is greater than 4KB, multiple I/OS will be triggered. At 4kb, it is quite large for a storage node. That said, we’d better have a size of <= 4KB per node, otherwise we’ll fire multiple IO.

However, when a node is updated, it is bound to change its size. How do you make sure that an n-tree is always an N-tree?

Add a node

Actually, it’s very simple. Take it apart when you have more. If the node exceeds its size, split it into two nodes. But after splitting, there are more parents. Then the parent node is removed, all the way to the root node. If the root node is out of size, remove it again and the whole root node comes out.

Remove nodes

In fact, deleting a node does not affect the size of the node beyond the limit. However, in the long run, some node elements may be too few, which seriously affects the query efficiency. Then, if the number of elements in the node is less than n/2, the two adjacent nodes are merged into one node. What if the number of elements exceeds the size? Open bai again.