After reading this article, you will know:

  • B tree
    • Contrast balanced binary trees with B trees
    • How do I find data in a B tree
    • How can B trees be balanced
    • Usage scenarios
  • B + tree
  • Thanks

In the previous article “3 Minutes to understand the complete binary tree, balanced binary tree, binary search tree”, we learned about the functions and characteristics of several special binary trees, know that they can improve the efficiency of searching data, but it should be noted that this refers to the search in memory. If there is a large amount of data, it is impossible to read it into the memory at one time. At this time, you need to consider how to quickly find the needed data in the disk.

Today this article will introduce the “B tree, B+ tree”, their use scenario is: find a large amount of data in the disk.

B tree

B tree is often called “B minus tree (B- tree)”, also known as balanced multipath (that is, more than two subtrees) search tree, it and balanced binary tree differences in the following points:

  1. A balanced binary tree node can have at most two subtrees, while a B tree can have multiple subtrees per node. A B-tree of order M means that the tree has at most M subtrees per node
  2. Balanced binary trees have only one data and two Pointers to children per node, whereas B trees have k-1 keywords (which can be understood as data) and K subtrees per intermediate node (**k is between order M and M/2, and M/2 is rounded up)
  3. All leaf nodes of the B tree are in the same layer, and the leaf nodes only have keywords, and the pointer to the child is null

The same point with balanced binary tree is that the node data size of B tree is also large according to the left and right, and the size comparison between the subtree and node determines the location of the subtree pointer.

It might be a little bit confusing to look at, but let’s look at a picture of a balanced binary tree versus a B tree.

Contrast balanced binary trees with B trees

The nodes of the balanced binary tree are shown in the figure below. Each node has one data and at most two subtrees:

Each node in a B tree consists of two parts:

  1. Keywords (which can be understood as data)
  2. A pointer to the child node

Each node can have more than one data and at the same time have data number plus a subtree. At the same time, the data of the left subtree of each node is smaller than that of the current node, and the data of the right subtree is larger than that of the current node:

The figure above is for the convenience of readers to understand the content of each node of the B-tree. The actual graph is still drawn as a circle to represent each node.

Now that we know the difference between nodes, let’s take a look at the definition of a B-tree. A B-tree must satisfy the following conditions:

  1. If the root is not terminal, there are at least two subtrees
  2. All non-leaf nodes except root nodes have at least M/2 subtrees and at most M subtrees (the number of key words is subtree minus one)
  3. All the leaf nodes are in the same layer

Use a diagram to balance binary trees and B trees:

As you can see, each node of the B-tree can represent more information, so the whole tree is “fatter”. This reduces the number of DISK I/OS during the process of retrieving data from disk (read into memory first, then search), thus increasing the search speed.

How do I find data in a B tree

Because of the subtree size sorting rule, when looking for data in a B tree, we generally need to:

  1. Starting at the root, if the data is smaller than the root, go to the left subtree, otherwise go to the right subtree
  2. Compare it with multiple keywords in the subtree to find the range in which it resides, and then continue the search in the subtree corresponding to the range
  3. This cycle continues until the leaf node is found or not found

How can B trees be balanced

We know that balanced trees speed up lookups because something is done to keep things balanced as they are added and removed.

The equilibrium conditions of balanced binary trees are: the height difference between left and right subtrees is not more than 1; There are three equilibrium conditions for B trees:

  1. The leaf nodes are all in the same layer
  2. The key word count of each node is the number of subtrees minus one (the number of subtrees k is between the order M of the tree and one half of it
  3. The keys of the subtree are in the order of the smallest on the left and the largest on the right

In other words, a third-order B-tree (that is, a node has at most three subtrees) has at least 1 and at most 2 key words for each node. If the subtree to add data has the most key words, it needs to split nodes and adjust the structure of the tree.

We found a great GIF on the Internet to analyze how to maintain balance when adding elements to a B-tree.

This graph is used to represent the process of inserting the following sets of data into the fourth-order B-tree:

1 7 8 8 6 3 6 21 5 15 15 6 32 23 45 65 7 8 65 4

You are advised to enlarge the view.

Since I’m a bit lazy, let’s analyze the process of adding a B tree based on the previous steps:

  1. First of all, it is clear that the fourth-order B-tree means that each node has at most four subtrees and three keywords, and at least two subtrees and one keyword
  2. Add 6, the first node, nothing to say
  3. Add 10. A maximum of three keywords can be added to the root node in sequence
  4. Add 4, and you can still put it in the root node
  5. When 14 is added, the maximum limit of keywords is exceeded and 14 needs to be added as a subtree. At the same time, in order to ensure that “all leaf nodes are in the same layer”, several keywords need to be removed as subtrees:

Split as follows:

This process is complicated, the first to determine the root node to keep a few key words, because “the leaf node with at least two KeZi root node of the tree”, it will take at least two key points out, because “subtree is the key word + 1”, if the root node has two key words, have to have three subtree, unable to meet, So we have to split the three keywords except 6 into subtrees.

Who is in the same subtree as whom? According to the rule that the left subtree is smaller than the keyword and the right subtree is larger than the keyword, 4 is in the left subtree, 10 and 14 are in the right subtree.

Continue to add:

  1. Add 5, put it on the subtree of 4
  2. Add 11 to the right subtree where 10 and 14 are
  3. Add 15, which should be placed on the same subtree as 10, 11, and 14 by size, but have to be split because the key word limit has been exceeded

Since “the root nodes must all be at the same level”, we cannot add subtrees to the existing left and right subtrees, only to 6; But if 6 has three subtree, it must have two keywords, keyword ascension who do, it all depends on who does 6 in the middle of the subtree, because all the keywords right subtree is greater than the parent node keyword, so the promotion keywords only is smaller than the future key words in the right subtree, then can consider only 10 and 11.

There is no subtree smaller than this, so we can only go up by 11:

The same logic applies to adding elements:

  1. First consider whether the subtree to be inserted has exceeded the key word limit
  2. Otherwise, if the position to be inserted is a leaf node, only one keyword can be removed and added to the parent node of the position to be inserted
  3. If the node is not a leaf, the child must be removed from another child tree to be the child of the newly inserted element

The same is true for deletion. After deleting the child, it is necessary to consider whether the parent node still meets the condition that the subtree K is between M/2 and M. If it does not meet the condition, it is necessary to dismember the subtree from other nodes or even modify the related subtree structure to maintain balance.

In short, the process of adding and deleting is very complicated, and there are many conditions to consider. The specific implementation will not be investigated in detail. Here we have a basic understanding.

It is this complex balancing operation that enables the balanced B-tree to function as a quick look-up on disk.

Usage scenarios

This section is excerpted from: A Brief Introduction to Algorithms and Data Structures: B-trees of balanced Lookup trees

B/B+ tree is commonly used in file systems and database systems. By expanding the number of storage nodes, continuous data can be quickly located and accessed, which effectively reduces the search time and improves the storage space locality, thus reducing I/O operations. It is widely used in file systems and databases, such as:

  • Windows: HPFS file system

  • Mac: HFS, HFS+ file system

  • Linux: ResiserFS, XFS, Ext3FS, JFS file system

  • Database: ORACLE, MYSQL, SQLSERVER, etc

  • Database: ORACLE, MYSQL, SQLSERVER, etc

B + tree

This part is mainly learned from the cartoon of “Programmer Grey” : What is B+ tree?

After learning about the B tree, let’s learn about its variant: THE B+ tree, which has higher query performance than the B tree.

A B+ tree must satisfy the following conditions:

  1. The number of subtrees of a node is the same as the number of key words (B tree is one fewer key words than the number of subtrees)
  2. The keyword of the node represents the maximum number in the subtree, which also contains this data
  3. The leaf node contains all the data and conforms to the order of small on the left and large on the right

Briefly summarize the three characteristics of B+ tree:

  1. The number of key words is the same as the subtree
  2. A non-leaf node is used only as an index, and its keywords and child nodes have duplicate elements
  3. The leaf nodes are connected by Pointers

First of all, there is no need to introduce the first point. In B tree, the key word of the node is used to determine the query interval during the query, so the number of key words is one less than the number of subtrees. In B+ trees, the keyword of a node represents the maximum number of subtrees, so the number of key words is equal to the number of subtrees.

Secondly, the keywords of all nodes except the leaf node also exist in the subtree of the next level. Finally, all data are stored in the leaf node.

The maximum keyword of the root node represents the maximum element of the entire B+ tree.

Third, the leaf nodes contain all the data and are arranged in order, whereas B+ trees use a linked list to arrange them so that they can be queried more efficiently.

Since the middle node of a B+ tree contains no actual data but only the maximum data and Pointers of the subtree, more node elements can be contained in the disk page. In other words, with the same data, the B+ tree is shorter and fatter than the B tree, so the query efficiency is faster.

B+ tree search will find leaf nodes, more stable.

Sometimes it is necessary to query the data within a certain range. As the leaf node of B+ tree is an ordered linked list, it only needs to traverse on the leaf node, rather than traverse in order to compare the size as in B tree.

Three advantages of B+ trees:

  1. Lower levels, fewer I/OS
  2. Leaf nodes need to be queried every time, and the query performance is stable
  3. Leaf nodes form an ordered linked list, and range query is convenient

The process of adding elements to a B+ tree will not be studied in depth, but it will be used later. Here is a diagram of dynamically adding elements to a B+ tree:

Thanks