1. Basic disk knowledge
- Page:
Modern operating systems use virtual memory to print to physical memory, which is limited and expensive, so persistence of data is on disk. Virtual memory, physical memory, and disk all use pages as the smallest unit of memory read. Generally, a page is 4KB (8 sectors, each sector 125B, 8*125B=4KB).
- Locality principle:
- When a piece of data is used, nearby data is usually used immediately.
- The data required during the running of a program is usually concentrated.
- Disk prefetch principle:
Disk reading relies on mechanical movement and is divided into three parts: seek time, rotation delay, and transfer time. The sum of the three parts is the time of a disk I/o, which is about 9ms. This cost is about 100,000 times more than accessing memory;
Disk reads are much faster than memory, so minimizing I/O times is key to efficiency.
Due to the principle of locality, and because sequential disk reads are efficient (no seek time required, very little rotation time required), the disk will read a page of data even if only one byte is read. That is, disk prefetch usually reads integer multiples of pages.
2. Review of tree basics
Sorted binary tree: left < and right < B tree: ordered array + multi-fork balanced tree, nodes store keywords, data, Pointers; B+ tree: ordered array linked list + multi-fork balanced tree, non-leaf nodes store Pointers and keywords, not data; Red Black trees: Red black trees are loosely balanced trees (balance trees are too demanding)
Balanced trees prevent binary lookup trees from degenerating into linked lists, while red-black trees maintain equilibrium to ensure O(log2(n)), and do not need to adjust the tree structure frequently.
The storage structure of binary trees
- Sequential storage (for full binary trees)
The corresponding relation between indexes:
Note: The sequential storage of binary trees is only suitable for storing complete binary trees, otherwise index will not correspond to nodes and it will be a bit disgusting:
- Chain store
We should understand it well here, otherwise it will affect the following understanding.
3. Why can’t we use binary trees to store database indexes
First the conclusion:
- When a balanced binary tree is inserted/deleted, it is most likely to be balanced by left/right rotation.
- Rotation needs to load the whole tree, frequent rotation efficiency is low;
- The I/O times of binary trees are approximately O(log2(n)).
- In range query, the time complexity of binary tree is reduced to O(n).
- When a binary tree degenerates into a linked list, the time complexity is approximately reduced to O(n).
- Binary trees cannot use the disk prefetch function.
In fact, in a relational database, it is almost impossible to use binary trees in a single range query. But to add to the knowledge, let’s look at other reasons.
First, remove the range query case. Reasons 1, 2, and 6 can be solved by red-black tree, so there are actually two reasons left:
- I/O count comparison;
- Utilization of disk prefetch function;
4. I/O count analysis of binary tree
Let’s start with I/O counts:
In fact, compared with binary trees, B trees, B+ trees, THE NUMBER of CPU operations has not changed, even increased. However, CPU times are negligible compared to I/O consumption, so I/O times are a key indicator to evaluate the efficiency of a database index.
For a red-black tree, the number of I/ OS is approximately log2(n). Why is it approximate?
First of all, indexes are stored on disk. The data on disk is mostly continuous, but as additions, deletions, changes, and searches occur, there may be a lot of fragmentation.
- The storage of indexes on disk is not necessarily continuous;
Here, for the sake of rigor, let’s divide it into two cases:
- Index nodes, that is, tree nodes stored on disk are contiguous;
Suppose a page can store 5 nodes. Suppose the binary tree is as follows:
Note that the serial number only represents the order stored on disk, not the keyword value of the corresponding node.
Binary trees can be stored chained or sequentially. But this assumes that nodes are stored continuously on disk, so it can be approximated as sequential storage. Even chained storage is nothing more than a pNext pointer to the next contiguous memory address.
Now suppose the result of the search is the leftmost leaf node 16, because of the nature of disk prefetch, plus one page can store 5 nodes, first I/O:
As shown above, the first I/O reads 5 nodes, not only reads the root node into memory, but also reads both nodes 2 and 4, which seems to save two I/ OS, right? Good fierce appearance……
At this point, I’m going to do a dichotomy, compare node 1 and then I’m going to find node 2, and then I’m going to find node 4, because they’re both in memory, so I/O is not required
Once again, the serial number does not represent the key of the node, but simply indicates the order in which the node is placed on disk.
Then, node 8 will be needed, and 8 is no longer in memory, so the second I/O will also read a page, i.e. 5 nodes:
Although 5 nodes were also read this time, only node 8 was actually useful, and the other nodes were not useful (this was the essence of binary tree that could not use the prefetch function), but the disadvantage was not reflected yet. Now node 16 was needed after comparison, and the third read was continued:
16 is now found and returned.
This is height 4, and there are only 31 data points. But in practice, how can there be only 31 data? Let’s say we’re looking for node 32, because the 17-20 after node 16 is loaded into memory, but it’s useless. Another I/O is required to load the page on node 32, and 33-36 is also loaded into memory, but these nodes are not used.
If I’m looking for 1,000, 10,000?
So, as you go further down the hierarchy, you get:
- Only one node in a page is useful (binary lookups look for child nodes, not siblings);
- The number of I/ OS is approximately log2(n);
That is:
- The potential advantage of the first I/O is lost as the hierarchy deepens;
- Even a red-black tree can only maintain a time complexity of log2(n);
In this case, the probability of the first few I/ OS reading useful data decreases, so the number of I/ OS increases, not decreases, and is still approximately log2(n).
5. B/B + tree
B tree is: multi-path balanced search tree;
The clever thing about B trees is that:
- Set the size of a node to the size of a page.
- A node can store multiple keywords (multi-fork tree).
- Since the balance;
The combination of these three things can do it:
- When a node is loaded into memory, these keywords are useful for comparing leftChild or rightChild (for example, comparing all nodes on the right side).
- A node can store multiple keywords, effectively reducing the height of the tree.
The neat thing about B+ trees is that:
- Non-leaf nodes do not store data, further increasing the number of stored keywords in a page.
- The leaf node stores data and has a linked list pointer to the next page, so sequential query (range query is supported) can be used.
6. The number of B/B+ tree indexes
Non-leaf nodes of a B+ tree: pointer, keyword (primary key); leaf nodes of a B+ tree: pointer (linked list), keyword, data
Note that this is not absolute, as in some B+ trees, the leaves store not data, but Pointers to data. Query the pointer and then go to the corresponding address to retrieve the data, but this should increase the I/O, should also be in the data volume and I/O times between the choice, specific for the moment not discussed.
For example, after Sqlite3.12, page_size = 16K, assume that the pointer is 8 bytes, the keyword type is 8 bytes, and the data is 1 KB.
A node of a B tree:
The amount of data that can be stored on a page is: 16KB/(1KB+8byte+8byte) ≈ 16;
A height of 3 B tree can store 4096 pieces of data 16 x16 x16
It actually lowers the level of the tree compared to just one binary tree. Also, assuming that the data is 1KB, a height of 3 B tree can store more data if the data is not so large, but it is not enough for a large database index.
B + tree:
As shown above, the core of a B+ tree is that non-leaf nodes do not store data.
This reduces the amount of space taken up by non-leaf nodes, increases the amount of data that can be stored on a page, and minimizes the number of levels in the tree.
Again, if the height of the tree is 3, then there are two layers to store the keyword + pointer and one layer of leaf nodes to store the actual data.
Keywords that can be stored on a page are as follows: 16 x 1024 / (8 + 8) = 1024 Amount of data that can be stored on a page: 16KB/(1KB + 8byte + 8byte) = 16 (the calculation is not completely accurate, the actual situation should be that only one linked list pointer points to the next page in one page) The keywords that can be stored are: 1024 * 1024 = 1048576;
Because the end node has 1024 Pointers, these Pointers can point to a page, which stores data, namely leaf nodes. A page can store 16 leaf nodes, so the total amount of data that can be indexed is 1048576 * 16 ≈ 16 million. If the height is 4, then multiplying by 1024 is about 1.7 billion…..
In the above reasoning, it is key to understand that the pointer to the terminal node points to a page where the keyword + data + linked list pointer is stored. The page tag is as follows to help you understand:
Although there are many leaf nodes, one page corresponds to one leaf node or even multiple pages to save a leaf node, but these are stored on the disk, the corresponding page can be loaded only after finding the corresponding page. The beauty of this design is that indexing large amounts of data has no impact on I/O counts.
But there are downsides:
No matter what the query result is, it must go to the leaf node before ending, that is, the NUMBER of I/O is fixed at O(h) or log(n) (base number is the number of nodes and branches), this H is usually 2-3, excluding the permanent memory of the root node, A height 3 B+ tree can index tens of millions of bytes with two I/ OS, and a height 4 B+ tree can index billions of bytes with three I/ OS, which is still good.
So, this disadvantage can also be said to be a strength: stability (steady as an old dog 🐶)
7. Practical application
- Red black tree advantages
Red-black trees are often used to store ordered data in memory, which can be quickly added and deleted without I/O operations.
- B/B+ tree
More suitable for disk storage, reducing the tree level, thus reducing I/O times;
- B tree versus B+ tree
Both are B trees, but B+ trees are more suitable for range queries, such as Mysql, and the number of queries is very stable at logn. B tree is more suitable for key-value pair aggregation database, such as MongoDB, and the optimal number of queries is O(1).
Red-black tree is more suitable for memory storage, B tree is more suitable for key-value pair storage, B+ tree is suitable for range query;