Common index data structures
Today’s database index structure mainly uses the following three data structures
- B-Tree
- B+Tree
- LSM (Log-structured Merge-trees)
For the above three data structures. We have the following databases for which we are relatively familiar:
- MySQL: In the case of InnoDB engine, is a data structure that uses B+ tree as index
- PostgresSQL: The most advanced database in the world, supports a variety of data structure index, one of our most common type is b-tree index.
CREATE INDEX ON rel_8192_$idx.user_live_rights using btree (user_id, right_id, right_type, scope_id) where status ! = 'deleted';Copy the code
- MongoDB: Advanced NoSQL database. Declare yourself in official documents as usingB-TreeThe index. The original content of the official document is as follows
MongoDB indexes use a B-tree data structure
But MongoDB implements an LSM-based index structure in the “Wired Tiger” engine. You can view the performance comparisonBtree vs LSM - HBase: HBase uses the LSM tree for file storage.
Data structure review
B-Tree
B-tree is a self-balanced search tree with the following characteristics:
-
- Multipath, non-binary tree
- Each node holds both indexes and data
- Search is equivalent to binary search
B+Tree
B+ trees are variants of B- trees and have the following characteristics:
-
- Multiplex is not binary
- Only leaf nodes hold data
- Search is equivalent to binary search
- Added pointer to adjacent contacts
So you can see the difference between a B- tree and a B+ tree
-
- The time complexity of B+ tree query is fixed at LogN, and the time complexity of B-tree query is O(1).
- Pointers to adjacent contacts of B+ tree can greatly increase interval access and can be used in range query, etc., while the key and data of each node of B- tree cannot be searched in the interval.
- B+ trees are better suited for external storage, that is, disk storage. Since the internal nodes have no data field, each node can index a larger and more accurate range
You can also see the evolution of the tree in the figure below
LSM Tree
The basic idea
Lsm-tree (Log Struct Merge Tree), in which the Log idea is derived from the Log Structured FileSystem, rather than the Write Ahead Log of the database. The tree in LSM-tree can theoretically be replaced by any ordered structure. Therefore, the kernel of LSM-tree is the Merge idea, which makes full use of the multi-layer storage structure and buffers write multiple layers, thus achieving better write performance.
As shown in the figure above, the main process is as follows:
-
- For a write operation, write to memTable first
- When memtable reaches certain limits, this part becomes immutable memtable.
- When imMUTABLE memtable reaches a certain limit, it is flushed to disk, that is, sstable
- Sstable operations are performed again
Primary data structure
There are three main structures that are important for LSM trees
-
- MemTable: MemTable is an in-memory data structure used to store the latest data and organize the data orderly by Key. LSM trees do not have a clear definition of data structure for organizing data orderly. For example, Hbase uses skip tables to ensure the order of keys in memory. Because data is temporarily stored in memory, memory is not reliable storage. If a power failure occurs, data will be lost. Therefore, write-ahead logging (WAL) is used to ensure data reliability.
- **Immutable MemTable: ** When MemTable reaches a certain size, it becomes Immutable MemTable. Immutable MemTable is an intermediate state that transforms a MemTable into an SSTable. Write operations are handled by the new MemTable and do not block data update operations during dump.
- SSTable(Sorted Sequence Table) : a collection of ordered key-value pairs. It is the data structure of the LSM tree group on the * disk. To speed up the reading of the SSTable, you can set up key indexes and Bloom filters to speed up the search for keys.
LSM number of data updates are journalized, when a data update is directly append an update record to complete. The purpose of this design is to flush Immutable MemTable to persistent storage without changing the keys in the SSTable.
Therefore, when the MemTable reaches a certain size flush and the persistent store becomes an SSTable, there may be records with the same Key in different SstAbles, of course, the latest record is accurate. While this design greatly improves write performance, it also brings some problems:
-
- Redundant storage. For a key, virtually all records except the most recent record are redundant and useless, but still occupy storage space. Therefore, a Compact operation (merging multiple SStAbles) is required to remove redundant records.
- When reading, you need to search backwards from the latest record until you find a key. The worst-case scenario is to query all of the Sstables, which can be optimized by using the index/Bloom filter mentioned earlier.
The internal of the Bloom filter relies on the hash algorithm. When detecting whether a certain piece of data is seen or not, there is a certain probability of False Positive, but there is definitely no False Negative. That is, when the Bloom filter thinks a piece of data has been present, it is likely to have been present; But if the Bloom filter considers a piece of data to have never been present, it must have never been present. This feature fits the need here to check if a piece of data is missing.
The characteristics of
-
When data is written, it is first cached in memory in an ordered tree structure called memtable. At the same time, it triggers the update of related structures, such as Bloom filter and sparse index.
-
When memtables are large enough, they are written to disk at once to generate an internal ordered segment file. This process is continuous writing, so it is highly efficient.
-
- When making a query, first check the Bloom filter. If the Bloom filter reports that the data does not exist, return that it does not exist. Otherwise, query each segment in order from new to old.
-
When querying each segment, binary search is first used to retrieve the corresponding sparse index and find the offset range where the data resides. It then reads the data in that range on the disk, performs a binary lookup again and gets the results.
-
Run a compaction operation in the background periodically to merge multiple files into a larger file to ensure that query efficiency does not deteriorate.
From B-Tree To LSM
Mainstream relational databases use B/B+ Tree as the data structure to build their indexes, because B tree provides the highest query efficiency in theory –. However, the pursuit of query performance also causes the corresponding disadvantage of B Tree, that is, each time a data is inserted or deleted, the index needs to be updated, resulting in a disk I/O. This feature determines that B Tree is only suitable for scenarios with frequent reads and few writes. In the scenario of frequent write, a large number of DISK I/OS will be generated, resulting in a performance slump. This kind of application scenario is common in traditional relational database.
While LSM tree avoids disk I/O overhead in frequent write scenarios, although its query efficiency is not idealBut still very fast and acceptable. In essence, the LSM tree sacrifices some query performance for considerable write performance. This is very important for key-value or journaled databases.
Is a B+Tree index better than a traditional B-tree index
MongoDB
Why Mongodb index uses B tree, Mysql index uses B+ tree? This article focuses on the description of MongoDB because it is a document database and an aggregation database, so the b-tree data structure is more suitable for such scenarios.
But is the actual MongoDB implementation also using the above described leaf node storage data b-tree index?
WiredTiger Storage Engine is the official documentation for MongoDB. WiredTiger has become the default MongoDB engine since MongoDB 3.2.
In the section of Tuning Page Size and Compression in the official WiredTiger document, there is a clear description of the indexing implementation:
WiredTiger maintains a table’s data in memory using a data structure called a B-Tree ( B+ Tree to be specific), referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The leaf pages store both keys and values.
It is clearly pointed out that MongoDB is also using B+Tree index
For the data engine “MMAPv1” previously adopted by MongoDB, no clear literature has been found indicating that the original data index is the b-tree implementation of leaf node to store data.
See What’s New in MongoDB 3.0 Part 3: Performance & Efficiency Gains, New Storage Architecture for a detailed comparison of MongoDB’s New and old data engines
Postgres
We can see the following statement when using Pg to build database index
CREATE INDEX ON rel_8192_$idx.user_live_rights using btree (user_id, right_id, right_type, scope_id) where status ! = 'deleted';Copy the code
This statement explicitly states that using btree is used when creating a database index. But is this a btree that we remember as a btree whose leaves also store data?
The PostgreSQL B-tree is a variation of the high-concurrency B-tree management algorithm. For details about the concurrency algorithm, see the Github description. The main improvements are as follows
- compared to a classic B-tree, L&Y adds a right-link pointer to each page to the page’s right sibling.
- It also adds a “high key” to each page, which is an upper bound on the keys that are allowed on that page.
These two additions make it possible detect a concurrent page split, which allows the tree to be searched without holding any read locks (except to keep a single page from being modified while reading it).
PostgreSQL b-tree index structure PostgreSQL B-tree index Structure
conclusion
The advantages of B + Tree
- B+ trees have lower tree heights: the tree height of the balanced tree O(h)=O(logdN), where D is the output of each node. A red black Tree has an outgo of 2, and a B+ Tree has an outgo of 2, so the height of a red black Tree is obviously much higher than that of a B+ Tree.
- Principle of disk access: The operating system generally divides the memory and disk into fixed size blocks, and each block is called a page. The memory and disk exchange data in the unit of pages.
-
- The database system sets the size of one node of the index to the size of a page, so that a node can be fully loaded in one I/O.
- If the data is not on the same disk block, it is often necessary to move the brake arm for pathfinding, which, due to its physical structure, causes inefficiency in movement and increases disk reading time.
- A B+ tree has a lower tree height and the number of seek times is proportional to the tree height. Access on the same disk block takes only a short disk rotation time, so B+ trees are more suitable for disk data reading.
- Disk prefetch feature: To reduce disk I/O operations, disks are not read strictly on demand but prefetch data every time. During the prefetch process, sequential data reads are performed on disks. Sequential data reads do not require disk seek and require only a short disk rotation time. In addition, neighboring nodes can also be preloaded using the read-ahead feature.
Resources Resources
- MongoDB WiredTiger Internals
- Indexes in PostgreSQL — 4 (Btree)
- Procedure for adding, deleting, modifying, and checking LSM numbers
- MONGODB MANUAL – Indexes
- PostgreSQL B-tree index structure
- wiredtiger manual
- The Difference Between B-trees and B+trees
- Why does MySQL use B+ trees
- LSM-Trees and B-Trees: The Best of Both Worlds
- The advantages of an LSM vs a B-Tree
- B-Tree-vs-Log-Structured-Merge-Tree
- MongoDB · Kernel features · WiredTiger Page Excommunication
- Postgres B tree algorithm parsing
- Discuss B-/B+ tree from MongoDB and Mysql
- From Btree To LSM-tree
- What is the principle of the LSM algorithm? : Zhihu topic discussion. It provides many implementation details of LSM
- Why do Mongodb indexes use B+ trees while Mysql indexes use B+ trees?
- LSM tree details: The two LSM merge strategies are explained in detail, as well as the advantages and disadvantages of each strategy
- Effective MongoDB Indexing (Part 1)
- [mongodb] Comparison between MMAP and wiredTiger
- MongoDB Engines: MMAPV1 Vs WiredTiger
- Read, write & space amplification – B-Tree vs LSM
- Indexes in PostgreSQL — 4 (Btree)
- Comparison between old and new storage engines, WiredTiger and MMAPv1