This section is very informative, we want to grasp the LevelDB building structure as a whole. Once we are familiar with the overall structure, we can then break it down to learn its subtle details.
A metaphor
LevelDB is a bit like a building in that it is divided into the ground and the ground, disk and memory, and the ground is divided into many levels like the earth’s crust, and the different levels of data are periodically moved from top to bottom – deposition. If the cold data at the bottom of the disk is modified, it will enter memory again, and after a period of time it will be persistently flushed back to the shallow layer of the disk file, and then slowly moved down to the bottom layer, like the earth’s water cycle.
Memory structure
LevelDB maintains two jump lists in memory, a read-only Rtable and a modifiable Wtable. Jump lists are explained in detail in my other book, Redis Deep Adventures, so I won’t repeat them here. A jump list is a Set of key-ordered sets, sorted by a global comparator, lexicographical by default. Skip list search and update operations are Log(n) time complexity.
A skip list is a hierarchy of linked lists, the lowest of which stores all the keys and is ordered. Regular lists do not support fast binary lookup, but the special structure of skip lists allows the lowest level of the list to locate the specified node with approximately the efficiency of binary lookup algorithms. The simple understanding is that the jump list has both the ability to locate the ordered array quickly and the ability to add and delete the linked list efficiently. But it comes at a cost and complexity in implementation.
If skip lists only store keys, where does Value go? The answer is that the Value is also in the Key of the jump list. The Key stored in the skip list is a special one. It is a compound structured string that contains both the Key and Value of the key-value pair.
Type indicates the data type. The value is Put or Delete. 0 indicates Delete and 1 indicates Put.
internal_key = key + sequence + type
Key = internal_key_size + internal_key + value_size + value
Copy the code
In the case of a delete operation, the value_size field value is 0 and the value field value is null. We will equate the Delete operation with the Put operation. To save storage space, internal_KEY_size and value_size must be varINT integers.
After the Put and Delete operation logs are written to the log file, the Key and value pairs are merged into compound keys and inserted into the specified location of the WTABLE.
Because WTABLE supports multithreaded reads and writes, access to it requires locking control. The Rtable is read-only, so it’s not needed, but it’s short-lived, and once it’s generated, it’s quickly serialized to disk by an asynchronous thread, and then it’s empty. However, asynchronous thread serialization also takes some time. If the WTABLE grows too fast and becomes full quickly, what if the RTABLE is not serialized yet and the Wtable needs to be changed? At this point, the writer thread blocks and waits for the asynchronous thread to complete serialization, which is one of LevelDB’s problems and an optimization point for RocksDB in the future.
There is also a log file that logs recent write operations. If LevelDB encounters an unexpected outage, unpersisted WTABLE and Rtable data will be lost. The lost data must be recovered by replaying the command data in the log file. Note that there are also two log files, which correspond to the jump list in memory. When the WTable changes, the log file changes as well. After the RTABLE is successfully dropped, the read-only log file can be deleted.
The disk structure
LevelDB stores a number of Sorted String Table SST files on disk. SST stands for Sorted String Table. Each file has a hierarchy, and each hierarchy has multiple files. The contents of the lower level files come from the upper level, and eventually they all come from tier 0 files, which in turn come from rtable serialization in memory. An Rtable is serialized as a complete layer 0 file. This is what we called “subsidence” earlier.
Serialization from an IN-memory Rtable to a tier 0 SST file is called a Minor Compaction. Sinking a tier N SST file to a tier N +1 SST file is called a Major Compaction. The reason for this distinction is that Minor is fast and inexpensive, so serializing the Rtable as a complete SST file is done. Major involves merging multiple files, which is resource-consuming and slow. The higher the level, the greater the total file size. There is a level capacity formula in the LevelDB source code. Generally, each SST file is about the same size, so the difference is that each layer has a different number of files.
capacity = level > 0 && 10^(level+1) M
Copy the code
The keys in each file are ordered, which means that there is a definite range of keys inside the file. One obvious difference between layer 0 files and other layer files is that there is no overlap between the scope of the files inside the other layer, they are strictly shard according to the Key order. The contents of layer 0 files are dumped directly from the memory. Therefore, the Key values of files at layer 0 overlap.
When a read miss occurs in the memory and the disk is searched, the memory will be searched from layer 0 first. If the search fails, the disk will be searched at a deeper level.
With other levels, the search is faster because you can quickly determine which file a Key is likely to be in based on its scope. But for layer 0, because the file Key scope overlaps, it may exist in multiple files, which need to be searched. As a result, LevelDB limits the number of tier 0 files, and if the number of files exceeds the default of four, it forces a “sinking” operation to tier 1. This “sinking” operation is a Major Compaction.
Key ranges, hierarchies, and other meta information for all files are stored in the MANIFEST file in the database directory. When the database is opened, you can read this file to see the hierarchy of all files and the range of Key values.
The MANIFEST file also has a version number, as shown in the file name manifest-000361. Each time the database is reopened, a new MANIFEST file is generated with a different version number, and the old MANIFEST file needs to be deleted.
There is another file in the database directory, CURRENT, which simply contains the file name of the CURRENT MANIFEST. LevelDB reads the CURRENT file first to know which MANIFEST file is valid. In the event of a power outage, there is a low probability intermediate state where the old and new MANIFEST files coexist in the database directory.
The LevelDB database directory does not allow multiple processes to access it at the same time. How does it prevent other processes from accidentally reading or writing to the directory file? If you look closely at the database directory, you’ll also find a file named LOCK that is key to controlling multiple processes’ access to the database. When a process opens a database, it places a mutex lock on the file, which is automatically released when the process terminates.
There is a final, less important operation LOG file called LOG, which records a series of critical operation logs from the database, such as information from each Minor and Major Compaction.
Multi-channel merge
While Compaction is a resource-intensive operation, LevelDB assigns a Compaction to a single asynchronous thread in order to create a stable read and write operation online. If the workload is heavy, the single asynchronous thread can be a bit overwhelming. When asynchronous threads are overwhelmed, online memory reads and writes are also affected. Because the Wtable can change only if the Rtable sinks to disk. Only when the WTable is transformed can a new WTable be created to accommodate further key/value pairs. It’s just one ring after the other.
Let’s look at this Compaction. Minor Compaction is a simple operation that causes data from an Rtable to be dumped to a layer 0 file. Why is it necessary to Compact from layer 0 to layer 1? Because too many layer 0 files will affect the search efficiency. As mentioned earlier, Key ranges overlap between layer 0 files, so a single Key may exist in multiple files, and the number of I/O reads will be magnified by the number of files. A Major Compaction reduces the number of layer 0 files and improves read efficiency. Is it just a matter of sinking down to level 1? Why does LevelDB need so many levels?
So what does LevelDB do? It adopts the multi-way merging algorithm, takes the relevant layer 0 files and layer 1 SST files as input, carries out multi-way merging, generates multiple new layer 1 SST files, and then kills the old SST files, and generates a new MANIFEST file. For each layer 0 file, it searches for SST files that overlap layer 1 files based on the range of Key values. If the number of files at layer 1 is too large, the number of files involved in each multi-way merge will be too expensive. So LevelDB also needs to control the number of files at tier 1, and when tier 1 is full, it continues to sink to levels 2, 3, 4, and so on.
The non-0 layer consumes less resources because the Key value range of a single file is limited, the number of files that can cover the next layer is limited, and the number of input files that participate in the multi-merge is much less. However, there is a loophole in this logic, which is that there is a 10-fold difference in the number of files at the top and the number of files at the bottom. Based on the average range interval, this means that the average value range of a file at the top will cover 10 files at the bottom. This means that a non-zero multiple merge operation costs a lot of resources, and Major Compaction is a very resource-intensive operation.
In the next section, we’ll delve into the internal structure of disk files to see what each sstable looks like inside