This article originated from personal public account: TechFlow, original is not easy, for attention


Today, the 10th article in our distributed series, we continue to talk about LSMT as a data structure.

LSMT is an intuitive and simple data structure that is widely used in distributed systems. We discussed it in detail in the last article, so if you’ve forgotten it or are new to it, please click on the link below to review the last lecture.

Distributed: LSMT is the Hbase bearer with strong throughput

Introduction of leveldb

In this article, we will take a look at levelDB, the classic KV database engine in the USE of LSMT and optimization.

Leveldb, levelDB, levelDB, levelDB, levelDB Different from general relational databases, its internal data are all stored in the form of KV, namely key-value, and it does not support structured SQL for data query, only API calls are supported. In other words, it is a typical noSQL data engine library. It was first developed and open source by Google, and then optimized by Facebook to launch the more popular RocksDB. Later, levelDB and other distributed noSQL databases are based on levelDB.

If you haven’t heard of any of these terms, it doesn’t matter. These libraries are easy to get started with, but hard to understand. Understand the principle and then actually start to use, in addition to more simple, but also have more experience.

Leveldb architecture

This is a levelDB architecture diagram, a little more complex than the bare LSMT architecture diagram introduced earlier, but the core essence is the same, let’s look at it one by one.


First, MemTable and Immutable MemTable are mutable. MemTable, as we all know, is essentially a data structure stored in memory. It can be a SkipList or a red-black tree or some other balanced tree (in LevelDB, SkipList is used), we just need to make sure that it is stored in memory. Immutable MemTable is also a mutable MemTable. As mentioned earlier, when the contents of MemTable exceed a certain threshold, we need to write the contents into an SSTable file. This is where ImmuTable MemTable is used.

When a MemTable is converted to ImmuTable MemTable before persisting, it is considered ImmuTable. In addition, a new MemTable will be created to maintain the service. Then write ImmuTable MemTable to the SSTable file.

For example, MemTable is like a cash register in a cash register. When we open a store, we obviously need a cash register to store the money paid by customers. When the cash register is nearly full, we need to deposit the money in the bank. But there’s a lot of money in there, and customers are paying, and it would cost us to leave it there for a while. So we took the whole cashier’s box, and for safety, we added a lock on the outside, and locked it up and sent it to the bank with the cashier’s box. But the cashier can’t do without the cashier’s box, so we need to take out a new cashier’s box for the cashier to collect money.

In this case, the cash register that was originally used to collect money was MemTable, but it became Immutable MemTable after being locked. Maybe not quite right, but it should be easy to understand by contrast.

The second is the.log file, which is easy to understand and was also used in the previous LSMT to store the changed data. Similar to the binlog in the database, used to recover data in the event of system failure and restart.

In the original LSMT, sstables were stored sequentially, so we queried data sequentially. When we found that the first SstAbles did not contain the contents we wanted to query, we moved on to the next file. In LevelDB, sstAbles are stored at levels, the first level being Level0, the second level being Level1, and so on. This is how LevelDB got its name.

An SSTable is essentially a list of key-values in which the keys are ordered. Since the keys in an SSTable are ordered, it is obvious that there are maximum and minimum values. By recording the maximum and minimum values, we can quickly determine which SstAbles the key to be queried may be in, thus saving time and speeding up efficiency.


The file that records the minimum and maximum keys in an SSTable file is called the manifest. In addition to the minimum and maximum keys, it also records the level to which the SSTable belongs and the file name. We can refer to the following figure as a reference:


Finally, there is the Current file, which in its name looks like the name of a pointer. Indeed, Current is a pointer. Because there is more than one manifest file in the actual run, because with our compression and so on, new manifest files are generated. We need a pointer to record which manifest file is the latest, so we can find it easily. And the amount of data in manifest is not small, so we can’t store it all in memory. It’s best to keep a pointer in a file.

Leveldb add delete change check

Leveldb write

The write, delete, and modify operations in LevelDB are basically the same as those in bare LSMT.

First, the changed data is written to a.log file. This is to persist data in case the system is down and the data is lost.

When the.log file is successfully written, MemTable is written. Because MemTable in LevelDB is implemented in SkipList, it will also be fast to write toThe complexity of. If the MemTable capacity reaches or exceeds the threshold, further writes to the SSTable are triggered. In this write, MemTable is first converted to Immutable MemTable, and then an empty MemTable is created for subsequent requests. After the dump command is issued, Immutable MemTable is written to an SSTable file for storage.


The above process is much the same as LSMT, with a few minor differences. Leveldb also does not support modification strictly, but can be converted to insert new data or delete old data and then insert, which are essentially the same and will be merged during the subsequent compression process.

The leveldb read

Leveldb reads slightly differently from LSMT. Let’s take a closer look at the following image.


First, when we perform lookup instructions, we first look in MemTable and Immutable MemTable. This is easy to understand because MemTable and Immutable are both sophisticated data structures that support fast lookup. Immutable MemTable is a mutable MemTable that is written to files. This is because when MemTable is converted to Immutable, there is a wait time before MemTable is written to disk. It is not executed immediately. Before a write is performed, Immutable MemTable may contain residual data, so lookup is necessary.

If it is not found in either MemTable or Immutable MemTable, then we read data from disk to find it.

Instead of naked LSMT looking for sstables one by one, LevelDB first reads the manifest file and finds possible Sstables based on the range of keys recorded in the manifest file.

The same key may appear in different levels of sstables at the same time, but since LevelDB writes to the Sstables the later the data is written, the newer the data is. In other words, the lower the level number is, the newer the data is. Therefore, if multiple values are found, the upper level result will be returned first.

Leveldb read and write can be seen in the original LSMT basic added optimization, there is not too much difficult to understand, or relatively straightforward. In some scenarios, we have sufficient memory resources, and there are certain requirements for searching, we can cache the manifest in memory, which can reduce the time to read the manifest file, play a role of speed. At the same time, however, there is a cost of maintaining the cache, which will be covered in more detail in a future article on caching.

Leveldb’s compression policy

Finally, let’s look at levelDB’s compression strategy, which is the essence of LevelDB.

Google has A paper on Bigtable: A Distributed Storage System for Structured Data, which can be regarded as the theoretical basis of leveldb. In BigTable’s paper, three compression strategies are mentioned.

The first strategy is called a minor Compaction. This strategy is as simple as importing data from a MemTable to an SSTable.

The second strategy is called a Major Compaction, which compiles sstables at different levels, reducing the number of levels.

The final strategy is called full Compaction, which compiles all sstables together.

In LevelDB, we implement the first two Compaction strategies, minor Compaction and major Compaction, which we’ll examine in detail.

minor Compaction

Minor Compaction is a simple operation that writes data from MemTable (SkipList) to disk to create an SSTable.


SkipList, which we introduced in previous articles, is essentially an ordered linked list. And the sstAbles we want to generate happen to be ordered. So we just iterate through the writes. For those of you who are interested in SkipList or want to review it, click on the link below:

Distributed — SkipList skip lists

According to the principle that the later the SSTable level is generated, the smaller the sequence number is, the higher the level is. Our latest SSTable is Level0. We then record the index in the newly generated SSTable to complete the write operation. Note that data will not be deleted during this process. The reason is simple, because we do not know which level of SSTable the data to be deleted is in, and finding and deleting the data will take a lot of time. So we’ll keep it intact and wait for a subsequent merge to process the deletions.

In addition, information about the key value is stored as an index at the end of the file. Since we read the file in reverse order, we get the index information first. We can quickly lock data in the SSTable based on the read index information without reading the entire file.

major Compaction

Let’s move on to a Major Compaction. This is also at the heart of levelDB’s layering mechanism, otherwise inserting SSTable would have been level0 and no hierarchy would have been possible.

Before we go into details, we need to clear up one insight. All other level Sstables in LevelDB are generated using a Major Compaction. We can guarantee that sstables at the same level have no overlapping elements, but level0 is different. The Sstables at Level0 are generated by a minor Compaction, so it’s possible that they will overlap.

There is more than one way that a LevelDB Compaction can trigger a major Compaction. A manual Compaction or a seek Compaction occurs when another Compaction occurs. Here is a brief introduction.

When a manual Compaction occurs, a call to an interface artificially triggers it to execute a Compaction. A size Compaction acts as a balancing operation. It is triggered when the number of Sstables at a tier exceeds the threshold. The last one is to seek Compaction. Leveldb records the miss rate of every SSTable file in each level. When levelDB finds a value in a file that is always miss, levelDB will consider the file as undeserving and merge it with the file at the next level to reduce IO consumption.

Of course, the most common occurrence of a Compaction is a size Compaction. This is when LevelDB marries when it finds SSTable data in a tier or when it dies when its size exceeds a certain threshold.

In a Major Compaction, assume that LevelDB selects a Level I file to merge. At this time, we need to discuss the case by case. If I =0, it means that we want to merge the data of level 0. As mentioned above, the data of different files in Level 0 overlap, so all files with overlapping key values need to be included in the set to be merged. When selecting the collection to merge, LevelDB records the maximum key value of the last time compression was triggered, and this time selects a file greater than this key value to begin compression.

Leveldb has designed a rotation mechanism to ensure that every file in level has a chance to be merged.

Once we have finished selecting the level I files, it is time to select the files from level I +1 and merge them. The selection criteria is very simple. We will select all files that have overlapping key values with those in the set to be merged.

The merging process is essentially a multi-way merging process, as shown in the following figure:


Since the keys in all files are in order, we start with their headers. For each key we decide whether to keep or discard it. The logic of judgment is also simple. For a certain key, if it appears at a lower level, it means that a newer value exists and we need to discard it.

When this Compaction is complete, all files involved in the merge are no longer useful and can be deleted.

This merging process is essentially the same as naked LSMT, but with an added hierarchy.

We’re not done here. Remember we have a manifest file that records all the SSTable indexes? Any Compaction changes the structure of the entire level, so every time we Compaction occurs, we need to create a new manifest file and document the changes caused by that Compaction. Finally, point Current to the newly generated MANIFEST.

So, we have the whole process strung together.

conclusion

A review of the process shows that while adding, deleting, updating, and Compaction adds a lot of detail, the underlying framework is still LSMT. Leveldb’s SstAbles can be optimized with bloomed filters as well as the flexible use of cache to further improve query efficiency.

Also, note that LevelDB is strictly a database engine, not a real database system. Based on LevelDB we can develop a more perfect database system, but it only provides the bottom of the core add, delete, change, check service basis. In addition to the basic functionality, a mature database system requires a great deal of detail and optimization. So far there are many database engines developed based on LevelDB, but very few complete database systems, after all, it takes a long time to develop and accumulate.

If we simply divide the distributed system into distributed computing system and distributed storage system, we will find that the essence of distributed storage system accounts for most of the. The distributed storage system can be simply regarded as the data structure of the bottom layer and the consensus protocol of the upper layer to solve the consistency problems caused by the distribution. And the distributed storage system commonly used in the underlying data structure is no more than a few, so our understanding and learning of these data structures is the basis for in-depth understanding of the distributed system. It is normal for a qualified and excellent system architect to solve distributed problems in business scenarios, and the core of problem solving ability lies in the understanding and application of these basic knowledge.

Today’s article is all of these, if you feel that there is a harvest, please feel free to point attention or forward, your help is very important to me.