Introduction to the

LevelDB is a very fast KeyValue persistent repository developed by Google. Keys and values can be any array of bytes, and the mapping is sorted by key.

Typical Application Scenarios

Counter function, update very frequently, and data can not be lost

features

  • Key and value can be any array of bytes
  • The data is sorted by key
  • Users can customize collation rules
  • The basic operation methods are Put(key,value), Get(key), and Delete(key).
  • Multiple changes can be operated as a single atom
  • Users can create a temporary snapshot to get a consistent view of the data
  • Data supports forward and backward iteration.
  • Data is automatically compressed and stored using Snappy Compression Library
  • External activities (file system operations, etc.) through virtual interfaces so that users can customize operating system interactions

limit

  • LevelDB is not a relational database and does not support relational data models, SQL queries, and indexes
  • Only one single process (possibly multithreaded) can access a particular database at a time
  • The library does not have Client and Server capabilities built in, and applications that require these capabilities are wrapped themselves

performance

The test library has 100 W rows of records, each record has 16 bytes of key, 100 bytes of value, and about 50 bytes of value after compression

Write performance

  • Sequential write: Each operation takes 1.765 microseconds on average. That is, about 55W sequential write operations are supported per second
  • Sequential write + Flush every time: The average time of each operation is 268.409 subtlets, that is, about 3,700 flush writes per second are supported
  • Random write: each operation takes an average of 2.460 microseconds, that is, about 40w random write operations are supported per second
  • Update write: The average operation time is 2.380 microseconds, and the performance is similar to random write

Read performance

  • Random read: The average operation duration is 16.677 microseconds. That is, about 6W random read operations are supported per second
  • Sequential read: each operation takes 0.476 microseconds on average. That is, about 210w sequential read operations are supported per second
  • Reverse read: each operation takes an average of 0.724 microseconds, which means about 130w reverse read operations per second

The above performance is achieved when the Compress function is not enabled. If the Compress function is enabled, the performance is improved. For example, the random read performance is improved to 11.602 microseconds, that is, 8.5 W times per second

RocksDB

Leveldb is a great storage engine, but it has a few drawbacks: LevelDB doesn’t support multithreading merges, its support for key range lookup is rudimentary, and it isn’t optimized.

Facebook’s RocksDB is a much tougher engine that is actually an improvement on LevelDB and is very similar to LevelDB in usage.

There are many databases in the modern open source market that use RocksDB as the underlying storage engine, such as TiDB.

Storage and retrieval

In introducing LevelDB’s implementation, I want to start with data storage and retrieval and discuss the evolution and differences between different structured storage approaches.

At the most basic level, a database should do two things:

  1. Give the database some data and it should be able to store it
  2. Ask the database for the data back, it should be able to give you the data

SimpleDB

We are trying to implement a simple KeyValue persistent database that supports the following functions:

  • persistence
  • set(String key, String value)
  • get(String key)

New project SimpleDB, build a Server, support

  • The SET interface appends the received JSON string of Key and Value concatenated into a KeyValue structure to the specified persistence file
  • The GET interface matches the Value of the parameter Key in the persistent file

Set

    @GetMapping(path = "set")
    public String set(@RequestParam("key") String key, @RequestParam("value") String value) {
        if (StringUtils.isEmpty(key) || StringUtils.isEmpty(value)) return "FALSE";
        String recordLine = buildRecordLine(key, value);
        FileUtil.writeString(recordLine, PATH, UTF_8, true);
        return "OK";
    }
Copy the code

Get

    @GetMapping(path = "get")
    public String get(String key) {
        List<String> recordLines = FileUtil.readLines(PATH);
        List<Record> targetRecords = recordLines.stream()
                .map(line -> JsonUtil.decodeJson(line, Record.class))
                .filter(record -> key.equals(record.getKey()))
                .collect(Collectors.toList());
        return CollectionUtils.isEmpty(targetRecords) ? null : targetRecords.get(targetRecords.size() - 1).getValue();
    }
Copy the code

use

Perform the following set request

http://localhost:8080/set?key=name&value=yuming
http://localhost:8080/set?key=age&value=24
http://localhost:8080/set?key=name&value=yuming2
Copy the code

Persist file contents

{"key":"name","value":"yuming"}
{"key":"age","value":"24"}
{"key":"name","value":"yuming2"}
Copy the code

Perform the following GET request

http://localhost:8080/get?key=name
http://localhost:8080/get?key=age
Copy the code

The response is a

yuming2
24
Copy the code

Parse write operation

The set operation is very simple, appending the received request to the persistent file, and is efficient because it is written sequentially.

Similarly, many internal log-based databases append data to files. Of course, real databases have more issues to deal with (such as concurrency control, reclaims disk space to avoid infinite log growth, and handling errors and partially written records), but the basic principles are the same.

Concurrent writes

The existing set method is not supported, but it needs to be tweaked to ensure that there is actually only one writer thread.

private static ExecutorService executorService = Executors.newSingleThreadExecutor(); /** * the set interface appends the received JSON string of Key and Value to the specified persistence file ** @param Key * @param Value * @return */ @getMapping (path =) "set") public String set(@RequestParam("key") String key, @RequestParam("value") String value) { if (StringUtils.isEmpty(key) || StringUtils.isEmpty(value)) return "FALSE"; String recordLine = buildRecordLine(key, value); Future<? > future = executorService.submit(() -> FileUtil.writeString(recordLine, PATH, UTF_8, true)); try { future.get(); return "OK"; } catch (Exception e) { LOGGER.error("writeString error", e); return "FALSE"; }}Copy the code

Parse read operation

As for the Get operation, the performance of the Get operation is not good as the log file grows larger and larger because each query requires traversing the entire log file (O(n)).

Can optimize

The first idea is to add a cache, but as the amount of data increases, it becomes clear that memory is running out.

If there is a lot of hot data, most keys will be overwritten quickly, and even if we provide a deletion method (append rows with value null, and then compress the thread), the actual KeyValue logarithm is not that large, so can we use cache?

The first thing to do in this case is to deal with data consistency between the cache and the contents of the file.

The second point is that since it’s all hot data, there’s not that much actual data available, so why do we use caching instead of storing it in memory?

Redis

Redis stores data directly in memory and appends data to log files for persistence. Log files only recover data at startup (read sequentially and execute only once).

This way Redis can ensure that:

  • Write operations: Write to the memory first and then to the disk sequentially, which is fast
  • Read operation, directly in the memory to find, fast

Because the data is stored in memory, Redis is also capable of:

  • Memory-based complex and efficient data structures
  • A single processing thread, serialization

How do I compress log files

  1. A new log file is generated after the original log file is compressed
  2. Generate a new log file directly from the data in memory

The index

Not all scenarios can store all data in memory, so we need to explore alternatives.

To efficiently find the value of a particular key in a database, we need a data structure: an index.

The general idea behind indexing is to keep some extra metadata as a roadmap to help you find the data you want. If you want to search the same data in several different ways, you may need different indexes, built on different parts of the data.

An index is an additional structure derived from master data, which is necessarily not a simple append log. Many databases allow indexes to be added and dropped without affecting the content of the data, which only affects query and insert performance. Maintaining additional structures incurs overhead, especially at write time.

Write performance is hard to beat by simply appending a file, which is the simplest write operation. Any type of index typically slows down writes because the index needs to be updated each time data is written.

This is an important tradeoff in storage systems: carefully selected indexes speed up reading queries, but each index slows down writes. For this reason, the database does not index everything by default, requiring you to manually select the index based on your knowledge of the applied query mode. You can choose to bring the most benefit to your application without introducing more indexes than necessary.

The hash index

As we continue to optimize our SimpleDB, we discussed caching read operations earlier, and we mentioned that adding cache memory to all data is not affordable.

In order to reduce the size of the cache, we can cache only part of the hot data and use LRU to eliminate the cache that is not commonly used.

With this scheme, the entire log file needs to be traversed in case of a cache miss, the search time is unacceptable in case of a large amount of data, and the response time of the query is too different.

Since we can’t partially cache, can we compress our cache?

Our previous cache built a Map<key, value> structure. The value of value is the most space-consuming part. If we change the value from the actual value to the location of the actual value in the file, we can save a lot of space. And because you know the location of the query so you can quickly find the corresponding data.

The structure we’ve built is essentially a hash index.

Such a storage engine is ideal for situations where the value of each key is updated frequently. For example, the key might be the URL of the video, and the value might be the number of times it is played (increasing each time someone clicks the play button).

In this type of workload, there are many writes, but not many different keys — many writes per key, but keeping all keys in memory is feasible.

The compression

Until now, we’ve just appending a file — so how do we avoid eventually running out of disk space? A good solution is to break the log into segments of a specific size, close the current segment file when the log grows to a specific size, and start writing a new segment file.

We can then compress these segments, as shown in the figure below. Compression means that duplicate keys are discarded in the log, keeping only the most recent updates for each key.

Also, since compression often makes segments very small (assuming that the keys are rewritten on average several times within a segment), it is possible to merge segments together while performing compression, as shown in the figure below. Segments are never modified after being written, so the merged segments are written to a new file.

Consolidation and compression of frozen segments can be done in background threads, and we can continue to use the old segment files to serve read and write requests normally as we proceed. Once the merge process is complete, we convert the read request to use the new merge segment instead of the old segment — then we can simply delete the old segment file.

Why append logs instead of overwriting old values

  • Append and segment merge are sequential write operations and are generally much faster than random writes, especially on spinning hard drives. Sequential writes are also, to some extent, preferred on flash-based SSDS.
  • If segment files are attached or immutable, concurrency and crash recovery are much simpler. For example, you don’t have to worry about a crash when you overwrite a value, but keep a file that contains part of the old value and part of the new value together.

limitations

  • Hash tables must fit into memory, and in principle you can keep a hash map on disk, which unfortunately is hard to do well. It requires a lot of random access I/O, is expensive to grow when it gets full, and hashing requires a lot of logic.
  • Range query is inefficient. For example, you can’t easily scan all the keys between Kitty00000 and Kitty99999 — you have to look for each key individually in the hash map.

SSTable

Using hash indexes When there is a lot of data we cannot store all the index data in memory, so can we store only part of the index? This is obviously not supported based on the previous file structure. If the contents of our files are ordered, can we store only partial indexes?

We can make a simple change to the format of the segment file: we require the sequential key ordering of the key-value pairs. At first glance, this requirement seems to break our ability to use sequential writes.

We call this format Sorted String Table, or SSTable for short. We also require each key to appear only once per merged segment file (the compression process is guaranteed). SSTable has several major advantages over log segments that use hash indexes:

  1. Merging segments is simple and efficient, even if the file is larger than the available memory. This approach is similar to the one used in the merge sort algorithm, as shown below: You start reading the input files side by side, look at the first key in each file, copy the lowest key (in sort order) into the output file, and repeat. This produces a new merge segment file, which is also sorted by keys.

What if the same key appears in several input segments? Remember that each segment contains all the values written to the database over a period of time. This means that all values in one input segment must be newer than all values in the other segment (assuming we always merge adjacent segments).

When multiple segments contain the same key, we can keep the value of the most recent segment and discard the value in the older segment. 2. You no longer need to keep an index of all keys in memory in order to find a particular key in a file. Here’s an example: Suppose you’re looking for the key handiwork in memory, but you don’t know the exact offset for that keyword in the segment file.

However, you know the offset of handbag and handsome, and because of the sorting nature, you know that Handiwork must appear in between the two. This means that you can skip to the offset of your handbag and scan from there until you find handiwork (or not).

You still need an in-memory index to tell you the offsets of some keys, but it can be sparse: one key for every few kilobytes of segment files is sufficient, because the kilobytes can be scanned quickly. 3. Because read requests scan multiple key-value pairs in the requested range anyway, these records can be grouped into blocks and compressed before being written to disk (as shown in the shaded area in the figure above).

Each entry in the sparse memory index points to the beginning of the compressed block. In addition to saving disk space, compression reduces IO bandwidth usage.

Build and maintain SSTables

write

So far, but how do you get your data sorted by keystrokes in the first place? Our incoming writes can happen in any order.It is possible to maintain an ordered structure on disk (see “B-tree”), but it is much easier to keep it in memory. There are many well-known tree data structures available, such as red-black trees or AVL trees. With these data structures, you can insert keys in any order and read them in sorted order.

Now we can make our storage engine work as follows:

When writing, it is first added to a balanced tree data structure (for example, a red-black tree) in memory. This memory tree is sometimes called a memtable.

If the memory table exceeds a certain threshold, the memory table is written to the disk as an SSTable file. This can be done efficiently because the tree already maintains key-value pairs for key-sorting. The new SSTable file becomes the latest part of the database. When an SSTable is written to disk, writes can continue to a new memory table instance.

To serve a read request, first try to find the keyword in the memory table, then in the nearest disk segment, and then in the next older segment.

A merge and compression process is run in the background to combine segment files and discard overwritten or deleted values.

The scheme worked well. It encounters only one problem: if the database crashes, the most recent writes (in the memory table, but not yet written to disk) will be lost. To avoid this problem, we can keep a separate log on disk where every write is immediately attached.

The log is not in sorted order, but that doesn’t matter because its sole purpose is to recover the memory table after a crash. Whenever a memory table is written to the SSTable, the corresponding log can be discarded.

read

Reading data from LevelDB is not complicated. Memtable and IMM are more like two-level caches that provide faster access in memory. If you can get the value of the response directly from these two places in memory, they must be the latest data.

LevelDB always writes the new key/value pairs first and deletes the history data when the data is compressed.

Data is read in memtables, Immutable memtables, and different levels of SstAbles, which are stored in memory and persist on disk as *. LDB files. The name of our database is LevelDB because of the different levels of Sstables.

When LevelDB does not find the corresponding data in memory, it searches the sstAbles at multiple levels on disk. The process becomes slightly more complicated, as LevelDB searches the sstAbles at multiple levels and does not skip any of them.

The lookup process involves a very important data structure called FileMetaData:

FileMetaData contains information about the entire file, including the maximum and minimum key values, number of lookups allowed, number of references to the file, file size, and file number, because all SstAbles are stored in the same directory in a fixed format. So we can easily find the corresponding file by the file number.

LevelDB will first look for the corresponding key in Level0. However, unlike other levels, the key ranges of multiple SstAbles at Level0 overlap, and the four fixed SSTAbles at Level0 are searched for the corresponding values.

However, when it comes to sstables at a higher level, since sstables at the same level have no overlapping parts, we can use the maximal value information in the known Sstables to find the corresponding Sstables quickly.

Then check whether the current SSTable contains the query key. If it does not, search the next level until the last level kNumLevels (default level 7) or find the corresponding value.

LSM (Log-structured Merge Tree)

The algorithms described here are essentially the key value store engine libraries used in LevelDB and RocksDB, designed to be embedded in other applications. In addition, LevelDB can be used as an alternative to Bitcask in Riak. Similar storage engines are used in Cassandra and HBase, both of which were inspired by Google’s Bigtable documentation (which introduced SSTable and memtable).

This index structure was originally described by Patrick O’Neil et al. A file system based on a log structure merging tree (or LSM tree) that builds on previous work. A storage engine based on this principle of merging and compressing sorted files is often called an LSM storage engine.

Lucene, a full-text search indexing engine used by Elasticsearch and Solr, uses a similar approach to store its dictionaries. Full-text indexing is much more complex than key-value indexing, but is based on a similar idea: give a word in a search query and find all documents (web pages, product descriptions, etc.) that mention the word. This is done through a key-value structure, where the key is a word (term) and the value is a list of ids of all documents that contain the word (list of articles). In Lucene, this mapping from terms to published lists is kept in an ordered file of the SSTable class, merged behind the scenes as needed.

Performance optimization

As usual, a lot of detail makes the storage engine perform well in practice. For example, when looking for keys that don’t exist in the database, the LSM tree algorithm can be slow: you have to check the memory table and then work the segments all the way back to the oldest (you may have to read each one from disk) before you can determine that the key doesn’t exist.

To optimize this access, storage engines often use additional Bloom filters. A Bloom filter is a memory-efficient data structure for approximating the contents of a collection that can tell you if a key is present in the database, thereby saving a lot of unnecessary disk reads for keys that don’t exist.

There are also different strategies for determining the order and timing of how SSTables are compressed and merged. The most common option is size layering compaction. LevelDB and RocksDB use flat compression (hence the name LevelDB), HBase uses size layering, and Cassandra supports both.

In the scale level adjustment, newer and smaller SSTables are merged into older and larger SSTables. In horizontal compaction, critical ranges are broken down into smaller SSTables, while older data is moved to a separate “level,” which allows compression to occur more incrementally and uses less disk space.

Even with many subtleties, the basic idea of LSM trees — to keep a series of SSTables merged in the background — is simple and effective. Even if the data set is much larger than the available memory, it continues to work.

Because the data is stored in sort order, range queries can be efficiently performed (scanning all keys above some minimum and maximum values), and because disk writes are continuous, the LSM tree can support very high write throughput.

B tree

The log structured indexes just discussed are gaining acceptance, but they are not the most common type of index. The most widely used index structure was introduced in 1970 and became “ubiquitous” less than a decade later, with b-trees standing the test of time. They are the standard index implementation in almost all relational databases.

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries. But that’s where the similarities end: B-trees have a very different design philosophy.

The log structural indexes we saw earlier decompose the database into variable-size segments, typically a few megabytes or more, and always write the segments in order.

In contrast, b-trees break the database into fixed-size blocks or pages, traditionally 4KB (sometimes larger), and only one page can be read or written at a time. This design is closer to the underlying hardware because the disks are also arranged in fixed-size blocks.

To find the

Each page can be identified using an address or location, which allows one page to refer to another — similar to a pointer, but on disk rather than in memory. We can use these page references to build a page tree, as shown in the figure.

Insert & Modify

If you want to update the value of an existing key in the B tree, search the page that contains the key, change the value in the page, and write the page back to disk (any references to the page remain valid).

If you want to add a new key, you need to find the page whose scope contains the new key and add it to that page. If there is not enough free space in the page for the new key, split it into two half-full pages and update the parent page to explain the new partition of the key range, as shown.

The algorithm ensures that the tree remains balanced: A B-tree with n keys always has O(log n) depth. Most databases fit into a three – or four-level B-tree, so you don’t need to track multiple page references to find the page you’re looking for. (A four-level tree with 4KB pages with a branching factor of 500 can store up to 256TB.)

To ensure reliable

The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. Assume that the override does not change the location of the page; That is, when a page is overwritten, all references to that page remain intact. This is in sharp contrast to log structured indexes, such as LSM trees, which only attach to files (and eventually delete obsolete files) but never modify files.

redo log

To make databases crash resilient, B-tree implementations typically come with an additional disk data structure: a WRITE-Ahead-log (WAL) (also known as a redo log). This is an app-only file, and each B-tree modification can be applied to the page of the tree itself. This log is used to bring the B-tree back to a consistent state when the database recovers after a crash.

Concurrency control

An additional complication of updating pages is that careful concurrency control is required if multiple threads want to access the B tree at the same time — otherwise threads may see the tree in an inconsistent state. This is usually done by using latches (lightweight locks) to protect the data structure of the tree. The log structured approaches are simpler in this respect because they do all the merging behind the scenes without interfering with incoming queries, and from time to time swap old fragment atoms for new ones.

Compare B trees and LSM trees

Although B tree implementations are generally more mature than LSM tree implementations, LSM trees are also very interesting because of their performance characteristics. As a rule of thumb, LSM trees are usually faster to write, while B trees are faster to read. LSM tree reads are generally slow because they have to check several different data structures and SSTables at different stages of compression.

Advantages of LSM trees

Write to fast

The B-tree index must write each piece of data at least twice: once to the pre-write log and once to the tree page itself (perhaps paging again). Even if only a few bytes change in the page, it takes the overhead of writing the entire page at once. Some storage engines even overwrite the same page twice to avoid partial page updates in the event of a power failure.

Pieces less

LSM trees can be compressed much better and therefore often produce smaller files on disk than B trees. The B-tree storage engine leaves some unused disk space due to fragmentation: when a page is split or a row cannot fit into an existing page, some space in the page remains unused. Because LSM trees are not page-oriented and SSTables are periodically rewritten to remove fragmentation, they have a lower storage overhead, especially when flat compression is used.

Disadvantages of LSM trees

The compression block

The disadvantage of log structured storage is that the compression process can sometimes interfere with ongoing read and write operations. Although storage engines try to perform compression incrementally without affecting concurrent access, disk resources are limited, so it is easy for requests to wait while disks complete expensive compression operations. B trees, on the other hand, behave more predictably.

The compression rate

Another problem with compression arises from high write throughput: the disk’s limited write bandwidth needs to be shared between initial writes (recording and flushing memory tables to disk) and compressed threads running in the background. When writing to an empty database, full disk bandwidth can be used for initial writes, but the larger the database, the more disk bandwidth is required for compression.

If write throughput is high and compression is not carefully configured, compression cannot keep up with write rates. In this case, the number of unmerged segments on the disk increases until the disk space runs out, and reads slow down because they need to check more segment files. Typically, the Sstables based storage engine does not limit the rate of incoming writes even if compression cannot keep up, so explicit monitoring is needed to detect this.

The transaction

One advantage of b-trees is that each key exists in only one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments.

This aspect makes B-trees attractive in databases that want to provide strong transaction semantics: in many relational databases, transaction isolation is achieved by using locks on the key range, which in b-tree indexes can be directly connected to the tree.

conclusion

B-trees are deeply ingrained in database architectures, providing consistently good performance for many workloads, so they’re not likely to go away anytime soon.

Structured log indexes are becoming more and more popular in new data stores. There are no quick and easy rules to determine which type of storage engine is better for your scenario, so it’s worth some empirical testing.

reference

  • LevelDB GitHub project page
  • Research results for LevelDB’s “upgraded” storage engine RocksDB
  • How to create LevelDB for Redis?
  • Designing Data-Intensive Application
  • LevelDB and Bigtable