This article is excerpted from the Bytedance Infrastructure Practices series.

“Bytedance Infrastructure Practice” is a series of articles designed by the technical teams and experts of Bytedance Infrastructure department to share the team’s practical experience and lessons in the development and evolution of infrastructure, and to exchange and grow with technical students.

RocksDB is one of the most widely used storage engines in the world, and a large number of byteDance database products (Figure Database, NewSQL, etc.) are built on RocksDB. The investment in storage engine development will continue to increase, enabling more services.

In this article, we will introduce bytedance’s improvements to RocksDB’s storage engine, including a large number of community contributions, as well as some original technologies, in the hope of bringing you more optimization ideas.

1. The background

As one of the most famous LSM-like storage engines, RocksDB plays a very important role in Bytedance. A large number of databases and storage systems are being built or improved based on RocksDB. However, some well-known problems of THE LSM series also trouble Bytedance’s business. This includes performance issues, cost issues, functional issues, and so on.

This article first attempts to sort out and introduce the five improvements we have made to RocksDB (developed on the basis of the internal storage engine TerarkDB), hoping to bring some reference value to the community. We also welcome technical experts who are interested in the storage engine to join us to build a more powerful underlying support for Bytedance.

2. Shortcomings of RocksDB

  • Severe read/write amplification
  • Insufficient peak cutting ability to deal with sudden flow
  • Limited compression
  • Index efficiency is low
  • , etc.

3. Our improvement

3.1 LazyBuffer

A previous version of RocksDB introduced PinnableSlice as a vehicle for transferring data within the engine. Its main purpose was to reduce data replication by returning only references to the data the user was looking for when it was in the BlockCache.

But PinnableSlice is not useful in situations where a user’s operation involves touching a large number of unwanted values (such as multiple versions of values or a large number of tombstones), PinnableSlice still generates I/O operations on these useless values, which is entirely avoidable overhead.

For this reason, we constructed LazyBuffer to replace PinnableSlice. When the user gets a Value, disk I/O is not really performed. Only when the user really needs the Value, the real FETCH operation is performed for I/O.

Lazy Buffer is an enhancement to PinnableSlice, further reducing unnecessary I/O from data replication, which is a great benefit for scenarios such as large numbers of scans, SeekRandom, and so on.

3.2 Lazy Compaction

Since LSM Compaction is launched, a variety of strategies are optimized to support this Compaction. The main strategies include the following:

  • Leveled Compaction

    • All levels are merged from top to bottom in a standard way
    • Read and write amplification is serious, but spatial amplification is low
    • In this paper (Towards Accurate and Fast Evaluation of Multi-Stage Log-Structured Designs)
  • Tiered Compaction

    • Universal Compaction in RocksDB
    • Spatial and read magnification are severe, but write magnification is minimal
    • In this paper (Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging) are elaborated on
  • Tiered+Leveled Compaction

    • That is, Level Compaction in RocksDB
    • It is a hybrid Compaction strategy, with lower spatial Compaction than Tiered Compaction, and lower write Compaction than grade
  • Leveled-N Compaction

    • Lower write magnification and higher read magnification than a grade Compaction Compaction
    • Merge layer N-1 to layer N at once

As we can see from this classification, while prevailing Compaction strategies typically balance decisions between merge times, ByteDance takes a slightly more sophisticated approach here.

First of all, we need to understand that if we allow SST to operate without having to maintain a strong order, we can gather more statistics before we actually execute an external sort. The downside of this is that we create a read Compaction that impacts read latency. Is there a way to create a controllable read Compaction? Hardly even an increase in read magnification?

We tried to build a new SST data structure (Adaptive Map, aMap for short), which is different from RocksDB’s default structure. AMap is a logical virtual SST, as shown in the following figure:

Figure: Schematic diagram of aMap structure

In the figure, SST 1, SST 2 and SST 3 are three physical SST files. When it is necessary to merge them, we first build a virtual mask to expose the logically merged SST (logically a big SST) to the upper layer. At the same time, the coverage, Key Space and other statistics are recorded and marked.

Its main functions are:

  • When operating a new Compaction policy, it inherits a Level Compaction from RocksDB. (It can also support a Level Compaction Compaction when this happens.
  • When Compaction is performed, the Adaptive Map is first built to form a logical new SST from several candidate SST (see figure 1).
  • Adaptive Map will segment multiple different overlapping segments (R1, R2, R3, R4 and R5), and the degree of overlap of these overlapping segments will be tracked and recorded
  • The background GC thread will preferentially select the layers that overlap better for GC, and by doing so, we can make GC more efficient, with write magnifications much lower than the default

Compared with the original RocksDB, there are advantages in theoretical analysis:

The I/O overhead The original RocksDB The improved instructions
Read the amplification O(log N) O(1) N is the total data
Write amplification O(log N) O(log log N) N is the total data
Computational overhead The original RocksDB The improved
Read the consumption O(p) O(p) P is the False Positive Rate of BloomFilter
Write a consumption O(log N) O(log log N) N is the total data

Table: Comparison of complexity analysis (read and write magnification)

3.3 KV separation

In a paper * WiscKey: Separating Keys from Values in SSD-conscious Storage* Separating Keys from Values in SSD-Conscious Storage* Separating Keys from Values in SSD-Conscious Storage Meanwhile, in the original SST, Value only records the actual location of data.

Figure: Basic principle of KV separation

When a value meets a threshold is stored in the SST, it can be changed to a file pointer to reduce the overhead of operations such as Compaction and Seek.

RocksDB community has a BlobDB function with KV separation, but the function is not perfect yet, and there is still a lot of work to be done, so I will not make a comparison here. The other TitanDB is a relatively fully implemented KV split storage engine (built as a RocksDB plug-in), and we’ll give them a simple comparison here:

WiscKey TitanDB TerarkDB (Our Improvements)
Under what circumstances always More than the threshold More than the threshold
Value Saving mode VLOG Specially designed Blob files Native SST files
Get the process 1) Key → VLOG Position2) VLOG Position → Value 1) Key → FileNumber + Handle 2) FileNumber → Blob 3) Blob + Handle → Value 1) Key → FileNumber2) FileNumber → SST3) SST + Key → Value
Scan the cost Slower than LevelDB to support Prefetch It is slower than RocksDB and has low I/O utilization. Prefetch is not supported It is slower than RocksDB and has low I/O utilization. Prefetch is not supported
The GC process 1) Reverse lookup and pop the VLOG header KV2) valid data is re-written to the end of the VLOG and re-inserted into the LSM tree 1) According to the event listener, the Blob with the most garbage is selected to initiate GC, and the KV2 in the Blob is backchecked) generates a new Blob and writes the encountered data to the LSM tree 1) Select the SST with the most garbage according to the event listener to initiate GC, reverse check KV2 in SST, and generate new SST
The efficiency of the GC Rolling VLOG implements GC rolling once a complete GC cycle is very long and hot data is not friendly Always GC the Blob with the most garbage, be hot data friendly, and rewrite the LSM tree Always initiate GC for the most garbage SST, hot data friendly, do not rewrite the LSM tree

In general, compared with the community, our general idea of realizing KV separation is similar. However, due to the existence of Adaptive Map, the real GC operation can be delayed to a time when the load is low, which has a quite good effect in dealing with sudden traffic spikes.

However, KV separation also brings some losses, the most important is the damage to the range query, which can be reduced by Prefetch at the logical layer.

3.4 Multiple indexes are supported

For native RocksDB, the SST format is roughly as follows:

[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]          (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block]  (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block]      (see section: "range deletion" Meta Block)
[meta block 5: stats block]          (see section: "properties" Meta Block)
...
[meta block K: future extended block]  (we may add more meta blocks in the future)
[metaindex block]
[Footer]                (fixed size; starts at file_size - sizeof(Footer))
Copy the code

Index block and Filter block help users quickly locate the block where the target key resides. RocksDB’s default index does not take into account the differences between different data types, so it cannot select the index structure with the highest compression efficiency and query efficiency according to different data types. To solve this problem, we build an adaptive index selection and construction mechanism.

  • For the input data, we will perform piecewise detection to determine the most efficient indexing algorithm
  • Index this batch of data separately and put the index in the Index block

Additional indexing algorithms currently supported are:

  • Compressing Trie indexes for general compression of string types
  • A non-descending integer index that is built with a bitmap and is highly compressed
  • .

Through the support of a variety of index structures, it provides more possibilities for the future long-term optimization, and even embedded B+ tree index data block in SST, etc. The more flexible and modular structure makes the engine more adaptable, and it can have better comprehensive performance in the face of different storage media and access modes.

3.5 Extreme Compression

For database applications, in addition to the pursuit of high performance, cost control is also a very important topic, among which an important link of cost control is data compression, for LSM structure, the default is through block compression, compression rate and block size is strongly related.

Here, in order to solve the tradeoff problem of block size and compression rate, we built a series of compression algorithms, among which the most commonly used is the global compression that can extract data by record. The basic idea is not complicated, but through the improvement of LZ series, Use efficient means to preserve the Offset of each Record, and Offset table has many ways to compress storage (obviously it is an increasing sequence of discontinuous integers), using pfordelta and other methods to optimize flexible storage is not difficult to do.

Figure: Outline flow of the global compression algorithm

The general process is as follows:

  • Data is scanned and sampled to construct a data dictionary
    • By default, Compaction SST needs to be modified to provide the ability to scan twice
  • According to the data dictionary, the original data is compressed by sliding window
  • And finally, an optional round of Entropy Compression
    • Mainstream entropy compression industry includes ANS and high-order Huffman, etc., which can be selected according to the actual data distribution
  • During compression, the offset information of each piece of original data is saved to form an offset table

Figure: Build and save offset table

The offset table can be compressed and saved using the common pfordelta algorithm. It should be noted that the offset table will be frequently accessed. The first-order difference table and the second-order difference table are suitable, which can be selected according to the actual situation.

The index can then be mapped directly to the specific record offset here, facilitating subsequent direct record addressing.

3.6 New Hardware support

We also adapted and optimized the current popular new hardware (such as persistent memory, FPGA, QAT, etc.). For example, when the equipment has QAT hardware, we chose to give up offload of some CPU compression into QAT for compression when the host CPU load is high. For example, we put some data in persistent memory to achieve true record-level data access (instead of using the usual block-level index structure, we index directly by Record) and so on.

Here we take QAT compression as an example to illustrate:

  • The popularity of PCIe NVMe SSDS greatly improves disk bandwidth and IOPS

  • The increase in disk bandwidth in turn shifts the system bottleneck to the CPU

    • CPU tasks include data sorting (SST needs to maintain order), CRC validation, data compression, query alignment, and so on
  • Our initial current plan is to offload data compression and CRC validation to specialized QAT hardware for acceleration

    • At present, the COST performance of QAT hardware is higher, and even some motherboards come with their own
  • QAT itself can support a limited number of compression algorithms, mainly zlib deflate

  • With data compression and CRC uninstalled, we can allocate more CPUS to run an SST GC and Compaction Compaction to optimize SST behavior as quickly as possible

At present, the use of QAT is still in the testing stage and has not been officially launched. In the next step, we plan to conduct in-depth investigation on the application of FPGA.

4. Comparison tests

We performed some simple tests using RocksDB’s Bench tool to compare the differences and differences between RocksDB, TitanDB and TerarkDB. It should be noted that the tool uses randomly generated data and is not very friendly to TerarkDB’s compression algorithm. So there’s not much difference in compression rates.

In this improvement, we focus on the performance of KV separation, so we only benchmark the larger Value to confirm the improvement effect:

  • Test environment:

    • The original test data set size was 256 GB and memory was 128 GB
    • CPU : 48 Core
    • RAM : 128 GB
    • Disk: Intel NVMe 3.4T
    • The test program is db_Bench
    • The Linux version 4.14
    • GCC Version 6.3.0
  • Test content:

    1. fillrandom: Multiple threads write data randomly, and duplicate keys exist
    2. readseq: Multithreading sequence Scan
    3. readrandom: Multithreaded random Get
    4. seekrandom: Multithreaded random Seek

Value size = 4KB

Value size = 32KB

5. Subsequent

Bytedance will continue to invest more in standalone engines, and will also consider building specialized engines for specific businesses, with the goal of providing more powerful performance, more flexible functions and more reliable services for upper-layer businesses within a standalone device.

In order to achieve these goals, we need to do a lot more, This includes unloading cpus from a single engine to conduct distributed Compaction on a cluster, introducing SPDK-related technologies to improve I/O efficiency, introducing AI Tuning to make I/O policies more flexible for different loads, and introducing new hardware (such as persistent memory and FPGA).

In order to realize the diversity of Bytedance’s storage engine and move to the forefront of the industry, we are eager for ambitious candidates to join us for new explorations. We also hope to see bytedance actively participate in mainstream journals and open source communities in the future, and contribute to the technical community.

6. References

  1. WiscKey: Separating Keys from Values in SSD-conscious Storage
  2. Bitcask A Log-Structured Hash Table for Fast Key/Value Data
  3. LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data
  4. Towards Accurate and Fast Evaluation of Multi-Stage Log-Structured Designs

More share

Deep understanding of the Linux kernel – TLB shootdown and optimization caused by Jemalloc

Bytedance’s own quadrillion graph database & graph computing practice

Bytedance EBB-level HDFS practice


Bytedance infrastructure team

Bytedance’s infrastructure team is an important team that supports the smooth operation of bytedance’s multi-hundred-million-scale user products, including Douyin, Toutiao, Watermelon Video and Huoshan Video, providing guarantee and driving force for the rapid and stable development of Bytedance and its businesses.

Within the company, the infrastructure team is mainly responsible for the construction of Bytedance private cloud, managing clusters of tens of thousands of servers, being responsible for the mixed deployment of tens of thousands of computing/storage units and online/offline units, and supporting the stable storage of EB massive data.

Culturally, the team embraced open source and innovative hardware and software architectures. We are looking for long-term candidates in the field of infrastructure. Please refer to job.bytedance.com. If you are interested, please contact [email protected].

Welcome to the Bytedance technical team