Based on RocksDB to achieve accurate TTL expiration elimination mechanism

This article is from OPPO Internet technology team, if you need to reprint, please indicate the source and author. Welcome to follow our official account: OPPO_tech

Parker is a distributed KV storage system based on RocksDB developed by OPPO Internet. It is a Redis like storage system. It mainly solves the problem of long recovery time due to memory overload when users use Redis. Problems such as inability to store massive amounts of data.

1. Introduction of Parker

Parker provides the following features:

Support mass storage: the storage capacity of a single cluster can reach hundreds of TB, and the peak TPS of a single cluster online can reach millions.
Horizontal expansion: The storage capacity and read/write performance can be horizontally expanded by adding more machines. The storage capacity and read/write performance can be expanded by allocating data fragments to different nodes.
High service availability: Each data fragment contains two replicas. The active and standby replicas can be switched within seconds to ensure high availability of read and write services for a single data fragment.
Compatible with certain Redis protocols: Parker is compatible with the Reids Cluster protocol for external communication and supports String and Hash types and corresponding operation functions. That is, users can read data in Parker using the Redis client.
Supports the TTL feature: After a piece of data is written, the time stamp specified by THE TTL expires and is invisible to users. Parker can delete expired data in a timely manner and reclaim disk space occupied by expired data.

2. You run into problems

We deployed 8 Parker instances on a 5TB storage server with data written to expire in 3 days. It is expected that the single instance will maintain 300GB of storage usage with the data write speed and expiration reclaim speed balanced.

In fact, Parker found that the disk usage continued to rise for five days after the data was written. The disk usage rose sharply for the first three days after data was written, and then slowed down, but the overall increase rate was still high, which was significantly different from the expected result. Although the data cannot be read after the TTL expires, disk space is not actually reclaimed in time, resulting in high disk utilization.

3. Cause analysis

3.1 RocksDB principle

Parker’s underlying storage engine uses RocksDB, and RocksDB’s underlying data store is the LSM architecture. When compaction styles selects grade compaction, the default rating is grade compaction. As shown below:

When a user writes data to RocksDB, the data is first written to a Memtable. When a Memtable is full, it becomes immutable. In the background, RocksDB flusits the memTableto disk with a flush thread, generating a Sorted String Table (SST) file at Level 0. When the number of SST files at Level 0 exceeds the threshold, the SST files are consolidated to Level 1 using a compaction policy, and so on.

If you don’t Compaction, writing is very fast, but it degrades read performance and can cause serious space scaling problems. To balance write, read, and space operations, RocksDB executes a compaction in the background that consolidates SST operations at different levels.

3.2 Implementation principle of TTL

RocksDB has a CompactionFilter function that calls a Filter function when it compaction results every data. This is a hook function that can be customized by users. The implementation method of TTL is to implement TTL expiration deletion logic in Filter function, the specific implementation is as follows:

When writing data to Parker, we add a four-byte TTL to the end of the value of the data, taking String as an example. The specific data encoding format is shown in the following figure:

In the Filter function, we implement such logic: fetch the TTL at the end of the data and compare it with the current time. If the TTL is less than the current time, the data is considered expired and will not be merged into the next layer file, so as to achieve the purpose of deleting the data.

3.3 Problem Analysis

The maximum storage capacity of each tier in RocksDB configured in Parker is as follows:

A close review of how RocksDB compaction works: RocksDB triggers a grade compaction when a grade 1 compaction exceeds the maximum file size or number. For example, when a grade 2 file size exceeds 10GB, a grade 2 compaction occurs, which results in a compaction fliter. The TTL expired data was filtered out.

Based on the problems we encountered and the analysis based on RocksDB’s principle, we found that: In our scenario, 380GB of data falls into Level 4, most of which has expired. However, this layer does not trigger a compaction filter that reaches the maximum capacity of 1TB. The disk space cannot be reclaimed.

4. Solutions

A simple configuration change to RocksDB so that each tier has a smaller upper limit of storage capacity can solve this problem, but this solution is not universally applicable. It is unacceptable to restart the service to change the configuration of RocksDB if the storage data amount is changed in different scenarios.

From a macro perspective, there are two schemes for recovering expired data disk space:

The service implements the expiration deletion logic and proactively deletes expired data.
When a compaction occurs, data that has expired is deleted.

In order to quickly reclaim disk space and delete expired data, we sorted out the following schemes based on the principle of RocksDB:

4.1 Service Implementation Deletion Logic

Parker automatically deletes expired data using its own logic. The industry is praised KV is using a similar scheme. To do this, keep the existing storage column family unchanged and add another column family to store the TTL of the key, and then use a Goroutine to delete expired data according to the current timestamp, which can be triggered at a certain time, such as 1 am, or every period of time. The encoding rules of keys in this column family are as follows:

Key: timestamp + key type + key value
Value: Stores a byte representing different data types

The advantage of this scheme is that the recovery speed is fast and the recovery time is controllable, but the disadvantage is that the implementation is complex, and the TPS can be reduced by 50% in extreme cases.

4.2 OpenDbWithTTL scheme

This is a data expiration elimination scheme supported by RocksDB itself. This scheme opens the DB through a specific API, and all keys written to the DB follow a TTL expiration policy, for example, the TTL is 3 days. Keys written to the DB automatically expire three days after they are written. The underlying implementation of this solution is a compaction filter, which means that stale data is hidden from the user, disk space is not reclaimed, and TTL cannot be set for every key.

4.3 Proactively triggers a RocksDB compaction

If this compaction does not trigger a compaction filter, manually call the CompactionRange function of RocksDB to trigger the filter. The disk space can be quickly reclaimed. However, calling CompactionRange actively causes RocksDB to pause its own compaction, which triggers a Write Stall, and that’s not perfect either.

4.4 Periodic compaction + dynamic compaction

Periodic compaction adds a periodic_compaction_seconds parameter and records the creation time of each SST file. Periodic_compaction_seconds Proactively perform a compaction operation to the SST file to retrieve the SST file when it dies. Dynamic compaction sets level_compaction_dynamic_level_bytes to true to dynamically merge operations, rather than merge operations to the next level in order of level, making this compaction more frequent. This solution is implemented in an elegant way, with no need to change the existing code structure, just some configuration changes.

Compared with the above schemes, scheme 4 is more perfect and has little change in implementation logic. Therefore, we carry out experimental verification for scheme 4.

5. Experimental verification

5.1 Machine Configuration

CPU	memory	disk
64 nuclear	188GB	5.9 TB Nvme – SSD

5.2 RocksDB Added configuration

RocksDB version: 6.4.6
Level_compaction_dynamic_level_bytes = true;
Periodic_compaction_seconds = 3600;

5.3 Test Results

We write string data to Parker KV storage: Total number of writes: 3000000000 (30 billion) data, a single data about 170Byte. The expiration time of each piece of data is 10800 seconds after the write time, that is, three hours after the write time. The write speed is 51MB/s, and the total number of written data is estimated to be 5.32 TB. The disk usage of the machine is as follows:

5.4 Result Analysis

In the first three hours, as none of the data was written out of date, the disk usage increased almost linearly, at a rate of 51 MB/s(approximately 180GB/h). Since RocksDB compressed the data itself, the disk increased by approximately 520GB.
From hour 4, data starts to expire, but since this SST file triggers a periodic compaction that happens only a few times, and when this dynamic level compaction is enabled, it happens relatively frequently. There are very few SST files that have not been updated for longer than 1 hour. In addition, even when a periodic compaction occurs, few keys need to be deleted. As a result, the disk growth rate drops significantly during this one hour, but overall disk usage increases. More disk space is written than reclaimed.
From the fourth hour to the fifth hour, the number of expired keys starts to increase, and SST files triggered by a periodic compaction increase. When this happens, the disk space written to the vm is roughly in balance with the disk space deleted, and the disk usage growth rate drops to near zero.
Ideally, from the fifth hour onwards, disk usage should be stable around some level, but from the fifth hour to the ninth hour, that’s pretty much ideal.
A review of the RocksDB logs shows that during this time, a large SST file triggers a periodic compaction, and the entire SST file is deleted. The number of SST files for Level 6 dropped by nearly 500. Since most file keys are completely stale at this time and are reclaimed once compaction is triggered, disk reclaims are greater than the number of writes.
When this happens, SST files triggered by a periodic compaction decrease, causing a slight increase in disk usage. From hour 10 to hour 11, you can see that the disk usage increases again.
For a long time after that, disk usage was basically balanced. During this time, you can consider 3 hours of unexpired write volume, which is about 600GB, while the disk usage reached 900GB, which is approximately 1.5 times the space magnification.
From 11:00 to 13:00 on 11/7, we reduced the write speed, and the overall throughput dropped from 58MB/s to 10MB/s. As can be seen from the figure, the disk usage also dropped sharply during this period, which means that the write speed was reduced, the amount of expiration did not decrease, and the disk reclamation speed remained unchanged. The reclamation speed is higher than the write speed, and the disk usage decreases.
When Parker stops writing data at 11:50 on 11/7, his write throughput drops to 0. Parker still has a large amount of stale data. At this time, the disk usage drops to a certain level.
The best scenario would be to reclaim to the level at which the data was originally written, but the best scenario would be ideal. In the end, the disk utilization stabilized at 10.239%, 3.622% higher than the original 6.617%. That is, 218GB of disk space was not reclaimed, which would be reclaimed on restart writes.

6. Summary

A periodic compaction + TTL expiration to reclaim disk space is feasible, and we can determine the following rules:

Monitoring results show that a periodic compaction affects CPU load. The shorter periodic_compaction_seconds is, the higher CPU load is. This is easy to understand: RocksDB’s more aggressive compaction of SST files will consume more CPU resources. It is recommended that this parameter be set to 12 hours.

Theoretically, expired data with a length of about Periodic_compaction_seconds in the entire RocksDB will be reclaimed late, resulting in space expansion. Therefore, reserve some free disk space during deployment. It is recommended to reserve 30% of the redundant storage space.

7. Reference materials

1. github.com/facebook/ro…

2. tech.youzan.com/shi-yong-ka…

3. mysql.taobao.org/monthly/201…

4. github.com/facebook/ro…

5. tech.meituan.com/2018/11/22/…

6.www.cockroachchina.cn/?p=1282