Mooring floating purpose in this paper, starting from Jane books: www.jianshu.com/u/204b8aaab…

version The date of note
1.0 2021.10.19 The article first

Background of 0.

The title comes from InfluxDB’s background on the birth of their storage engine:

The workload of time series data is quite different from normal database workloads. There are a number of factors that conspire to make it very difficult to get it to scale and perform well: - Billions of individual data points - High write throughput - High read throughput - Large deletes to free up disk space - Mostly an insert/append workload, very few updates The first and most obvious problem is one of scale. In DevOps, for instance, you can collect hundreds of millions or billions of unique data points every day. To prove out the numbers, Let's say we have 200 VMs or servers running, With each server collecting an average of 100 measurements every 10 seconds. Given there are 86,400 seconds in a day, A single measurement will generate 8,640 points in a day, Per server. That gives us a total of 200 * 100 * 8,640 = 172,800,000 individual data points per day larger numbers in sensor data use cases.Copy the code

Recently, I was responsible for the monitoring of part of the products. I was thinking about the high RT and IOPS requirements for a sequential database, so I wanted to see how it was implemented internally — if it looked anything like Kafka or HBase that I knew.

Let’s do some popular science. A sequential database is a database that stores data that changes over time and is indexed by time (point in time or interval in time). So it is the earliest data collected and generated by various types of real-time monitoring, inspection and analysis equipment applied in industrial (electric power industry, chemical industry). These industrial data are typically produce fast frequency (each monitoring point in one second can produce multiple data), heavily dependent on the acquisition time (only each data are required in the time), measuring point more informative (conventional real-time monitoring system can reach tens of thousands of monitoring stations, monitoring data are generated per second). Its data is a historical brand, which has invariance, uniqueness and order. Time series database has the characteristics of simple data structure and large amount of data.

Problem 1.

Those of you who have used timing databases know that. Data in a sequential database is usually appended, rarely deleted or not allowed to be deleted at all, and the query scenes are generally continuous. Such as:

  • We usually observe data at a certain point in time on a monitoring page. When needed, they look for more detailed periods of time to observe.
  • The timing database pushes metrics of interest to the alarm system

1.1 Pit trodden by Prometheus

Here, we’ll take a quick look at the data structures in Prometheus. It is a typical K-V pair, k (generally called Series) consists of MetricName, Lables and TimeStamp, and V is the value.

In early designs, the same Series would be organized according to certain rules, and files would also be organized according to time. This becomes a matrix:

The advantage is that writes can be written in parallel, and reads can be read in parallel (either by condition or time period). The drawbacks are obvious: first, the query becomes a matrix, which can easily trigger random reads and writes, which can be difficult on both HDDS (speed limited) and SSDS (write magnifications).

So Prometheus refined another version of storage. Each Series is a file, and the data of each Series is stored in memory up to 1KB and then brushed down once.

This alleviates the problem of random reads and writes, but it also introduces new problems:

  1. If the machine carsh before the data reaches 1KB is still in memory, the data is lost
  2. Series can easily become too many, which can lead to high memory usage
  3. Continuing above, the disk gets busy as the data is flushed in one go
  4. Subsequently, many files will be opened and the FD will be consumed
  5. When the application has not uploaded data for a long time, should the data in memory be flushed? It’s not really a good way to determine that

1.2 Pits trod by the InfluxDB

1.2.1 LevelDB based on LSM Tree

The write performance of an LSM Tree is much better than the read performance. However, InfluxDB provides an API for deletions, and once deletions occur, they can be troublesome — it inserts a tombstone record and waits for a query that merges the result set with the tombstone, and later a merge program runs to delete the underlying data. Also, InfluxDB provides TTL, which means that Range deletes data.

To avoid this slow deletion, InfluxDB uses a sharding design. To split different time periods into different LevelDB, just close the database and delete the files. However, when there is a large amount of data, it will cause the problem of too many file handles.

1.2.2 BoltDB based on MMAP B+Tree

BoltDB is based on a single file as the data store, and MMAP-based B+Tree is not bad at runtime. But when the writes get big, things get messy — how do you mitigate the problem of writing hundreds of thousands of Serires at a time?

To alleviate this problem, InfluxDB introduces WAL, which effectively alleviates random writes. Write multiple adjacent buffers and then fresh them together, just like MySQL’s BufferPool. This does not solve the problem of write throughput degradation, however, it merely prolongs the problem.

2. Solutions

If you think about it, the data hot spots in a sequential database are only focused on recent data. And write less read, almost no deletion, data only sequentially added. Therefore, we can make very aggressive storage, access, and Retention Policies for sequential databases.

2.1 Key data structures

  • Reference Log Structured Merge Tree (LSM-tree) variant implementation to replace the traditional relational database B+Tree as the storage structure, LSM is suitable for the scenario of write more read less (random write to sequential write), and almost unchanged data. A common implementation uses time as the key. In InfluxDB, this structure is called a Time Structured Merge Tree.

  • Temporal Database and even a is not uncommon that more extreme form, called the rotation type Database (Round Robin Database, RRD), it is based on the thinking of the ring buffer to achieve, can only store a fixed number of the latest data, extended or exceed the capacity of coverage, the data will be changed, so it also has a fixed Database capacity, But can accept unlimited data input.

2.2 Key Policies

  • WAL (Write Ahead log) : Like many data-intensive applications, WAL ensures data persistence and alleviates random writes. In a sequential database, it is used as a query data carrier — the storage engine merges data from WAL and falling disk when a request is made. In addition, it will do Snappy based compression, which is a less time-consuming compression algorithm.
  • Configure aggressive data retention policies, such as automatically deleting related data based on TTL to save storage space and improve query performance. For ordinary databases, it is unthinkable that data will be stored for a period of time and then automatically deleted.
  • Resampling the data to save space. For example, the data of recent days may need to be accurate to the second, while the query of cold data of one month ago needs to be accurate to the day, and the query of data of one year ago needs to be accurate to the week. In this way, Resampling the data can save a lot of storage space.

3. Summary

Overall, compared with Kafka and HBase, the internal structure of sequential database is not simple, which is very valuable to learn.

3.1 Reference Links

  • Prometheus. IO/docs/promet…
  • Docs.influxdata.com/influxdb/v1…
  • Blog.csdn.net/cymm_liu/ar…
  • zhuanlan.zhihu.com/p/32710333
  • Archive.docs.influxdata.com/influxdb/v0…
  • Zhou Zhiming: The Phoenix Architecture