Time series database is often applied in computer room operation and maintenance monitoring, IoT device collection and storage, Internet advertising click analysis and other application scenarios based on time line and continuous inflow of multi-source data into the data platform. InfluxDB is specially developed for time series data storage, especially in intelligent manufacturing in the industrial field, which has great potential for future application.

The data model

1. Characteristics of time series data

The application scenario of sequential data is that at each time point on the time line, data will flood in from multiple data sources, a large amount of data will be generated according to various latitudes of continuous time, and written into storage in real time in seconds or even milliseconds.

The traditional RDBMS database supports the write by the line processing, and establishes the B-tree structure index, it is not designed for batch high speed write, especially like the multi-latitude sequential data pouring into the data platform, the storage engine of RDBMS will inevitably lead to the load, throughput in the write performance is extremely unsuitable.

Therefore, the storage design of sequential data generally ignores the traditional RDBMS and focuses on the storage direction of LSM-tree and column data structure.

LSM data model is a layer of batch data from memory to disk files. Some data are arranged in KV order to form column clusters to store files, such as HBase and Cassandra. Some, such as druid. IO, have multiple files stored in column columns, which can greatly improve compression.

Time series data structure: DataSource + Metric + TimeStamp = data point, and each data point is an indicator measurement point on the time line.

If expressed in a two-dimensional graph, as shown in the figure below, the horizontal axis represents time, the vertical axis represents measured values, and each Point in the graph is the data Point measured by the index.

The figure above can be expressed as: Data source 1(dynamic ring detection – Machine room of High-tech Zone – Data area) – index (humidity), data source 2(dynamic ring detection – Machine room of Xi Xian New Area – Computing area) – two time series broken line diagrams of index (humidity) at continuous time point (TimeStamp).

The dynamic loop detection represents the business domain of the data source, the machine room and region represent the Tag of the data source, and the dynamic loop detection + the machine room and region determine the unique timing data source.

2. OpenTSDB data model based on HBase

Besides InfluxDB, OpenTSDB is another well-known timing database. OpenTSDB is a timing database implementation based on HBase platform. In my previous answers, I have analyzed the internal mechanism of HBase for many times. It features a data model derived from Google BigTable, with column clusters as the overall unit of data organization and storage.

It can be understood that a column cluster is an Excel wide table. Each cell in the table is composed of K/V, and the KEY is composed of row KEY + column cluster name + column name + time, etc. After sorting keys, K/V similar to cells will naturally be arranged in the order of row KEY, column cluster name, column name and time. It is very convenient to grab a set of column values of a column cluster by the row key.

We know from the characteristics of timing data that data source + index + timestamp can determine a data point, so in the design of OpenTSDB, this combination is the row key. However, since the timestamps are continuously different, each K/V row key is different, which can result in a very large number of single columns of data.

So OpenTSDB has been optimized to divide the timestamp in the row key into 3600 columns by hour and by second, so that one row of the column cluster represents an hour of sequential data, and each column represents the value of that second.

From a mechanism design point of view, OpenTSDB is already doing a good job of optimizing based on HBase features. However, the essence of HBase is that K/V is an atomic unit. Multiple K/V blocks form HFile files.

The resulting disadvantages are:

(1) Each K/V will bring a lot of redundancy due to KEY construction, and it is impossible to effectively realize conditional indexing based on tags, because tags are all folded into row keys, which requires full sequential scanning.

(2)KEY cannot be effectively compressed on the general compression algorithm, which will eventually occupy more storage costs. The essential problem is that Timestamp cannot be independently stripped.

3.InfluxDB data model

InfluxDB does not intend to develop a completely new data storage architecture. Instead, it builds a storage architecture for sequential data, called TSM, based on HBase’s LSM-Tree data model.

We repeat the HBase data model mechanism again. WAL is added and MemStore is created in memory to write data in batches. MemStore flushes data to disk StoreFile when it is periodically or fully written. Do merge sort and complete record deduplication.

This data model based on LSM-tree structure can greatly improve write performance, and many NoSQL systems operate in this way.

Furthermore, continuing the LSM-tree pattern, WAL is appended to sequential data before it is written to Cache, and then periodically or fully flushed to disk TSM files.

Different from HBase, it mainly lies in the design of data structure. HBase MemStore directly encapsulates written data into K/V and then forms larger chunks one by one.

The most striking feature of this structure is that HBase sequentially in the form of a near universal aligning atomic unit (K/V) and encapsulated into a bigger Chunk pieces and data between design very loosely, structure optimization, random search does not rely on data relies instead on index on the basis of scanning, belong to the tiger balm type, any upper application can use, such as: All kinds of wide table business scan query, aggregation analysis or design TSDB according to the time line analysis.

However, InfluxDB re-optimizes the Cache structure: Map set < data source,Map set < index,List < time stamp: data value >>>, we can see that is a Map Map Map List structure, the first layer Map is a Series, A Series is a list of data sources defined by the InfluxDB data source (table measurement + multi-label tagset), a Map is a collection of indicators, and a third layer is a list of time records for an indicator in a data source.

Therefore, it can be seen that InfluxDB firstly readjust the data structure according to the characteristics of time series data to meet the needs of such time series characteristics. Thus, indicators are indexed by data source to locate indicators, and indicators are indexed to locate the data list on the timeline.

The Cache in the InfluxDB memory is flushed to the TSM file. The TSM file also creates data block arrangement and index block arrangement for disk files based on the above structure. Each data block can be a Timestamps list and a Values list respectively.

You can then compress the TimeStamps list separately (delta-delta encoding) and the Values list, which has the same Type Type, by Type. Index blocks establish the relationship between data blocks and Series, and quickly locate the block displacement of data to be searched by binary search of time range.

4. InfluxDB index

InfluxDb is divided into two types of indexes, one is a Series index built into TSM files and the other is an inverted index.

Index of the Series: The index block consists of N index entities. Each index entity provides a minimum and maximum time, and the offset of the time range corresponding to the Series data block in the TSM file. The TSM file consists of several Series blocks, each of which contains Timestamps that correspond to the minimum and maximum times of the index entities. Therefore, Series index blocks can be sorted by Key and binary search by time range. The time Series data can be quickly located in a certain time range before scanning and matching.

InfluxDB in addition to TSM, there is also a TSI structure. TSI mainly indexes time series data in reverse order, for example: The indexes of each area of the high-tech zone machine room can be queried by using the condition of moving ring detection-High-tech Zone machine room, or the indexes of the data area of all the machine room can be queried by using the condition of moving ring detection-data area.

TSI data structure is: the Map collections < tag name, Map collections < tag value, List List > >, the first layer is the tag name, the second layer is a tag name all of the collection, such as: label called area, it contains a data area, calculating area such as tag value, the third layer is a specific tag corresponding data source, such as: Data sources containing data area labels include moving loop detection – High-tech Zone – data center, and moving loop detection – Xi Xian New Area – data center.

In this way, it is possible to quickly index all data sources containing the tag by tag. Through the Series of the data source, and from the TSM file by other criteria.

The TSI data model uses the same formula as TSM, which is essentially based on LSM data structure. As data is written, WAL is written In reverse Index first, in-memory Index is written, and TSI file is flushed when the Memory threshold is reached. The TSI file is the inverted Index file of disk-based Index, which contains Series blocks and label blocks. You can find all corresponding Series in the Series block by the label value in the label block.

The timing library can be very efficient at analyzing queries sorted by data source tags, which is why InfluxDB adequately addresses the needs of the timing data business.

distributed

I have previously talked about the comparison between HBase and Cassandra. If you are interested, you can check it out: Experience sharing on the comparison and analysis of HBase and Cassandra architecture

Compared with Cassandra, InfluxDB is closer to Cassandra. Cassandra is a decentralized distributed architecture, while InfluxDB includes Meta nodes and Data nodes. However, Meta nodes seem to be the master nodes, but their functions are weak. They mainly store some common information and schedule continuous queries. Data read and write problems are mainly concentrated on Data nodes, while the read and write between Data nodes is closer to the decentralized way.

InfluxDB is an AP distributed system in the CAP theorem that focuses heavily on high availability rather than consistency. Furthermore, sharding is mainly two-level. The first level is ShardGroup and the second level is Shard.

ShardGroup is a logical concept, representing multiple Shards within a time range. The continuous time range of ShardGroup is determined after the InfluxDb DataBase and RETENTION POLICY are created. This means that the sequential data is first partitioned according to the time range, such as 1 hour as a ShardGroup.

However, the data generated within an hour is not stored on one data node, which is different from HBase regions. A Region writes data to a data node, splits the Region, and migrates and distributes the Region.

InfluxDB should refer to Cassandra’s method, which uses Hash distribution based on Series as Key, and distributes multiple shards of a ShardGroup on different data nodes in the cluster.

In the following code to locate the shard, key is Series, the only specified data source * (table measurement + tag collection tagset) *.

shard := shardGroup.shards[fnv.New64a(key) % len(shardGroup.Shards)]

Shards in the above code is the number of Shards of shardGroup, which is N/X, where N is the number of data nodes in the cluster and X is the copy factor.

For example, there are four nodes in the cluster, and 2 copies, got the need will be 1 hour on 4/2 ShardGroup range data into two Shard, and 2 copies of each, distributed in four data nodes, is through the way of the Hash distribution and more uniform distribution of the time-series data, and make full use of cluster advantages of each data node, speaking, reading and writing.

InfluxDB is the final consistency and implements various tuning policies (Any, One, Quorum, All) on write consistency, just like Cassandra, and enhances the temporary queue persistence function of Hinted Handoff of the coordination node, which is purely for high availability.

Even if the real replica node fails, it will be stored in the Hinted Handoff queue of the collaborative node for a long time. After the replica node recovers from the fault, it will be copied and restored from the Hinted Handoff queue. To achieve final consistency.

Also, InfluxDB has the same anti-entrory function for background fragmentation repair as cassandra3.x and previous versions, but it is interesting that Cassandra does not retain the function of background read repair in the new version 4.0, and it is not recommended to enable it in previous versions. Because of the active read repair capability, the background read repair has little effect and also affects the performance.

Of course, I’m not sure whether InfluxDB, like Cassandra, has active read fixes during the read process, since it’s a closed source system.

InfluxDB is very weak in deleting, allowing only a range set or a dimension set under a Series to be deleted. This is also due to the design of such timing structure. Deleting a single Point will be troublesome and cost a lot.

The method of deletion should also refer to Cassandra’s Tombstone mechanism: the problem of decentralization is that after everyone deletes the replica, a replica node that happened to be in failure will recover. It does not know what happened, but during the repair process, it will think that the deleted replica was lost and ask everyone to recover it.

With the long-term presence of Tombstone markers, a failed node with duplicate data will be deleted immediately according to the Tombstone markers of other duplicate nodes after recovery.

conclusion

For example, we used to use Elasticsearch to create date index to save logs. However, we always struggled with the problem of under what conditions to create and when to destroy logs, and we spent a lot of effort in coding. However, the retention policy and partition grouping of Influxdb solve these problems well.

Finally, the index establishment of a more appropriate timing model, especially the inverted index TSI, is very efficient in realizing data capture, aggregation and analysis according to a certain latitude within a period of time, which is exactly the core requirement of timing application scenarios and can greatly improve the application efficiency of overall computing resources.

————​————​————​————​————​————​

This article was published by Lao Fang, CTO of Xi ‘an Guardian Stone Information Technology Co., LTD. Please indicate the source and author.