Deeply interpret the overall architecture design and implementation of MRS IoTDB sequential database

Huawei cloud community in June came, fresh out of the Top10 technology dry goods, heavy technical topics to share; There are big challenges in graduation season, huawei cloud experts take you to make a good career plan.

Abstract: This paper will systematically introduce the background and function characteristics of MRS IoTDB, focusing on the overall architecture design and implementation of MRS IoTDB sequential database.

This article is shared by Huawei cloud community “Overall Architecture Design and Implementation of MRS IoTDB Timing Database”, originally written by cloudsong.

MRS IoTDB is the latest timing database product of Huawei FusionInsightMRS Big Data Suite. Its leading design concept is becoming more and more competitive in the field of timing database and has been recognized by more and more users. In order to better understand MRS IoTDB, this paper will systematically introduce the ins and outs and functions of MRS IoTDB, focusing on the overall architecture design and implementation of MRS IoTDB timing database.

1. What is a sequential database

Sequence database is short for time series database, which is a special database system for storing, querying, analyzing and other processing operations of data with time label (according to the sequential change of time, namely time serialization). In layman’s terms, a timing database is a database dedicated to recording various values (measuring points and events) such as temperature, humidity, speed, pressure, voltage, current, and bid and offer rates of Internet of Things devices that change over time.

At present, with the deepening development and application of big data technology, two types of data represented by InternetOf Things (IoT) and financial analysis continuously produce a large number of sensor values or event data over time. Timeseries data is a continuous sequence of values based on the time (timestamp) of the data (event). For example, the temperature data of an Internet of Things device at different times constitute a time series data:

Both machine-generated sensor data and social event data generated by human activities have some common characteristics:

(1) High collection frequency: dozens, hundreds, hundreds or even millions of times per second;

(2) High accuracy of collection: at least millisecond level of collection is supported, and some need to support microsecond level and nanosecond level of collection;

(3) Large collection span: 7*24 hours to continuously collect data for several years or even decades;

(4) Long storage cycle: it needs to support the persistent storage of time series data, even some data need to be permanently stored for hundreds of years (such as seismic data);

(5) Query window length: need to support time window query of different granularity from milliseconds, seconds, minutes, hours to day, month, year and so on; Also need to support ten thousand, one hundred thousand, million, ten million and other granularity of the number of window query;

(6) Data cleaning is difficult: time series data are out of order, missing, abnormal and other complex situations, requiring special algorithms for efficient real-time processing;

(7) High real-time requirements: whether sensor data or event data, millisecond and second-level real-time processing capabilities are required to ensure real-time response and processing capabilities;

(8) Strong algorithm expertise: time series data in different fields such as earthquake, finance, electricity, transportation, etc., have many professional requirements for time series analysis in vertical fields, and need to use time series trend prediction and similarity sub-sequence.

Analysis, periodic prediction, time moving average, exponential smoothing, time autoregressive analysis and lSTM-based sequential neural network algorithms for professional analysis.

Can be seen from the common features of time-series data, time series special scene demand for traditional relational database storage and large data storage have brought challenges, not a relational database is adopted to structured storage, or with no database for storage, cannot satisfy the huge amounts of time-series data high concurrency real-time written demand and query. Therefore, there is an urgent need for a special database for the storage of time series data, and the concept and product of time series database was born in this way.

It is important to note that temporal databases are different from temporal and real-time databases. A TemporalDatabase, such as TimeDB, is a database that can record the change history of objects, that is, maintain the change history of data. Temporal database is a fine-grained maintenance system for the time states of time records in traditional relational database, while temporal database is completely different from relational database, and only stores the values of measurement points corresponding to different time stamps. A more detailed comparison between temporal and temporal databases will be introduced in a follow-up article and will not be detailed here.

Sequential databases are also different from real-time databases. Real-time database was born in traditional industry, mainly because of the development of modern industrial manufacturing process and large-scale industrial automation, traditional relational database is difficult to meet the storage and query requirements of industrial data. Thus, in the mid-1980s, a real-time database for industrial monitoring was born. Due to the early birth of real-time database, there are limitations in scalability, big data ecological docking, distributed architecture, data types and other aspects, but it also has the advantages of complete supporting products and complete industrial protocol docking. Time series database was born in the era of Internet of Things, which has more advantages in big data ecological docking and cloud native support.

The basic comparison information of time sequence database, temporal database and real-time database is as follows:

2. What is MRS IoTDB sequential database

MRS IoTDB is a timing database product in Huawei FusionInsightMRS Big Data suite. It is a high-performance enterprise-level timing database product launched on the basis of deep participation in the Open source edition of ApacheIoTDB community. IoTDB, as its name implies, is a special timing database software launched for the field of IoT and Internet of Things. It is a domestic Apache open source software initiated by Tsinghua University. Since the birth of IoTDB, Huawei has been deeply involved in the architecture design and core code contribution of IoTDB, invested a lot of manpower and put forward a lot of improvement suggestions and contributed a lot of code to the stability, high availability and performance optimization of IoTDB cluster edition.

At the beginning of design, IoTDB comprehensively analyzed the products related to timing database on the market. Mainstream timing databases, including Timescale based on traditional relational database, OpenTSDB based on HBase, KariosDB based on Cassandra, and InfluxDB based on timing exclusive structure, draw on the advantages of different timing data in implementation mechanisms. Has formed its own unique technical advantages:

(1) Support high-speed data writing

The unique tLSM algorithm based on two-stage LSM merger effectively ensures that IoTDB can easily realize the concurrent writing ability of ten million data points per second in a single machine even in the case of out-of-order data.

(2) Support high-speed query

Supports TB data query at the millisecond level

(3) Complete function

It supports CRUD and other complete data operations (update is realized by writing to the measurement point data of the same device with the same timestamp, and delete is realized by setting TTL expiration time). It supports frequency domain query, has rich aggregation functions, and supports professional time sequence processing such as similarity matching and frequency domain analysis.

(4) Rich interface, easy to use

Supports various interfaces such as JDBC, Thrift API, and SDK. Using SQL statements, the standard SQL statements added for the time sliding-window statistics and other commonly used time sequence processing functions, to provide system efficiency. The Thrift API supports Java, C\C++, Python, C# and other multi-language interfaces.

(5) Low storage cost

The TsFile timing file storage format independently developed by IoTDB is specially optimized for timing processing. Based on column storage, it supports explicit data type declaration, and different data types automatically match different compression algorithms such as SNAPPY, LZ4, GZIP and SDT. The compression ratio of 1:150 or even higher can be achieved (if the data accuracy is further reduced), which greatly reduces the storage cost of users. For example, if a user used 9 KariosDB servers to store time series data, IoTDB can be easily implemented with 1 server of the same configuration.

(6) Multi-mode deployment of cloud edge and end

The unique lightweight architecture design of IoTDB ensures that IoTDB can easily achieve “a set of engines through the cloud side, a data compatible with the whole scene”. In the cloud service center, IoTDB can adopt cluster deployment to give full play to the cluster processing advantages of cloud. In the edge computing location, IoTDB can deploy a single IoTDB on the edge server or a cluster version with a few nodes, depending on the configuration of the edge server. In the device terminal, IoTDB can be directly embedded in the form of TsFile file in the local storage of the terminal device, and can be directly read and written by the device terminal. It does not need the startup and running of the IoTDB database server, which greatly reduces the requirements on the processing capacity of the terminal device. Due to the open format of TsFile, any terminal language and development platform can directly read and write the binary byte stream of TsFile, or use the SDK of TsFile for reading and writing, or even send the TsFile file to the edge or cloud service center through FTP.

(7) Integration of query and analysis

IoTDB supports both real-time read and write and distributed computing engine analysis of a piece of data. The loose coupling design of TsFile and IoTDB engine ensures that IoTDB can efficiently write and query timing data by using the proprietary timing data processing engine. TsFile can also be read and written by Flink, Kafka, Hive, Pulsar, RabbitMQ, RocketMQ, Hadoop, Matlab, Grafana, Zeepelin and other big data related components. It greatly improves IoTDB’s query and analysis integration ability and ecological expansion ability.

3. Overall architecture of MRS IoTDB

MRS IoTDB on the basis of the existing ApacheIoTDB architecture, integrated MRS Manager powerful log management, operation and maintenance monitoring, rolling upgrade, security reinforcement, high availability guarantee, disaster recovery, fine-grained permission control, big data ecological integration, resource pool optimization scheduling and other enterprise core capabilities. The ApacheIoTDB kernel architecture, especially the distributed cluster architecture, has done a lot of reconstruction and optimization, and has done a lot of system-level enhancement in stability, reliability, availability and performance.

(1) Interface compatibility:

Further improve the northbound interface and southbound interface, support JDBC, Cli, API, SDK, MQTT, CoAP, Https and other access interfaces, further improve THE SQL like statement, compatible with most Influx SQL, support batch import and export

(2) Distributed peer architecture:

MRS IoTDB adopted the improved Multi-Raft protocol on the basis of Raft protocol, and optimized the low-level implementation of Muti-Raft protocol. The optimization strategies such as CacheLeader were adopted to ensure no single section failure. Further improve the performance of MRS IoTDB data query routing; At the same time, fine-grained optimization was carried out for strong consistency, medium consistency and weak consistency strategies. The virtual node strategy is added to the consistent hash algorithm to avoid data skew, and the algorithm strategy of table lookup and hash partition is combined to further guarantee the performance of cluster scheduling on the basis of improving the high availability of cluster.

(3) Two-layer granularity metadata management: Due to the peer-to-peer architecture, metadata information is naturally distributed on all nodes of the cluster for storage, but due to the large storage capacity of metadata, large memory consumption is caused. In order to balance memory consumption and performance, MRS IoTDB adopts a two-layer granularity metadata management architecture. Firstly, the metadata of time series group is synchronized among all nodes, and secondly, the metadata of time series is synchronized among partitioned nodes. In this way, when the metadata is queried, the filtering tree is pruned based on the time series group to greatly reduce the search space, and then the time series metadata is queried on the partitioned nodes after filtering.

(4) Local disk high-performance access:

MRS IoTDB uses the special TsFile format for time series optimization storage, uses the column format for adaptive coding and compression, supports pipeline optimization access and high-speed insertion of out-of-order data

(5) HDFS ecological integration:

MRS IoTDB supports HDFS file read and write, and implements local cache, short circuit read, HDFS I/O thread pool and other optimization methods to comprehensively improve THE READ and write performance of MRS IoTDB on HDFS. At the same time, MRS IoTDB supports Huawei OBS and provides high-performance and in-depth optimization.

Based on HDFS integration, MRS IoTDB supports efficient reading and writing of TSfiles by MRS components such as Spark, Flink, and Hive.

(6) Multi-level permission control:

Supports multi-level permission control for storage groups, devices, and sensors

Multi-level operations such as creating, deleting, and querying are supported

Kerberos Authentication

Ranger authority architecture is supported

(7) Cloud side deployment:

Flexible deployment is supported on the edge of the cloud. The edge of the cloud can be connected to Huawei IEF products or directly deployed on Huawei IES.

MRS IoTDB Cluster edition supports dynamic capacity expansion and shrinkage, providing more flexible deployment support for the cloud side.

4. Single-machine architecture of MRS IoTDB

4.1 Basic Concepts of MRS IoTDB

MRS IoTDB mainly focuses on the real-time processing of the measured point value of the device sensor in the field of IoT. Therefore, the infrastructure design of MRS IoTDB takes the device and sensor as the core concepts. At the same time, in order to facilitate the use of users and IoTDB management of time series data, the concept of storage group is added. Here is an explanation for you:

StorageGroup: a concept introduced by IoTDB to manage sequential data, similar to the concept of a database in a relational database. From the perspective of users, it is used to manage device data in groups. From the perspective of the IoTDB database, the storage group is also a unit of concurrency control and disk isolation, with parallel reads and writes between different storage groups.

Device: refers to physical devices in the real world, such as a manufacturing unit of a power plant, wind turbine, automobile, aircraft engine, and seismic wave acquisition instrument. In IoTDB, device is the unit of a single write to sequential data, and a single write request is limited to one device.

Sensor: a Sensor that corresponds to a physical device in the real world, for example, a Sensor that collects information about wind speed, steering Angle, and energy yield on a wind generator. In IoTDB, Sensor is also called Measurement point, which specifically refers to the Sensor value collected by the Sensor at a certain time and stored in the form of <time, value> in IoTDB.

The relationship between storage groups, devices, and sensors is as follows:

TimeSeries: Similar to a table in a relational database, except that this table has three main fields: Timestamp, Device ID, and Measurement. To facilitate more description of device information of time series, IoTDB also adds extended fields such as Tag and Field. Tag supports indexes, while Field does not. In some time series databases, it is also known as time line, which records the value of a sensor of a device with constant changes over time, forming a time line with continuous addition of measured point values along the time axis.

Path: IoTDB constructs a tree structure that takes root as the root node and connects storage groups, devices, and sensors in series. A Path is formed from the root node through storage groups, devices, and sensor leaves. As shown below:

Virtual storage group: The concept of storage group has dual functions of concurrent control for device groups and system. The excessive coupling of the two will affect the concurrent control of the system due to different usage modes of users. For example, if a user puts all irrelevant device data into a storage group, IoTDB locks the storage group for concurrent control, limiting the concurrent data read and write capability. In order to realize the relatively loose coupling between storage group and concurrency control, IoTDB designed the concept of virtual storage group, which split the fine granularity of concurrency control for storage group into virtual storage group, thus reducing the granularity of concurrency control.

4.2 Basic architecture of MRS IoTDB

A single MRS IoTDB consists of different storage groups. Each storage group is a unit of concurrency control and resource isolation. Each storage group contains multiple TimePartitions. Each storage group corresponds to a WAL pre-write log file and a TsFile timing data store file. The timing data in each TimePartition is first written to Memtable and then logged to WAL, and periodically flushed to TsFile asynchronously. The implementation mechanism will be described in detail later. The basic architecture of MRS IoTDB single machine is as follows:

5. Cluster architecture of MRS IoTDB

5.1 Distributed peer architecture based on Multi-raft

MRS IoTDB cluster is a completely peer-to-peer distributed architecture, which not only avoids the single point of failure based on Raft protocol, but also avoids the single point of performance caused by a single Raft consensus group through Multi-Raft protocol. Meanwhile, the underlying communication, concurrency control and high availability mechanism of distributed protocol are further optimized.

First, all nodes of the entire cluster form a metadata group (MetaGroup), which is only used to maintain the metadata information of the storage group. For example, a 4-node IoTDB cluster is shown in the blue and gray box below. All 4 nodes form a MetaGroup.

Secondly, the data group is constructed according to the number of data copies. For example, if the number of copies is 3, construct a DataGroup (DataGroup) containing three nodes. Storage groups are used to store time series data and corresponding metadata.

Reliable storage of data in distributed systems is usually implemented in the form of multiple copies. Multiple copies of the same data are stored on different nodes and must be consistent, so Raft consensus protocol is used to ensure data consistency, which breaks the problem of consistency into several relatively independent sub-issues such as leader election, log replication, consistency assurance, etc. The Raft protocol has the following important concepts:

(1) set of Raft. The Raft group has an elected leader node and the other nodes are followers. When a write request arrives, it is first submitted to the leader node for processing. The leader node records the write request in its own log and then distributes the log to the follower node.

(2) the Raft log. Raft maintains a Commit number and an Apply number in the log to ensure that actions are not lost. If a log is committed, it means that more than half the nodes in the current cluster have received and persisted the log. If a log is applied, the current node executes the log. When some node fails and recovers, the node’s log falls behind the leader’s log. The node cannot provide services to the outside world until it catches up with the leader’s log.

5.2 Layered Metadata Management

Metadata management strategy is the key point of MRS IoTDB distributed design. When designing the metadata management strategy, we should first consider the usage of metadata in the read and write process:

Metadata is required to check the validity of data types and permissions
Metadata is required to query routes. At the same time, due to the timing of the data scene in the number of elements

It is also important to consider the memory resource consumption of metadata.

Existing metadata management strategies either assign metadata to metadata nodes for special management, which reduces read and write performance. Or you can save metadata in full on all nodes of the cluster, which consumes a lot of memory resources.

In order to solve the above problems, MRS IoTDB designed a two-layer granularity metadata management strategy, whose core idea is to manage metadata by splitting it into storage groups and time series:

(1) StorageGroup metadata: the MetaGroup contains the routing information of data query, and the metadata information of the StorageGroup is fully stored on all nodes of the cluster. The granularity of storage groups is large. The order of magnitude of storage groups within a cluster is much smaller than the order of magnitude of time series. Therefore, the storage group metadata is saved on all nodes of the cluster, which greatly reduces the memory usage.

Each node in a metadata group is called a metadata holder and the Raft protocol is used to ensure data consistency between each holder and the other holders of the same group.

(2) time series metadata: the time series metadata in the DataGroup (DataGroup) contains the data type, permission and other information required for data writing, which is stored on the node where the DataGroup resides (some nodes of the cluster). Due to the time sequence of metadata particle size small, far more than the storage group metadata, so the time series of the metadata stored in the data set is located on the node, to avoid the unnecessary memory footprint, as well as by storing metadata set of first-order filter rapid positioning, at the same time, the Raft consistency of data sets to avoid the metadata storage time sequence of a single point of failure.

Each node in a data group is called a data partition holder and the Raft protocol is used to ensure data consistency between each holder and the other holders of the group.

According to this method will metadata storage group and time sequence is two layers of granularity holders in the metadata and data partitioning, respectively, holders of management, due to the time series data and metadata in the data synchronization in the group, so every time writing data do not need to inspection and the synchronous operation of metadata, only need to modify the time series of the metadata stored metadata set of checks and synchronous operation, This improves system performance. For example, when creating a time series and performing 500,000 data writes, the metadata check and synchronization operation is reduced from 500,000 to 1 **. 支那

5.3 Metadata Distribution

Based on hierarchical metadata management, metadata is classified into storage group metadata and time series metadata.

Storage group metadata has copies on all nodes of the whole group and belongs to the MetaGroup group.

Time series metadata is only stored on the corresponding DataGroup, which stores some information about time series attributes, field types, and field descriptions. The distribution of time series metadata is generated using slot hash, just like the distribution of data.

5.4 Time Series Data distribution

In distributed system implementation, sequential data is partitioned according to storage groups based on hash ring and look-up algorithm. Put each node of the cluster on the hash ring according to the hash value. For an incoming time series data point, calculate the hash value of the storage group corresponding to the name of the time series and place it on the hash ring. Search clockwise on the ring, and the first node found is the node to be inserted.

Using hash ring data partition, prone to the hash value of the two nodes of the small difference, so the consistency in the use of hash ring on the basis of introducing virtual node, and the specific practice is to each physical node virtual into several, and placing these virtual node according to the hash value to hash ring, to a great extent, to avoid the data skew, Make the data more evenly distributed.

First, 10000 slots are preset for the entire cluster and distributed evenly across dataggroups. As shown in the following figure, the IoTDB cluster has four datagroups and the entire cluster has 10,000 slots. Therefore, on average, each DataGroup has 10,000/4 =2500 slots. Since the number of datagroups is equal to the number of cluster nodes 4, this translates to an average of 2500 slots per node.

Second, complete the mapping of slot to DataGroup, Time Partition, and Time series.

The IoTDB cluster is divided into multiple DataGroup groups based on RAFT protocol. Each DataGroup group contains multiple slots and each slot contains multiple time partitions. At the same time, each time partition contains multiple time series, as shown in the figure below:

Finally, Hash the slot value to complete the input storage group and timestamp to slot mapping:

1) First partition by time range, which is convenient for time range query:

TimePartitionNum = TimeStamp %PartitionInterval

TimePartitionNum is the ID of the time partition,TimeStamp is the TimeStamp of the data to be inserted, and PartitionInterval is the interval of the time partition. The default value is 7 days.

2) Hash slot values by storage group partition:

Slot = Hash(StorageGroupName + TimePartitionNum) % maxSlotNum

In the command, StorageGroupName is the name of the storage group, TimePartitionNum is the ID of the time partition calculated in step 1, and maxSlotNum is the maximum slot number. The default value is 10000.

The relationship between Data Group and StorageGroup is shown in the following figure. Data Group 1 on node 3 and node 1 shows the situation where the same Data Group is distributed on two nodes:

Click to follow, the first time to learn about Huawei cloud fresh technology ~