From the past life to talk about why the big factory favor timing database

Abstract: This article will introduce the basic concepts, application scenarios, requirements and capabilities of sequential database, and introduce you to the past and present of sequential database.

The timing database suddenly became popular. Facebook has opened source its Beringei timing database, as has TimeScaleDB, a Postgresql-based timing database. As a very important service in the direction of the Internet of Things, the frequent voice of the industry shows that enterprises have been eagerly embracing the arrival of the Internet of Things era.

This article introduces the basic concepts, application scenarios, requirements and capabilities of the sequential database, taking you through the history of the sequential database.

Application scenarios

Sequence database is a vertical database which is highly optimized for sequence data. Manufacturing, banking and finance, DevOps, social media, healthcare, smart home, networking, and many other industries have a number of applications suitable for sequential databases:

Manufacturing: For example, the lightweight production management cloud platform utilizes the Internet of Things and big data technology to collect and analyze all kinds of time series data generated in the production process and present the production progress, target achievement status and utilization status of human, machine and material on the production site in real time, so as to make the production site completely transparent and improve production efficiency.
Banking and finance: the transaction system of traditional securities and emerging cryptocurrency collects and analyzes the timing data generated in the transaction process to realize financial quantitative transactions.
DevOps: An operation and maintenance system for IT infrastructure and applications. IT collects and analyzes monitoring indicators of device running and application service running, and learns the health status of devices and applications in real time.
Social media: Social APP big data platform, tracking user interaction data, analyzing user habits and improving user experience; The live broadcast system collects the monitoring index data of anchors, audiences and all intermediate links in the process of live broadcast to monitor the quality of live broadcast.
Healthcare: Business intelligence tools that collect health data from smartwatches and smart wristbands to track key indicators and the overall health of the business
Smart home: Home Internet of Things platform, collection of home intelligent equipment data, remote monitoring.
Network: A network monitoring system that displays network delay and bandwidth usage in real time.

Requirements for timing data

In the above scenarios, especially in the field of IoT and OPS operation and maintenance monitoring, massive monitoring data need to be stored and managed. Take the huawei Cloud Eye Service(CES) Service as an example. A single Region needs to monitor more than 70 million monitoring indicators, and processes 900,000 reported monitoring indicators every second. Assume that each indicator is 50 bytes, and 1PB of monitoring data is generated each year. Self-driving cars monitor 80 gigabytes of sensor data a day.

Traditional relational databases cannot support such a large amount of data and such a large write pressure. Hadoop big data solutions and existing sequential databases will also face great challenges. Large-scale IoT and public cloud-scale operation and maintenance monitoring scenarios have the following requirements for sequential database:

Continuous high-performance writing: Monitoring indicators are often collected at a fixed frequency. The collection frequency of sensors in some Industrial Internet of Things scenarios is very high, some of which have reached 100ns. The public cloud operation and maintenance monitoring scenarios are also collected at the second level. Sequential databases need to support 7*24 hours of continuous high pressure writes without interruption.
High-performance query: the value of temporal database is data analysis, and has a high real-time requirements, typical analysis tasks such as anomaly detection and predictive maintenance, such as time series analysis task requires frequent access to a large number of time-series data from the database, in order to ensure that the analysis of the real-time, temporal database need to be able to quickly respond to mass data query request.
Low storage cost: The amount of data in IoT and O&M monitoring scenarios has increased exponentially. The amount of data is more than 1000 times that in typical OLTP database scenarios, and it is very cost sensitive, requiring low-cost storage solutions.
Support massive timelines: In large-scale IoT and public cloud operation and maintenance scenarios, the indicators to be monitored are usually at tens of millions or even hundreds of millions, and the timing database should be able to support the management ability of hundreds of millions of timelines.
Flexibility: The monitoring scenario also has a sudden increase in services. For example, the operation and maintenance monitoring data of Huawei Welink service increased by 100 times during the epidemic. The sequential database needs to be flexible enough to rapidly expand to cope with the sudden increase in services.

Open source timing database capabilities

Over the past 10 years, with the mobile Internet, big data, artificial intelligence, Internet, rapid application and development of machine learning and other related technology, have sprung up many temporal database, because the adopted technology of the different database and designed different, so in solving the time-series data on demand, also appear bigger difference between them, In the following sections of this article, several open source timing databases that are used most are selected as analysis objects for discussion.

OpenTSDB

OpenTSDB uses the Hbase database as the underlying storage and encapsulates its own logical layer and external interface layer upward. This architecture makes full use of Hbase features to achieve high data availability and high write performance. However, compared with Influxdb, OpenTSDB data stack is longer, and there is room for further optimization in read/write performance and data compression.

InfluxDB

Influxdb is a popular time series database. It has a self-developed data storage engine and an inverted index to enhance multi-dimensional conditional query, which is ideal for use in sequential business scenarios. Since temporal insight report and temporal data aggregation analysis are the main query application scenarios of temporal database, each query may need to process grouping and aggregation operations of hundreds of millions of data. Therefore, the volcano model adopted by InfluxDB has a great impact on the performance of aggregated query.

Timescale

TimeScale is a time series database based on the traditional relational database PostgresQL. It inherits many advantages of PostgresQL, such as support for SQL, support for track data storage, support for JOIN, scalability, etc., and good read and write performance. TimeScale uses fixed schema and occupies large data space. It is also a choice for sequential services that are relatively fixed for a long time and insensitive to data storage costs.

Appearance of GaussDB (For Influx)

There is no good open source solution for high-performance write, massive timeline and high data compression. GaussDB (For Influx) has designed a sequential database with a cloud-native architecture, taking advantage of open source expertise. The architecture is shown below.

Compared to existing open source timing databases, the architecture is designed with the following two considerations:

Storage is separated from computation

Storage computing separation, on the one hand, using mature distributed storage system to improve system reliability. Continued high performance monitoring data, as well as a large number of query operations, business interruption or data loss, as a result of any system failure will cause serious business impact, and the use of proven mature distributed storage system, can significantly improve the system reliability, reduce the risk of data loss, and shorten the time of constructing this system obviously.

On the other hand, under the traditional Share Nothing architecture, the constraint of physical binding between data and nodes is removed. Data only logically belongs to a certain computing node, making the computing node stateless. In this way, you only need to transfer some data from one node to another node, avoiding massive data migration during the capacity expansion of compute nodes. In this way, the capacity expansion time of the cluster is shortened from days to minutes.

On the other hand, by uninstalling the multi-copy replication from the compute node to the distributed storage node, users can avoid the problem of 9 copies redundancy caused by 3 copies of the distributed database and 9 copies of the distributed storage when they build their own databases in the Cloud Hosting mode, which can significantly reduce the storage cost.

Kernel Bypass

To avoid performance damage caused by copying data back and back in user-mode Kernel mode, the GaussDB (for Influx) system considers an end-to-end Kernel bypass design. Instead of using a standard distributed block or distributed file service, the GaussDB (for Influx) system uses a customized distributed storage designed for the database, exposing user-mode interfaces. Compute nodes are deployed in container mode and communicate with storage nodes over a dedicated storage network

In addition to architecture, GaussDB (for Influx) has optimized other requirements for IoT and OPERATIONS monitoring scenarios as follows:

Write optimized LSM-Tree layout and asynchronous Logging solutions to improve write performance by 94% compared to current sequential databases.
Improved aggregate query performance with vectorization query engine, ARC Block Cache, and Aggregation Result Cache up to 9 times better than the current sequential database
A compression algorithm based on the distribution characteristics of sequential data is designed. The compression rate is two times higher than Gorilla, and the cold data is automatically classified into object storage, reducing the storage cost by 60%
Optimize the indexing algorithm for massive timelines to improve indexing efficiency, and the write performance is 5 times that of the current sequential database under the order of ten million timelines.

The editor

delete

Following the successful launch of GaussDB (for Influx) with welink and CES, we’ll explore how to effectively analyze valuable data within a flood of data, providing users with more appropriate analytics and insights.

reference

zhuanlan.zhihu.com/p/32709932

www.cnblogs.com/jpfss/p/121…

Click to follow, the first time to learn about Huawei cloud fresh technology ~

From the past life to talk about why the big factory favor timing database

Related Posts

In 2017 —— Ali God will take you to explain Dubbo architecture design in detail

A job description for iOS mid-level and senior interviewers

Functional testing for Web testing