Author: Ma Xiangrong
Small T introduction: As a logistics company with business in many countries and regions, the technical environment is very complex, so it has become an inevitable choice to build a system based on hybrid cloud to quickly build an environment in the cloud era. However, it is a challenge to effectively manage the system based on various clouds on hybrid cloud, especially the problems, business demands and pain points of the database layer at the bottom of the system, which are the focus of the DBA team at present.
At present, data stores managed by DBA team include MySQL, Redis, Elasticsearch, Kafka, MQ, Canal, etc. In order to ensure real-time monitoring sampling, the sampling interval of our self-developed monitoring system is set to 10 seconds, which generates huge monitoring data every day. The data volume of monitoring indicators reaches over 2 billion. In the early stage, due to fewer management instances and less monitoring data, all data were stored in MySQL. Later, with the increasing number of management instances, it became more and more difficult to use MySQL to store the increasingly large monitoring data, and it was in urgent need of upgrading. Combined with the actual needs and through the investigation of different time series databases, we finally chose TDengine and successfully completed the upgrade and transformation of data storage monitoring.
1. Problems encountered in the development of monitoring system
From the perspective of storage path, each data store is distributed in multiple areas. Each area delivers and reports monitoring sampling data to the message bus, which is then consumed uniformly by the consumer program. The consumer program updates necessary alarm data to the alarm table, and stores the original monitoring data to MySQL. For example, for Redis data storage, if the monitoring sampling interval is set to 10 seconds, the number of instance sampling will reach 30 million + every day, and the monitoring index will reach 600 million +, showing the huge amount of data. In the early days, in order to store data in MySQL and support the use of monitoring drawing, the sampling data of the full monitoring item of the instance would be calculated during each instance monitoring sampling, and then the sampling result would be stored as a record.
Even through such optimization processing, when doing monitoring drawing with a large time span, the front-end will still have the problem of delay lag, and the MySQL of back-end monitoring data storage is also under great pressure, often full of load, and is ridiculed every day. So for a query of this large time span, we specially developed a series of data aggregation scheduling tasks, according to the different time span, 10 seconds interval of the monitoring data will do a good job in advance aggregation, monitoring drawing program again according to the different time span to select different aggregate data table drawing, to solve the long span monitoring graphics display caton delay problems.
But this still treats the symptoms rather than the cause. In order to compress the storage space of monitoring data, the original monitoring data table with an interval of 10 seconds can only archive the monitoring data retained for 3 days, but the storage size is also nearly 200GB. If the related data aggregation table with different time periods is added, the storage capacity suddenly exceeds 300GB. This is just the storage size of Redis monitoring data, plus monitoring data from other data stores, at least 1TB+ MySQL storage is required.
Complex data set task management, loss of backtracking of original data in later monitoring, and hidden dangers brought by the growing MySQL storage space are all problems that need to be solved urgently.
2. Selection of timing database
To solve MySQL’s monitoring data store problem, we turned our attention to a timing database suitable for storing monitoring data. There are a variety of timing database products in the market, including old ones and new ones. After a series of investigation and selection, we choose TDengine, mainly because it has the following advantages:
- The distributed architecture supports linear scaling
- The data storage compression ratio is extremely high
- The cluster mode supports multiple copies without a single point of failure
- Strong performance in single machine mode
- The installation, deployment and maintenance are simple
- SQL supports time-dimension aggregated queries
- The client supports RESTful apis
It is worth mentioning that TDengine’s SQL native syntax supports time-dimension aggregated queries, while data storage is highly compressed and has little storage space, which directly hits our pain points. After landing, the storage space of the same data measured is only one tenth of MySQL storage space or less. Another surprise is that an 8C16GB stand-alone TDengine can easily withstand all the current monitored traffic and storage pressure when the existing monitored data storage (MySQL) fails, and runs stably with almost no faults.
3. Transformation process
The design principle of TDengine’s super table for storing monitoring data is simple and efficient. Generally, there are few fields, and each type of monitoring item (INT, FLOAT, NCHAR, etc.) needs to create a separate super table. The super table generally has key TS, Type, value fields. Specific monitoring items are identified by the Type field, plus necessary tags and a few other fields.
Previously, when monitoring data was stored in MySQL, each data record contained all monitoring items (25+) sampled by the instance at a time. If the general monitoring supertable design principles are adopted, the data structure of sampling needs to be modified in two ways:
- Modify the data structure of monitoring data delivery
- Transform consuming program consuming logic reorganizing data structure
But the consumption end has timeliness requirements, transformation is difficult, the production end involves a large range of resistance is not small. After comprehensive consideration, we finally decided to adopt the table structure similar to MySQL data storage to design the super table. Such transformation has the least intrusion on the original system and the lowest transformation difficulty. The transformation process is as follows:
- TDengine, the monitoring data of each data store is stored separately
taos> create database redis; taos> create database es; . . .Copy the code
- TDengine, build super table, take Redis as an example
taos> use redis;
taos> create table redis_node_meters (created_at TIMESTAMP, qps INT, . ..) TAGS (region NCHAR(10), cluster_name NCHAR(50), ip NCHAR(20), port INT, role NCHAR(15));
Copy the code
- Monitoring data consumption, as the newly added instance nodes are uncertain, for example, the node resources of Redis are automatically discovered by Agent and automatically collected monitoring indicators after registration. At this time, the written data is not sure whether a subtable exists, so TDengine’s automatic table building method is needed to create non-existent subtables. If the child table already exists, no new table will be created, only data will be written.
Taos > INSERT INTO tablename USING redis_node_meters TAGS ('China','t est', '127.0.0.3', meters 'master') VALUES ('2021-12-02 14:21:14', 6490 , ...)Copy the code
- Drawing shows that the drawing program does not install the client driver, and directly uses the RESTful API provided by TDengine to conduct data query in the way of HTTP. In SQL query, the time dimension aggregation function INTERVAL provided by TDengine is used extensively. For data query with long time span, only reasonable selection of aggregation INTERVAL is required, and basically the response is millisecond level, which ensures the smooth and stable front-end drawing.
Four, the transformation of the landing effect
After changing the monitored data storage from MySQL to TDengine, it not only withstands the pressure brought by the growth of monitored data, but also saves the storage space and reduces the cost to one tenth of the original or even less. Historical raw monitoring data can also be traced back longer. The space used to store raw and aggregated data for 3 days is now available for 45 days of raw data.
In addition, after the transformation, there is no need to maintain complex data aggregation scheduling tasks, greatly reducing the complexity of monitoring system and monitoring data management, and at the same time, front-end mapping data query becomes more concise and efficient.
Write at the end
With more in-depth understanding of TDengine and accumulation of experience, we will gradually consider the migration of cargo and trailer monitoring system, business data (driving track) and other time series data to TDengine. For the convenience of friends who need to use TDengine better, here also share some of our experience:
- Because TDengine does not support too complex SQL query syntax, it is necessary to fully consider when designing the super table tag. Currently, only tag fields support GROUP BY query of function operation results, while ordinary fields are not supported.
- The value of a subtable tag can be changed, but the value of the same subtable tag is unique. You are advised to use the combined value of the subtable tag to generate the subtable name.
In this project, TDengine has helped us to achieve cost reduction and efficiency increase. It is a timing database product worth trying. We hope that it can play more and more rich functions and features in the future, and we look forward to closer and deeper cooperation in the future.
⬇️ Click the image below to see the details of the event, iPhone 13 Pro waiting for you to take home!