Understanding and selection of time Series database (TSDB)
This article was written by “Borrowed Directions” from MageByte team. Follow the public account to give you more hardcore technology
background
These two years the Internet industry has a new wind, always listening to a variety of lofty new terms. Big data, artificial intelligence, Internet of Things, machine learning, business intelligence, intelligent warning, etc.
Previous system, do data visualization, information management, process control. Businesses are no longer satisfied with simple management and control. Data visualization analysis, big data information mining, statistical prediction, modeling and simulation, and intelligent control have become the pursuit of various businesses.
“All is lost in time like tears. Time is dying.” We used to use the Internet to solve real problems. Now we are not satisfied with the reality, the data will be connected into time series, the history can be viewed in advance, the regularity can be revealed, the trend can be grasped in the future, the trend can be predicted.
We began to store a large amount of data, summarized the structural characteristics and common usage scenarios of these data, and continuously improved and optimized, creating a new database classification — Time Series Database.
Time series model
Time series database is mainly used to process the data with time label (change according to the order of time, that is, time serialization). The data with time label is also called time series data.
The structure of each timing point is as follows:
- Timestamp: Indicates the time of the data point, indicating the time when the data occurred.
- Metric: Indicates the metric name, which identifies the current data. In some systems, it is also called name.
- Value: Indicates the value of data, which is usually of the double type, such as CPU usage and access volume. In some systems, a data point can have only one value, and multiple values are multiple time series. Some systems can have multiple values, represented by different keys
- Tag: Attached attribute.
implementation
Suppose I want to record time series data from a series of sensors. The data structure is as follows:
* Identifier: device_id, timestamp
Copy the code
- Metadata: LocationId, devThe type, the firmwareVersion of the customerid
- Device specifications: CPU1mAvg, freeMem, informs theMem,.netRssi,.netLoss of battery
Sensor indicators: Temperature, humidity, pressure, CO, NO2, PM10
If you are using traditional RDBMS storage, create a table with the following structure:
This is the simplest time series library. But this only meets the needs of the time series data model. We also need to do more on performance, efficient storage, high availability, distribution and ease of use.
You can think about, if we were to implement a time series database, how would you design, what performance optimization you would consider, and how to make it highly available, how to make it easy to use.
Timescale
This database is actually a postgresQL modified time series database based on the traditional relational database. Postgresql is a powerful, open source and extensible database system.
Timescale. Inc developed Timescale, an SQL-compatible timing database, on top of PostgresQL. Serves as a PostgresQL extension. Its characteristics are as follows:
Basis:
- Supports all PostgreSQL native SQL, including the full SQL interface (including secondary indexes, non-temporal aggregation, subqueries, joins, and window functions).
- With a PostgreSQL client or tool, you can apply it directly to the database without changing it.
- Time-oriented features, API functionality and corresponding optimizations.
- Reliable data storage.
Extension:
- Transparent time/space partition for magnification (single node) and expansion.
- High data write rates (including bulk commit, in-memory indexing, transaction support, data backup support).
- Appropriately sized blocks (2d data partitions) on a single node to ensure fast reading even in large data volumes.
- Parallel operations between blocks and servers.
Disadvantage:
- Because TimescaleDB does not use column storage technology, it does not compress sequential data very well, with compression ratios up to around 4X
- Currently, distributed extension is not fully supported (related functions are being developed), so it has high requirements on the single server performance
In fact, you can all dig into this database. We are all familiar with RDBMSS, and knowing this will give us a deeper insight into RDBMSS, their implementation mechanisms, their storage mechanisms. In the specialized processing of time series, we can also learn the characteristics of time series data and learn how to optimize RDBMS for time series models.
We can also write an article later to take a closer look at the characteristics of this database.
Influxdb
Influxdb is a popular time series database in the industry, especially in IOT and surveillance. It uses go language development, outstanding feature is performance. Features:
- Efficient time series data write performance. Custom TSM engine for fast data writing and efficient data compression.
- No additional storage dependencies.
- Simple, high-performance HTTP query and write API.
- Plugin support for data intake for many different protocols such as Graphite, CollectD, and openTSDB
- Sql-like query language, simplify query and aggregation operations.
- Index Tags, support fast and efficient query time series.
- The retention policy effectively removes expired data.
- Continuous queries automatically calculate aggregated data, making frequent queries more efficient.
Influxdb has turned the distributed version into a closed source. So in the distributed cluster this is a weakness, need their own implementation.
OpenTSDB
The Scalable Time Series Database. That’s the first sentence you’ll see on the OpenTSDB website. Scalable is seen as a key selling point. OpenTSDB runs on Hadoop and HBase and makes full use of HBase features. The service is provided by a separate Time Series Demon(TSD), so it can easily scale by adding or subtracting service nodes.
-
Opentsdb is a hbase-based time series database (the new version also supports Cassandra).
The hbase-based distributed column storage feature implements high data availability and high-performance write features. Due to Hbase, the storage space is large and the compression is insufficient. Rely on HBase and ZooKeeper
-
Use a schemaless tagset data structure (sys.cpu.user 1436333416 23 host=web01 user=10001)
Simple structure, multi-value query is not friendly
-
HTTP – DSL query
OpenTSDB’s table design and RowKey design for TSDB on HBase is a feature worthy of further study. Interested students can find some detailed information to study.
Druid
Druid is a real-time online analytics system (LOAP). Its architecture integrates the characteristics of real-time online data analysis, full-text retrieval system and time series system, so that it can meet the requirements of data storage in different scenarios.
- Use column storage: support efficient scan and aggregation, easy to compress data.
- Scalable distributed Systems: Druid implements its own scalable, fault-tolerant distributed cluster architecture. Simple deployment.
- Powerful parallelism: Druid provides query services across cluster nodes in parallel.
- Real-time and batch data ingestion: Druid can ingest data in real-time, such as through Kafka. You can also import data in batches, for example, through Hadoop.
- Self-healing, self-balancing, easy to operate: Druid’s own architecture is fault-tolerant and highly available. Different service nodes can be added or subtracted based on load requirements.
- Fault tolerant architecture ensures data loss: Multiple copies of data can be maintained with Druid. In addition, HDFS can be used as deep storage to prevent data loss.
- Indexing: Druid backcodes String columns and Bitmap indexing, so efficient filters and groupBy are supported.
- Time-based partitioning: Druid partitions raw data based on time, so Druid is more efficient with time-based range queries.
- Automatic preaggregation: Druid supports preaggregation of data as soon as it is consumed.
Druid’s architecture is quite complex. It divides the whole system into multiple services according to functions. Query, Data, and master systems with different responsibilities are deployed independently to provide unified storage and query services externally. It provides an underlying data storage service in the form of distributed cluster service.
Druid’s architectural design is worth learning from. If you’re interested not only in time series storage but also in distributed cluster architecture, take a look at Druid’s architecture. In addition, Druid’s design of the segment(Druid’s data storage structure) is also a highlight, namely the implementation of column storage and reverse indexing.
Elasticsearch
Elasticsearch is a distributed open source search and analysis engine for all types of data, including text, digital, geospatial, structured and unstructured data. Elasticsearch is based on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now Elastic). Elasticsearch is known for its simple REST-style API, distributed features, speed, and extensibility.
Elasticsearch is known as ELK Stack. Many companies build log analysis and real-time search systems based on ELK. My team started working on metric monitoring systems based on ELK. The idea is to use Elasticsearch to store the time series database. Optimized the Mapping of Elasticserach to make it more suitable for storing time series data model, and achieved good results, fully meeting business requirements. Elasticsearch has also started to release Metrics and APM components, as well as the ability to store time series in addition to full text search. That’s exactly what we were thinking.
Elasticsearch – as-a-time-serial-Data-store
Check out Elasticsearch’s Metric component, Elastic Metrics
Beringei
Beringei is a high-performance in-memory sequential data storage engine that Facebook opened source in 2017. It has the characteristics of fast read/write and high compression ratio.
In 2015, Facebook published A paper called Gorilla: A Fast, Scalable, In-Memory Time Series Database, and Beringei is building on the idea for A time-series Database.
Beringei stores data using the delta-of-delta algorithm and compresses values using XOR encoding. It can store a lot of data with very little memory.
How to choose a suitable time series database
-
Data model
There are generally two types of time series data models, one without schema and with multiple tags, and the other type of name, TIMESTAMP and value. The former is suitable for multi-value patterns and more suitable for complex business models. The latter is more suitable for a one-dimensional data model.
-
Query language
Most TSDB currently supports HTTP-based SQL-like queries.
-
Reliability
Availability is mainly reflected in the stable high availability of the system and the high availability storage of data. A good system should have an elegant and highly usable architectural design. Simple and stable.
-
Performance
Performance is something we have to consider. When we started to think about data storage in more niche domains, it was largely because, in addition to the data model requirements, the performance of a generic database system did not meet our needs. Most time series libraries tend to write more than read less scenario, users need to balance their own needs. Below is a comparison of the performance of each library for your reference.
-
Ecosystem
I’ve always thought that ecology is something we have to take seriously when choosing an open source component. An ecologically good system, with more people using it, will have fewer undiscovered pits. In addition, in the use of problems, to the community, often can get some better solutions. In addition, a good ecosystem will have mature boundary systems around it, which will allow us to have more mature solutions for connecting with other systems.
-
Operational management
Easy to operate and maintain.
-
Company and support
The company behind a system is also important. There is a strong company or organization behind it, which has a great experience in project availability assurance and later maintenance updates.
The performance comparison
Timescale | InfluxDB | OpenTSDB | Druid | Elasticsearch | Beringei | |
---|---|---|---|---|---|---|
write(single node) | 15K/sec | 470k/sec | 32k/sec | 25k/sec | 30k/sec | 10m/sec |
write(5 node) | 128k/sec | 100k/sec | 120k/sec |
conclusion
To sum up:
- If you want a system with extreme performance, consider Beringei and InfluxDB. In terms of high data availability, double-write client mode can be used to make a copy of data to ensure data availability.
- If you don’t have a lot of data and don’t have high performance requirements, but need queries, deletes, and associative queries, consider a Timescale.
- If you spacing index and time series as required. Druid and Elasticsearch are the best choices. They all perform well and are highly available and fault-tolerant architectures.
The last
Later we can delve into a TSDB or two, such as Influxdb, Druid, Elasticsearch, etc. You can also learn the difference between downlink storage and column storage, the implementation principle of LSM, the compression of numerical data, and the knowledge of MMap to improve read and write performance.
Pay attention to the public number, master more core technology