In this paper, starting from vivo Internet technology WeChat public links: mp.weixin.qq.com/s/qayKiwk5Q… Author: DuZhimin

Along with the development of the Internet, especially the Internet of things, we need to take all kinds of terminal real-time monitoring, inspection and analysis of equipment acquisition, data record, in time the coordinates of the data points into lines, can be made more latitude to look past statements, reveals the trend and regularity, abnormal sex; The future is big data analytics, machine learning, prediction and warning.

The data of the typical features are: fast frequency (each monitoring point in one second can produce multiple data), heavily dependent on the acquisition time (only each data are required in the time), measuring point more informative (real-time monitoring system all have tens of thousands of monitoring, monitoring data every second, every day to produce dozens of GB of data volume).

Based on the characteristics of time series data, relational database can not meet the requirements of effective storage and processing of time series data, so there is an urgent need for a database system to optimize the processing of time series data.

A list,

1. Time series data

Sequential data is a series of data based on time.

2. Timing database

Sequential database is a database that stores sequential data, and it needs to support the basic functions such as fast writing of sequential data, persistence, and multi-latitude aggregated query.

Whereas traditional databases only record the current value of data, sequential databases record all historical data. At the same time, time is always used as a filter condition in the query of time series data.

3, OpenTSDB

Receive and store large amounts of time series data without omission.

3.1, storage,

  1. You don’t have to convert it. What data is written and what data is stored
  2. Timing data is stored in milliseconds
  3. Retain raw data permanently

3.2 Scalability

  1. It runs on Hadoop and HBase
  2. Scalable to millions of writes per second
  3. You can expand capacity by adding nodes

3.3. Reading ability

  1. Generate charts directly from the built-in GUI
  2. You can also query data through the HTTP API
  3. You can also use an open source front end to interact with it

4. OpenTSDB Core concepts

The home PV of a product client is 1000W. The home PV of a product client is 1000W

  1. Metric: Metric for monitoring items. For example, PV up here
  2. Tags: Dimensions, or Tags. In OpenTSDB, Tags are key-value pairs made up of tagk and tagV, tagk=takv. The label is used to describe the Metric, such as the above version= ‘3.2.1’ for a product client.
  3. Value: A Value indicates the actual Value of a metric, for example, 1000W
  4. Timestamp: Timestamp used to describe when the Value occurred: for example: 2019-12-5 22:31:21
  5. Data Point: indicates the Value of a Metric ata certain Point in time. Data Point includes Metric, Tags, Value, and Timestamp
  6. The data saved to OpenTSDB is an infinite number of Datapoints

[DataPoint] [DataPoint] [DataPoint] [DataPoint] [DataPoint] [DataPoint]

2. OpenTSDB’s deployment architecture

1. Architecture diagram

2, description,

  1. OpenTSDB uses HBase to store data. That is, before building OpenTSDB, you must build an HBase environment.
  2. OpenTSDB is made up of a series of TSD and utility command-line tools.
  3. Applications interact with OpenTSDB by running one or more Time Series Daemons (TSDS).
  4. Each TSD is independent, there is no master, and there is no shared state, so you can run as many TSDS as possible to handle the workload.

3. HBase Overview

OpenTSDB’s deployment architecture shows that OpenTSDB is based on HBase. What is HBase? To better analyze OpenTSDB, here is a brief introduction to HBase.

1. HBase is a distributed open source NoSQL database with high reliability, consistency, high performance, column-oriented, scalable, and real-time read and write.

2. HBase is a schema-free database. You only need to define column clusters in advance and do not need to specify column qualifiers. It is also a typeless database, where all data is stored in binary bytes.

3. It stores data in tables organized in a four-dimensional coordinate system of row keys, column clusters, column qualifiers, and time versions, meaning that to uniquely locate a value, all four must be unique. The following reference to Excel to illustrate:

4. HBase operations and access are performed in five basic modes, namely, Get, Put, Delete, Scan, and Increment. The only way for HBase to query non-row key values is through scanning with filters.

5. Data storage in HBase (physical) :

6. Data storage in HBase (logically) :

4. HBase tables supporting OpenTSDB

If you are running OpenTSDB for the first time with your HBase instance, you need to create the necessary HBase tables. OpenTSDB only needs four tables to run: Select * from TSDB, tsdb-uid, tsdb-tree, and tsdb-meta; select * from TSDB, tsdb-uid, tsdb-tree, and tsdb-meta;

1, TSDB – uid

create 'tsdb-uid',
{NAME => 'id', COMPRESSION => 'NONE', BLOOMFILTER => 'ROW', DATA_BLOCK_ENCODING => 'PREFIX_TREE'},
{NAME => 'name', COMPRESSION => 'NONE', BLOOMFILTER => 'ROW', DATA_BLOCK_ENCODING => 'PREFIX_TREE'}Copy the code

2, TSDB

create 'tsdb',
{NAME => 't', VERSIONS => 1, COMPRESSION => 'NONE', BLOOMFILTER => 'ROW', DATA_BLOCK_ENCODING => 'PREFIX_TREE'}Copy the code

3, TSDB – tree

create 'tsdb-tree',
{NAME => 't', VERSIONS => 1, COMPRESSION => 'NONE', BLOOMFILTER => 'ROW', DATA_BLOCK_ENCODING => 'PREFIX_TREE'}Copy the code

4, TSDB – meta

create 'tsdb-meta',
{NAME => 'name', COMPRESSION => 'NONE', BLOOMFILTER => 'ROW', DATA_BLOCK_ENCODING => 'PREFIX_TREE'}Copy the code

The contents of each of the four tables will be specifically explained in the following sections against the actual data.

How does OpenTSDB save a data point to HBase?

1, first check the data in the four tables

From above, the data in all four tables is empty

2. Then we write a data point to OpenTSDB

@Test
public void addData() {
    String metricName = "metric";
    long value = 1;
    Map<String, String> tags = new HashMap<String, String>();
    tags.put("tagk"."tagv");
    long timestamp = System.currentTimeMillis();
    tsdb.addPoint(metricName, timestamp, value, tags);
    System.out.println("-- -- -- -- -- -- -- -- -- -- -- --");
}Copy the code

3. After inserting the data, let’s look at the four tables

There is data in HBase. The tsDB-UID, TSDB, and tsdb-meta tables contain data, but the tsdb-tree table does not contain any data.

4, TSDB – tree table

It is an index table that shows a tree structure, similar to a file system, for use by other systems, but we won’t go into the details here.

The tsD.core.tree. enable_processing option is used to enable whether to write data to the table.

5, TSDB – meta table

This table is an index of different time series in OpenTSDB and can be used to store some additional information. The table has only one column family name and two columns, ts_meta and ts_CTr. The data in this table can be controlled to generate or not according to the configuration items. Several columns are generated. The specific configuration items are:

tsd.core.meta.enable_realtime_ts
tsd.core.meta.enable_tsuid_incrementing
tsd.core.meta.enable_tsuid_trackingCopy the code

<metric_uid><tagk1><tagv1>[…<tagkN><tagvN>]

Ts_meta Column is similar to UIDMeta in that it is a UTF-8 encoded JSON string

The ts_CTr Column counter is used to record the number of data stored in a time series. The Column name is TS_CTR, which is an 8-bit signed integer.

6. Tsdb-uid table data analysis

Tsdb-uid stores UID mappings, both forward and reverse. There are two column families, one called name to map a UID to a string, and the other called ID to map a string to a UID. Each row in a column family has at least one of the following three columns:

  • Metrics maps the name of the metric to the UID
  • Tagk maps a tag name to a UID
  • Tagv maps the value of a tag to a UID

If metadata is configured, the Name column family can also include additional Metatata columns.

6.1. Id column family

  • Row Key: Indicates the actual indicator name, either tagK or tagV
  • Column Qualifiers: one of the metrics, TagK, and TAGV Column types
  • Column Value: an unsigned integer, an incremented number encoded in 3 bytes by default, with a Value of UID

6.2. Name column family

  • Row Key: UID, the value of the ID column cluster
  • The Column the Qualifiers: Metrics, TAGk, TAGV, metrics_meta, tagk_meta, and tagv_meta are among the six column types. *_meta is generated only when tsD.core.meta. enabLE_realtime_UID is enabled
  • Column Value: string corresponding to UID. For a *_meta Column, the Value will be a UTF-8 encoded JSON string. Do not modify this value outside OpenTSDB because the order of the fields affects the CAS call.

7. TSDB table

The point-in-time data is stored in this table with only one column cluster T:

7.1 RowKey Format

  • UID: The default encoding is 3 Bytes, and the timestamp is 4 Bytes
  • Salt: Split hot spots of the same metric with different timelines
  • Metric, tagK, tagV: Actually stores the UID corresponding to the string (in the TSDB-UID table)
  • Timestamp: data is one line per hour, record is second grade timestamp every hour on the hour

7.2 Column Format

Column Qualifier occupies 2 Bytes or 4 Bytes,

If 2 Bytes are occupied, the offset in seconds is in the following format:

  • 12 bits: Delta of the hour relative to row. 2^ 12 = 4096 > 3600 at most, so no problem
  • 1 bit: an integer or floating point
  • 3 bits: Indicates the data length. The length must be 1, 2, 4, or 8. 000 indicates 1 byte,010 indicates 2 bytes, 011 indicates 4 bytes, and 100 indicates 8 bytes

Four Bytes indicates the offset in milliseconds. The format is as follows:

  • 4 bits: hexadecimal 1 or F
  • 22 bits: indicates the millisecond offset
  • 2 bit: reservations
  • 1 bit: an integer or floating point. 0 indicates an integer and 1 indicates a floating point
  • 3 bits: Indicates the data length. The length must be 1, 2, 4, or 8. 000 indicates 1 byte,010 indicates 2 bytes, 011 indicates 4 bytes, and 100 indicates 8 bytes

7.3, the value

Values are stored in 8 Bytes, which can store both longs and doubles.

7.4 characteristics of TSDB table design:

  1. Metric and tag are mapped to UUIds, and actual strings are not stored to save space.
  2. The hourly data points of each timeline are grouped in a row, and each column is a data point. In this way, each column only needs to record the offset from the start time of the row to save space.
  3. Each column is a KeyValue.

Write at the end

1. Application scenarios

  • As a sequential database, OpenTSDB can not only provide query of original data, but also support aggregation of original data, filtering and aggregation calculation after filtering.
  • Support downsampling query, such as raw data is a data point 1 minute, if I want to show a data point 1 hour, it can also support.
  • Support group query according to dimension. For example, I have the data of a City in China, and now I want to query it after grouping by province, which can also be supported.

2. Precautions for use

  • The default character set for OpenTSDB is ISO-8859-1. The default character set is ISO-8859-1. The encoding length of OpenTSDB is fixed because it is single-byte encoding.
  • By default, HBase construction statements do not have pre-partitioning. This will cause hot issues when a large number of data is written. Therefore, pre-partitioning is recommended.
  • OpenTSDB is not suitable for large amounts of data. It can extract tens of thousands of data from tens of millions of levels, such as 5-minute data of an index within half a year, and still respond quickly. However, if multiple points of data are extracted, such as hundreds of thousands or millions, or an aggregation operation is performed after extraction, OpenTSDB is barely able to do it. There is no problem when it is used as a server machine for monitoring, but it is slow when it is used as a client APP for monitoring.
  • OpenTSDB has only four HBase tables, and all data is stored in one table. This means that services cannot be treated differently at a smaller granularity on the OpenTSDB level. For example, different services create different tables to store data.
  • OpenTSDB supports real-time aggregate computing, but is based on a single point, so its computing power is limited.

3, looking forward to

Druid or InfluxDB is recommended if you need to support large amounts of timing data. InfluxDB is the easiest timing database to use.

For more content, please pay attention to vivo Internet technology wechat public account

Note: To reprint the article, please contact our wechat account: Labs2020.