Abstract: at present, the Internet, Internet industry, intelligent interconnection technology in various industries such as car networking scenario rapid popularization application, has caused a dramatic increase networked sensors, intelligent equipment, followed by a mass timing monitoring data storage, processing, and temporal database compression, data storage capacity and high efficiency is put forward higher requirements. For the time series big data storage of the Internet of Things with increasing flux, although the standard compression method can still play its value, some scenarios put forward new requirements for the efficiency and performance of the compression and decompression technology of time series data. This paper introduces the existing compression and decompression techniques of sequential data, and classifiedly introduces the characteristics, advantages and disadvantages of different algorithms.

The author | | ali RenWei source technology public number

Abstract: * * * * at present, the Internet, Internet industry, intelligent interconnection technology in various industries such as car networking scenario rapid popularization application, has caused a dramatic increase networked sensors, intelligent equipment, followed by a mass timing monitoring data storage, processing, and temporal database compression, data storage capacity and high efficiency higher requirements are put forward. For the time series big data storage of the Internet of Things with increasing flux, although the standard compression method can still play its value, some scenarios put forward new requirements for the efficiency and performance of the compression and decompression technology of time series data. This paper introduces the existing compression and decompression techniques of sequential data, and classifiedly introduces the characteristics, advantages and disadvantages of different algorithms.

Timing data is widely used in IoT, Industrial Internet, Internet of vehicles and other related scenarios. IoT devices have been applied in various industrial scenarios, from wearable devices to industrial production equipment, which will or will generate a large amount of data. The new Boeing 787, for example, generates about 500 gigabytes of data per flight from its sensors. These scenarios are usually characterized by high concurrent write and high throughput data processing, so the selection of sequential data compression algorithm needs to fully consider the needs of data acquisition, storage and analysis. Special attention should be paid to the way in which business applications analyze the current and historical data of sequential data. Improper selection of compression algorithms may lead to the loss of key information, thus affecting the analysis results. For business, the application of sequential data compression technology is sequential database. Compression and decompression of sequential database is a key data processing step, and the performance of compression algorithm directly affects the ROI invested in the construction of sequential database.

Sequential data compression

There is a more common interpretation of the data compression algorithm in the industry, which is usually for common scenarios and business-related scenarios, such as video, audio, and image data compression. This paper focuses on some common compression algorithms for sequential data design and processing in sequential database. The algorithm we choose to analyze has the ability to continuously generate sequential data compression processing in more general scenarios, and makes special design for the following characteristics of sensor data compression in IoT scenarios:

1. Data Redundancy: Time series data of some specific patterns frequently repeat in one or more time series.

2. Function estimation: The generation patterns of temporal data generated by some sensors can be estimated according to predefined functions.

To predict the future trend of some time series data through algorithms, such as regression, deep neural network, etc.

Graph sequential data compression algorithm classification

This paper focuses on summarizing commonly used compression algorithms for timing database and IoT sensor management. And precisely according to technical methods (dictionary-based, functional approximation, Autoencoders, sequential, etc.) and technical properties (adaptive, Lossless reconstruction, symmetry, tuneability) classified the compression techniques of fragmentation. Refer to the above figure for details, and compared and analyzed the performance of the main algorithms.

Ii. Introduction of background technology

Before introducing the compression algorithm, we define several key concepts including timing data, compression and quality Indices.

1 Time Series

Sequential data index data tuple A data set arranged in ascending order by timestamp (TI), which can be divided into:

1. Univariate Time Series (UTS) : the data tuple collected each Time is a single real variable.

Multivariate Time Series (MTS) : Each collection of data tuples is composed of a plurality of real number sequences, and each component enantiomers a feature.

For example, the fluctuations of stock prices in the specified time window in Figure 2 can be defined as univariate time series data, while daily trading information (such as opening and closing prices, trading volume, etc.) can be defined as multi-variable time series data.

Figure SAMPLE UTS timing data of stock trading

A mathematical paradigm for timing can be defined as:

2 Data Compression

Data Compression (also known as source coding), according to David Salmon in Data Compression: The definition in The Complete Reference can be simply described as “The process of converting The input original data stream into a smaller output data stream of a character stream or compressed stream”. This process follows the Simplicity Power (SP) theory proposed by J. G.Wolff, which aims to reduce data redundancy while preserving data information as much as possible.

Data decompression (also known as source decoding) performs a process opposite to compression to reconstruct data stream to meet the needs of more efficient data application layer for data expression and understanding.

Existing compression algorithms can be divided into the following categories according to the differences in implementation principles:

  • Non-adaptive/ adaptive: Non-adaptive algorithms do not need training for special data flows to improve efficiency, while adaptive algorithms need training process.
  • Loose/Lossy (Lossy/Lossless) : The loose algorithm is not guaranteed to compress the original data into a unique result, while the non-loose algorithm is guaranteed to compress the same original data into a unique result.
  • Symmetric/Asymmetric: Symmetric algorithms use the same algorithm to compress and decompress data. Different code paths are used to switch the compression and decompression processes. Asymmetric algorithms use different algorithms in data compression and decompression.

For the specific compression and decompression process of sequential data stream, a compression algorithm instance (encoder) input the sequential data stream TS of s volume and return the sequential data stream TS ‘of compressed volume S’, and S > S ‘contains the compressed timestamp field E(TS) = TS’. Decompression algorithm instance (decoder) execution process is from the compressed data stream to the original sequential data stream D(TS ‘) = TS, if TS = TSs compression algorithm is not loose, otherwise it is loose.

Quality Indices

In order to measure the performance of sequential data compression algorithms, three characteristics should be considered: compression rate, compression speed and accuracy.

1. Compression rate: To measure the compression ratio of the compression algorithm to the original timing data, it can be defined as:

Rho = s’ s

Where, S ‘represents the volume after compression of sequential data, and S represents the original volume before compression of sequential data. ρ transpose is also known as the compression factor, and the quality indices are used to express the compression returns, which are defined as:

CG = 100 loge1 rho

2. Speed: It measures the execution speed of the compression algorithm, usually using Cycles Per Byte (CPB).

3. Accuracy: Also known as Distortion Criteria (DC), which measures the reliability of time-series data retention information reconstructed by compression algorithms. In order to meet the measurement needs of different scenarios, a variety of measurement indicators can be used to determine the common indicators are:

Mean Squared Error:

Root Mean Squared Error:

Signal to Noise Ratio:

Peak Signal to Noise Ratio:

Three compression algorithm

Currently, the commonly used sequential data compression algorithms mainly include the following:

  1. Dictionary-Based (DB)

TRISTAN 1.2. CONRAD 1.3. a-LZSS 1.4. d-LZW

  1. Functional Approximation (FA)

2.1. Piecewise Polynomial Approximation (PPA)

2.2. Chebyshev Polynomial Transform (CPT)

2.3. Discrete Wavelet Transform (DWT)

  1. Autoencoders:

3.1. Recurrent Neural Network Autoencoder (RNNA)

  1. Sequential Algorithms (SA)

Delta encoding, run-length and Huffman (DRH) 4.2. Sprintz 4.3. run-length Binary encoding (RLBE) 4.4. RAKE

  1. Others:

5. Segment Merging (SM) 5. Continuous Hidden Markov Chain (CHMC)

1 Dictionary-Based (DB)

The realization concept of DB algorithm is to identify the same fragments that all the time series data exist in, define the fragments as atomic fragments and replace them with more concise identification markers, and form a dictionary for use with identification as Key recovery. In this way, it can ensure a higher data compression ratio and reduce the error rate. The compression achieved by this technique may be lossless, depending on the implementation. The main challenges of this architecture are:

  • Maximize search speed to find time series fragments in dictionaries;
  • Make the time string stored in the dictionary as general as possible to minimize the distance of the compression phase.

TRISTAN is an algorithm based on DB strategy. TRISTAN algorithm divides compression into two stages: adaptive learning in the first stage and data compression in the second stage. In the learning phase, TRISTAN dictionary tables are generated by learning training data sets, or by combining expert experience to define atomic fragments of specific patterns. With the dictionary table, the TRISTAN algorithm performs the retrieval of W from the following formula during the compression phase.

S = w. D w ∈ k {0, 1}

Where, D is the dictionary table, S is the atomic fragment, and K is the length of compressed timing sequence characterization data. TRISTAN decompression process is the process of interpreting the represented data W through the dictionary table D to obtain the original timing data.

CORAD algorithm adds automatic data association information on the basis of TRISTAN, measures two pairs of sequential data fragments based on Pearson correlation coefficient, stores correlation with adjacent matrix M, and further improves compression ratio and data decompression accuracy by combining M with dictionary table D.

Accelerometer LZSS (A-LZSS) algorithm is A DB strategy implementation based on LZSS search matching algorithm. A-lzss algorithm uses Huffman coding and is generated by statistical probability distribution in offline mode.

Differential LZW (D-LZW) is about creating a really big dictionary list that grows over time. Once the dictionary table is created, if a buffer block is found in the dictionary table, it is replaced by the corresponding index, otherwise, the new box is inserted into the dictionary as a new entry. The addition of new cache blocks is implemented under the principle of non-loose compression and does not limit the number of additional blocks. However, the consequent problem is that the dictionary table can expand indefinitely, which makes d-LZW algorithm only suitable for certain scenarios, such as the input sequential data stream consisting of finite phrase or enumerable character sets.

Zstandard (ZSTD) is a fast non-loose DB compression algorithm based on the Huffman encoded Entropy coder. The dictionary table is an optional support parameter that controls turning on and off. The algorithm, which is open-source by Facebook, supports on-demand adjustment between compression speed and compression ratio, with the ability to sacrifice compression speed for a higher compression ratio, and vice versa. Compared with similar algorithms, the performance of ZSTD algorithm can be found in the following table.

Table ZSTD algorithm performance comparison

2 Function Approximation (FA)

The main design idea of FA is to assume that time series can be expressed as time function. Since it is difficult to avoid the occurrence of new values that cannot be processed, it is not feasible to find a function that can accurately describe the whole time series. Therefore, we can divide the time series into multiple segments and find an approximate time function for each segment to describe it.

Since it is not feasible to find a function f:T → X that can completely describe the time series, it is relatively feasible to find a function cluster and its antipode parameters to describe the segmented time series data, but this also makes the compression algorithm loose implementation.

In contrast, the advantage of FA algorithm is that it does not depend on the value range of data, so there is no need for training stage based on sample data set. If regression algorithm is adopted, we only need to consider a single time segment divided separately.

The Piecewise Polynomial Approximation (PPA) is a common implementation of fa-like algorithms. This technique divides time series into multiple segments of fixed or variable length and tries to find the best Polynomial representation that is close to segmentation. Although compression is lossy, it can be repaired prior to the maximum deviation of the original data to achieve a given reconstruction accuracy. PPA algorithm uses the greedy method and three different online regression algorithms to approximate constant functions, lines, and polynomials.

Chebyshev Polynomial Transform (CPT) implementation principle is similar to PPA algorithm, only improved to support the use of different types of polynomials. Discrete Wavelet Transform (DWT) uses Wavelet Transform to Transform sequential data. Wavelet is a function that moves from zero to zero,

3 Autoencoders

Autoencoder is a special neural network trained to generate sequential data. The algorithm architecture consists of two symmetric parts: encoder and decoder. Given the n dimension timing sequence data input, the encoder of Autoencoder outputs the m (m<n) dimension output. Decoder can restore m – dimensional output to N – dimensional input. Recurrent Neural Network Autoencoder (RNNA) is a typical implementation of Autoencoder, which implements compressed expression of sequential data through RNN.

Figure Autoencoder algorithm implementation structure

4 Sequential Algorithms (SA)

The realization principle of serialization algorithm SA is to sequentially integrate multiple simple compression technologies to realize the compression of sequential data. Common technologies include:

  • Huffman coding: The encoder creates a dictionary that associates each symbol with a binary representation and replaces each symbol of the original data with the corresponding representation.
  • Delta Encoding: This technique encodes the object file to process one or more reference files. In the specific case of the time series, each element is encoded as ∆ at t (xt,xt=1).
  • Run-length Encoding: In this technique, each Run (a sequence of consecutive repetitions of the same value) is subplotted with (vt, o), where vt is the time t value and O is the number of consecutive occurrences.
  • Fibonacci binary Encoding: This encoding technique is implemented based on Fibonacci sequences.

Delta encoding, Run-length and Huffman (DRH) algorithm is a compression algorithm that incorporates Delta Encoding, Huffman coding, Run-Length Encoding and Huffman Coding. The implementation of the algorithm is non-loose and the computational complexity is relatively small, which is suitable for the compression and decompression of data in the edge short. Therefore, it is suitable for the application of the Internet of Things, industrial Internet, Internet of vehicles and other scenarios requiring the edge short acquisition and processing of time series data.

Sprintz is an SA algorithm specially designed for Internet of Things scenarios. In algorithm design, considering the fluctuation law factors of time sequence indicators such as energy consumption and speed in Internet of Things scenarios, the algorithm design is specially optimized for the following requirements:

1) Fast processing of small fragment data

2) The compression and decompression of the bottom computing complexity to adapt to the limited computing resources at the edge

3) Fast compression and decompression of real-time time series data

4) Non-loose compression

In order to achieve better compression effect in processing time series data in IoT environment, Sprintz algorithm improves the compression performance of the algorithm by predicting the trend of data generation. The algorithm process mainly includes the following parts:

1) Prediction: Based on Delta Encoding or FIRE algorithm, the generation rule of new sample data is predicted by statistical historical time series data;

2) Bit packing: Prediction error information data and packet header description used to decompress data information;

3) Run-length encoding: If no error message is found through the prediction algorithm during the compression process, the error message sent in the bit packing process will be skipped, and the skipped data length will be recorded in the packet header of the prediction error message when the next error occurs;

4) Entropy coding: encode the package files generated by big packing using Huffman coding.

Run-length Binary Encoding (RLBE) is a commonly used non-loose SA sequential data compression algorithm for data compression and decompression in IoT scenarios, which is suitable for limited computing and storage resource environment at the edge of IoT. RLBE algorithm combines Delta Encoding, Run-length encoding and Fibonacci coding are three technologies, and the execution process is shown in the figure below.

Figure Run-length Binary Encoding (RLBE

The principle of RAKE algorithm is to compress data by detecting the sparsity of sequential data. RAKE is non-loose compression, and the execution process includes two main processes: preprocessing and compression. During the pre-processing, the RAKE algorithm is designed to transform the raw data by dictionary tables. In the compression process, the sparsity of the preprocessed output data is detected and the adjacent same data is compressed.

5 Other types of algorithms

The Major Extrema Extractor (MEE) algorithm compresses data by counting the maximum and minimum value of time series data in a specified time period. The Segment Merging (SM) algorithm abstracts sequential data as fragments composed of timestamp and values and biases, expressed by tuples (t,y,δ), where T is the start time, Y is the numerical constant identifier, and δ is the data Segment deviation. Continuous Hidden Markov Chain (CHMC) algorithm uses Markov Chain probability model to define sequential data as A series of finite state node set S and state transition probability edge set A of linked nodes.

Four summarizes

Timing data is the data type with the highest proportion in the total amount of data generated in scenarios such as Internet of Things, Industrial Internet and Internet of vehicles. An effective compression algorithm can not only reduce the cost of data storage, but also save network bandwidth resources and reduce data synchronization time in data transmission from edge to center and from center to cloud. The performance of compression algorithm is very important for sequential database which is the core storage center of sequential data. Ali cloud Lindorm multimode database is one of the core database engine timing engine Lindorm TSDB built-in since research and efficient data compression algorithm for the Internet of things, Internet industry, such as car networking smart, connected scenario targeted optimization, further improve the ROI of time-series data storage, a new generation of cloud native smart + connected the system to provide the necessary support.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.