The Heart of Machine, written by Wu Jiying, edited by Hao Wang.

In the age of limited network bandwidth, data compression is especially valuable. Remember Pied Piper’s epic data compression algorithm in the first season of Silicon Valley made the company hot. Efficient data compression makes large-scale web applications possible on mobile, and the prospect is very attractive. With the advent of the era of big data, the volume of data and the growth rate of data have reached an unprecedented height. With the rapid development of 5G technology, there are more and more application requirements and scenarios for edge computing, Internet of Things, federated learning and other applications. In the case of limited transmission network and storage capacity, data compression technology plays an increasingly important role. With the continuous development of traditional data compression algorithms, deep learning networks have been applied to data compression and achieved good results in recent years.

This paper briefly reviews the “past and present” of data compression, focuses on the analysis of lossy compression and lossless compression methods based on deep learning, and discusses and prospects data compression based on deep learning.

1. Background knowledge of data compression

It is well known that there is a strong correlation between information theory and machine learning, and they are often referred to as “two sides of the same coin”. A particularly important link between the two is the essential equivalence between data probabilistic models and data compression methods. Shannon’s Source Coding theorem (Shannon-Fano Coding) can be regarded as the basic theorem describing this idea. Huffman Coding, Arithmetic Coding and recently developed Asymmetric Numeral Systems (ANS) are all classical algorithms for data compression based on statistical models. It is based on the statistics of the frequency of individual characters in the message. In addition to statistical probability-based methods, classic data compression methods include dictionary model-based compression techniques, such as LZ77, LZ78, LZW, etc., Entropy Encoding, run-length Encoding, etc.

The data compression tools we often use in daily life are basically variations, combinations or optimizations of the above classical methods, and rarely use a single technology. For example, the compression principle of GZIP is: first use a variant of LZ77 algorithm to compress, and then use static or dynamic Huffman coding method to compress the result; The compression principle of Bzip2 is as follows: one run-length encoder is used for encoding, then block sorting compression and move-to-front (MTF) transform further produce a large number of the same symbols, and then another run-length encoder is used for encoding. The final result is Huffman encoding, which wraps a header with it. LZMA is an improved and optimized compression algorithm based on Deflate and LZ77, while Deflate is a lossless data compression algorithm using both LZ77 and Huffman coding. However, in the face of data processing in the era of big data, traditional data compression methods are increasingly inadequate to meet the requirements of data compression characterized by large volume, rapid growth and complex structure, especially real-time data compression.

In recent years, the field of machine learning has experienced explosive development, and some researchers have achieved good results in the field of data compression using traditional neural network models. After the original image is processed by the neural network, only the weight of the neural network is stored, and the information of the image itself is not stored, so the high compression ratio is obtained without reducing the image quality. On this basis, the application of neural network in data compression is further advanced by combining neural network with other compression technologies, improving the structure of neural network and improving the training algorithm of neural network. However, the neural network is a shallow network, and the convergence speed, stability and validity of weight update of the model are all inadequate. In addition, the neural network relies on pre-training data with labels, which is difficult to meet in practical application scenarios.

2. Data compression based on deep learning

The introduction of deep learning effectively solves the problems existing in traditional methods. Compared with traditional compression technology, the deep learning-based method has the following natural advantages:

  1. Since the parameters of deep learning are derived from a large amount of actual data, while the traditional data compression and coding methods are mainly manually constructed based on prior knowledge, the excellent content adaptability of deep learning is superior to the model based on signal processing.

  2. Deep learning methods effectively make use of a large acceptance Field (Receptive Field), so not only adjacent information can be used, but also remote samples can be used to improve coding efficiency. However, traditional coding tools only use adjacent samples, so it is difficult to use remote samples.

  3. Compression methods based on deep learning are flexible and can further reduce bit rates according to specific domain characteristics while achieving fast processing. The internal representation of deep learning is suitable for modern data processing.

Compared with the traditional neural network compression technology, the advantages of the method based on deep learning are as follows :(1) based on the multi-layer network structure, the deep learning model has better nonlinear mapping ability, so it is conducive to learning the deep features of data (especially images). (2) The deep learning model improves the training speed through the process of multi-layer learning and weight fine-tuning, so as to meet the requirements of large-volume data compression. (3) The deeper learning level can remove redundant data features more effectively, thus obtaining a higher compression ratio.

So far, Random Neural Networks, Convolutional Neural Networks (CNN), Recurrent Neural Networks, RNN, Generative Adversarial Networks (GAN), Variational auto-encoder (VAE) and so on have been applied to data compression. In this paper, two kinds of deep learning data compression techniques based on GAN and VAE are analyzed and discussed from the research results of important academic conferences in recent two years.

1. Application of GAN in lossy compression

Deep generative models for distribution preserving lossy compression, NIPS2018




Address: https://arxiv.org/abs/1805.11057?context=stat

This paper proposes a distribution-preserving Lossy Compression (DPLC) system. The system is applied to image compression, encoder under different compression ratios can generate consistent with the distribution of original data coding, its decoder with zero bit rate generated independent identically distributed samples, and then gradually with the increase of the bit rate generated contains more content of the original image reconstruction, finally in a high enough bitrate achieve perfect reconstruction.

For the random variable X, let P_X represent the distribution of X and E_X[X] represent the expectation. W~P_X means that W obeys the distribution of P_X, and X~Y means that X and Y are identically distributed. In a standard lossy compression system, the task of the encoder is to generate the code E: X→W, so as to encode the original input as R bits of code, and the task of the decoder is to map the code back to the original space D: W→X, while minimizing some distortion measures D: XxX. The task objective function is expressed as:

In traditional compression methods, E and D are typically deterministic, so the number of different reconstruction inputs X^ : =D(E(X)) is limited by 2^R, which leads to quality degradation of reconstructed X^, such as image blurring and modularization. To solve this problem, it is necessary to add a constraint term to the objective function, that is, “the distribution of reconstructed instances follows the distribution of training data” :

Adding this constraint term is equivalent to learning the exact generation model of the general unknown distribution P_X when R=0. Considering the generation model, we naturally thought of introducing GAN. In order to reduce the deviation of the relaxation from the above equation to the mean value problem, D is decomposed into D=G o B, where G is the generation model (samples are extracted from the fixed prior distribution P_Z as input, trained to minimize the difference between P_G(Z) and P_X), B is a random function (trained with the original objective function E to minimize the distortion of the fixed value G). Wasserstein Distance can be defined as any transportation cost function, so it is applicable to the distortion measure D of quantitative reconstruction quality in this paper. For random variables X and Y, Wasserstein distance is:


Where P(P_X, P_Y) is the set of all the joint distributions of (X,Y). Let (X,d “) represent a metric space, and when we set c(X, y)=d “(X, y), the Kantorovich-Rubinstein dual can be obtained:

Where F1 is a class of bounded 1-Lipschitz functions F :X->R. The generation model G used in this paper learns by minimizing the Wasserstein distance between P_X and P_G(Z). There are two ways to learn: (I) Wasserstein Autoencoder (WAE), Where G o F parameterizes the coupling between P_X and P_G(Z), or (ii) via Wasserstein GAN (WGAN). In this paper, Wasserstein++ is proposed to combine WAE and WGAN. The trained generation model G and rate-constrained encoder E and random function B are used to implement the PROPOSED DPLC system, thus minimizing the distortion between X and X^ while ensuring that P_X and P_X^ are similar at all rates. The complete DPLC is shown in Figure 1.

Figure 1. Overall structure of DPLC.

In order to learn G, B, and E from the raw data, each component is parameterized into a deep neural network and the corresponding optimization problem is solved by stochastic gradient descent (SGD). As mentioned earlier, G* can be learned through WGAN or WAE. Because the WAE framework naturally includes an encoder, it ensures that the structure of the potential space Z is easy to code, which is not the case with WGAN. In the experiments in this paper, we observe that the sharpening degree of edges in images using WAE is weaker than that of WGAN. On the other hand, WAE is less prone to pattern loss than WGAN because the WAE goal severely penalizes pattern loss due to reconstruction error terms. In order to combine the advantages of the two methods, we propose a new optimal combination of the original form and dual form of Wd through the convex combination of Wd:

Training F for pseudo-samples based on G(Z) or G(F(X), these samples generally do not follow the same distribution due to a mismatch between F(X) and P_Z (more obvious at the beginning of the optimization process). Therefore, the following adjustments are needed:

  • Based on samples of G (Z ~) training f, which z. ~ = offers + (1 – U) f (X), U ~ Uniform (0, 1).

  • Based on the sample training G in F(X), the loss terms of WGAN and WAE are calculated.

Experiments were performed on two standard generative modeling baseline image datassets, CelebA and LSUN bedrooms, where images were reduced to 64×64 resolution. Figure 2 shows the test MSE (smaller is better), reconstructed FID (smaller is better) and conditional pixel variance PV (larger is better) for different methods. The results for CelebA are shown in the top row and the results for LSUN Bedroom are shown in the bottom row. The PV of the DPLC model presented in this paper increases steadily with the decrease of the rate, that is, they generate more and more effective image content than GC.

Figure 2. Experimental results of different methods.

2. Application of VAE in lossless compression:

Title: Bit-swap: Recursive bits-back Coding for Lossless Compression with Hierarchical Latent Variables, ICML2019

Address: https://arxiv.org/abs/1905.06845

This paper discusses the application of Variational auto-encoding (VAE) in compression technology and proposes a bit-swap lossless compression method. According to classical statistical theory data compression, any data distribution can be converted into a lossless code in which each data point is encoded into a number of bits equal to the negative logarithmic probability assigned by the model. When the model matches the real data distribution, the best expected code length can be obtained. Therefore, the lossless compression algorithm based on probabilistic likelihood design needs to solve two problems: one is to approximate the real data distribution by using the model; the other is to find a practical compression algorithm that meets the compatible conditions of the model and guarantees the code length of -log p_θ(x). VAE is introduced in this paper to decompose the original probability distribution in multiple layers. The approximate edge distribution of the original data is calculated by using the potential variable model. Combined with the classic ANS compression technology, fast compression and decompression are realized.

Given discrete original data x = (x1… , xD), and its distribution is p_data. ANS lossless compression mechanism based on probabilistic likelihood design is: Using probability p_θ(x), when the code length satisfies -log p_θ(x)=-log p_data(x), the average code length E[-log p_θ(x)] approximates the entropy of data H(x), that is, the average code length of the optimal compression scheme is obtained. ANS encodes the original symbol vector x using a unary probability distribution or a fully decomposed probability distribution P (x), resulting in -log p(x) bits. The state of ANS algorithms is a bit stream with a stack structure: each time a symbol is encoded, its bits are pushed to the far right of the stack; Each time a symbol is decoded, bits pop up from the far right of the stack. Figure 3 shows how ANS works.

Figure 3. ANS operates on bitstreams in a stack-like fashion.

In this paper, the latent variable model is used to calculate the approximate edge distribution p_θ(x) of p_data(x) :

Where Z is the potential variable. VAE into the reasoning model q_ theta (z | x) to approximate posterior probability model p_ theta (z | x) :

As D_KL>=0, the inference model and generation model can be obtained by jointly optimizing the Lower Bound of Evidence (ELBO) and the Lower Bound of log p_θ(x) :

The processing flow of the coding side is as follows:

  • The sender sends a use the prior p (z) and potential sample code x z ~ q_ theta (z | x), using p_ theta (x | z) coding;

  • Using q_ theta (z | x) from the bit stream decoding z, after deducting from the bit stream – log q_ theta (z | x) the length of the bit;

  • Using a p_ theta (x | z) to encode x, increase – log in to the bitstream p_ theta (x | z) length of bits;

  • Encodes x with p(z), adding -log p_(z) length bits to the bitstream;

  • Finally got a length of N_total: = N_init + log q_ theta (z | x) – log p_ theta (x | z) – log p_ (z) bits of bit stream, and sent to the receiver.

The receiver decodes the data by initializing ANS as a stream of received bits, then doing so in reverse order, and exchanging encoding and decoding operations.

When the model has many potential variables, efficiency issues arise when passing the initial bitstream N_init (which is also a problem with deep learning’s large number of parameters). To solve this problem, a hierarchical latent variable model is used for VAE with multiple latent variables and the sampling process follows the formal Markov chain, as shown in FIG. 4:

Figure 4. Hierarchical latent variable model.

At this point, the approximate edge distribution p_θ(x) is:



Definition for each hidden layer z_i inference model q_ theta (z_i | z_ (I – 1)), so that we can based on variational boundary edge probability calculation p_ theta (x), the ELBO objective function is:

In the bit-swap model proposed in this paper, the sampling process of both the generation model and the inference model obeies the Markov chain dependence between random variables. Similar to the processing for standard VAE, data X was generated from the potential variable Z_1. However, z_1 is no longer a fixed prior value, but is generated by a second potential variable, Z_2. Similarly, z_2 is not a fixed prior, but is generated by a third potential variable, Z_3, and so on. These nested dependencies allow us to recursively apply the Bits Back argument. We can apply the first two steps of the bits back argument to the first potential variable: first decode Z_1, then encode X directly, which will add new bits to the bitstream, and further decode operations of Z_2,L will require fewer initial bits to continue. The bits back parameter of the second potential variable z_2 is applied recursively in a similar manner: z_2 is first decoded, then z_1 encoded. The maximum number of initial bits required by bit-swap is:

The complete bit-swap data compression algorithm proposed in this paper is as follows:

Taking the three hidden layers as an example, the process of bit-swap is shown in Figure 5.

Figure 5. Bit-swap process.

In this paper, we train hierarchical latent variable models on ImageNet random blocks of 32×32 pixels. For testing, 100 images were taken independently of the test set. The compression ratio of 100 images is shown in Table 1. The 100 images were cropped to multiples of 32 pixels on each side, so that 32 pixel blocks could fit into a grid. The grid is treated as a data set that must be processed sequentially before it can be compressed. The authors used the same cropped image in the baseline scheme, rather than first dividing the image into 32 x 32 pixel blocks.

Table 1. Compression ratio of ImageNet 100 images (unscaled and cropped).

About deep learning application in data compression, the authors established a research alone home page, update the related research results and provides code, video downloads, etc., interested can visit https://bair.berkeley.edu/blog/2019/09/19/bit-swap/.

3. Prospect of data compression application based on deep learning

As can be seen from the above analysis and the results of specific paper examples, when dealing with data compression problems, deep learning methods no longer try to find the statistical and numerical features of the data themselves like traditional methods, but automatically mine the internal features of the data of complex structures. The data mining and reconstruction capabilities of deep learning enable it to obtain high compression ratio while maintaining data characteristics and ensuring coding effect.

Despite its good application effect, deep learning still has its own limitations:

  • When the amount of data is large and the data structure is complex, the overall structure of deep learning will also become very complicated. When the deeper training is carried out, the time required by the model will increase rapidly. As can be seen from the two examples selected in this paper, the author has reduced the image in the experiment, such as 32×32, 64×64 size image. The author also mentioned in the discussion that if the original size image is used, the calculation time and cost will be huge.

  • The deep learning method still lacks the analysis of the internal mechanism and is still used in the way of “black box” in different data compression applications. The main method is to improve the compression effect by adjusting the training layer and parameters.

As has been the case with deep learning in other machine learning scenarios, data compression based on deep learning has yielded surprising results, but there are still some problems. In the context of increasing video sharpness bit rates, applications that rely heavily on low-latency networks, and a host of VR and AR devices, data compression remains an area of long-term research. There is still a long way to go for deep learning research and application in industrial scenarios.

Jiying Wu, PhD in engineering, graduated from Beijing Jiaotong University. He once worked as a research assistant and a research assistant in the Chinese University of Hong Kong and the Hong Kong University of Science and Technology respectively. Now he is engaged in the research of new information technology in the field of e-government. My main research interest is pattern recognition and computer vision. I am interested in scientific research. I hope I can keep learning and make progress.