From Stanford University

By Kedar Tatwawadi

Heart of the machine compiles

Participation: Li Zannan, Huang Xiaotian

Neural networks can not only analyze, identify features, make predictions, but also compress files. In a paper recently presented by researchers at Stanford University, the ability of recurrent neural networks to capture long-term dependencies is used for lossless compression tasks. The technique, known as DeepZip, has been tested on text and genomic data files. The researchers say the results have potential.

The ongoing big data revolution allows us to collect a lot of different types of data, such as images, text and audio; New types of data, such as 3D VR data, point cloud data for autonomous driving, different types of genomic data, occupy a huge amount of storage space. As a result, there is a great demand for statistical models and efficient compression methods for a wide variety of data formats.

Lossless compression technology has undergone many important developments in the last 50 years. In a classic study by Claude Shannon, the pioneer pointed out that the entropy rate was the best compression ratio possible for a given data source, and gave a (albeit impractical) way to do it. J. Rissanen proposed arithmetic coding, which is an effective method to realize the known distribution entropy boundary. For data sources with unknown distributions (such as text and DNA), he also designed adaptive variants of arithmetic coding that could be compressed by trying to learn the distributions of conditional K-gram models. Although the complexity of this process increases exponentially with k, the context is usually limited to the k=20 symbol. This results in a significant loss of compression ratio because the model does not capture long-term dependencies. We all know that models based on recurrent neural networks (LSTM/GRU) are good at capturing long-term dependencies and predicting the next letter/word with relative accuracy. So, can you use an RNN based framework for compression tasks? In a Stanford university study, researchers explored using RN-based language models and arithmetic coding to improve lossless compression performance.

In the paper of this study, researchers first analyzed and understood the performance of RNN and arithmetic coding method on synthetic data set under the condition of known entropy. The purpose is to intuitively understand the capabilities and limits of various RNN structures. The researchers also tested pseudo-random number generation sequences (PRNG), which are extremely difficult to compress using standard techniques, despite having a zero entropy rate (because they are deterministic). Based on experience with previous tests on synthetic datasets, the researchers used text compression models and genomic datasets.


Compressor frame


2.1 an overview of the

The first is the compressor model for experiments, whose framework can be divided into two modules:

  1. RNN probability estimator module: for data streams S_0, S_1… S_N, RNN Probability estimator modules can be based on previously observed negative signs S_0, S_1… S_k minus 1 to estimate the conditional probability distribution of S_k. This estimate a probability P ˆ (S_k | S_0, S_1,…, S_k – 1) will be delivered to the arithmetic coding module;
  2. Arithmetic encoder module: The algorithm encoder module can be thought of as FSM, which receives an estimate of the probability distribution of the next symbol and encodes it into a state (as opposed to the operation of the decoder).


2.2 RNN probability evaluator module

In fact, the RNN evaluator module can be any recurrent neural network (LSTM/GRU) that includes the Softmax layer for the final probability estimation. Arithmetic encoder modules can be classic arithmetic coding FSM or faster Asymmetric Numeral Systems (ANS) modules. There are some important limitations to running the model:

  1. Input causality: The RNN estimator must be causal, it can treat the input as a feature and estimate only based on the previous coded symbol (BiLSTM et al may not).
  2. Weight updates: Weight updates (such as execution) should be performed in the encoder and decoder. This is necessary because we need the encoder and decoder to generate the distribution of each symbol.

The researchers mainly explored two models: symbolic GRU model (DeepzIP-CHGRU) and feature-based model (DeepZip-FEAT). On deepZIP-GRU, at step k, the input to the GRU module is X_K-1, and state_K-1 is the output state up to point K. Deepzip-feat contains inputs as features to calculate causality, such as the past 20 symbols, and records of the observed contextual representations within the stream. In addition, researchers have considered a text-based model (attention-RWA model).


2.3 Arithmetic encoder module

The arithmetic encoder remains between the interval [0,1]. Each symbol stream uniquely identifies a range that can be calculated sequentially and evaluated directly based on the probability of the next symbol. It can be thought of as a state of the arithmetic encoder passed to the next iteration. Finally, the range is encoded, resulting in compressed data. In the case of a given probability assessment, the decoding operation is reversed. The arithmetic coding operation is shown in Figure 2.

Figure 2: Arithmetic codes of sequences (0, 2, 3) independently identically distributed (0.6, 0.2, 0.1, 0.1) as distributed sources


2.4 Encoder & decoder operation

Encoder & decoder operation is shown below:

  1. Arithmetic encoder modules usually start with a custom probability distribution evaluation of the first symbol S_0. Once done, the decoder can decode the first symbol.
  2. Both the arithmetic encoder and the RNN evaluator module pass state information through iteration. The final state of the arithmetic encoder acts as compressed data.
  3. If the model trains more than one epoch, the weight of the RNN evaluator module needs to be stored and the compression size calculated (MDL Principle [14]).

Figure 3: Encoder model architecture


The researchers then discuss some interesting experiments with different models on the above data sets. Models are:

  1. Deepzip-chrnn: A neural network model based on character-level RNN.
  2. Deepzip-chgru: Neural network model based on character-level GRU.
  3. Deepzip-feat: GRU based model, which contains all previously observed symbol functions, not just previous inputs.


3 experiments on synthetic data sets

Figure 5: Performance of the 128-unit DeepZP-CHRNN model on Markov-K sources


Figure 6: Performance of the 128-unit DeepZIP-Chgru model on Markov-K sources

Figure 7: Performance comparison of the 128-unit DeepZip model with GZIP[15], adaptive arithmetic coding -CABAC


Figure 8: Representation of the 128-cell DeepZip model on a real dataset


Abstract: DeepZip: Lossless Compression using Recurrent Networks

The paper addresses: web.stanford.edu/class/cs224…

Abstract: Today, the amount of data we generate has increased dramatically. New types of data, such as genomic data [1], 3D-360-degree VR data, autonomous driving point cloud data, have emerged. A great deal of effort goes into analyzing the statistical information from the above data to design a good compressor. We know from information theory that good compressors come from good predictors [2]. We know that models based on recurrent neural networks (LSTM/GRU) are good at capturing long-term dependencies [3] and can predict the next character/word very well. Can RNN thus be used effectively for compression? We analyze the application of RNN in data compression.

The compressor DeepZip contains two main modules: a probability estimator based on RNN and an arithmetic coding module. First, we discussed the existing literature and the basic model architecture. We then delve into the experimental results of synthetic and real text and genomic datasets. Finally, we discuss our findings and future work.