Data compression refers to a technical method that reduces the amount of data to reduce storage space and improve transmission, storage, and processing efficiency without losing useful information, or reorganizes data according to certain algorithms to reduce data redundancy and storage space.

The above paragraph, taken from Baidu Baike, roughly describes how data compression works.

concept

Essentially, “compressing” means finding a probability distribution for the contents of a file and replacing those parts with shorter ones that are more likely to appear. So, the more repetitive the file, the smaller it can be compressed. For example, “ABABABABABABAB” can be compressed to “7AB”. Correspondingly, if the content is non-repetitive, it is difficult to compress. At the extreme, you encounter random strings that are uniformly distributed and cannot compress a single character. For example, an arbitrary array of 10 Arabic digits (5271839406) is incompressible. Irrational numbers, such as PI, are also hard to compress. Compression is the process of eliminating redundancy and expressing the same content in a more concise form. As you can imagine, once compressed, the number of repeated strings in the file will be greatly reduced. A good compression algorithm minimizes redundancy so that no further compression is possible. Therefore, compressing already compressed files (recursive compression) usually does not make sense.

classification

  • lossy

Lossy compression refers to the use of compressed data for reconstruction. The reconstructed data is different from the original data, but does not affect the misunderstanding of the original information. Lossy compression is used where the reconstructed signal does not have to be identical to the original signal. It can be divided into feature extraction and quantization. Feature extraction includes pattern-based coding and fractal coding, and quantization includes zero-memory quantization, predictive coding, direct mapping and transformation coding.

  • nondestructive

Lossless compression refers to using the compressed data to reconstruct (or restore, decompress), and the reconstructed data is identical with the original data. Lossless compression is used where the reconstructed signal is required to be identical to the original signal. Statistical coding techniques, including Huffman coding, arithmetic coding and run-length coding, are commonly used.

coding

Compression can be broken down into two steps:

  • The first stepIs to get the probability distribution of the contents of the file, which parts appear more frequently, which parts appear less frequently;
  • The second stepIt encodes the file, replacing parts that appear repeatedly with shorter symbols.

There are many coding methods, such as RLE, Huffman, Rice, LZW, etc

So here’s a little bit about RLE

The original data (40, 0000000000000001111111000000011111111111)

The string contains 15 0s,7 1s,7 0s, and 11 1s, so we compress the character to 15(binary representation: 1111),7 (0111),7 (0111), 1(0001). All strings are made up of alternating 01’s, so we just need to encode the length of the run.

Encoded data 1111011101110001 (16 bits)

limit

Probability distributions are generally deterministic, so let’s think about coding now, how to find the shortest symbol as a substitute. If there are only two cases of the file’s contents (such as the result of a coin toss), then only one binary bit is sufficient, 1 for heads and 0 for negatives. If the file contains three conditions (such as the result of a football match), a minimum of two binary bits is required. If the file contains six cases (such as the result of a sieve throw), a minimum of three binary bits is required.

In general, given that the probability of a character (or string) appearing in a file is P in the case of uniform distribution, there are at most 1/ P possible occurrences at this location. Log2 (1/ P) binary bits are required to represent alternative symbols.

expect


p n l o g 2 ( 1 / p n ) = E ( l o g 2 ( 1 / p ) ) ∑ pn * log2 (1 / pn) = E (log2 (1 / p))

You can think of it as the average number of bits per symbol

example

Premise: Suppose you have two files that each contain 1024 symbols, and in the case of ASCII they are equal in length, both 1KB.

Case 1: The contents of file A are 50% A, 30% B, and 20% C, so each symbol occupies 1.49 binary bits on average.


0.5 l o g 2 ( 1 / 0.5 ) + 0.3 l o g 2 ( 1 / 0.3 ) + 0.2 l o g 2 ( 1 / 0.2 ) = 1.49 + 0.3 * 0.5 * log2 (1/0.5) log2 (1/0.3) + 0.2 * log2 (1/0.2) = 1.49

Each symbol occupies 1.49 binary bits, so compressing 1024 symbols theoretically requires a minimum of 1526 binary bits, or about 0.186KB, or 81% of the volume.

Case 2: The contents of b file are 10% A and 10% B,…… , 10% is j, then each symbol occupies 3.32 binary bits on average.


0.1 l o g 2 ( 1 / 0.1 ) 10 = 3.32 0.1 * log2 (1/0.1) * 10 = 3.32

If each symbol occupies 3.32 binary bits, then compressing 1024 symbols theoretically requires at least 3400 binary bits, or about 0.415KB, equivalent to compressing 58% of the volume.

conclusion

Comparing the two formulas above, you can see that the more scattered (random) the contents of the file, the longer the number of bits required. Therefore, this value can be used to measure the randomness (also known as uncertainty) of the contents of the file. This is called the information entropy.

The information entropy

concept

Information entropy is a measure of the amount of information needed to eliminate uncertainty, that is, the amount of information that unknown events may contain.

Understanding information entropy

  • Information entropy only reflects the randomness of content and has nothing to do with content itself. Regardless of the content of the file, as long as it obeys the same probability distribution, the same information entropy will be calculated.
  • The higher the entropy of information, the longer the number of bits it takes up, and thus the more symbols it can express. Therefore, it is sometimes said that the greater the entropy of information, the more information. For the first reason, though, it is easy to mislead. Higher entropy, just because there are more possible symbols, doesn’t mean you’re getting more information out of them.
  • Entropy of information has nothing to do with entropy of thermodynamics. These two entropys are not the same thing, information entropy means disordered information, and thermodynamic entropy means disordered energy

The relation between information entropy and compression

  • The higher the information entropy, the lower the data compression efficiency
  • The process of reducing redundancy is itself an entropic action