An overview of the

When I heard data compression before, I thought it must be some enigmatic algorithm that can complete data compression. Recently, I looked at it, well, at least I can understand it.

Lossless compression

As we all know, no matter you are EXE, Word, TXT, DMG and so on, the storage is binary storage, so, when discussing compression, ignore the file format, as long as it is a string of numbers.

Plan a

Here we go, number string: 111111111111111111.

If you were to dictate this string to someone, how would you say, “1111… 1111”. I bet even if you say that, people will not understand you for a long time. But when you say it like this, it’s different: “twenty ones “. People get it.

You didn’t think it would work that way? Naive, if it is a black and white photo, each pixel above, black or white, there must be a lot of continuous same content, so after processing, will naturally be much smaller. Of course, there are certain limitations to this, repeated content can not be separated.

This is a compressed way of dealing with repeated data.

Scheme 2

Add another string of numbers: 123456-78-123456-987-12345678

You can probably guess what to do with it from the dot marks I put in this string. What follows is the same as what preceded it, namely, repeated data. Then you can go ahead and copy it. This string is processed as follows: 123456 78 (return 8, copy 6) 987 (return 17, copy 8) of course, the real compressed string does not have this pile of Chinese, represented by a symbol encoding, let’s assume r(return) and C (copy). The string of numbers on it now looks like this: 123456-78-R8C6-987-R17C8

Here is a very interesting place, recall plan 1 20 1. Go back 1, copy 19, although there is only one number in front, but as you copy, the length will change, copy one, the length will be the strain length, you can copy a new bit, and so on. How about that? Interesting.

This is also an idea of compression, copying data forward.

Plan 3

I need to quote letters for convenience.

Look at this string: aaaaaaaaaaaaabc.

Each letter needs to be encoded for storage, in ASCII: A (97), B (98), C (99). Each letter has two digits, so the 15-length string needs to be 15*2=30 digits. As you have probably noticed, the string letter A appears in large numbers, and if the letter A could be represented by a digit, the overall length would be much smaller. Why don’t I just write a as 9?

No, the computer can’t tell if 98 should be 9 and 8 or 98 if you don’t make a special mark. So, you need to have a flag, for example, that starts with a 7, that’s a 1-bit code, that starts with a 2, that’s a 3-bit code, and so on.

This is fierce, is not to see what, there is no wrong, is the famous Huffman coding. Shorter codes represent more frequently used content at the expense of longer codes representing less frequently used content. I won’t say much about Huffman coding, but it could be written in its own right.

You think Huffman encoding can only be used to compress text documents? It only needs to be a binary string. After all, text documents are written to disk as binary strings. You can think of each block of a binary (say, 2 bytes) as the letter above, which can be processed directly using Huffman encoding.

ZIP format

The zip file is a common compression format in daily use. It is the result of the compression processing in schemes 2 and 3 above. The compression steps are as follows:

  1. Use the fileScheme 2Remove most of the repetition.
  2. Put the processing result of step 1 throughHuffman codingProcessing, using shorter codes to represent higher frequency content. Of course, in this step, you need to generate and record an encoding comparison table, after all, the high frequency content of each file is different.
  3. Save the output from Step 2 directly tozipFile, including the encoding table (after all, decompression). Done

It is said that there will be some optimization in step 2, which will divide the file into many different pieces, and encode the different pieces separately, so as to achieve efficient compression of different files.

other

Of course, not only the file zip compression, including in many network transfers, to reduce the size of the transfer package, the file will be compressed before sending.

Lossy compression

The above lossless compression, after decompressing the compressed file, can completely restore the file before compression. This is good, but lossy compressed files are much smaller, at the cost of being unrecoverable. Don’t think it’s useless. Remember that when watching videos online, you will choose standard DEFINITION, HIGH definition, ultra definition and other formats according to the current network speed. The lower the definition, the less traffic consumption.

Lossy compression requires different processing for different files. I didn’t look. I didn’t dare say. Just a quick example.

Image compression

For example, a 1080 by 1080 resolution image needs 1080 by 1080 content to remember every pixel on the image. Here, if you save the two adjacent pixels on the left and discard the one on the right, that is, if you discard one column for every two columns of pixels, the overall size will be doubled. Of course, correspondingly, the sharpness of the picture will also decrease.

Of course, this is a bit rude, and JPEG compression is said to be good, but I haven’t looked at it yet. As for the rest of the video, audio, everything. Well, I guess.

conclusion

In data lossless compression, the idea is basically to reduce duplicate data, whether duplicate data replication, or Huffman coding can be said to be around this idea.

After seeing the compression code, it reminds me of the error correction code I saw before. How to deal with error correction codes? Adding content to the original data, correcting errors through data redundancy, and compression? Reduce the size of the data in the source file by converting it. It’s kind of interesting, like two sides of the same coin, there’s no right or wrong, it just depends on how you use it.