preface

GZip is often used in HTTP-based network transport and can also be enabled in Nginx with half a line of code. How does GZip compression work? This article is a brief summary of some documents I read online.

From RFC 1952

RFC 1952 is GZIP File Format Specification Version 4.3. This specification mainly defines the data format specification of GZip compression to facilitate file transfer and exchange between different operating systems, cpus, and file systems. Let’s pick out a few interesting points, and if you’re interested, you can read the original text of RFC 1952.

The format of GZIP file is designed to allow a file to have multiple compressed data sets (compressed data sets) — the fragments of GZIP compressed together. But for most of our application scenarios, it’s basically a compressed data set per file, and if multiple files are packaged together, it’s often a tar file.

Each compressed dataset has the following structure:

CM | | ID1 | ID2 | FLG | MTIME (4 bytes) | XFL | | the OS – > more

| and | is 1 byte, it is Big endian (Big Edian)

  • ID1 and ID2 are 0x1f and 0x8b respectively, which indicate that the file format is GZIP

  • CM identity encryption algorithm, currently 0-7 is reserved word, 8 refers to deflate algorithm

  • FLG addresses from low address to high address are FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT, reserved, reserved, reserved. If you are interested in what significance each bit is set, please refer to RFC 1952 for details. More interesting is FEXTRA, which, if set, indicates the presence of additional extended fields. The structure of the extended field is as follows:

    • | SI1 | SI2 | LEN | … LEN bytes of subfield data … |
    • SI1 and SI2 are the IDS of the subdomain, consisting of ASCII codes. If you need to use it, ask his maintainer jean-Loup Gailly<[email protected]>Apply by email. Apollo File now has its own ID
  • MTIME refers to the time when the source file was last modified and stores Unix timestamps

  • XFL is a set of parameters passed to the compression algorithm to identify how to decompress. In the Defalte algorithm, 2 indicates the algorithm with the highest compression rate, and 4 indicates the algorithm with the fastest compression speed

  • The OS identifies the file system on which the compressor runs to handle problems such as EOF

  • The value of more depends on whether FLG is enabled. It may contain cyclic redundancy check code, source file length, additional information, and other information

Deflate compression core

At the heart of GZIP is Deflate, which was standardized in RFC 1951 and became very widely used at the time as an alternative to LZW.

Deflate is an algorithm that uses both LZ77 and Huffman Coding. Here is a brief introduction to the general ideas of the two algorithms:

LZ77

The core idea of LZ77 is that if there are two duplicate strings in a string, all you need to know is the contents of the first string and the distance between the following strings and the starting position of the first string + the length of the string.

For example, ABCDEFGABCDEFH → ABCDEFG(7,6)H 7 refers to the beginning of the seventh digit, and 6 refers to the length of the repeated string. ABCDEFG(7,6)H can represent the preceding string without ambiguity.

LZ77 implements this algorithm using sliding-window compression. The idea is that the scan head scans the string from the head of the string, and there is a sliding window of length N in front of the scan head. If it is found that the string on the top of the scanning and window of the longest matching string is the same, with (the distance between two strings, the length of the string) instead of a bunch of repeat after, at the same time also need to add a bunch said is true or the replacement of “string of bytes for decompression in front (the string need in real string and replace has existed before the” string “).

In the actual process, the size of the sliding window is fixed, and the matching string has a minimum length limit, so as to facilitate the identification that + the distance between two strings + the length of the string takes up a fixed number of bytes and does not reduce the compression volume. For more detailed implementation, please refer to Standford Edu. lz77 algorithm, LZ77 Compression algorithm and LZ77 Compression algorithm.

With this compression mechanism, it is easy to explain why CSS BEM GZIP compression can ignore the length and JPEG images may be larger after GZIP

Decompression: GZIP compression is relatively inefficient because of the need to find repeated strings in the window (LZ77 is still improved by a series of methods such as hashing). What about decompression? If you look at the compressed whole string, each small string is preceded by a mark to mark whether it is the original string or the replacement string. By this mark, you can read and replace the replacement string directly with O (1) complexity, which is very efficient overall.

Huffman Coding

Huffman Coding is an algorithm that is commonly mentioned in college textbooks. The core idea is to re-encode characters by constructing Huffman Tree (the core is to avoid the path of one leaf being the prefix of another leaf path), so as to ensure that characters with higher frequency path occupy fewer bytes. The structure of Huffman Tree is not detailed here, but can be referred to as Huffman Coding.

Decompression: After Huffman Coding, you need to maintain a Huffman Map table to record the encoded strings. According to this table, restoring the original strings is also very efficient.


Deflate uses a combination of LZ77 and Huffman Coding to compress files, which is a relative improvement. For details, see gZIP principles and implementation

Use in websites

GZIP has become one of the three standard HTTP compression formats specified in RFC 2016. The vast majority of websites use GZIP to transfer HTML, CSS, JavaScript and other resource files.

Nginx open

The ngx_http_gzip_module of Nginx also provides a way to enable GZIP compression, with the following common configurations:

# open
gzip on;

# Compression level, 1-9. Set how many you can refer to: http://serverfault.com/questions/253074/what-is-the-best-nginx-compression-gzip-level
gzip_comp_level 2;

# "MSIE [1-6]\." for example, disable GZIP from IE6
gzip_disable regex ...

Minimum compressed file length
gzip_min_length 20;

# Minimal HTTP version with GZIP compressionGzip_http_version 1.1;# compressed file type, Value is [the MIME type] (https://developer.mozilla.org/zh-CN/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types)
gzip_types text/html;
Copy the code

The related detection

Once GZIP is enabled on Nginx, compression is theoretically turned on in accordance with the GZIP configuration. So how do you check if it works?

Open your browser, visit your website, and look at Chrome Network. If there are two different sizes in Size (e.g. 222KB and 613KB), GZIP has been successfully enabled.

How does the browser work with the server?

The accept-encoding: gzip argument is added to the header when the browser requests resources. After receiving the Header, Nginx finds that it sends the file after GZIP if it has this configuration (the returned Header also contains instructions), and sends the source file if it doesn’t. The browser looks at the Response header to see if you want to extract the returned file and display it.

Reference documentation

  • RFC 1952 – GZIP file format specification version 4.3
  • RFC 1951-Deflate Compressed Data Format Specification Version 1.3
  • DEFLATE – Wikipedia, the free encyclopedia
  • LZW – Wikipedia, the free encyclopedia
  • LZ77 – Wikipedia, the free encyclopedia
  • Standford Edu. lz77 algorithm
  • LZ77 Compression Algorithm
  • LZ77 compression algorithm coding principle details (combined with pictures and simple code)
  • Huffman Code – Wikipedia, the free encyclopedia
  • Gzip principle and implementation
  • Module ngx_http_gzip_module
  • What is the best nginx compression gzip level?
  • MDN – A complete list of MIME types
  • Do you really know gzip?

Original link: github.com/rccoder/blo…