How to achieve speed and compression ratio? Optimization of compression algorithms in build deployment

background

Generally speaking, the process of building and deploying a service publishing platform (with the exception of image deployment) goes through the steps of build (synchronize code -> compile -> package -> upload), deploy (download package -> extract to target machine -> restart the service). Taking Meituan’s internal publishing platform Plus as an example, we recently found that some releases took a long time to build product packaging and compression. The pack step, as shown below, took a total of 1 minute and 23 seconds.

When we solve operation and maintenance problems for users, we also find that many users are used to putting some large machine learning or NLP related data into the warehouse. This part of data often occupies hundreds of megabytes or even several GB disk space, which greatly affects the packaging speed. The same is true for Java projects. Due to the numerous Java service frameworks and dependencies, these services usually take up hundreds of megabytes of space and take more than ten seconds to package. The figure below shows the median of our pack steps. Basically, most Java and Node.js services take at least 13s to pack.

Pack is a necessary step for almost all services that need to be deployed, and it currently takes less time than compiling and building images, so to improve the overall build efficiency, we are going to do a round of optimization of the pack packaging and compression steps.

Scheme comparison

Preparing Scenario Data

Package size analysis of published items

In order to simulate the application scenario in construction deployment as much as possible, we sorted out and analyzed part of construction package data in 2020. The compressed package size is shown in the figure below. The bell-shaped curve indicates that the overall package volume is normally distributed and has an obvious long-tail effect. After compression, the volume is mainly within 200M, and before compression, the size is roughly within 516.0MB.

The size of 99% service compression packages will be less than 1GB. As for the compression step, the larger the project, the more time it takes and the more space for optimization. Therefore, we choose a construction package of about 1GB for compression test in the comparison test of targeted schemes, which can not only cover 99% of scenes, but also show obvious improvement among compression algorithms.

The main reasons for this choice are as follows:

In the case of large data, the calculation result will be much smaller than the error of small data.
It covers most application scenarios.
The contrast is clear and you can see if there is a significant improvement.

Note: In the case of the same Compression library with the same Compression ratio and other configurations, the Compression Speed did not change significantly, so the batch test and data summary of other package volumes were not done.

The test project we used in this paper was a larger C++ project within meituan. The file types included C++, Python and Shell code files, as well as binary data such as NLP and tools (not including submitted data stored in.git). The data types were relatively comprehensive.

The size of the directory is 1.2G, which also provides a clear contrast between the different solutions.

gzip

Gzip is the DEFLATE based algorithm, which is a combination of LZ77 and Huffman encoding. DEFLATE was intended to replace LZW and other patented data compression algorithms that limited the availability of compression and other popular archivers at the time (Wikipedia).

We usually use the tar -czf command to package and compress the operation. The z parameter is compressed by gzip. The DEFLATE standard (RFC1951) is a widely used lossless data compression standard. Its compressed data format consists of a series of blocks, corresponding to blocks of input data, each compressed by the LZ77 (lexic-based compression, which represents the most likely letter in the shortest code) algorithm and Huffman encoding, which reduces the size of the data by finding and replacing repeated strings.

The file format

A 10-byte header containing a magic number (1F 8b), compression methods (such as 08 for DEFLATE), 1-byte header flags, 4-byte timestamp, Compression flags, and the OPERATING system ID.
Optional additional HEADERS, including the original file name, comment field, “extra” field, and the CRC-32 check code lower half for header.
DEFLATE compresses the body.
An 8-byte footer containing CRC-32 checksum and raw uncompressed data.

We can see that Gzip is the DEFLATE algorithm based primarily on CRC and Huffman LZ77, which is the optimization goal of the ISA-L library later.

Brotli

Alakuijala and Szabadka completed the Brotli specification in 2013-2016. This data format is designed to further improve compression ratio, and it is widely used in optimizing website speed. Formal validation of the Brotli specification was independently implemented by Mark Adler. Brotli is a data format specification for data stream compression that uses a specific combination of the generic LZ77 lossless compression algorithm, Huffman coding and 2nd Order Context modelling. You can refer to this paper to see how it works.

Context Modeling can achieve better compression ratios due to the nature of the language, but is difficult to popularize due to its performance problems. Currently, there are two popular breakthrough algorithms:

ANS：Zstd, lzfse
Context Modeling: Brotli, BZ2

Specific test data are shown below.

Zstd

Zstd is a fast compression algorithm that provides a high compression ratio. It is mainly implemented in C programming language and published by Yann Collet of Facebook in 2016. Zstd adopts Finite State Entropy, Abbreviation: FSE) encoder. The encoder is a new entropy encoder developed by Jarek Duda based on ANS theory, which provides a very strong compression speed/compression ratio compromise (in fact, it does have both “fish” and “cake”). Zstd provides compression ratios close to LZMA, LZHAM, and PPMX at its maximum compression level, and performs better than LZA or Bzip2. Zstandard achieves Pareto Frontier (the ideal state of optimal resource allocation) because it decompresses faster than any other currently available algorithm, but the compression ratio is similar or better.

For small data, it specifically provides a way to optimize speed by loading a preset dictionary that can be generated by training the target data.

Official Benchmark data comparison

The compression level can be specified with –fast, which provides faster compression and decompression speed, resulting in some loss of compression ratio compared to level 1, as shown in the table above. Zstd can trade compression speed for a stronger compression ratio. It is configurable in small increments and the decompression speed remains constant under all Settings, a feature shared by most LZ compression algorithms such as Zlib or LZMA.

We use the default parameters of Zstd to test, and the compression time of 8.471s is only 11.266% of the original, an improvement of 88.733%.
The decompression time of 3.211 is only 29.83% of the original, and the improvement is about 70.169%.
The compression rate also increased from 2.548 to 2.621.

LZ4

LZ4 is a lossless compression algorithm that provides a compression speed of more than 500 MB/s (more than 0.15 Bytes/cycle) per core. It is characterized by extremely fast decoding speed of more than GB/s (about 1 Bytes/cycle) per core.

LZ4 is more focused on compression and decompression speed, especially the speed of decompression. Compression ratio is not its strong point. It supports compression parameters ranging from 1 to 9 by default. We tested it separately.

LZ4 uses the default parameters to compress very well, much faster than Zstd, but the compression ratio is not very high, 206 MB more than Zstd, a full 46% more, which means more data transfer time and disk space footprint. Even the maximum compression ratio was not very high, only going from 1.79 to 2.11, but the elapsed time increased from 5s to 51s. By comparison, LZ4 is indeed not the best scheme in terms of compression rate, and the time advantage in 2.x level compression rate basically disappears. Moreover, LZ4 does not officially support parallel compression of multi-core CPU at present, so in the subsequent comparison, LZ4 loses the advantage of compression and decompression speed.

Pigz

Pigz author Mark Adler is also a co-author of info-Zip’s ZIP and Unzip, GNU’s Gzip and Zlib compression libraries, and a participant in the development of the PNG image format.

Pigz stands for parallel implementation of GZIP, and the main idea is to leverage multiple processors and cores. It divides the input into 128 KB blocks, each of which is compressed in parallel. The individual checksums for each block are also computed in parallel. It is implemented directly using the Zlib and PThread libraries, which are relatively readable and, importantly, gZP-compatible. Pigz uses one thread (the main thread) for decompression, but three more threads can be created for reading, writing, and checking calculations, so decompression can be accelerated in some cases.

Some bloggers tested Pigz’s compression performance on home PC platforms such as the I7 4790K, and the speed was not very high, but the improvement was much more noticeable in the data we verified on real machines. Through the test, its compression time execution speed is only 1.768s, giving full play to the performance of our platform physical machine. The User time (the sum of CPU time) is 1M30.569s, which is almost the same as the speed of the previous gZIP single-thread method. The compression rate of 2.5488 is almost the same as normal use of TAR-CZF.

ISA-L Acceleration Version

Isa-l ISA set of open source libraries that accelerate algorithm execution on the IA architecture to address the computing needs of storage systems. Isa-l uses the BSD-3-clause License, so it can be used commercially.

If you have used Storage Performance Development Kit (SPDK) or Data Plane Development Kit (DPDK), you should have heard of ISA-L. The former uses the CRC part of ISA-L. The latter uses its compression optimization. Isa-l uses efficient SIMD (Single Instruction, Multiple Data) instructions and special instructions to maximize the use of CPU microarchitecture to speed up the calculation process of storage algorithms. Isa-l’s underlying functions are written in manual assembly code and have been tuned in many details (ARM platforms are now often considered, and some of the instructions mentioned in this article are not well or even supported on these platforms).

The isA-L compression algorithm is mainly optimized by CRC, DEFLATE and Huffman encoding, and the official data indicates that ISA-L is 5 times faster than Zlib-1.

For example, CRC implemented by many underlying open source storage software uses look-up table method. A table of CRC values is stored or generated in the code, and then the values are queried during calculation. The assembly code of ISA-L contains the no-carry multiplication instruction PCLMULQDQ to perform no-carry multiplication on two 64-bit digits. Maximize the throughput of Intel PCLMULQDQ instructions to optimize CRC performance. Better yet, if the CPU supports AVX-512, other optimization instruction sets such as VPCLMULQDQ (PCLMULQDQ implemented in the 512 bit version of EVEX encoding) can be used (see appendix for ways to see if this is supported).

Note: Snapshots are taken from crC32_IEEE_BY16_10.ASM

use

Isa-l supports the compression optimization level of [0,3], and 3 is the version with the largest compression ratio. Overall, we use the largest compression ratio, although the compression ratio of 2.346 is slightly lower than gzip, but the impact is not significant. In the V2.27 version released in June 2019, ISA-L added the Feature of multi-threading, so in the subsequent test, we adopted the parameter of multi-threading concurrency, and the effect was significantly improved.

Since the system we build the machine is Centos7, we need to compile the dependence of ISA-L, such as NASM library, so the installation and configuration will be complicated. Isa-l supports a variety of optimization parameters, such as IGZIP_HIST_SIZE (increase the sliding window during compression). LONGER_HUFFTABLES, the larger Huffman encoding table, these compilation parameters will also greatly improve the library.

The compression time

Real 0m1.372s user 0m5.512s sys 0m1.791sCopy the code

After the test, the effect is quite amazing, which is the fastest in the current comparison scheme, saving 98.09% in time.

Because it is compatible with gzip, you can also use the tar -xf command to decompress the file. In the subsequent decompression test, we still used the isA-L decompression method.

Pzstd

Through Pigz’s tests, we wondered if an algorithm as good as Zstd could support parallelism, and we were pleasantly surprised to find a treasure trove in the official Repo.

Pzstd is a parallel version of Zstandard implemented in C++11 (Zstd has since added multithreading support), similar to Pigz’s tools. It provides compression and decompression capabilities compatible with the Zstandard format and can utilize multiple CPU cores. It divides the input into blocks of equal size and compresses each block independently into a Zstandard frame. These frames are then joined together to produce the final compressed output. Pzstd also supports parallel decompression of files. When decompressing files compressed using Zstandard, PZstandard performs IO in one thread and decompresses in another.

The following figure shows the compression and decompression speed comparison with Pigz, which is from the official Github warehouse (machine configuration: Intel Core I7 @ 3.1ghz, 4 Threads), and the effect is even better than Pigz. See the detailed comparison data below:

Pied Piper (Middle-out compression)

Middle-out Compression was originally a concept mentioned in American TV series, but now there is a real implementation of middle-out Compression. Currently, this algorithm is applied in a small range to compress time series data. Due to the lack of mature theoretical support and data comparison, it has not been formally entered into the scheme’s benchmark.

Note: Pied Piper is a company and algorithm invented in Silicon Valley.

compatibility

All the comparison schemes mentioned in the research in this paper are tested uniformly on the machines in the construction cluster. As the configuration of the machine cluster on the construction system is highly unified (including hardware platform and system version), the compatibility performance and test data are highly consistent.

The selection

In fact, the configuration of some official test machines is not the same as that of the physical machines on the Meituan build platform, and the scenarios may also be different. The CPU, platform architecture and compilation environment used in the test results quoted above are also somewhat different from those used by us, and most of them are home hardware such as I7-4790K or I9-9900K. This is why we need to use the physical machine of the build platform and a concrete build package compression scenario for testing, so that we can get the data that most closely matches our usage scenario.

Compare the data

The data of several schemes are compared in the following table (the time data selection in this paper is the median of the selection results after multiple runs) :

Compression time correlation

As you can see from the time visualization of the entire post-build compression build package, the initial version of GZIP compression is quite time-consuming, while Pzstd is the fastest solution, ISA-L is slightly slower, and Pigz is slightly slower. All three can achieve optimization from 1M11s to about 1s. The fastest compression time can be saved by 98.51%.

Comparison of decompression time

The decompression time does not differ as much as the compression efficiency. In gB-level projects, when the compression ratio is in the range of 2.5-2.6, the time is within 11s. In order to maximize compatibility with existing implementations and maintain stability, the decompression scheme takes precedence over the gZP-compatible policy, which is the least intrusive to the deployment target machine. That is, the decompression scheme that can use tar-xf takes precedence.

In the later implementation of the scheme, we did not modify the environment of the deployment machine for the time being because the deployment required a stable and reliable environment.

The following time comparisons are made using their respective decompression schemes:

Pzstd has the fastest decompression speed, saving 86.241% of the time compared with Gzip.
Zstd algorithm has the second highest decompression efficiency, which can save about 70.169% of decompression time.
Isa-l can save 61.9658% of time.
Pigz can save 43.357% of decompression time.
Brotli decompression saves 29.02% of your time.

Comparison of compression ratios

In the comparison of compression ratio, Zstd and Pzstd have some advantages, among which Brotli and LZ4 are difficult to test the speed at the same level of compression ratio due to the parameters they support, so they choose a parameter with lower compression ratio, but there is still some gap between Pigz and Pzstd in efficiency.

The ISA-L implementation has some compression ratio sacrifices, but not by much.

Pros and cons analysis summary

At the beginning of this article, we introduced the construction and deployment process, so the approximate calculation formula of the target time of our optimization is as follows:

In the comparison of test cases, the time consuming order is Pzstd < ISA-L < Pigz < LZ4 < Zstd < Brotli < Gzip (the higher the ranking is, the better), in which the time of compression and decompression accounts for a large proportion of the overall time consuming. Therefore, alternative strategies are Pzstd, ISA-L, and Pigz.

There is, of course, the cost of disk space optimization problem, namely the compression ratio to optimize disk space, but for the construction of the time-consuming will produce negative optimization, but due to the time dimension is the main target, we optimize is important than disk and upload bandwidth costs, so the compression ratio adopted common or default the optimal compression ratio, That is, in the range 2.3 to 2.6. However, in some scenarios with high cost of storage media such as in-memory databases, more considerations may be required to integrate multiple aspects, please be aware.

Marking strategy

Compared to GZIP, the new algorithm has the desired speed, but due to the forward incompatibility of the compression format and the need for client (deployment target machine) support, the implementation cycle may be longer.

However, for deployment, the benefits may not be very obvious, but increase some maintenance and operation costs, so we do not give high priority to Pzstd solutions for the time being.

The selection strategy is mainly based on the following principles:

The overall time consumption optimization improves the most, which is also the starting point of the overall optimization plan.
Ensure maximum compatibility. In order to reduce the cost of change for the services connected to the build platform and the platform, the solution compatibility needs to be maintained (priority is given to the most compatible policy, i.e. the gZIP-compatible solution).
To ensure the stability and reliability of the deployment target machine environment, choose the solution with the least intrusion on the deployment machine, so that no client or library installation is required.
In the real computer simulation test, the compression scene completely fits the scene of Meituan construction platform, that is, the data comparison between our existing physical machine platform and the target compression scene is good.
In fact, there are many dimensions to score this question comprehensively, such as disk cost of object storage, bandwidth cost, task time, and even machine cost. However, in order to simplify the selection of the overall solution, we omit some calculations, and also choose the range of compression ratio comparison recommended by each official.

Based on the above points, it is decided to adopt ISA-L acceleration in the first phase, which can improve the efficiency of the construction platform in the most stable and relatively high speed. The Pzstd scheme may be realized in the future. The data below are the results of the first phase.

The optimization effect

In order to facilitate the results show, we filter out the part of the package long post display (these time-consuming long projects tend to affect the user experience, and overall accounted for about 10%), this part of the task optimization time ranged from 27 s to 72 s, in the past, the more projects of compression, the longer Compression times are now stable at less than 10s, even when our machines are performing multiple tasks at the same time.

Then we summarized the data of the Pack step (compression + upload) before optimization and the data of the optimization, and obtained the average comparison of optimization effects. The data are as follows:

In one of our previous build statistics, most of the build packages were around 100MB after compression and around 250MB before compression. According to the GZIP algorithm, the compression speed was indeed around 10s.
Because the physical machine you are building may be running multiple tasks at the same time, the actual compression will take slightly more time than in the tests.

Compression saves an average of 90% of time.

Write in the back

As some schemes mentioned in this paper involve CPU instruction sets of specific platform environments and even compilation environment of libraries, compilation parameters will also affect the specific effect, so it is recommended to keep the cluster environment uniform during the implementation of schemes, or to make special customization optimization for the cluster environment.
When packing RPM files for Centos, pay special attention to the configuration of the compilation environment. Otherwise, the effect and test may be different.
Java Jar packages and War packages can also be compressed. For this scenario, the compression rate does not increase much, but the speed is still improved.

Author’s brief introduction

Hongda, R&D engineer of Meituan Basic Technology Department.

Team introduction

Basic Technology Department-R & D Quality and Efficiency Department-code warehouse and construction team. The team aims to build code warehouse management, collaboration and code construction capabilities, improve multi-dimensional workflow execution engine and construction tool chain, connect with other R & D tools of the company, and improve overall development and delivery efficiency of the business.

The appendix

The machine environment

The tests in this paper are carried out on the following physical machines, using the same target files. The test machine uses a non-pcie SSD disk.

Inxi-fx omits some of the data output below, and some of the parallel instruction sets may be used in optimization. The avX AVX2 instruction set is supported, but avX-512 is not. System: Host: ****** Kernel: ****** bits: 64 Compiler: GCC V: 4.8.5 Console: tty 7 Distro: CentOS Linux Release 7.1.1503 (Core) CPU: Info: 2x 16-Core Model: Intel Xeon Gold 5218 bits: 64 type: MT MCP SMP ARCH: Cascade Lake Rev: 7 L2 Cache: 44.0 MiB Flags: AVX AVX2 LM NX PAE SSE SSE2 SSE3 SSE4_1 SSE4_2 SSSE3 VMX Bogomips: 293978 Speed: 2300 MHz min/ Max: N/A Info: Processes: 764 Memory: 187.19 GiB Used: 15.05 GiB (8.0%) Init: Systemd Runlevel: 3 Compilers: GCC: 4.8.5Copy the code

The /proc/cpuinfo file also displays flags supported by the CPU: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic  movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d arch_capabilitiesCopy the code

reference

Engineering.fb.com/core-data/s…
Engineering.fb.com/core-data/z…
Peazip. Making. IO/fast – compre…
News.ycombinator.com/item?id=162…
Github.com/facebook/zs…
Fastcompression.blogspot.com/2013/12/fin…
zlib.net/pigz/
Cran.r-project.org/web/package…
bugs.python.org/issue41566
01. org/sites/defau…
www.intel.com/content/dam…

Recruitment information

Meituan R&D Quality and Efficiency Department – R&D platform team, committed to building the industry’s first-class continuous delivery platform, now we are looking for engineers in construction and product library, code warehouse, service release, assembly line, etc., located in Beijing/Shanghai. All interested students are welcome to join us. Resume can be sent to: [email protected] (Please mark the subject of email: MEituan R&D Quality and Efficiency Department – R&D Platform)

| want to read more technical articles, please pay close attention to Meituan technical team (meituantech) WeChat public official.

| in the public, the menu bar reply goodies for [2019], [2018] special purchases, goodies for [2017], [method] the key word, can see Meituan technology team calendar year essay collection.