This is the 8th day of my participation in the August Text Challenge.More challenges in August
Data compression algorithm
Common compression formats in the field of big data include Gzip, SNappy, LZO, LZ4, bzip2, and ZSTD.
Why data compression?
To optimize storage (reduce storage space) and make full use of network bandwidth, compression is usually used. Big data requires processing massive amounts of data, so data compression is very important.
In many scenarios that exist in the enterprise, the data sources are typically derived from multiple text formats (CSV, TSV, XML, JSON, and so on). These files are human-readable, but take up a lot of storage space.
However, in big data processing, data should be as machine readable as possible. Using serialized compression techniques to compress this human-readable data into machine-readable data ensures that the storage space required is significantly reduced.
Here are some commonly used compression formats known as codecs that allow data compression/serialization and decompression/deserialization
Gzip(extension.gz)
GNU Zip(GZip), a well-known compression format, is widely used in the Internet world. You can use this format to compress requests and responses to efficiently utilize the bandwidth of your Web site /Web application.
advantages
Hadoop itself supports high compression. Processing gZIP files in the application is the same as processing text directly. Hadoop Native library is available.
disadvantages
Does not support the split
What is Hadoop Native?
Due to performance issues and the absence of certain Java class libraries, Hadoop provides its own native implementation for some components. These components are kept in a separate dynamically linked repository in Hadoop. This library is called libhadoop.so on Unix platforms.
Snappy(extension.snappy)
The Codec developed by Google (formerly known as Zippy) is considered to have the best performance of medium compression ratios. Performance is more important than compression ratio for this format. Snappy is one of the most widely used formats, obviously due to its excellent performance.
advantages
Fast compression speed; Hadoop Native library is supported
disadvantages
Split is not supported; Low compression ratio; Hadoop itself is not supported and needs to be installed. The corresponding command does not exist in Linux
LZO(extension.lZO)
Licensed under the GNU Public License (GPL) and very similar to Snappy, it has a medium compression ratio and high compression and decompression performance. LZO is a lossless data compression algorithm focusing on decompression speed.
advantages
Compression/decompression speed is relatively fast, reasonable compression rate; Support split, which is the most popular compression format in Hadoop; Hadoop Native library support; You need to install the lzop command in Linux, which is easy to use
disadvantages
Hadoop itself is not supported and needs to be installed. Lzo supports split, but lZO files need to be indexed. Otherwise, Hadoop will regard LZO files as ordinary files. (To support split, InputFormat needs to be set to LZO format.)
LZ4(extension.lz4)
advantages
Good performance, high compression ratio, good initial initialization speed, compression speed and stability are also good
disadvantages
- Lz4 is cumbersome to decompress and needs to specify the size of the original byte array, so it takes a lot of work to develop
- Does not support the split
Bzip2(extension.bz2)
More or less similar to GZip, with a higher compression ratio. But as expected, Bzip2’s data decompression is slower than GZip’s. One important aspect is that it supports data segmentation, which is important when using HDFS as storage. If the data is just stored and not queried, this compression is a good choice.
Advantages:
Support the split; High compression rate, higher than gZIP compression rate; Hadoop itself supports, but native is not supported; The bzip2 command comes with the Linux operating system and is easy to use
Disadvantages:
Slow compression/decompression speed; Does not support native
ZSTD (extension.zstd)
ZSTD is a new lossless compression algorithm that Facebook opened source in 2016. The advantages of ZSTD are compression rate and compression/decompression performance. ZSTD also has a special feature that supports dictionary file generation in training mode, which can greatly improve the compression rate of small packets compared to traditional compression method.
- For text compression scenarios with large amounts of data, ZSTD is the best choice considering compression rate and compression performance, followed by LZ4.
- For small data compression scenarios, if ZSTD’s dictionary mode can be used, the compression effect is more outstanding.
- Is ZSTD splitabble in hadoop/spark/etc
Data compression specification
Evaluation of compression mode
- Compression methods can be evaluated using the following three criteria
- Compression ratio: The higher the compression ratio, the smaller the compressed file, so the higher the compression ratio, the better
- Compress time: The faster the better
- Whether files in compressed formats can be resplit: A split format allows a single file to be processed by multiple Mapper programs, allowing for better parallelization
Common Compression formats
contrast
Compression way | Compression ratio | Compression speed | Decompression speed | Separable or not |
---|---|---|---|---|
gzip | 13.4% | 21 MB/s | 118 MB/s | no |
bzip2 | 13.2% | 2.4 MB/s | 9.5 MB/s | is |
lzo | 20.5% | 135 MB/s | 410 MB/s | is |
snappy | 22.2% | 172 MB/s | 409 MB/s | no |
Hadoop encoding/decoder mode
Compressed format | Corresponding encoding/decoder |
---|---|
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
Gzip | org.apache.hadoop.io.compress.GzipCodec |
BZip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | com.hadoop.compress.lzo.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
Data compression use
Compression of data in Hive tables
#Set totrueTo enable intermediate data compression, the default isfalse, not turned on
set hive.exec.compress.intermediate=true;
#Set the compression algorithm for intermediate data
set mapred.map.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;
Copy the code
Compress the output of the Hive table
set hive.exec.compress.output=true;
set mapred.output.compression.codec=
org.apache.hadoop.io.compress.SnappyCodec;
Copy the code