HIVE uses HDFS and MapReduce to store and compute data. HIVE can use InputFormat and Outputformat of Hadoop to read files from different data sources and write files in different formats to the file system. HIVE can also compress intermediate results or final data using the compression method configured by Hadoop.
1. What is compression and pros and cons?
Data compression and decompression in Hive are different from data compression in Windows. There are also many compression algorithms, and the results are different with different name extensions. The advantage of using data compression is to minimize the disk space required by files and the overhead of network I/O. In particular, the compression rate of text files can be as high as 40%. Bandwidth is a rare resource for clusters, and the improvement of network transmission performance is very important. But using compression and decompression can add CPU overhead.
Therefore, the decision not to use data compression depends on the job type: Using data compression for I/O intensive jobs, which are CPU intensive, degrades performance. However, the judgment of the type of operation can only be analyzed by comparing the execution results with actual measurements.
2. Common Compression algorithms in HIVE
Note, note, note that the compression algorithm in Hive depends on the Hadoop version. Different versions will have different compression encoders and decoders. For example, the version of hadoop2.9 currently used by our company supports a variety of compression methods. The higher the version, the more compression methods are supported. You can configure the compression mode in the core-site.xm file in Hadoop, which is also used by Hive. The following is the compression mode configured in my cluster. In actual development, you can configure the compression mode according to your own requirements. Of course, if not configured, compression is not used by default. For example, our company is not configured to use SNappy compression.
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.Lzo Codec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value> </property>Copy the code
You can run the following command to view the compression algorithm configured in Hive. Using the set command, you can view the property values of all Hive configuration files and hadoop files in hive installation environment. Compression is off by default in the hive, can pass the set hive.exec.com the output to look at it
hive (fdm_sor)> set io.compression.codecs;
io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
Copy the code
The result of the above query is the corresponding algorithm in hadoop bottom class, why there are different compression algorithms? The main reason is that different compression algorithms are different in compression rate, compression time, whether the compressed file can be sliced, etc., so they need to be used according to the actual situation in actual development.
3. Performance analysis of compression algorithms in HIVE
The file in the table tested here is 516.4MB, and the block size in the hadoop environment is set to 256Mb, so the data store is block storage, and the calculation has I/o overhead. It can calculate the time of data transmission calculation under different compression algorithms, as well as the compression rate and other factors.
[robot ~] hadoop fs - du h/user/finance/hive/warehouse/fdm_sor db/t_fin_demo/staits_date 516.4 M = 201901 /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201901/201901.txtCopy the code
This file is directly loaded from Linux to HDFS. The actual data size of the file is 516.4Mb
1. Hive does not use compression for computing and storage
--1. Perform data storage calculation without compression algorithm.
set hive.exec.compress.output=false; -- The default is false
insert overwrite table t_fin_demo partition(staits_date ='201900')
select
name,
id_type,
idcard,
org,
loan_no,
busi_type,
busi_category,
open_date,
dure_date,
loan_amount,
happen_time,
amout,
due_amt,
stat
from t_fin_demo where staits_date ='201901';
Copy the code
--2. Run the du -h command to check the file storage status on HDFS
[finance@master2-dev software]$ hadoop fs -du -h /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201900
271.0 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201900/000000_0
271.0 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201900/000001_0
4.7 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201900/000002_0
Copy the code
--3. Program running time
Total MapReduce CPU Time Spent: 54 seconds 200 msec
Time taken: 36.445 seconds
Copy the code
Summary: It can be seen from the above data that in uncompressed mode, the data is stored in text format without suffix. You can view the data directly from -cat. The storage size of the original file is 546.7Mb (271+271+4.7=546.7Mb), and the running time is 36.445.
2. Use the default Hive compression mode, and the file extension is. Deflate
--1. Use Deflate for compression
set hive.exec.compress.output=true;
--true Enables compression. This function is disabled by default. If the compression mode is not specified after compression is enabled, defalte is used by default.
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
insert overwrite table t_fin_demo partition(staits_date ='201904')
select
name,
id_type,
idcard,
org,
loan_no,
busi_type,
busi_category,
open_date,
dure_date,
loan_amount,
happen_time,
amout,
due_amt,
stat
from t_fin_demo where staits_date ='201901';
Copy the code
--2. View the data storage and computing status
[finance@master2-dev hadoop]$ hadoop fs -du -h /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201903
75.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201903/000000_0.deflate
75.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201903/000001_0.deflate
1.3 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201903/000002_0.deflate
Copy the code
--3. Program time:
Time taken: 54.659 seconds
Copy the code
Summary: Using the default deflate compression algorithm, the data store file suffix is.deflate. The file storage size is 75.9+75.9+1.3=153.1. The elapsed time of the program is 54.659s. It can be seen that deflate compression rate is very high, but the elapsed time of the program is higher than that of no compression.
3. Use gzip to compress files. In Hive, the file extension is
--1. Use Gzip for compressed storage
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
insert overwrite table t_fin_demo partition(staits_date ='201904')
select
name,
id_type,
idcard,
org,
loan_no,
busi_type,
busi_category,
open_date,
dure_date,
loan_amount,
happen_time,
amout,
due_amt,
stat
from t_fin_demo where staits_date ='201901';
Copy the code
--2. Run the du -h command to check the file storage status on HDFS
[finance@master2-dev hadoop]$ hadoop fs -du -h /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201904
75.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201904/000000_0.gz
75.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201904/000001_0.gz
1.3 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201904/000002_0.gz
Copy the code
3.Program runtime Total MapReduce CPUTime Spent: 1 minutes 33 seconds 430 msec
OK
Time taken: 62.436 seconds
Copy the code
Summary: Using the default gzip compression algorithm, the data store file name extension is. Gz. The size of the data store is 75.9+75.9+1.3=153.1. Procedure time is 62.436. If downloaded to Windows local decompression can be read
4. Use the LZO compression algorithm to compress files. Lzo_deflate
--1. Use LZO for compressed storage
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
insert overwrite table t_fin_demo partition(staits_date ='201905')
select
name,
id_type,
idcard,
org,
loan_no,
busi_type,
busi_category,
open_date,
dure_date,
loan_amount,
happen_time,
amout,
due_amt,
stat
from t_fin_demo where staits_date ='201901';
Copy the code
--2. Run the du -h command to check the file storage status on HDFS
[finance@master2-dev hadoop]$ hadoop fs -du -h /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201905
121.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201905/000000_0.lzo_deflate
121.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201905/000001_0.lzo_deflate
2.1 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201905/000002_0.lzo_deflate
Copy the code
--3. Program running time
Total MapReduce CPU Time Spent: 58 seconds 700 msec
OK
Time taken: 42.45 seconds
Copy the code
Summary: Using the default LZO compression algorithm, the data store file suffix is. Lzo_deflate. The file storage size is 121.9+121.9+2.1=245.9. The program takes 42.45 seconds.
5. Use Lzop compression and the file extension is. Lzo
--1. Use LZOP for compressed storage
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
insert overwrite table t_fin_demo partition(staits_date ='201906')
select
name,
id_type,
idcard,
org,
loan_no,
busi_type,
busi_category,
open_date,
dure_date,
loan_amount,
happen_time,
amout,
due_amt,
stat
from t_fin_demo where staits_date ='201901';
Copy the code
--2. Run the du -h command to check the file storage status on HDFS
[finance@master2-dev hadoop]$ hadoop fs -du -h /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201906
121.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201906/000000_0.lzo
121.9 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201906/000001_0.lzo
2.1 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201906/000002_0.lzo
Copy the code
--3. Program running time
Total MapReduce CPU Time Spent: 47 seconds 280 msec
OK
Time taken: 34.439 seconds
Copy the code
Summary: The above data, using the default Lzop compression algorithm, data storage file suffix named. Lzo. The file storage size is 121.9+121.9+2.1=245.9. The program took 34.439 seconds.
6. Use BZip2 compression and the file extension is. Bz2
--1. Use Bzip2 for compressed storage
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
insert overwrite table t_fin_demo partition(staits_date ='201907')
select
name,
id_type,
idcard,
org,
loan_no,
busi_type,
busi_category,
open_date,
dure_date,
loan_amount,
happen_time,
amout,
due_amt,
stat
from t_fin_demo where staits_date ='201901';
Copy the code
--2. Run the du -h command to check the file storage status on HDFS
[finance@master2-dev hadoop]$ hadoop fs -du -h /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201907
52.5 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201907/000000_0.bz2
52.5 M /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201907/000001_0.bz2
935.2 K /user/finance/hive/warehouse/fdm_sor.db/t_fin_demo/staits_date=201907/000002_0.bz2
Copy the code
--3. Program running time
Total MapReduce CPU Time Spent: 2 minutes 47 seconds 530 msec
OK
Time taken: 96.42 seconds
Copy the code
Summary: Using the default Bzip2 compression algorithm, the data store file suffix is.bz2. The file storage size is 52.5+52.5+0.934=106Mb. The procedure takes 96.42 seconds
Comprehensive analysis of various compression algorithms
As can be seen from the above table, each compression algorithm has its own advantages and disadvantages. The exact type of compression used depends on the data format stored and the computing mode. Specific use and principles of compression refer to subsequent blogs.
1. Bzip2 > Gzip > Deflate > LZO, so bzip2 saves the most storage space, but time-consuming.
2. Decompression speed and time: LZO > Deflate >gzip>bzip2 LZO decompression speed is the fastest