background

Due to the lack of unified specification implementation and platform tool support, most service personnel and technical personnel do not consider the importance of Hive table format when building tables. As the amount of data increases, the storage waste of TextFile format becomes more and more serious

In data warehouse construction, except ODS layer tables use TextFile to speed up data loading, other layers of Hive tables should use ORC and other storage formats with high compression ratio

Hive tables can be stored in various formats. This section describes the two storage formats: TextFile and ORC

TextFile

Data is stored in rows without compression by default. You can select a data compression mode

Hive table in TextFile format, fast data loading. The ODS layer of data warehouse generally uses TextFile format to speed up data loading

Textfiles have several obvious disadvantages:

  1. Uncompressed data consumes large storage space
  2. Data cannot be divided or merged after compression
  3. Data is stored in rows
    1. The data types of a row are difficult to be consistent and the compression ratio is low
    2. When a query involves only a few columns, it will also read the entire row, without skipping unnecessary column reads

ORC

Optimized RCFile is an Optimized version of the RCFile format

Data is stored in columns. For a row, only the columns required for query are read, which speeds up query

Because the column data format is consistent, it usually has a high compression ratio, which can effectively reduce storage space

Storage size comparison

When I compressed a set of TextFile Hive tables, I found that the STORAGE size of ORC was usually half or less than that of TextFile

For some tables with many int columns, the ORC storage size is only 1.88% of the TextFile format, and the compression ratio is more than 50 times!

Original size: TextFile format size; Compressed size: ORC format size

Compression ratio: original table size/compressed table size; Compression ratio: Reciprocal of compression ratio

Thinking before compression

Instead of thinking about how to convert TextFile format to ORC, let’s think about how changes in storage format affect production tasks

Storage format changes

Hive Meta contains a table SDS that records the input and output format classes of Hive tables: INPUT_FORMAT and OUTPUT_FORMAT

Different storage formats will use different input and output format classes

Storage format INPUT_FORMAT OUTPUT_FORMAT
TextFile org.apache.hadoop.mapred.TextInputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
ORC org.apache.hadoop.hive.ql.io.orc.OrcInputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

Hive SQL tasks are automatically converted to corresponding tasks based on THE SQL. Changing the storage format does not affect the tasks. However, for MR tasks such as code writing, the corresponding data reading mode will be specified when reading and writing data, and the change of storage format will lead to the failure of such tasks

Table data volume changes

Because of the high compression of ORC, the data volume of ORC tables is usually less than half that of TextFile tables, and some special tables can even achieve extremely high compression ratio

For Hive SQL tasks, the Map phase sets the number of Map tasks based on the table size. Compared with TextFile, ORC tables have fewer Map tasks because of the smaller data volume of ORC tables

If the ORC table has an extremely high compression ratio, such as 1/50 of the original table, the ORC table may have 1/50 of the Map tasks of the TextFile table

After TextFile tables are converted to ORC, the number of Map tasks decreases, which may lead to longer Task running time. Typically, compression rates above 30% have little impact on Hive SQL tasks. If the compression rate of the ORC table is below 30%, downstream tasks may need to set mapred.max.split.size to control the size of the number of Map tasks

TextFile format optimization

TextFile format storage space waste is serious, and query efficiency is not as good as ORC, TextFile format optimization is imminent

If a Hive table is stored in TextFile format, how to convert it to ORC?

The following optimization methods are for Hive external partition tables

Step 1: Copy data to the intermediate table

  1. Create an intermediate table in ORC format with the same structure as the original table
CREATE EXTERNAL TABLE IF NOT EXISTS <ORC intermediate TABLE > LIKE < original TABLE > STORED AS ORCCopy the code
  1. Copy data to intermediate tables
-- set hive.exec.dynamic.partition = true; -- set hive.exec.dynamic.partition.mode = nonstrict; INSERT OVERWRITE TABLE <ORC > PARTITION(SELECT * FROM TABLE <ORC >)Copy the code

Step 2: Verify data

To avoid data flushing, verify that the data in the middle table is consistent with the original table after data is copied to the middle table

Verification rules:

  1. Select count(1) from table_name where table_name = 1
  2. – show partitions

In addition, you can verify other rules, such as the null value ratio of a field

Step 3: Task test

Intermediate tables in ORC format are used to test table production tasks and downstream tasks that consume data

After the test passes, modify the original table

Production task testing (Level of importance: Medium)

The change operation can be detected by rerunning the data production task. Normally, Hive table storage format changes do not affect production tasks (except jobs that generate data using MR code).

Test contents:

  1. Check whether the production task is running successfully
  2. Check whether the running time of a production task is abnormal

Downstream task testing (Importance level: High)

The impact analysis

The storage format change of Hive tables changes the data volume of Hive tables

Hive Sets the number of Map tasks based on the table size. The volume of data in ORC format is smaller than that in TextFile format. When the ORC format is changed from TextFile to ORC format, the number of Map tasks in downstream tasks decreases and the running time of downstream tasks increases to some extent

The impact of compression in ORC format on downstream tasks depends on the compression rate of Hive tables

Compression ratio: original data size/compressed data size

Under normal circumstances, TextFile -> ORC has a compression rate of 30%-50%. When the compression rate is above 30%, it has little impact on downstream tasks. If the compression rate is less than 30%, adjust the mapred.max.split.size parameter of downstream tasks to manually increase the number of Map tasks

Test downstream tasks

Test contents:

  1. Check whether downstream tasks are running successfully
  2. Check whether the running time of downstream tasks is abnormal

If the downstream Task takes longer to run, you can manually increase the number of Map tasks by adjusting the mapred.max.split.size parameter of the downstream Task

Step 4: Change the format of the original table

There are two ways to modify the original table storage format

1) Alter table Location

The core idea of this approach is to assign the Location of the original table directly to the Location of the intermediate table

Main steps:

  1. Modify the format of the original table
  2. Alter table Location
  3. Rebuild the original table partition
1. Change the format of the original table
ALTER TABLE < original TABLE > SET FILEFORMAT ORCCopy the code
2. Modify the original table Location

Point the Location of the original table to the Location of the intermediate table

ALTER TABLE < original TABLE > SET LOCATION 'Copy the code

Here are some things to watch out for:

  1. Note that the original Location address of the table is saved. Firstly, in order to roll back the change operation, and secondly, the original Location data can be cleared in time after confirming that the change operation has no impact
  2. If you change the Location, you are advised to delete the table structure of the intermediate table. When multiple tables point to the same Location, “accidental injury” data is easy to occur in the subsequent operation and maintenance links
3. Recreate the original table partition

In Hive Meta, input_format, output_format, and location are recorded for each partition

Input_format and output_format determine the read mode of table data, which corresponds to the storage format of Hive tables. Because the storage format of Hive tables is changed, metadata of partitions needs to be rebuilt to update input_format and output_format. This method modifies the location, which will also be updated

ALTER TABLE < ALTER TABLE > DROP PARTITION(< ALTER TABLE > DROP PARTITION) -- 2. MSCK REPAIR TABLE < original TABLE >Copy the code

2) Move intermediate table data to original table directory

Run the mv command of HDFS to move the ORC intermediate table data to the original table Location

Main steps:

  1. Back up original table data: Move original table data to a specific directory for rolling back or deleting old data
  2. Move data: Move intermediate table data to the original table Location
  3. Modify the format of the original table
  4. Rebuild the original table partition
1. Back up the original table data
Hadoop fs-mv < original table Location>Copy the code
2. Move data
Hadoop fs-mv < middle table Location>Copy the code
3. Change the format of the original table
ALTER TABLE < original TABLE > SET FILEFORMAT ORCCopy the code
4. Recreate the original table partition

In Hive Meta, input_format, output_format, and location are recorded for each partition

Input_format and output_format determine the read mode of table data, which corresponds to the storage format of Hive tables. Because the storage format of Hive tables is changed, metadata of partitions needs to be rebuilt to update input_format and output_format

ALTER TABLE < ALTER TABLE > DROP PARTITION(< ALTER TABLE > DROP PARTITION) -- 2. MSCK REPAIR TABLE < original TABLE >Copy the code

Step 5: Verify the data of the original table

After the previous processing, Hive table storage format and data have been updated to ORC format

Check whether the data in the new format of the original table is normal:

  1. Data amount verification: Verifies whether the data amount is the same as before
  2. Data verification: Select some output to see if the data is abnormal