Hive tables can be stored in various formats, such as TextFile, ORC, and RCFile.
Textfiles are stored in uncompressed plain text format, with fields divided by specified delimiters. When the data itself contains delimiters, it can cause data dislocation. Therefore, storing data using TextFile is a big security risk
If the storage format of the table is already TextFile, how do I convert it to another format?
This tutorial is intended for Hive partition tables. If the Hive table is not a partition table, modifying the storage format will cause data read failure.
First, check the form
View the storage format of the table itself by building table sentences
The command is as follows:
SHOW CREATE TABLE < TABLE name >Copy the code
The following information is displayed after the command is executed:
CREATE EXTERNAL TABLE 'XXX' (' name 'bigint COMMENT' name ') PARTITIONED BY (' XXX 'XXX) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'Copy the code
INPUTFORMAT and OUTPUTFORMAT determine the read and write modes and storage formats of tables
Reference table for storage format
The following table describes the storage format of Hive tables
OUTPUTFORMAT | Storage format |
---|---|
org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat | TextFile |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | TextFile |
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | Orc |
org.apache.hadoop.hive.ql.io.RCFileOutputFormat | RCFile |
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | Parquet |
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat | Sequence |
Modify the table storage format
The following tutorial is about changing the storage format of Hive partition tables
We will introduce some concepts of storage formats in Hive Meta
Table storage format & partition storage format
In Hive Meta, TBLS is the metadata of the table, which records the storage format of the table, and PARTITIONS are metadata, which also records the storage format of PARTITIONS
What is the relationship between table storage format and partition storage format? Which one determines the storage format?
Table storage format Indicates the storage format of Hive tables. New partitions follow the storage format of Hive tables, that is
Storage format of the new partition = table storage formatCopy the code
Partitioned storage format: Determines the actual storage format of data for that partition
Take a chestnut
If db_name. Tbl_name_1 starts as a TextFile and two new partitions are added: part_1 and part_2. Part_1 and part_2 are stored in TextFile format
Later, we changed the table storage format to ORC and added a partition part_3, so the storage format of part_3 is ORC
In this case, the storage format of the table is ORC. The storage format of the three partitions is as follows
The partition name | Storage format |
---|---|
part_1 | TextFile |
part_2 | TextFile |
part_3 | ORC |
Thus, partitions in the same partitioned table may not all be stored in the same format
Difference between Hive and Spark in reading partition data of Hive tables
Hive reads and writes partition data based on the actual storage format of the partition. In the previous example:
- Part 3 will be read using ORC
- TextFile will be used to read part_1/part_2
- The new part 4 will be ORC
- OVERWRITE/ append part_2(INSERT OVERWRITE/INTO TABLE), the storage format of part_2 will become ORC
Spark Uses the partition storage format as the default table storage format to reduce the read of Hive Meta. However, Spark reads part_1 in ORC mode and fails to read it (the data in Part_1 is actually a TextFile).
3. Modify the table storage format
ALTER TABLE < TABLE name > SET FILEFORMAT < ORC>Copy the code
After modifying the table storage format:
- The storage format of the new partition is the new storage format
- The storage format of the history partition is still the old format
- The Hive engine can read historical partitions
- The Spark engine fails to read the historical partition.
4. Brush historical partition data
Rebrush the historical partition data to change the storage format of the historical partition to the latest storage format of the table
INSERT OVERWRITE TABLE < TABLE name > PARTITION SELECT XXX, XXX, XXX FROM < TABLE name > WHERE PARTITION expressionCopy the code