Hive tables can be stored in various formats, such as TextFile, ORC, and RCFile.

Textfiles are stored in uncompressed plain text format, with fields divided by specified delimiters. When the data itself contains delimiters, it can cause data dislocation. Therefore, storing data using TextFile is a big security risk

If the storage format of the table is already TextFile, how do I convert it to another format?

This tutorial is intended for Hive partition tables. If the Hive table is not a partition table, modifying the storage format will cause data read failure.

First, check the form

View the storage format of the table itself by building table sentences

The command is as follows:

SHOW CREATE TABLE < TABLE name >Copy the code

The following information is displayed after the command is executed:

CREATE EXTERNAL TABLE 'XXX' (' name 'bigint COMMENT' name ') PARTITIONED BY (' XXX 'XXX) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'Copy the code

INPUTFORMAT and OUTPUTFORMAT determine the read and write modes and storage formats of tables

Reference table for storage format

The following table describes the storage format of Hive tables

OUTPUTFORMAT Storage format
org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat TextFile
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat TextFile
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Orc
org.apache.hadoop.hive.ql.io.RCFileOutputFormat RCFile
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Parquet
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Sequence

Modify the table storage format

The following tutorial is about changing the storage format of Hive partition tables

We will introduce some concepts of storage formats in Hive Meta

Table storage format & partition storage format

In Hive Meta, TBLS is the metadata of the table, which records the storage format of the table, and PARTITIONS are metadata, which also records the storage format of PARTITIONS

What is the relationship between table storage format and partition storage format? Which one determines the storage format?

Table storage format Indicates the storage format of Hive tables. New partitions follow the storage format of Hive tables, that is

Storage format of the new partition = table storage formatCopy the code

Partitioned storage format: Determines the actual storage format of data for that partition

Take a chestnut

If db_name. Tbl_name_1 starts as a TextFile and two new partitions are added: part_1 and part_2. Part_1 and part_2 are stored in TextFile format

Later, we changed the table storage format to ORC and added a partition part_3, so the storage format of part_3 is ORC

In this case, the storage format of the table is ORC. The storage format of the three partitions is as follows

The partition name Storage format
part_1 TextFile
part_2 TextFile
part_3 ORC

Thus, partitions in the same partitioned table may not all be stored in the same format

Difference between Hive and Spark in reading partition data of Hive tables

Hive reads and writes partition data based on the actual storage format of the partition. In the previous example:

  • Part 3 will be read using ORC
  • TextFile will be used to read part_1/part_2
  • The new part 4 will be ORC
  • OVERWRITE/ append part_2(INSERT OVERWRITE/INTO TABLE), the storage format of part_2 will become ORC

Spark Uses the partition storage format as the default table storage format to reduce the read of Hive Meta. However, Spark reads part_1 in ORC mode and fails to read it (the data in Part_1 is actually a TextFile).

3. Modify the table storage format

ALTER TABLE < TABLE name > SET FILEFORMAT < ORC>Copy the code

After modifying the table storage format:

  • The storage format of the new partition is the new storage format
  • The storage format of the history partition is still the old format
    • The Hive engine can read historical partitions
    • The Spark engine fails to read the historical partition.

4. Brush historical partition data

Rebrush the historical partition data to change the storage format of the historical partition to the latest storage format of the table

INSERT OVERWRITE TABLE < TABLE name > PARTITION SELECT XXX, XXX, XXX FROM < TABLE name > WHERE PARTITION expressionCopy the code