Common storage formats

1.Textfile

Default format of Hive data tables. Data is not compressed, resulting in high disk overhead and high data parsing overhead. Storage mode: row storage.

You can use the Gzip compression algorithm, but the compressed files do not support split.

During deserialization, it is necessary to determine character by character whether it is a delimiter or an end-of-line character. Therefore, deserialization costs tens of times more than SequenceFile.

2.RCFile

Storage mode: Data is divided into rows and each block is stored in columns. Combines the advantages of row and column storage:

First, RCfiles guarantee that the data in the same row is on the same node, so the cost of tuple reconstruction is low

Second, like column storage, RCFile can take advantage of data compression of column dimensions and skip unnecessary column reads

Data appending: RCFile does not support any data write operations, but provides only an appending interface. This is because the HDFS supports data appending only to the end of the file.

Row group size: Larger row groups can improve the efficiency of data compression, but may hurt data read performance by increasing the cost of Lazy decompression performance. Moreover, the row group changes take up more memory, which affects other MR jobs that are executed concurrently. For storage space and query efficiency, Facebook chose 4MB as the default row group size, but also allowed users to choose their own parameters.

3.ORCFile

Storage mode: Data is divided into rows and each block is stored in columns.

Fast compression, sharable, fast column access. It is an improved version of RCFile with higher efficiency than RCFile.

Support for a variety of complex data types, such as Datetime, Decimal, and complex structs

4.Sequence Files

Compressing data files saves disk space, but one of the disadvantages of some native compressed files in Hadoop is that they don’t support splitting. There are multiple Mapper programs that process large data files in parallel. Most files do not support splitting because they can only be read from scratch. Sequence File is a separable File format that supports Block-level compression of Hadoop.

A binary file provided by the Hadoop API that is serialized into a file as a key-value. Storage mode: row storage.

Sequencefile supports three compression options: NONE, RECORD, and BLOCK. Record has low compression. Record is the default option, and blocks generally provide better compression performance than Record.

The advantage is that files are compatible with mapFiles in the Hadoop API

Note: This format is used by default when creating tables. When importing data, the data file will be directly copied to HDFS without processing. Data in SequenceFile, RCFile, or ORC tables cannot be directly imported from the local file. Data must be imported to the TextFile table first, and then imported to the SequenceFile and RCFile tables using INSERT from the TextFile table.

5.parquet

Parquet is also a column storage with good compression performance; It also reduces the amount of table scan and deserialization time.

Compression format supported by Hive

  • Hive supports Gzip, Bzip2, LZO, and SNappy compression formats

———-

Lzo file plus index can do segmentation decompression. In the case of a single LZO file without index, mapReduce execution is a map process, and the Map phase is very slow!! .

Create lZO index:

hadoop jar /opt/hadoopgpl/lib/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer /data/rawlog/your_log_file.lzo
Copy the code

end