Hive storage and compression

Row storage versus column storage

Today’s data processing can be roughly divided into two categories: ON-LINE Transaction Processing (OLTP), on-line Analytical processing (OLAP), OLTP is the main application of traditional relational database to perform some basic and daily transactions, such as adding, deleting, modifying and searching database records, while OLAP is the main application of distributed database. It does not have high requirements on real-time performance, but it deals with a large amount of data, which is usually used in complex dynamic reporting systems

Therefore, OLTP generally uses row storage because of high real-time requirements and a large number of update operations. OLAP uses column storage because of low real-time requirements, mainly for good performance

Features of row storage

When querying a full row of data that meets the criteria, only one of the values needs to be found, and the rest of the values are adjacent, so the row stores the query faster.
Traditional relational databases, such as Oracle, DB2, MySQL, and SQL SERVER, use row-based storage. In a database based on row-based storage, data is stored in a logical storage unit based on Row data, and data in a Row is continuously stored in the storage media.
The storage format of Both TEXTFILE and SEQUENCEFILE is row-based
This storage format is convenient for INSERT/UPDATE operations. The disadvantage is that if the query involves only a few columns, it will read the entire row and cannot skip unnecessary column reads. Of course, the data is relatively small, generally there is no problem, if the data volume is relatively large, it will affect the performance, and because of the data type of each row, column is not consistent, it is not easy to obtain a very high compression ratio, that is, the space utilization is not high

Column storage features

When querying, only the columns involved will be queried, not all columns will be queried, that is, unnecessary column query can be skipped, when the query only needs a few fields, can greatly reduce the amount of data read; Because the data in each column is stored together, the data type of each field must be the same. Column storage can be targeted to design better compression algorithm, efficient compression rate, saving not only storage space but also computing memory and CPU
The downside is that INSERT/UPDATE is cumbersome or inconvenient and not suitable for scanning small amounts of data
Column-based storage is relative to row storage. Newly distributed databases such as Hbase, HPVertica and EMCGreenplum all use column-based storage. In a database based on columnar storage, data is stored according to the basic logical storage unit of the column, and the data in a column exists in the form of continuous storage in the storage medium.

The characteristics of column storage: because of the aggregated storage of data for each field, it can greatly reduce the amount of data read when only a few fields are needed for a query; The data type of each field must be the same, and column storage can be tailored to better design compression algorithms. ORC and PARQUET are based on column storage.

Let’s take an example that’s not too abstract, but if you have a table with a billion rows, by the definition of column storage, you’re supposed to store a billion rows of one field before you store the rest of the fields.

Common data formats

Hive supports the following storage formats. The following describes the features of each format

Text File
SequenceFile
RCFile
Avro Files
ORC Files
Parquet
Custom INPUTFORMAT and OUTPUTFORMAT

Hive uses real Text files by default. This means that when you build a table without specifying the File format, it uses Text files. Hive supports specifying the default File format

<property>
  <name>hive.default.fileformat</name>
  <value>TextFile</value>
  <description>
    Expects one of [textfile, sequencefile, rcfile, orc, parquet].
    Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
  </description>
</property>
Copy the code

TextFile

Storage mode: row storage

The default storage format does not compress data, resulting in high disk overhead and high data parsing overhead. You can use it in combination with Gzip and Bzip2 (the system automatically checks and decompresses the files during query). However, the compressed files do not support split. Hive does not split data and therefore cannot perform parallel operations on data.

In addition, in the deserialization process, the delimiter and line end character must be determined character by character. Therefore, the deserialization cost is dozens of times higher than SequenceFile.

SequenceFile

SequenceFile is a binary file supported by the Hadoop API. It is stored in a row and is easy to use, split, and compressed.

Compressed data files can save disk space, but some native compressed files in Hadoop do not support segmentation, so Hadoop provides the SequenceFile format. The files that support segmentation can be processed by multiple Mapper programs in parallel. Most files do not support divisibility because they can only be read from scratch.

SequenceFile supports three compression options: NONE, RECORD, and BLOCK. The compression rate of Record is low. BLOCK compression is recommended. Record is the default option.

The advantage of SequenceFile is that files are compatible with mapFiles in the Hadoop API.

Note: This format is used to build tables. When importing data, data files will be directly copied to HDFS without processing. Data in SequenceFile, RCFile, or ORC tables cannot be directly imported from the local file. Data must be imported to the TextFile table first, and then imported to the SequenceFile and RCFile tables using INSERT from the TextFile table

RCfile

Storage mode: Data is divided into rows and each block is stored in columns

Short for Record Columnar, the first Columnar storage format in Hadoop. Provides good compression and fast query performance, but does not support schema evolution. It is a combination of column and column storage.

First, it blocks data into rows to keep the data in the same row in the same block, avoiding the need to read multiple blocks to read a record. Secondly, block data column storage is conducive to data compression and fast column access, and can skip unnecessary column reading

ORCfile

Storage: Data is stored in columns, in blocks, in rows (not columns, but segmented columns)

Optimized Row Columnar ORC file format is a column storage format in the Hadoop ecosystem. It was created in early 2013 from Apache Hive and is used to reduce Hadoop data storage space and speed up Hive query. Like Parquet, it is not a pure column storage format. It still splits the entire table by row group first, and stores columns within each row group.

ORC files are self-describing, their metadata uses Protocol Buffers serialization, and the data in the file is compressed as much as possible to reduce storage space consumption. It is currently supported by Spark SQL, Presto, and other query engines, but Impala does not currently support ORC. Parquet is still used as the primary column storage format. In 2015, ORC project was promoted to Apache Top Project by Apache Project Foundation.

ORC files feature fast compression and fast column access. They are an improved version of RCFile that can be compressed better than RC, can be queried faster, supports various complex data types such as Datetime and Decimal, and complex structs are stored in binary mode, so they cannot be read directly. The ORC file is also self-parsed and contains a lot of metadata that is serialized by the isomorphic ProtoBuffer.

It is important to note that ORC uses extra CPU resources to compress and decompress while reading and writing, which is of course very low CPU consumption.

format

ORC file: An ORC file can contain multiple stripes. Each ORC file consists of one or more stripes. Each stripe is generally the block size of HDFS and contains multiple records. Corresponding to Parquet is the notion of a row group. Each Stripe consists of three parts: Index Data, Row Data, and Stripe Footer.

Stripe: A group of rows forms a stripe. Each read file is in the unit of the row group. It is generally the block size of HDFS and stores the index and data of each column.

File-level metadata: file description PostScript, file meta-information (including the statistics of the entire file), information about all stripes, and file schema information.

Stripe metadata: Stores the position of the stripe, statistics for each column in that stripe, and all stream types and positions.

Row Group: Minimum unit of an index. A stripe contains multiple Row groups. The default value is 10000. Each read file is in the unit of row groups, usually the block size of HDFS, and stores indexes and data of each column.

The ORC file contains three levels of statistics: file level, stripe level, and row group level. These levels can be used to determine whether certain data can be skipped based on Search ARGuments (predicate pushdown criteria). And set specific statistics for different types of data.

File level: The statistics of columns in the entire file are recorded at the end of the ORC file. This information is mainly used for query optimization, and can also output results for simple aggregate queries such as Max, min, and sum.

** Stripe level: **ORC files hold statistics for each field stripe level. ORC Reader uses these statistics to determine which stripe records to read for a query statement. For example, in a stripe where Max (a)=10 and min(a)=3, if a >10 or a <3, all records in the stripe will not be read during query execution

Row level: To further avoid reading unnecessary data, the index of a column is logically divided into multiple index groups with a given value (default value: 10000, configurable). Take 10000 records as a group, conduct statistics on the data. Hive query engines pass constraints in WHERE conditions to ORC readers, which filter out unnecessary data based on group-level statistics. If the value is set too low, more statistics will be stored, and users will need to weigh a reasonable value based on the characteristics of their data

The data access

The ORC file is read from the end, 16KB first, and as much Postscript and Footer data as possible is read into memory. The last byte of the file contains the PostScript length, which cannot exceed 256 bytes. The metadata information of the entire file is stored in PostScript, including the compressed format of the file, the maximum length of each compressed block in the file (the size of the memory allocated each time), and the Footer length. And some version information. Between Postscript and Footer, the statistics of the entire file are stored (not shown in the figure above). This part of the statistics includes the information of each column in each stripe, the number of main statistics, the maximum value, the minimum value, the empty value, and so on.

The Footer information for the file is then read, which contains the length and offset of each stripe, the file’s schema information (storing the schema tree in an array according to the number in the schema), the statistics for the entire file, and the number of rows for each row group.

When processing stripe, the actual position and length of each stripe and Footer data (metadata, which records the length of index and data) of each stripe are firstly obtained from Footer. The whole striper is divided into index and data. Stripe is divided into blocks by row group (how many records in each row group are stored in the Footer of a file), and row groups are stored in columns. Each row group holds data and index information from multiple streams. The data for each stream is saved using a specific compression algorithm based on the type of column. There are several stream types in ORC:

PRESENT: Each member value holds a bit in the stream to indicate whether the value is NULL or not. It is possible to record only the value of the NULL part
DATA: The value of a member of the current stripe in this column.
LENGTH: The LENGTH of each member. This is only available for string columns.
DICTIONARY_DATA: The contents of a dictionary after encoding string data.
SECONDARY: Stores Decimal, timestamp Decimal or nanosecond values.
ROW_INDEX: Saves the statistics of each row group in the stripe and the start position of each row group.

After all metadata has been retrieved during the initialization phase, you can specify the column number to be read through the includes array, which is a Boolean array. If not, all columns will be read. You can also specify filtering criteria by passing the SearchArgument parameter. First read the index information in each stripe according to the metadata, then determine the number of row groups to be read according to the statistics in the index and the SearchArgument parameter, and then determine the columns to be read from these row groups according to the includes data. The data that needs to be read through the two layers of filtering is only multiple segments of the stripe. ORC then merges multiple discrete segments to minimize I/O times. The next row group stored in index is located in the first row group to be read from the stripe.

If the ORC file format is used, users can use each block of HDFS to store a stripe of the ORC file. For an ORC file, the stripe size must be smaller than the block size of HDFS. If this is not the case, a stripe will be located on multiple blocks of HDFS, and remote data reading will occur when such data is read. If the stripe is set to be stored on only one block, and there is not enough space on the current block to store the next strpie, the ORC writer will then split the data and store it on the remaining space of the block until the block is full. In this way, the next stripe is stored from the next block.

Because ORC uses more precise index information so that data can be read from any row, fine-grained statistics allow ORC files to skip the entire row group. ORC uses ZLIB compression for any chunk of data and index information by default, so ORC files take up less storage space

Parquet

Parquet compresses well, has good query performance, and supports limited schema evolution. But the writing speed is usually slow. The file format is primarily used on Cloudera Impala. The Parquet file is stored in binary and therefore cannot be read directly. The file contains the data and metadata of the file, so the Parquet format file is self-parsed.

Parquet’s design scheme, on the whole, basically follows the leveling and reconstruction algorithm of nested data structure in Dremel, realizes column storage (column group) through efficient data leveling and reconstruction algorithm, and then introduces more targeted coding and compression scheme for column data to reduce storage cost and improve computing performance. To understand the logic of this algorithm, you can read Dremel’s paper: Dremel: Interactive Analysis of WebScaleDatasets

test

Prepare test data

First we generate a test data. This is the test code that generates the data

public class ProduceTestData {
    SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-mm-dd HH:MM:ss");

    @Test
    public void testRandomName(a) throws IOException {
        Faker faker = new Faker(Locale.CHINA);
        final Name name = faker.name();
        final Address address = faker.address();
        Number number = faker.number();
        PhoneNumber phoneNumber = faker.phoneNumber();

        BufferedWriter out = new BufferedWriter(new FileWriter("/Users/liuwenqiang/access.log"));
        int num=0;
        while (num<10000000) {int id = number.randomDigitNotZero();
            String userName = name.name();
            String time = simpleDateFormat.format(new Date(System.currentTimeMillis()));
            String city = address.city();
            String phonenum = phoneNumber.cellPhone();
            StringBuilder stringBuilder = new StringBuilder();
            stringBuilder.append(id);
            stringBuilder.append("\t");

            stringBuilder.append(userName);
            stringBuilder.append("\t");

            stringBuilder.append(city);
            stringBuilder.append("\t");

            stringBuilder.append(phonenum);
            stringBuilder.append("\t"); stringBuilder.append(time); out.write(stringBuilder.toString()); out.newLine(); } out.flush(); out.close(); }}Copy the code

Prepare three tables, log_TEXT, LOG_ORC, and log_parquet

create table log_text(
     id int,
     name string,
     city string,
     phone string,
     acctime string)
row format delimited fields terminated by '\t'
stored as textfile;
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/access.log' OVERWRITE INTO TABLE ods.log_text;
create table log_orc(
     id int,
     name string,
     city string,
     phone string,
     acctime string)
row format delimited fields terminated by '\t'
stored as orc;
insert overwrite table ods.log_orc select * from ods.log_text;
create table log_parquet(
     id int,
     name string,
     city string,
     phone string,
     acctime string)
row format delimited fields terminated by '\t'
stored as parquet;
insert overwrite table ods.log_parquet select * from ods.log_text;
Copy the code

All arguments to ORCFile appear in the TBLPROPERTIES field of Hive SQL statements

Key	Default	Notes
orc.compress	ZLIB	high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size	262144	number of bytes in each compression chunk
orc.compress.size	262144	number of bytes in each compression chunk
orc.row.index.stride	10000	number of rows between index entries (must be >= 1000)
orc.create.index	true	whether to create row indexes

Storage space size

text

orc

parquet

Test SQL execution efficiency

SQL SELECT city,count(1) as CNT from log_text group by city order by CNT desc;

text

orc

parquet

conclusion

This section describes the features and application scenarios of row storage and column storage
Both Parquet and ORC are binary and cannot be read directly. Both Parquet and ORC are Apache top-level projects. Parquet does not support ACID and does not support updates
We briefly compared the storage occupancy and query performance of Text, ORCfile and Parquet. Because our query is relatively simple and the data itself is not very large, so the query performance is not very different, but the difference of the storage space is still very large

Hive compression

For data-intensive tasks, I/O operations and network data transfers can take quite a long time to complete. By enabling compression in Hive, you can improve Hive query performance and save storage space on HDFS clusters.

HiveQL statements are eventually converted into MapReduce jobs in Hadoop, and MapReduce jobs can compress the processed data.

We’ll start by explaining which MapReduce procedures can be set to compression: The data to be analyzed and processed can be compressed before entering the Map and then decompressed. The output after map processing can be compressed to reduce network I/O(Reduce and Map are usually on different nodes). Reduce copies the compressed data and decompresses it. To reduce disk usage.

Hive intermediate data compression

Once submitted, a complex Hive query is typically transformed into a series of multi-stage MapReduce jobs that are linked through the Hive engine to complete the query. Therefore, the intermediate output here refers to the output of the previous MapReduce job, which will be used as the input data of the next MapReduce job.

By using Hive set command in the Shell or modify the Hive – site. The XML configuration files to modify hive.exec.com the intermediate properties, This allows us to enable compression on Hive Intermediate output.

Hive.exec.com press. Intermediate: the default is false, set true to activate the intermediate data compression, is the middle of the graphs of the shuffle phase on the mapper compression. You can use the set command to set these properties in the Hive shell

set hive.exec.compress.intermediate=true set mapred.map.output.compression.codec= Org.apache.hadoop.io.com press. Press SnappyCodec or set hive.exec.com. Intermediate = true set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodecCopy the code

It can also be configured in a configuration file

<property>
   <name>hive.exec.compress.intermediate</name>
   <value>true</value>
   <description>
     This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed.
     The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
   </description>
</property>
<property>
   <name>hive.intermediate.compression.codec</name>
   <value>org.apache.hadoop.io.compress.SnappyCodec</value>
   <description/>
</property>
Copy the code

The final output is compressed

Hive.exec.com press. The output: the user can Hive of the resulting table data is usually need to compress. This parameter controls whether this function is enabled or disabled, and is set to true to declare that the resulting file is compressed.

Mapred.output.com pression. Codec: hive.exec.com press. The output parameter is set to true, then choose a suitable codec, if choose SnappyCodec. The Settings are as follows (both compressions are written the same way) :

Press the set hive.exec.com. The output = true set mapred.output.compression.codec=org.apache.hadoop.io.com press. SnappyCodec or set mapred.output.compress=true set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzopCodecCopy the code

It can also be configured through a configuration file

<property>
  <name>hive.exec.compress.output</name>
  <value>true</value>
  <description>This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. The compression  codec and other options are determined from Hadoop config variables mapred.output.compress*</description>
</property>
Copy the code

Common compression formats

Hive supports compression formats such as Bzip2, Gzip, Deflate, SNappy, and LZO. Hive relies on the Hadoop compression method. Therefore, an advanced Hadoop version supports more compression methods. You can configure the following methods in $HADOOP_HOME/conf/core-site. XML:

<property>  
    <name>io.compression.codecs</name>  
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,c om.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>  
</property>  
<property>
 
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
Copy the code

Note that before we enable compression in Hive configuration, we need to configure Hadoop support, because hive compression only specifies which compression algorithm to use, and the specific configuration needs to be configured in Hadoop

Common compression formats are:

The compression ratio bzip2 > Zlib > gzip > Deflate > SNappy > LZO > LZ4 varies in different test scenarios. This is only a general ranking. Bzip2, zlib, gzip, and Deflate guarantee minimal compression but consume too much time in computation.

In terms of compression performance, lZ4 > LZO > SNappy > Deflate > Gzip > Bzip2. Lz4, LZO, and SNappy have high compression and decompression speed and low compression ratio.

Therefore, lZ4, LZO, and SNAPPY compression are often used in the production environment to ensure computing efficiency.

Compressed format	Corresponding encoding/decoding
DEFAULT	org.apache.hadoop.io.compress.DefaultCodec
Gzip	org.apache.hadoop.io.compress.GzipCodec
Bzip	org.apache.hadoop.io.compress.BzipCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec
Lzo	org.apache.hadoop.io.compress.LzopCodec

For files that are compressed using Gzip or Bzip2, we can import them directly into the table. Hive automatically decompresses the data for us

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
 
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
Copy the code

Native Libraries

Hadoop is developed by Java language, so most compression algorithms are implemented by Java. However, some compression algorithms are not suitable for Java implementation, and will provide supplementary support for Native Libraries. In addition to the bzip2, LZ4, SNappy, and Zlib compression methods provided in Native Libraries, you can customize function Libraries (such as SNappy and LZO) required for installation. Using the compression method provided by Native Libraries can improve performance by about 50%.

Use the following command to view the loading status of native libraries:

hadoop checknative -a
Copy the code

You can compress Hive tables in either of the following ways: Configure MapReduce compression or enable Hive table compression. Hive converts SQL jobs into MapReduce jobs. Therefore, you can directly configure MapReduce compression to achieve compression. For the sake of convenience, Hive supports compression properties for certain tables, which automatically perform compression.

Available compression codecs in Hive

To enable compression in Hive, we first need to find out which compression codecs are available on the Hadoop cluster. We can list the available compression codecs using the set command below.

hive> set io.compression.codecs;
io.compression.codecs=
  org.apache.hadoop.io.compress.GzipCodec,
  org.apache.hadoop.io.compress.DefaultCodec,
  org.apache.hadoop.io.compress.BZip2Codec,
  org.apache.hadoop.io.compress.SnappyCodec,
  com.hadoop.compression.lzo.LzoCodec,
  com.hadoop.compression.lzo.LzopCodec
Copy the code

demo

First we create an uncompressed table tmp_no_COMPRESS

CREATE TABLE tmp_no_compress ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
AS SELECT * FROM log_text;
Copy the code

Let’s look at the output without compression

Setting compression properties in Hive Shell:

set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
Copy the code

Create a compressed table tmp_order_id_compress from tmp_order_ID;

CREATE TABLE tmp_compress ROW FORMAT DELIMITED LINES TERMINATED BY '\n'
AS SELECT * FROM log_text;
Copy the code

Let’s look at the output after setting the compression property:

conclusion

At what stages can data compression occur: 1 the input data can be compressed; 2 the intermediate data can be compressed; 3 the output data can be compressed
Hive is only configured to enable compression and use the compression mode. The real configuration is performed in Hadoop, while data compression is performed in MapReduce
For data-intensive tasks, I/O operations and network data transfers can take quite a long time to complete. By enabling compression in Hive, you can improve Hive query performance and save storage space on HDFS clusters.

Guess you like

Hadoop3 Data Fault Tolerance Technology (Erasure codes)

Hadoop data migration in detail

Flink calculates the topN hot list in real time

Hierarchical theory of warehouse modeling

Warehouse modeling methodology

The big data component focuses on these

Row storage versus column storage

Features of row storage

Column storage features

Common data formats

TextFile

SequenceFile

RCfile

ORCfile

format

The data access

Parquet

test

Prepare test data

Storage space size

Test SQL execution efficiency

conclusion

Hive compression

Hive intermediate data compression

The final output is compressed

Common compression formats

Native Libraries

Available compression codecs in Hive

demo

conclusion

Related Posts

Assault Concurrent programming JUC series -ReentrantReadWriteLock

Lastb command for Linux

Algorithm Interview Replay: Taking the Stairs (bytes)