Hive data sampling and storage format

“This is the 15th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

When the data scale is expanding, we need to find a subset of data to speed up the efficiency of data analysis. Therefore, we need to filter and analyze the data set for pattern & trend recognition. Currently, there are three methods of sampling: random sampling, bucket table sampling, and block sampling.

1. Random sampling

Key words: rand() function.

The rand() function is used for random sampling, and the limit keyword limits the data returned by sampling. Distribute and sort keywords before the RAND function can ensure that data is randomly distributed in the Mapper and Reducer stages.

Examples are as follows:

select * from table_name 
where col=xxx 
distribute by rand() sort by rand() 
limit num; 
Copy the code

Use the order keyword:

Examples are as follows:

select * from table_name 
where col=xxx 
order by rand() 
limit num; 
Copy the code

According to the test comparison, the order by method of random sampling in tens of millions of data takes longer, about 30 seconds more.

2. Block sampling

Keywords: tablesample() Function.

Tablesample (N percent) Extracts data based on the size of hive table data and stores the data in a new Hive table. For example, extract 10% data from the original Hive table

Note: During the test, it was found that the SELECT statement cannot have a WHERE condition and does not support sub-queries. You can create an intermediate table or use random sampling to solve the problem.

select * from xxx tablesample(10 percent) Numbers andpercentThere should be Spaces between themCopy the code

Tablesample (nM) Specifies the sample data size. The unit is M.

select * from xxx tablesample(20M) No Spaces between numbers and MCopy the code

Tablesample (n rows) Specifies the number of rows of data to be sampled, where N indicates that n rows of data are taken by each map task. The number of maps can be confirmed by using simple query statements in the Hive table.

select * from xxx tablesample(100 rows) Numbers androwsThere should be Spaces between themCopy the code

3. Bucket table sampling

Key words: tablesample (bucket X out of Y [on colname]).

Where x is the number of buckets to be sampled, starting from 1, colname is the column to be sampled, and Y is the number of buckets.

In Hive, buckets are modeled based on a Hash field and placed into buckets with specified data. For example, table table_1 is divided into 100 buckets based on ID. The algorithm is Hash (ID) % 100. The hash(ID) % 100 = 1 record is placed in the second bucket. The key statement for creating a bucket table is the CLUSTER BY statement.

For example, divide the table into 10 groups randomly and extract data from the first bucket:

select * from table_01 
tablesample(bucket 1 out of 10 on rand())
Copy the code

4. Hive storage format

Hive supports the following storage formats: TEXTFILE, SEQUENCEFILE, ORC, and PARQUET.

5 Row storage and column storage

The logical table is on the left, the first one on the right is row storage, and the second one is column storage.

The features of row storage are as follows: When a whole row of data meets the criteria is queried, column storage needs to find the value of each column in each clustered field. Row storage only needs to find one value, and the other values are located in adjacent places, so the query speed of row storage is faster. select *

The characteristics of column storage: because of the aggregated storage of data for each field, it can greatly reduce the amount of data read when only a few fields are needed for a query; The data type of each field must be the same, and column storage can be tailored to better design compression algorithms. Select certain fields are more efficient.

6. TEXTFILE

Default format, data is not compressed, high disk overhead, high data parsing overhead. You can use it in combination with Gzip and Bzip2 (automatically checked by the system and decompressed during query). However, hive does not split data in this way and therefore cannot perform parallel operations on data.

7. The ORC format

Optimized Row Columnar (Orc) is a new storage format introduced in Hive 0.11.

You can see that each Orc file consists of one or more stripes, each stripe is 250MB in size. This stripe is actually equivalent to the RowGroup concept, but the size is 4MB->250MB, which can improve sequential read throughput. Each Stripe consists of three parts: Index Data,Row Data, and Stripe Footer:

Index Data: a lightweight Index that is indexed every 1W rows by default. The index only records the offset of each field of a Row in Row Data.
Row Data: Stores specific Data by fetching partial rows and then storing them in columns. Each column is encoded into multiple streams for storage.
Stripe Footer: Stores metadata information of each Stripe

Each File has a File Footer, which stores the number of rows in each Stripe, data type of each Column, etc. At the end of each file is a PostScript, which records the compression type of the entire file and the length of the FileFooter. When reading a File, seek reads PostScript at the end of the File, parses the FileFooter length from the inside, then reads FileFooter, parses the information from the inside to each Stripe, and then reads each Stripe from the back to the front.

8. PARQUET format

Parquet is a columnar storage format for analytics. Developed by Twitter and Cloudera, Parquet graduated from Apache’s incubator as an Apache Top-level project in May 2015.

The Parquet file is stored in binary and therefore cannot be read directly. The file contains the data and metadata of the file, so the Parquet format file is self-parsed.

Generally, the size of the row group is set according to the Block size when Parquet data is stored. In general, the minimum unit for data processing of each Mapper task is a Block, so that each row group can be processed by a Mapper task to increase the parallelism of task execution. The format of the Parquet file is shown below.

The figure above shows the contents of a Parquet file. Multiple row groups can be stored in a file. The first part of the file is the Magic Code of the file, which is used to verify whether it is a Parquet file. This value and the length of the file can be used to calculate the offset of the metadata. The metadata of the file includes the metadata information of each row group and the Schema information of the data stored in the file. In addition to metadata for each row group in a file, metadata for that page is stored at the beginning of each page. In Parquet, there are three types of pages: data pages, dictionary pages, and index pages. Data page is used to store the value of the column in the current row group, dictionary page is used to store the code dictionary of the column value, each column block contains a maximum of one dictionary page, index page is used to store the index of the column under the current row group, currently Parquet does not support index page.

Hive data sampling and storage format

1. Random sampling

2. Block sampling

3. Bucket table sampling

4. Hive storage format

5 Row storage and column storage

6. TEXTFILE

7. The ORC format

8. PARQUET format

Related Posts

RocketMQ source code parsing – communication module

Object-oriented Programming in Python 02: Understanding the class in depth

Popular science series in vernacular — Double eleven, is actually a “premeditated” DDoS attack?