Features of mainstream storage formats

File storage layout can be divided into three types, row storage, column storage, and mixed storage, different storage layout scheme will have an absolute impact on the overall performance of the upper system, can directly determine the efficiency of data loading into the data warehouse, the speed of responding to user queries, and the utilization of the underlying storage space.

In row storage, the fields of each record are stored together consecutively. In addition to the metadata stored in each data block, each record is compressed in rows and stored together consecutively. Line store, what is the problem first, a lot of SQL queries read records may involve only the record of all parts of the field, the line will store all fields can read after the required fields, second, and even though can use data compression, storage for all of the fields can only adopt the same compression algorithm, lead to data compression ratio is not high, The disk usage is low.

Column storage: All records are vertically divided by column, and the contents of the same column are continuously stored together. Column type storage is a problem, that is, if only the columns of data to be stored together in a row, and the general task of large data often need to traverse each data record, it might appear, when different columns of scattered data in different data blocks, the different machine nodes, in order to piece together a complete records from the column type data content, and the need to read the data across a network and to merge, That requires a lot of network traffic. To alleviate this problem, column storage introduces the column family solution, which groups all columns into groups and groups columns that are often used together. This ensures that searches can be traced from the same data block and avoids unnecessary network transfers. But this approach only mitigates the problem. In the following figure, row layout is stored in a row layout. Columns A1, B1, and C1 in the first row are stored consecutively before a2. Column layout stores all rows in a row, followed by column B.

Finally, hybrid storage. Hybrid storage combines the advantages of row and column storage. First, records are grouped by row, several actions are grouped, and the contents of the same column are consecutively stored together by column. This storage layout, on the one hand, ensures that the records of the same row are stored on the same machine node like row storage, on the other hand, column-based storage allows different columns to adopt different compression algorithms, and avoids the operation of reading irrelevant columns when reading data.

Like ORC, Parquet is hybrid storage. Note, however, that in general English context, these storage formats are still called Columar file format, which may be mistaken for pure column storage.

ORC

The ORC file format is that an ORC file consists of several stripe files with raw data, indexes, and a Stripe footer. The stripe footer contains a description of the position of streams for scanning data. The index records the maximum, minimum, and average values of each column of the stripe, as well as the position information (offset) of each column in the data block. Columns “=” cuID “and only available after 0.13). This parameter can be used to locate a stripe file and jump to the correct stripe file location. Finally, the file Footer of the entire file records metadata about all stripe files, such as how many stripe files there are, the number of records each stripe contains, and statistics for each column, such as count, min, Max, and sum. When we set the hive cluster “hive.com pute. Query. Using the stats = true” when (the default is true, this parameter can be used to check under the hive you use cluster is set to true), if our query just hit the metadata, you don’t need to calculate the results back to second. Query parquet (ORC) parquet (ORC) parquet (ORC) parquet (ORC) parquet (ORC) parquet (ORC) parquet (ORC) parquet (ORC) parquet

Parquet

Parquet is the storage data structure is divided into three layers, metadata also divided into three layers. Each HDFS file consists of several row_groups (row groups) and a footer. Each column of a row group consists of several column chunks. Each column chunk of a column contains a dictionary page. The Footer holds the row groups, column data, and metadata of the data pages, such as the starting position of each column chunk. To the smallest and most indivisible page granularity, there is also a page header with some metadata describing the page, such as pv amount, encoding method, and so on. Given that Parquet has metadata in all three tiers of data structures, we know that if we sort the data, we can skip unnecessary files by routing to the start of the correct column chunk.

Query optimization

Min-max query optimization

Next, let’s talk about how understanding the storage format can help us optimize query efficiency. Taking ORC as an example, we know that ORC saves the maximum and minimum values of each stripe file. In this way, when querying, the maximum and minimum values and column indexes in ORC’s stripe Footer can be predicated to the columns in the query condition, that is, to judge whether the data I want to read is in the file. If it does not exist, The entire file will be skipped (dataskiping). However, we need to manually specify column global sorting, can achieve the optimal effect, sort of ORC won’t help you, after all, this is also a very CPU and memory consumption work, need to decide by the developer (unless you write data before the final step is to broadcast the join, as may ensure the final data has been sort of associated key column).

You might want to shuffle and distribute the data before you specify that you want to sort the written data, so if you want to optimize the query column A, you want to sort column A, then you want to distribute by A, so you want to distribute all the row column A in A hash function, The numbers with the same hash value will be put into the same reducer data file. This measure ensures that the same data will be put together and similar data will be easily put together. If your task also writes to multiple partitions, and the partition column is B, you can distribute by B, A, which reduces the number of files. Successfully beat similar data to the same file, another problem emerged, column values corresponding to different amount of data is not the same, according to the column A mix after washing, some column values corresponding pv everyday is big, the file is very big, some small, namely data skew problem, this time we need to use solution method to solve the data skew. At present, the effect of our company’s solution with ORC is very obvious, which can reduce the query time to about 20% of the original on the premise of not greatly increasing the amount of calculation. From the perspective of min-max, we can think about what algorithm can not only add salt hash to solve the data tilt, but also do the data sorting incidentally.

Z-order takes it up a notch

The optimization method mentioned above can only be used for the scenario of “optimizing only one column will benefit most downstream tasks”, but it is likely that there are large wide tables where every column has the potential to be queried and analyzed. So, is there an algorithm that allows us to sort all the columns and still sort the columns efficiently? This is z-Order. Z-order is applied to Delta Lake of Ali Cloud. Iceberg also plans to support Z-Order.

Space-filling curve is a technique to reduce spatial dimensions. It maps high-dimensional spatial data to one-dimensional space, and stores and queries data using transformed index values. A space filling curve divides a multidimensional space (which can be understood as a multidimensional coordinate axis) into numerous grids through finite recursive operations. A coordinate point represents the position of the grid, and a continuous curve passes through all the coordinate points to indicate that the line fills all the grids. Z-order is one of the space-filling curves that always fills the space in the z-word order (see Figure 1).

The principle is as follows: Z-order means that we map the multi-column values of a multi-column data set in reality to z-Order curve, which is a record in our table (dimension 1, dimension 2, dimension 3…). The dimension values of are converted to binary by some function and mapped to the coordinate axes (x1y1z1, x2y2z2, x3y3z3,….). (See Figure 3, dimension 1 and dimension 2 become x and y), where x1y1z1 of the coordinate axis concatenates the binary values of multiple dimensions into a new binary value, z-index. We loop down in this way, thread the coordinate points one by one according to the sequence of Z route in the four squares, and finally use the earliest and latest Z_index as the min-max of the file (see Figure 2). The filtering condition generates z-index by the dimension of the filtering condition to compare whether to skip the file.

Z-order enables the data set using this sort to obtain the min-max values of all columns with Z-order, so as to achieve the effect of skipping files in the query. In addition, Z-order enables more similar records to be next to each other, and similar records have a higher probability of being detected together when being queried, and the probability of cache hitting is also higher. This also improves the efficiency of queries. Arranging data in order of a space-filling curve improves spatial locality ==> data will be found in cache ==> Improvements at all levels of the memory hierarchy)

In the following example, a data set describing the network has 4 fields, the data is mixed into 100 files, and then divided into 2 control groups, one is sorted by sourceIP only, the other is sorted by Z-order. Query the filter condition of these two data sets with 4 fields respectively. For example, sourceIP= XXX and destIP= XXX. Dataskiping can skip 99 files and read only one if you select sourceIP. Dataskiping can skip 99 files and read only one if you select other columns. The second, using a Z-order sort, guarantees at least 44% dataskipinp effects and an average of 54% filtering effects (the first averages 25%).

But one thing need to weigh the is, the data of z – the order can be quite time sorting (need to sort the field more time), the need to weigh, add z – whether can significantly optimize the downstream after the order sorting query efficiency (if are full table read there would be no need), timeliness is also can guarantee, etc. (the original: Doing sort opertaion is slow, but it will make all your query fast). Also, the more dimension values you use for z-order, the more difficult it is for similar values in the same column to approach, so be careful to select fields that require Z-order.

Storage optimization

Compression algorithm involved

VLC (Variable Length Codes) – The higher the frequency of the character/string, The more the binary string index is, the shorter the pole is.

Huffman Encoding — Assuming that we know, or can predict with relative accuracy, the percentage of each character in a string to be compressed in the entire string (if not, we can assume that all characters have the same probability of occurrence and dynamically update the probability as we read the data, Assuming that the following data flows in with the same probability as the processed data), we can encode this string with minimal repetition by constructing a Huffman tree (see figure below).

LZ77 — Dictionary-based algorithm that encodes long strings (also known as phrases) into short tokens and replaces phrases in dictionaries with small tokens for compression purposes. That is, it compresses the data by replacing long string methods that repeat many times in the data with small tags. The symbols it processes do not have to be text characters, but can be symbols of any size.

The default compression format for ORC files is ZLIB, and the default compression format for Parquet files is SNAPPY, but the most efficient compression format is GZIP, which uses ZLIB Library. ZLIB uses the Deflate algorithm, which includes Huffman Encoding and LZ77. Two algorithms of the interpretation of the principle in the rough, so to speak, with independent and prefix don’t repeat the binary string index to indicate the various characters within the column/string, and the more high frequency/string of characters, the shorter the corresponding binary string index, finally through the characters in a considerable number of columns in the binary string index, achieve the result of data compression. I’ve covered only two simple scenarios for compression algorithms that are relatively easy to explain, because the actual use in ZLIB is complex.

How to improve compression efficiency

For the best compression results, we should try to have similar data allocated to the same file, and the data is next to each other, that is, manually specify the sorting. This compression results best, because LZ77 will read the data as it is processed. For example, dwD_XXX_XXX_XXX_LOG_HI. The table has 30 fields, 14 of which describe the device and app environment, and 8 of which describe point locations. This is when we use cuID to sort the data best. At the same time, we can also consider the influence of the number of files written on the compression effect. The compression effect of dividing a part of the data with the same value into 10 files is definitely different from that of storing it in the same file. The latter can represent each independent column value in 10 files with the same set of binary codes. However, it should also be taken into account that if the final reducer/mapper number is too small, data processing may be slow, and the reading efficiency may be affected by too large files. Therefore, a trade-off should be made between compression efficiency and query efficiency. The following figure shows the storage space (GB) occupied by several processing policies. The original data of control group 1 was 14.6GB (since it was the default data written by real-time ETL, it could be considered as sorted by timestamp). After reshuffling according to a column that did not represent the data set, the space utilization was directly doubled to 28GB. Later, we tried to shuffling according to a more representative column, and the space utilization was reduced to 11-13GB again. In comparison 2, on the basis of shuffling according to code in the previous step, after sorting the data according to the representative column CUID, the space taken up is reduced from 28GB to 18GB, and the effect is very obvious! This means that by using this method, we can distribute data according to the columns that we want to optimize min-max, while keeping the space occupied unchanged. The third data compares the storage difference of different files generated at last, and it can be seen that the more data is divided, the more space occupied.

-- Distribute by code --11.7286 ~ 13.9923 <-- distribute by cuID -- -- Distribute by code --17.9894 <--128 reducers (generate 128 files) + distribute by code sort by cuID --20.3902 <-- 256 Reducers (Generate 256 files)Copy the code

The compression efficiency of data is mainly affected by four factors. First, the compression algorithm, which is generally selected by the storage library according to the format of the column. Second, to sort the columns, the compression efficiency of similar data is higher; Third, divide fewer files, the fewer files, the index in the compression algorithm can represent more values; If there are multiple columns to sort, the columns with low cardinality are sorted first, then the columns with high cardinality.

The resources

Chapter 8, 8.4, Big Data Day Notes: Architecture and Algorithms

ORC:community.cloudera.com/t5/Communit…

Parquet website document: parquet.apache.org/documentati…

Why the sort by can affect storage efficiency: stackoverflow.com/questions/6…

Z – order curve:tildesites.bowdoin.edu/~ltoma/teac…

The Delta Lake:databricks.com/wp-content/… Z – part order

LZ77:www.youtube.com/watch?v=zev…