In this article, we’ll take a closer look at what’s new in the dark over the evolution of storage formats. Huawei opened the parquet column storage format: CarbonData in 2016 and contributed it to the Apache community. CarbonData graduated in less than a year as a top project in the Apache community. CarbonData is the first Apache top project led by a Chinese company. (Originated from eBay, Kylin is the first top open source project led by Chinese.) I would like to salute huawei’s partners for completing such a breakthrough from 0 to 1. This article tries to tease out in technical detail what makes CarbonData different from its predecessors and what we can learn from them when we actually use and design storage formats.
First, let’s take a look at the positioning of CarbonData itself, as shown below:
·1. Support mass data scanning and taking several columns;
·2. Support search based on primary key and response at second level;
·3. Support interactive query similar to OLAP on massive data, and many filtering conditions are involved in the query. This type of workload should be responded within a few seconds;
·4. Support to extract individual records quickly and obtain all column information from this record;
·5. Support HDFS, seamless connection with Hadoop ecosystem, born with distributed gene.
For OLAP queries, there are many different types of queries, and different storage structures affect the data performance of different queries. So CarbonData is positioned to store data as a general purpose query that uses Spark SQL to solve the problem of massive queries and seamlessly integrates with the Hadoop ecosystem. CarbonData was initially used in conjunction with Spark SQL and Spark DataFrame, and later brought to Presto by Ctrip and To Hive by Didi.
In fact, whether it is a multidimensional OLAP query, or a full scan query, or a partial range query. CarbonData’s predecessors ORCFile and Parquet could have done the same, so what makes CarbonData special as a newcomer?
The following figure shows the actual test data provided by Huawei. CarbonData’s performance was slightly better than Parquet’s in the vast majority of test scenarios.
Of course, the fast query needs to pay the price, the fast query sacrifice is the reduction of compression rate and the extension of the storage time.
What follows is a detailed discussion of the logical relationship between CarbonData’s performance and the underlying design.
The following figure shows CarbonData’s data storage format:
·File Header the format of the File Header is relatively simple, which stores the version and mode information of the storage format. (This part is usually stable and immutable)
·Blocklet The maximum capacity threshold of a single Blocklet is 64 MB, which means that a Block of the HDFS can hold multiple blocklets (depending on the Block size). This is in line with ORCFile and Parquet’s design, both of which use Pax’s storage model to optimize data query performance.
·File Footer stores indexes and summaries of stored data at the end of the File. Indexes are key implementations of CarbonData, which greatly improve CarbonData’s performance in different query scenarios.
CarbonData is supported bySecondary indexes, greatly improved the performance of CarbonData data query.
CarbonData indexes both the HDFS Block level and the internal Blocklet level, which greatly reduces unnecessary task startup and disk I/O operations. We all know that introducing indexes can speed up data query, but there is no such thing as a free lunch. I think readers have an answer to the question of why CarbonData’s compression rate has decreased and data import time has increased.
You can see that at the end of the CarbonData file, the index is implemented as a B+ tree. Due to the appending nature of HDFS, I think readers can see why these index data and statistics need to be stored at the end of CarbonData.
The figure above shows the process of filtering a query completely. With the function of secondary index, this process avoids a large number of unnecessary query interactions, and the resulting performance optimization is very obvious.
Compared to ORCFile and Parquet’s relatively brief summary indexes, CarbonData takes a lot of effort at the indexing level. Through this way to surpass the predecessors, of course, such a choice of design also has to pay an additional price.
This is a controversial feature in CarbonData, which was added by default in previous versions and is now available as an option in version 1.3. (The author’s elder brother who works in Gauss Department of Huawei also joked with the author that there seem to be some ‘holes’ in global dictionary coding in the production environment.) So it seems that it is indeed a problem worth discussing to be able to use dictionary coding well, and the author will briefly talk about it here:
As shown in the figure above, global dictionary encoding is done simply by replacing recurring data in a table with numbers and dictionaries. The benefits are obvious:
· Greatly reduced the amount of data needed to store table data
· Global dictionary encoding for certain fields requiring group by can greatly reduce the amount of shuffle data during calculation. To achieve the purpose of performance improvement.
However, in the process of importing data into CarbonData, once the global dictionary is established for columns with low repetition rate, it will obviously slow down the data import speed and affect the compression degree of data. However, for data with high data repetition rate, such as gender and age, the compression degree of CarbonData can be greatly improved by establishing global dictionary, and the data import rate will not be affected.
The author suggests that the use of dictionary encoding should be analyzed and tested according to the specific business scenarios to give a more appropriate way of use. Blindly using dictionary encoding will bring negative optimization to performance.
So far, the author has also roughly talked about the understanding of CarbonData storage structure and the thinking caused by the author in simple practice. As the first Apache top project led by a Chinese company in the Chinese community, I will continue to pay attention to and learn CarbonData, and HOPE that Chinese programmers can continue to expand their influence in the open source community in the future.