Global dictionary explanation
Why do WE need a global dictionary
In the field of OLAP data analysis, Count distinct is a very common requirement, and it can be divided into approximate and exact deduplication according to the requirement of deduplication results.
With large data sets, it can be challenging to be accurate and responsive to queries. We know that the most commonly used processing method for accurate deduplication is bitmap. For integer data, we can store statistical information in a Bit map, but there are other types of data such as String in the actual data processing besides integers. If we want to achieve accurate reprocessing, we first need to build a dictionary to map these data uniformly, and then carry out statistics through the Bit Map method.
We all know that Kylin uses predictive computing to accelerate big data analysis. When incrementally building a Cube, all segments in a Cube use the same Dictionary, that is, the Global Dictionary, in order to avoid the error of creating separate dictionaries for different segments.
Evolution of global dictionaries
Kylin has supported global dictionaries since version 1.5.3, but the way they were built had significant flaws:
The global dictionary is built in a single point on the Job Server, and as the data increases, the build time becomes uncontrollable and as the data accumulates, The Kylin distributed global dictionary build has been implemented in Kylin3.1, which addresses this problem. For more information, see The Kylin Distributed Global Dictionary. However, in order to adapt to the new build query engine, Kylin 4.0 implements another way to build the global dictionary based on Spark. Today we will describe in detail how Kylin 4.0 implements the global dictionary.
Global dictionary based on Spark
The global dictionary construction method of Kylin 4.0 is based on Spark for distributed coding processing, which reduces the pressure of single node, and the number of dictionary construction can break the limit of the maximum number of integers.
Design and Principles
Structure of the global dictionary
- Each build task generates a new global dictionary
- The dictionary for each new build task is saved by version number, and the old global dictionary is gradually removed
- A dictionary consists of a meta data file and multiple dictionary files, each of which is called a Bucket.
- Each bucket is divided into two mappings (Map
,>
Bucket concept and transformation
Kylin introduced the concept of buckets, which can be understood as dividing data into several buckets (i.e., partitions) for parallel processing. When the dictionary is first constructed, the values in each bucket will be encoded from 1. After all buckets are encoded, the dictionary value will be allocated according to the offset value of each bucket. The two encodings in the code are stored using two HashMaps, one of which stores dictionary values relative to each bucket, and the other stores absolute dictionary values across all buckets.
Each build creates a new version of the bucket (v1, v2, v3, etc.). The reason for adding version control will be explained later. Curr(Current) and Prev(Previous) are two hashmaps in a bucket that store the Relative encoding values of the current dictionary in the bucket and the Absolute encoding values of all dictionary values that have been previously constructed.
Build steps
- Use Spark to create a flat table and obtain the DISTINCT values that need to be deleted accurately
- The number of fragments is confirmed according to the number of recalculated literals, and whether to expand according to requirements
- Repartition the data into multiple partitions, encode them separately, and store them in their respective dictionary files
- Assign a version number to the current build task
- Save dictionary files and metadata data (number of buckets and offset value of buckets)
- Delete an earlier version based on conditions
The first building
- Calculate the size of buckets. Use the number of dictionaries to build. Process the maximum value of the threshold and the default value of the number of buckets.
- Buckets are created and data is allocated for encoding
- Generate meta files to record the offsets of the bucket
The following are the configuration items and their default values.
kylin.dictionary.globalV2-min-hash-partitions=10
kylin.dictionary.globalV2-threshold-bucket-size=500000
Copy the code
Non-initial build
- Determine whether buckets need to be expanded based on the number of dictionaries
- The encoded dictionary values are used to reallocate the expanded bucket
- Read the latest version of dictionary data and distribute it to each bucket
- Assign the new value to the bucket
- The dictionary values from the previous build will not change
Version control
The global dictionary provides isolation by assigning a timestamp based version number to a single build. The reason for versioning is that build tasks can be executed concurrently, and the current code for building global dictionaries does not support concurrency. Through version control, each encoding can fully read the previously built global dictionary, thus ensuring that the latest version of the dictionary has the most complete global dictionary encoding, and that a Cube’s global dictionary will select the latest version every time it is read. The dictionary is eventually stored by version on the file storage system (HDFS in this case) as shown in the following figure.
Q&A
- Why do I need two maps in a BucketDIctionary?
At the beginning of the construction process, a Relative code should be made for the dictionaries assigned to each bucket starting from 1. The Relative code values of this part of the dictionary will be stored in a HashMap. After the Relative dictionary value coding is completed, the offset value of each bucket will be obtained, which is the number of dictionaries in the bucket. From this dictionary value, the Absolute encoding of the dictionary value relative to the offset value of all buckets is calculated for each bucket (buckets are sequential). The Absolute encoding of the dictionary is also stored in another HashMap. 2. Will there be data skew problem? The probability of a hot spot not being built is very small. Generally, the tilt level of a billion can not be passed. Many columns can indeed cause this problem, but the number of coded buckets can be infinitely magnified unless there is a single key hot spot, otherwise it is easy to complete the build by adjusting the parameters.
- Why can the number of global dictionaries exceed the maximum cardinality limit for integers (2.1 billion)?
Because of the introduction of the new BitMap data structure Roaring64BitMap, after the global dictionary encoding is complete, the encoding is compressed into binary and stored in the Roaring64BitMap object, the BitMap is actually stored by Long instead of Integer.