❝
In this article you will learn about the underlying data storage of Apache Druid. Second, you’ll see why Apache Druid features data warehousing, full-text retrieval, and time series. Finally, you’ll learn an elegant underlying data file structure.
❞
❝
Today’s motto: Good software, original from the beginning of imitation.
❞
Those of you who are familiar with Apache Druid or have read previous articles in this series know that Druid combines data warehousing, full-text retrieval, and time series capabilities. So why does Druid have these capabilities, and what kind of design and effort did Druid make to implement these capabilities?
Druid’s underlying data storage is key to achieving these capabilities. This article will take you through how Druid’s underlying file segments are organized.
“Read with questions:“
- What is Druid’s data model like?
- What about the three storage data structures for Druid dimension columns? Respective roles?
- Segment File identification component?
- How to store data in segments?
- How does the Segment take effect?
The Segment file
Druid stores data in segment files, which are partitioned by time. In the basic configuration, a segment file is created for each interval, which can be configured using the segmentGranularity parameter of granularitySpec. In order for Druid to run properly under heavy query load, the segment file size should be in the recommended 300MB-700MB range. If your segment file is larger than this, consider changing the interval granularity or partitioning the data, and adjust the targetPartitonSize parameter of the partitionSpec (the default is 5 million lines).
The data structure
The internal data structure of the Segment file is described below, which is columnar in nature, with each column of data placed in a separate data structure. By storing each column separately, Druid can reduce query latency by scanning only those columns that are actually needed.
Druid has three basic column types: timestamp column, dimension column, and indicator column, as shown in the following figure:
The TIMESTAMP and Metric columns are simple: at the bottom, they are both lZ4-compressed arrays of interger or float. Once the query knows which rows to select, it simply decompresses them, pulls out the relevant rows, and applies the required aggregation operations. As with all columns, if a column is not required for a query, the data for that column is skipped.
Dimension columns are different because they support filtering and grouping operations, so each dimension requires the following three data structures:
- A “dictionary” that maps values (always treated as strings) to integer ids,
- “List of column values” encoded with 1, as well
- For each different value in the column, a “bitmap” indicates which rows contain that value.
Why do you need these three data structures? The dictionary only maps strings to integer ids so that the values in 2 and 3 can be represented compactly. 3 in the
Bitmaps, also known as reverse indexes, allow fast filtering operations (in particular, bitmaps facilitate fast AND AND OR operations). Finally, group by and TopN require a list of values in 2, in other words, a query based solely on a filter summary does not need to query the list of dimension values stored in it.
To see these data structures in detail, consider the “Page” column in the example above, and the figure below illustrates the three data structures that represent this dimension.
1: dictionary encoding column values
{
"Justin Bieber": 0,
"Ke$ha": 1
}
2: the column data
,0,1,1 [0]
3: Bitmaps - one for each column with unique values
Value = "Justin Bieber" :,1,0,0 [1]
Value = "Ke $ha" :,0,1,1 [0]
Copy the code
Note that bitmaps differ from the first two data structures: while the first two grow linearly in data size (in the worst case), the size of the bitmap portion is the product of the data size and column cardinality. Compression will help us here because we know that only one bitmap has non-zero entries for each row in the “column data”. This means that high-radix columns will have extremely sparse compressible height bitmaps. Druid uses bitmap-friendly compression algorithms such as roaring Bitmap compressing (for those interested) to compress bitmaps.
If the data source uses multi-valued columns, the data structure in the Segment file will look different. Assume that in the example above, the second line is tagged with both the “Ke $HA” and “Justin Bieber” themes. In this case, the three data structures now look like this:
1: Field encoding column values
{
"Justin Bieber": 0.
"Ke$ha": 1
}
2: the data of the column
[0.
[0.1], <--Row value of multi-value column can have array of values
1.
1]
3: Bitmaps - one for each unique value
value="Justin Bieber": [1.1.0.0]
value="Ke$ha": [0.1.1.1]
^
|
|
Multi-value column has multiple non-zero entries
Copy the code
Note the column data and the change to the second row in the Ke$HA bitmap. If a column in a row has more than one value, its input in column data is a set of values. In addition, a row with n values in column data will have n non-zero entries in the bitmap.
Naming conventions
The segment identifier usually consists of the data source, interval start time (ISO 8601 format), interval end time (ISO 8601 format), and version number. The segment identifier will also contain the partition number if the data is fragmented because it is out of time. The segment identifier=datasource_intervalStart_intervalEnd_version_partitionNum
Segment file composition
At the bottom, a segment consists of the following files:
-
Version. bin The value is an integer representing the version of the current segment. For example, for v9 segment, the versions are 0x0, 0x0, 0x0, 0x9.
-
Meta. Smoosh stores metadata (filename and offset) about other smooth files.
-
XXXXX.smooth
These files store a series of binary data.
These Smoosh files represent multiple files that have been smooshed together, and splitting into multiple files reduces the number of file descriptors that must be opened. They are up to 2GB in size (to match the limitations of Java’s memory-mapped ByteBuffer). These Smoosh files contain separate files for each column in the data, as well as index. DRD files with additional metadata about that segment.
There is also a special column, called __time, which is the time column of this segment.
In the code base, the segment has an internal format version. The current segment format version is V9.
Column format
Each column is stored in two parts:
- Jackson Serialized ColumnDescriptor
- The remaining binaries for the column
The ColumnDescriptor is essentially an object. It consists of some metadata about the column (what type it is, whether it is many-valued, etc.) followed by a serialization/deserialization list that can deserialize the rest of the binary numbers.
Fragmented data
shard
For the same data source, there may be multiple segments in the same time interval. These segments form a block interval. Sharding data is configured according to shardSpec, and Druid queries are not completed until the block is completed. That is, if a block consists of 3 segments, for example:
sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0
sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_1
sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_2
Copy the code
All three segments must be loaded before the interval query 2011-01-0t02:00:00:00 Z_2011-01-01T03:00:00:00Z is complete.
The “exception to this rule” is the use of the linear sharding specification. The linear sharding specification does not enforce “integrity” and the query can be completed even if the shard is not loaded into the system. For example, if your live ingestion creates 3 segments sharded using the linear sharding specification, and only 2 segments are loaded in the system, the query will only return the results for those 2 segments.
Pattern changes
Replace the segment
Druid uses datasource, interval, version, and partition number to uniquely identify segments. If multiple segments are created in a period of time, the partition number is visible only in the segment ID. For example, if you have a segment in a one-hour time range, but there is more data in an hour than a single segment can fit, you can create multiple segments in the same hour. These segments share the same datasource, interval, and version, but the partition number increases linearly.
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-01/2015-01-02_v1_1
foo_2015-01-01/2015-01-02_v1_2
Copy the code
In the example segment above, dataSource = foo, interval = 2015-01-01/2015-01-02, version = v1, partitionNum =0. If, at a later point in time, you reindex the data with a new schema, the newly created segment will have a higher version ID.
foo_2015-01-01/2015-01-02_v2_0
foo_2015-01-01/2015-01-02_v2_1
foo_2015-01-01/2015-01-02_v2_2
Copy the code
Druid bulk indexing (Hadoop-based or IndexTask-based indexing) ensures atomic updates on every interval. In our example, the query only uses v1segment until all V2Segment2015-01-01 /2015-01-02 are loaded into the Druid cluster. Once v2 has loaded all segments and is ready to query, all queries ignore v1segments and switch to these v2Segments. Shortly after that, v1segment will be unloaded by the cluster.
Note that updates across multiple segment intervals are atomicity only within each interval. They’re not atomic during the whole renewal process. For example, when you have the following segment:
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v1_1
foo_2015-01-03/2015-01-04_v1_2
Copy the code
V2segment will be loaded into the cluster after v2 builds and replaces v1Segment. Therefore, v1 and v2Segment may exist in the cluster before v2Segment is fully loaded.
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v2_1
foo_2015-01-03/2015-01-04_v1_2
Copy the code
In this case, the query might have both V1 and v2Segment.
Segment Multiple different modes
Segments from the same data source may have different schemas. If a string column (dimension) exists in one segment but not in the other, queries involving both segments are still valid. Segment queries that lack dimensions will behave as if the dimensions only have null values. Similarly, if one segment contains a numeric column (metric) and another does not, queries on segments missing that metric will generally “do the right thing.” Aggregations that lack this metric behave as if the metric were absent.
The last
1. Have you answered the question at the beginning of the article
- What is Druid’s data model like? (Timestamp column, dimension column, and indicator column)
- What about the three storage data structures for Druid dimension columns? Respective roles? (Encoding mapping table, column value list, Bitmap)
- Segment File identification component? (datasource, Interval, Version, partition numbe)
- How to store data in segments?
- How does the Segment take effect?
Second, knowledge expansion
- What is column storage? What is the difference between column storage and row storage?
- Do you understand Bitmap data structures?
- Insight into the
roaring bitmap compressing
Compression algorithm. - How does Druid locate a piece of data? What is the detailed process?
* Please continue to pay attention to the later will develop more knowledge for you. Those of you who are interested in Druid can also review my previous series of articles.
❝
Follow the public account MageByte, set the star punctuation “looking” is our motivation to create good writing. “Add group” into the technical exchange group for more technical growth.
❞