Hologres Revealed: Cloud Native Storage Engine

Hologres (Chinese name) interactive analysis is ali cloud from the research on the number of one-stop real-time warehouse, the cloud native system combines real-time service and analysis of large data, fully compatible with PostgreSQL deal with large data ecological seamless get through, can use the same set of data architecture also supports real-time written real-time query and real-time offline federal analysis. Its emergence simplifies the architecture of the business, and at the same time provides the ability for the business to make real-time decisions, so that big data can play a greater role in the business value. From the birth of Ali Group to the commercialization on the cloud, with the development of business and the evolution of technology, Hologres is also continuously optimizing the competitiveness of its core technology. In order to make you better understand Hologres, we plan to continue to launch a series of revealing the underlying technical principles of Hologers, from high-performance storage engine to efficient query engine. High throughput write to high QPS query, a full range of interpretation of Hologers, please continue to pay attention! “Alibaba Hologres: A Cloud-Native Service for Hybrid Serving/Analytical Processing”, VLDB, 2020

In this issue, we’ll look at the storage engine for Hologers

1. Background introduction

MaxCompute Interactive Analytics (Hologres) is a Hybrid ad-serving/Analytical analytics integrated system developed by Ali Cloud. It integrates real-time services and analyzing big data scenarios. Full PostgreSQL compatibility and seamless integration with the big data ecosystem. Its emergence simplifies the architecture of the business, while providing the ability to make decisions in real time, allowing big data to exert greater business value. For a more detailed introduction to the architecture, see the VLDB paper at the end of this article.

Compared with traditional big data and OLAP systems, HSAP systems face the following challenges:

High concurrent mixed workloads: HSAP systems face concurrent queries far beyond traditional OLAP systems. In practice, the concurrency of data services goes well beyond OLAP queries. For example, we see in real-world applications that data services need to process tens of millions of queries per second, which is five orders of magnitude higher than the concurrency of OLAP queries. At the same time, compared with OLAP queries, data service queries have more stringent requirements for latency. Complex mixed query loads have very different tradeoffs between latency and throughput on the system. How to efficiently use the system’s resources while handling these very different queries and ensuring each SLO is a huge challenge.
High throughput real-time data import: While handling high concurrent query load, the HSAP system also needs to handle massive real-time data import. Synchronized data from traditional OLTP is only a small part of this, but a large amount of data comes from logging and other systems that do not have strong transaction semantics. The amount of data imported in real time is far greater than traditional HTAP or OLAP systems. Another difference between OLAP and traditional OLAP systems is the high requirement of real-time data. The imported data needs to be visible at the second level or even sub-second level, so as to ensure the timeliness of our service and analysis results.
Resilience and scalability: Data import and query load may have sudden spikes, which puts high requirements on the flexibility and scalability of the HSAP system. In the real world, we notice that the data import peak can be 2.5 times the average, and the query peak can be 3 times the average. Peaks in data import and query may not occur at the same time, which also requires the ability of the system to quickly adjust to different peaks.

Based on the background of appeal, we have developed a Storage Engine, which is mainly responsible for managing and processing data, including the methods of creating, querying, updating and deleting (CRUD) data. The design and implementation of the storage engine provides the capability of high throughput, high concurrency, low latency, flexibility and scalability required by the HSAP scenario. According to Alibaba Group’s business and the needs of customers on the cloud, we continue to innovate and refine, develop to today, can support a single table PB level storage, and perfectly support Tmall double 11 core scene 100 billion level point query and 10 million level real-time complex query.

Next, we will make a detailed introduction to the underlying storage engine of Hologres, and introduce the specific implementation principle and technical highlights of the storage engine landing Hologres.

Second, data model

The basic abstraction of the Hologres storage engine is the distributed tables, which we need to shard in order to make the system scalable. In order to support scenarios such as JOINs and multi-table updates more efficiently, users may need to store several related tables together, and for this reason Hologres introduces the concept of Table groups. The sharding strategy A set of identical tables constitutes a table group, and all tables in the same table group have the same number of sharding. Users can specify the number of shards in a table by “SHARD_COUNT” and the sharded columns by “DISTribution_KEY”. Currently we only support Hash sharding.

The data storage format of the table is divided into two types, one is row storage table, the other is column storage table, and the format can be specified through “orientation”.

The records in each table are stored in a certain order, which can be specified by the user through “clustering_key”. If no sequence is specified, the storage engine automatically sorts it in the order in which it was inserted. Choosing the right row sequence can greatly optimize the performance of some queries.

Tables can also support a variety of indexes, and we currently support dictionary indexes and bitmap indexes. Users can specify the columns to be indexed by “Dictionary_Encoding_Columns” and “Bitmap_Columns”.

Here’s an example:

This example creates two tables, LINEItem and Orders. Since the LINEItem table also specifies a PRIMARY KEY, the storage engine automatically builds an index to ensure that the PRIMARY KEY is unique. The user puts the two tables in the same table group by specifying “colocate_with”. This table group is divided into 24 shards (as specified by shard_count). LineItem will be sharded based on the data value of L_OrderKey, and Orders will be sharded based on the data value of O_OrderKey. The L_ShipInstruct field for LineItem and the O_OrderStatus field for Orders will create the dictionary. The L_OrderKey, L_LineNumber, and L_ShipInstruct fields of LineItem and the O_OrderKey, O_CustKey, and O_OrderStatus fields of Orders will create bitmap indexes.

Three, storage engine architecture

1) Overall structure

Each Shard (Table Group Shard) constitutes a storage management and Recovery Unit. The figure above shows the basic architecture of a shard. A shard consists of multiple tablets that share a write-ahead Log (Wal). The storage engine uses a technique called Log-Structured Merge (LSM) in which all new data is inserted in an APPEND-ONLY format. The data is written to the MemTable where the tablet is located, accumulated to a certain size, and then written to a file. When a data file is closed, its contents remain unchanged. The new data and subsequent updates are written to the new file. Compared with the B± Tree data structure of the traditional database, LSM reduces the random IO and greatly improves the write performance.

As writes come in, each tablet will accumulate many files. When a tablet accumulates to a certain number of small files, the storage engine compactions them in the background. This reduces the need for the system to open many files at the same time, reducing the use of system resources, and more importantly, improving read performance by reducing the number of files combined.

In DML functionality, the storage engine provides a single or batch create, query, update, and delete (CRUD operations) access method interface through which the query engine can access the stored data.

2) Storage engine components

The following are the important components of the storage engine:

Wal and Wal Manager Wal Manager is used to manage log files. Storage engines use pre-written logging (WAL) to ensure atomicity and persistence of data. When a CUD operation occurs, the storage engine first writes to a WAL and then to the MemTable of the corresponding tablet. When the MemTable has grown to a certain size or has reached a certain time, it will switch the MemTable to an immutable flushing MemTable. And open a new MemTable to receive new write requests. And the immutable flushing MemTable can be brushed to the disk to become an immutable file; When an immutable file is generated, the data is considered persistent. When the system crashes after an error, the system restarts to read the logs from WAL and recover the data that has not been persisted. Wal Manager will delete a log file only when the corresponding data is persisted.
Each tablet stores data in a set of files that are stored in the DFS (Aliba Pangu or Apache HDFS). I’m going to store my row files in Sorted String Table (SST) format. The column file supports two storage formats: a self-developed PAX-like format and an improved version of Apache’s ORC format (based on Aliorc with many optimizations for the Hologres scenario). Both of these column formats are optimized for file scanning scenarios.
In order to avoid the need to fetch IO from a file for each Read, the storage engine uses the BlockCache to store frequently used and recently used data in memory. This reduces unnecessary IO and speeds up Read performance. Within a node, all shards share a Block Cache. Block Cache has two types of obsolescence strategies: LRU (Least Recently Used,) and LFU (Least Frequently Used,). As the name implies, LRU algorithm eliminates the blocks that have been unused for the longest time first, while LFU eliminates the blocks that have been accessed least in a certain period of time first.

3) Principle of reading and writing

Hologres supports two types of writes: single-shard writes and distributed batch writes. Both types of writes are Atomic, that is, writes or rollbacks. Single-shard writes update one shard at a time, but need to support extremely high write rates. Distributed batching, on the other hand, is used for scenarios where large amounts of data are written to multiple shards as a single transaction, and is typically performed at a much lower frequency.

Single shard write

As shown in the figure above, after receiving a single-shard write request, the WAL manager (1) assigns a Log Sequence Number (LSN) to the write request, which consists of a timestamp and an incremental Sequence Number, and (2) creates a new Log and persists this Log in the file system. This log contains the information needed to recover the write operation. You don’t commit a write to the tablet until the log is fully preserved. After that, (3) we perform the write in the MemTable of the corresponding tablet and make it visible to the new read request. It’s worth noting that updates on different tablets can be parallelized. When a MemTable is full, (4) flashes it to the file system and initializes a new MemTable. Finally, (5) the multiple shard files are combined asynchronously in the background (CompAction). At the end of the merge or the Memtable refresh, the metadata file that manages the Tablet is updated accordingly.

Distributed batch writing

The foreground node that receives the write request distributes the write request to all relevant shards. These shards guarantee write atomicity for distributed batch writes through a two-phase Commit mechanism.

Version to read more

Hologres supports reading data in multiple versions of the tablet. The read request consistency is read-your-writes, which means that the client always sees its most recently committed write. Each read request contains a read timestamp that is used to construct the Snapshot LSN of the read. If there is a row of data whose LSN is larger than the snapshot LSN record, the row will be filtered because it was inserted into the tablet after the snapshot was read.

IV. Technical highlights of Hologres storage engine

1) Separation of storage and computation

The storage engine adopts the architecture of separation of storage and computation, and all data files are stored in a distributed file system (DFS, such as Alibaba Pangu or Apache HDFS). When the query load becomes large and more computing resources are needed, the computing resources can be expanded separately. When the amount of data grows rapidly, the storage resources can be rapidly expanded independently. The architecture that computing nodes and storage nodes can be independently expanded ensures that resources can be expanded quickly without waiting for data to be copied or moved. Furthermore, you can take advantage of DFS’s multi-copy mechanism to ensure high availability of data. This architecture not only greatly simplifies the operation and maintenance, but also provides a great guarantee for the stability of the system.

2) Asynchronous execution of the process

The storage engine adopts the pure asynchronous execution architecture based on event triggering and non-blocking, which can give full play to the processing capacity of modern CPU’s multi-core, improve the throughput, and support high concurrent writing and query. This architecture benefits from HOS (HOLOOS) framework. HOS not only provides efficient asynchronous execution and concurrency capabilities, but also can automatically do CPU load balancing to improve the utilization of the system.

3) Unified storage

In the HSAP scenario, there are two types of query patterns: simple point queries (data service ad-serving scenarios) and complex queries that scan large amounts of data (Analytical scenarios). Of course, there are many queries that fall somewhere in between. These two query modes put forward different requirements for data storage. Row storage can support point queries more efficiently, while column existence has a significant advantage over queries that support a large number of scans.

In order to be able to support various query modes, uniform real-time storage is important. The storage engine supports both row and column storage formats. Depending on the user’s needs, a tablet can be a storage format for rows (suitable for the ad-serving scenario); It can also be a column-based storage format (for Analytical scenarios). For example, in a typical HSAP scenario, many users will store their data in column storage format to facilitate large-scale scanning for analysis; At the same time, the index of the data is stored in the row storage format, which is easy to click. We also define Primary Key Constraint (which we implemented with row saving) to prevent data duplication. The interface for reading and writing is the same regardless of whether rows or columns are used at the bottom level. The user has no awareness of the interface and can only specify it when the table is being built.

4) Read/write isolation

The storage engine adopts the semantics of Snapshot Read. When reading data, the state of the data at the beginning of reading is adopted. There is no need for data lock, and the read operation is not blocked by the write operation. When a new write comes in, because the write operation is append-only, all writes are not blocked by reads. This is a good way to support HSAP’s high concurrency mixed workload scenarios.

5) Rich indexes

The storage engine provides a variety of index types to improve the efficiency of queries. A table can support both clustered and non-clustered indexes. A table can have only one clustered index, which contains all the columns in the table. A table can have multiple non-clustered indices. In addition to the non-clustered index key used for sorting, there is a Row Identifier (RID) used to find the whole Row of indexes in the non-clustered indexes. If the clustered index exists and is unique, the clustered index key is RID. Otherwise, the storage engine generates a unique RID. To improve the efficiency of queries, the clustered index can have other columns, so that the index can be scanned to retrieve the values of all the columns (covering index) in some queries.

Inside the data file, the storage engine supports dictionary and bitmap indexes. Dictionaries can be used to improve the efficiency of processing strings and the compression ratio of data, and bitmap indexes can help filter out unwanted records efficiently.

Five epilogue.

The design and development direction of the storage engine is to better support the HSAP scenario, can efficiently support high throughput real-time writes and interactive queries, and is optimized for offline batch writes. With the number of writes and storage doubling every year, Hologres has withstood the test of rigor and survived multiple Singles’ Day perfectly. “Hologres stands at a peak of 596 million megabits per second of real-time data, storing up to 2.5 petabytes per table. It provides multi-dimensional analysis and services based on trillions of data, and 99.99% of queries can return results within 80ms.” This set of data speaks volumes about the technical capabilities of the storage engine, as well as the strengths of our technical architecture and implementation.

Of course, Hologres is still a new product and HSAP is a new concept. “Today’s best is just the beginning of tomorrow”. We are still constantly improving and listening to the feedback of customers.

Author: Jiang Yuan, Alibaba senior technical expert, has many years of rich experience in the field of database and big data.

In the future, we will successively launch the series of unveiling the underlying technical principles of Hologres. The specific planning is as follows. Please keep your attention!

Hologres Revealed: First Public! Alibaba Cloud Native Real-Time Database Core Technology Reveals Hologres: First Reveals Cloud Native Hologres Storage Engine (this article) Reveals Hologres: Deep Analysis and Efficient Distributed Query Engine Reveals: Transparent Accelerate MaxCompute Query Core Principles Hologres: How to achieve maxCompute and Hologres data synchronization a hundred times faster How to support high QPS in online service scenarios Hologres Uncover: How to support high concurrency queries Hologres Uncover: How to support high availability architectures Hologres Uncover: How to support resource isolation and support multiple loads Hologres Uncover: The Principle and Practice of the Vector Retrieval Engine Proxima Hologres: Understanding Execution Plans and Improving Query Performance by a Factor of Ten How to support more Postgres Eco Extension Pack Hologres Reveals: High Puff Write Hologres in N Postures……