Hologres (Chinese name) interactive analysis is ali cloud from the research on the number of one-stop real-time warehouse, the cloud native system combines real-time service and analysis of large data, fully compatible with PostgreSQL deal with large data ecological seamless get through, can use the same set of data architecture also supports real-time written real-time query and real-time offline federal analysis. Its emergence simplifies the architecture of business, and at the same time provides the ability to make real-time decisions for business, enabling big data to play a greater business value. From the birth of Ali Group to the commercialization of cloud, With the development of business and the evolution of technology, Hologres has been continuously optimizing its core technology competitiveness. In order to let everyone know Hologres better, we plan to continuously launch the series of Hologers underlying technology principle disclosure, from high-performance storage engine to efficient query engine. High throughput write to high QPS queries, etc., full interpretation of Hologers, please continue to pay attention to! Alibaba Hologres: A Cloud-Native Service for Hybrid Analytical Processing (VLDB 2020)
In this installment we introduce Hologers’ storage engine
I. Background introduction
MaxCompute Interactive Analysis (Hologres) is a Hybrid Serving/ Analytical Processing (HSAP) integrated system developed by Ali Cloud, which integrates real-time services and big data analysis scenarios. Full compatibility with PostgreSQL and seamless integration with big data ecology. Its emergence simplifies the architecture of business, and at the same time provides businesses with the ability to make decisions in real time, enabling big data to play a greater business value. For a more detailed introduction to architecture, see the VLDB paper at the end of this article.
Compared with traditional big data and OLAP systems, HSAP systems face the following challenges:
- High concurrent mixed workloads: HSAP systems need to face concurrent queries far beyond traditional OLAP systems. In practice, the concurrency of data services far exceeds that of OLAP queries. For example, we see real-world applications where data services need to process tens of millions of queries per second, which is five orders of magnitude higher than the concurrency of OLAP queries. At the same time, data service queries have more stringent requirements on latency than OLAP queries. Complex mixed query loads have very different trade-offs for system latency and throughput. How to efficiently utilize the resources of the system while simultaneously handling these very different queries and ensuring the SLO for each query is a huge challenge.
- High-throughput real-time data import: The HSAP system also needs to deal with massive real-time data import while dealing with high concurrent query load. The data synchronized from traditional OLTP is only a small part of this, but there is also a large amount of data from systems such as logging that do not have a strong transactional semantics. The amount of data imported in real time far exceeds that of traditional HTAP or OLAP systems. Another difference from traditional OLAP system is that it has high requirements on real-time data. The imported data needs to be visible at the second level or even sub-second level, so as to ensure the timeliness of our service and analysis results.
- Flexibility and scalability: Data import and query load may have a sudden peak, which puts forward high flexibility and scalability requirements for HSAP systems. In a real-world application, we note that the peak value of data imports can be 2.5 times the average value, and the peak value of queries can be 3 times the average value. Spikes in data import and query may not occur at the same time, which also requires the system to be able to adjust quickly to different spikes.
Based on the background of appeal, we developed a Storage Engine, which is mainly responsible for managing and processing data, including the methods of creating, querying, updating and deleting (CRUD for short) data. The design and implementation of the storage engine provide high throughput, high concurrency, low latency, flexibility and scalability capabilities required by the HSAP scenario. According to the business of Alibaba Group and the needs of customers on the cloud, we continue to innovate and polish, and develop to today, can support single table PB level storage, and perfectly support the 2020 Tmall Double 11 core scene of 100 billion level of point query and 10 million level of real-time complex query.
Below, we will make a detailed introduction to the underlying storage engine of Hologres, and introduce the specific implementation principle and technical highlights of the storage engine landing Hologres.
Second, data model
The basic abstraction of the Hologres storage engine is distributed tables. To make the system scalable, we need to shard the tables. Hologres introduced the concept of Table groups in order to efficiently support scenarios such as joins and multi-table updates, where users may need to store several related tables together. A group of tables with the same sharding policy constitutes a table group. All tables in the same table group have the same number of sharding. Shard_count and distribution_key can be used to specify the number of table fragments. Currently, we only support Hash sharding.
The data storage format of the table is divided into two categories, one is row storage table and the other is column storage table, and the format can be specified through “orientation”.
Records in each table are stored in a specific sequence, which can be specified by clustering_key. If no sequence is specified, the storage engine will automatically sort by insertion order. Choosing the right sequence can greatly optimize the performance of some queries.
Tables can also support a variety of indexes, currently we support dictionary and bitmap indexes. Users can specify columns to be indexed by “dictionary_encoding_columns” and” bitmap_columns”.
Here is an example:
This example creates the LINEITEM and ORDERS tables. Since the LINEITEM table also specifies the PRIMARY KEY, the storage engine automatically creates an index to ensure that the PRIMARY KEY is unique. The user puts the two tables into the same table group by specifying “colocate_with”. The table group is divided into 24 shards (specified by shard_count). LINEITEM will be shard based on the data value of L_ORDERKEY, and ORDERS will be shard based on the data value of O_ORDERKEY. The L_SHIPINSTRUCT field of LINEITEM and the O_ORDERSTATUS field of ORDERS will create the dictionary. The L_ORDERKEY, L_LINENUMBER, L_SHIPINSTRUCT fields of LINEITEM and the O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS fields of ORDERS will create bitmap indexes.
Storage engine architecture
1) Overall architecture
Each Table Group Shard (Shard) constitutes a Recovery Unit for storage management and Recovery. The figure above shows the basic architecture of a shard. A shard consists of multiple tablets that share a write-ahead Log (WAL). The storage engine uses a log-structured Merge (LSM) technique in which all new data is inserted in appends-only format. Data is first written to the MemTable where the tablet is located, and then written to a file after it has accumulated to a certain size. When a data file is closed, its contents do not change. New data and subsequent updates are written to the new file. Compared with B± Tree data structure of traditional database, LSM reduces random I/O and greatly improves write performance.
As writes come in, a lot of files accumulate on each tablet. When a tablet reaches a certain number of small files, the storage engine compacts the files together in the background. This eliminates the need to open multiple files at once, reducing system resource usage, and more importantly, improving read performance.
In terms of DML functionality, the storage engine provides single or batch create, query, update, and delete (CRUD operations) access methods through which the query engine can access stored data.
2) Storage engine components
Here are some important components of the storage engine:
- WAL WAL Manager and
WAL Manager is there to manage log files. Storage engines use write-ahead logging (WAL) to ensure atomicity and persistence of data. When CUD occurs, the storage engine writes WAL first and then to the MemTable of the tablet. When the MemTable accumulates to a certain size or a certain time, it switches the MemTable to the flushing MemTable that cannot be changed. And open a new MemTable to receive new write requests. And this immutable flushing MemTable can be flushed to the disk as an immutable file; When an immutable file is generated, the data can be persisted. If the system crashes due to errors, WAL logs are read during the system restart to recover data that has not been persisted. WAL Manager will delete a log file only after its data is persisted.
- File storage
Each tablet stores data in a set of files that are stored in DFS (Alibaba Pangu or Apache HDFS). The Sorted String Table (SST) format is used to store row storage files. Column storage files support two storage formats: a custom pax-like format and an improved version of Apache ORC (with many optimizations for Hologres scenarios based on AliORC). Both formats are optimized for file scanning scenarios.
- Block Cache (Read Cache)
To avoid using I/OS to read data from files every time, the storage engine uses BlockCache to store commonly used and recently used data in the memory to reduce UNNECESSARY I/OS and improve read performance. All shards in a node share a Block Cache. Block Cache has two types of elimination policies: LRU (Least Recently Used) and LFU (Least Frequently Used). As the name implies, the LRU algorithm first eliminates the Block that has not been used for the longest time, while THE LFU algorithm first eliminates the Block that has been accessed least in a certain period of time.
3) Principle of reading and writing
Hologres supports two types of writes: single-shard writes and distributed batch writes. Both types of writes are Atomic Write, that is, Write or rollback. Single-shard writes update one shard at a time, but need to support extremely high write frequencies. Distributed batch writing, on the other hand, is used in scenarios where large amounts of data are written to multiple Shards as a single transaction, and often at a much lower frequency.
– Single-fragment write
As shown in the figure above, after receiving a single shard write request, the WAL manager (1) assigns a Log Sequence Number (LSN) consisting of a timestamp and an increasing Sequence Number to the write request, and (2) creates a new Log and persists this Log in the file system. This log contains the information needed to resume the write operation. Commit writes to the Tablet only after the log is fully preserved. After that, (3) we perform the write operation in the MemTable of the corresponding tablet and make it visible to the new read request. It is worth noting that updates on different tablets can be parallelized. When a MemTable is full, (4) flush it to the filesystem and initialize a new MemTable. Finally, (5) merge a shard file asynchronously in the background. At the end of the merge or MemTable refresh, the metadata file that manages the Tablet is updated accordingly.
– Distributed batch write
The foreground node that receives the write request distributes the write request to all relevant shards. These shards guarantee the write atomicity of distributed bulk writes through a two-phase Commit mechanism.
– Read multiple versions
Hologres supports multiple versions of data in tablets. The consistency of read requests is read-your-writes, that is, clients can always see their latest committed write operations. Each read request contains a read timestamp that is used to construct the snapshot LSN for the read. If there is a row of data whose LSN is greater than the Snapshot LSN record, the row will be filtered because it was inserted into the tablet after the snapshot was read.
Technical highlights of Hologres storage engine
1) Separation of storage and calculation
The storage engine adopts the architecture of storage and computing separation. All data files are stored in a distributed file system (DFS, such as Alibaba Pangu or Apache HDFS). When the query load becomes large and more computing resources are required, you can expand computing resources. When the amount of data increases rapidly, storage resources can be rapidly expanded. The architecture of independent expansion of compute nodes and storage nodes ensures that resources can be rapidly expanded without waiting for data to be copied or moved. Furthermore, the multi-copy mechanism of DFS can be utilized to ensure high availability of data. This architecture not only greatly simplifies operation and maintenance, but also provides a great guarantee for system stability.
2) Execute the process asynchronously
The storage engine uses an event-based, non-blocking, pure asynchronous execution architecture, which makes full use of the multi-core processing power of modern cpus, improves throughput, and supports high-concurrency writes and queries. This architecture benefits from the HOS (HoloOS) framework, which not only provides efficient asynchronous execution and concurrency, but also automatically performs CPU load balancing to improve system utilization.
3) Unified storage
In the HSAP scenario, there are two types of query patterns, one is simple point query (data service Serving scenario) and the other is complex query scanning large amounts of data (Analytical Analytical scenario). Of course, there are plenty of queries that fall in between. These two query modes put forward different requirements for data storage. Row memory can efficiently support point queries, while column existence has obvious advantages over queries that support a large number of scans.
In order to be able to support various query modes, unified real-time storage is very important. The storage engine supports both row and column storage formats. Depending on the user’s requirements, a tablet can be a storage format for line memory (for the Serving scenario); It can also be a column storage format (for Analytical scenarios). For example, in a typical HSAP scenario, many users will store their data in a columnar storage format that facilitates large-scale scanning and analysis. At the same time, the index of the data is stored in the row storage format, which is easy to check. The primary key constraint (which we implement with row memory) is defined to prevent data duplication. Regardless of whether the underlying use is row storage or column storage, the interface for reading and writing is the same, the user does not perceive, only when building the table can be specified.
4) Read and write isolation
The storage engine uses the snapshot read semantics. Data is read at the beginning of the data. Data locks are not required and read operations are not blocked by write operations. When a new write comes in, since the write is append-only, all writes are not blocked by the read. This is a good way to support HSAP’s high concurrency mixed workload scenarios.
5) Rich index
Storage engines provide multiple index types to improve query efficiency. A table supports both clustered and non-clustered indexes. A table can have only one clustered index, which contains all the columns in the table. A table can have multiple non-clustered Indices. In addition to the non-clustered index key, a Row Identifier (RID) is used to locate the entire Row. If a clustered index exists and is unique, a clustered index key is a RID; Otherwise the storage engine produces a unique RID. To improve query efficiency, there can also be other columns in the non-clustered index so that in some queries all columns can be retrieved by scanning one index.
Inside the data file, the storage engine supports dictionaries and bitmap indexes. Dictionaries can be used to improve string processing efficiency and data compression ratios, and bitmap indexes can help efficiently filter out unwanted records.
Five epilogue.
The design and development direction of the storage engine is to better support the HSAP scenario, can efficiently support high-throughput real-time write and interactive query, and optimize the offline batch write. Hologres has withstood the test of rigor and survived the double 11 perfectly, with the number of writes and storage increasing exponentially every year. “Hologres withstood a real-time data flood of 596 million per second, storing up to 2.5 petabytes of data in a single table. Providing multi-dimensional analysis and services based on trillions of data, 99.99% of queries can return results within 80ms “. This set of data demonstrates the technical capabilities of the storage engine as well as the strength of our technical architecture and implementation.
Of course, Hologres is still a new product, HSAP is a new concept, “today’s best is only the beginning of tomorrow”, we are still constantly improving, listen to customer feedback, the future will continue to improve stability, ease of use, functions and performance.
Author: Jiang Yuan, senior technical expert of Alibaba, has many years of experience in database and big data field.
In the future, we will launch a series of revealing the underlying principle of Hologres technology. The specific plan is as follows, please continue to pay attention to it!
Hologres Revealed: First public! Hologres reveals the core technology of Alibaba cloud native real-time data warehouse: Reveals the cloud native Hologres storage engine for the first time (this article) Hologres reveals: Deep analysis efficient distributed query engine Hologres reveals: Hologres: How to synchronize data from MaxCompute to Hologres: How to synchronize data from MaxCompute to Hologres How to support ultra-high QPS in online service scenarios Hologres Reveal: How to support high concurrent queries Hologres reveal: How to support high availability architectures Hologres Reveal: How to support resource isolation, support multiple loads Hologres reveal: Hologres: How to design a distributed system based on Shard and Table Group How to support more Postgres ecological expansion pack Hologres Revealed: High throughput write Hologres N postures……