Author: Guo Bei, senior technical expert of Aliyun, head of Hologres real-time data warehouse

On January 7, 2022, Ali Cloud real-time data warehouse Hologres held its annual press conference. At the press conference, senior technical experts from Ali explained the new development trend of real-time data warehouse “online, agile and one-stop” from the core scene of Ali. Through this article, we will deeply understand the problems faced by the development of real-time data warehouse, as well as the core development trend, to help you better do product selection and data warehouse planning. Real-time data warehouse is now a very hot concept in the field of big data (and it is probably one of the lake warehouse). After more than a decade of development, big data has become the standard of every company. Traditionally, offline repositories (open source represented by Hive/Spark, closed source represented by Alibaba MaxCompute, Snowflake, AWS Redshift, Google BigQuery, and traditional IT vendors like Vertica, Oracle, and HANA), The standard architecture of big data processing is Lambda architecture, represented by Flink/Spark Structured Streaming, and data service layers (HBase, MySQL, ES, Redis, etc.). The Lambda architecture provides real-time data serving capability. But the typical problems with Lambda architectures are complex development, data redundancy, and inflexible analysis.

In recent years, the rise of real-time data warehouse represented by ClickHouse, Apache Doris, Ali Hologres and so on has realized the de-lambda architecture through real-time writing detailed data + flexible interactive query, and achieved a good balance in real-time, flexibility, cost, management, operation and maintenance and other aspects.

With the perfect conclusion of The 2021 Double 11, real-time counting warehouse technology in alibaba Double 11 scene has experienced many years of practice and development. From early chimney type development, based on different operation to the introduction, based on hierarchical modeling in the field of number storehouse to analyze new fusion type “one-stop” work style architecture, service integration gradually improve development efficiency, data quality is more assured, and precipitation more technological innovation, let’s see some future possibility and trend of development and application of several storehouse.

Now let’s talk about some trends of real-time data warehouse development seen from Alibaba Double 11.

Real-time data warehouse has become a business standard

The first trend is that real-time data warehousing has become standard.

Business demands for timeliness and flexibility are becoming higher and higher, making real-time data a necessity. The great advantages of real-time data warehouse in cost and flexibility make it a priority for businesses to choose real-time data warehouse as the production, storage and use platform of real-time data. In Alibaba, Hologres serves about 90% of BU, the cluster size exceeds 600,000 core, and maintains a growth rate of 100%. In these services, there are more common real-time warehouse scenarios, such as:

  1. Digital operation:This kind of scene upstream docking Flink data flow processing; The downstream is connected with BI tools and large data screens to realize self-service development and online services. Greatly improved development efficiency and flexibility, supporting the WYSIWYG development experience.
  2. ** Network traffic analysis and Metrics analysis: ** By storing and monitoring network traffic and other Metrics data in real time, it can quickly warn and locate potential device faults. Query second-level response on trillion-level record, fault second-level discovery.
  3. ** Real-time logistics tracking: ** realizes real-time tracking of logistics information through real-time data warehouse to ensure real-time update and query of logistics flow status.

In addition to these relatively common real-time warehouse scenarios, due to Hybrid Serving/Analytics Processing (HSAP) capabilities (and their Hologres high-speed pure real-time write and point-and-search capabilities), Hologres is also used in many atypical real-time warehouse scenarios. Such as:

  1. Crowd selection for merchants: Hologres provides high QPS and low delay crowd selection and advertising service for merchants (to B).
  2. Unmanned vehicle delivery: Hologres carries the orders, logistics and other indicators of unmanned vehicle goods, reports logistics information in real time to the station B, so as to help the station owner complete intelligent parcel sorting, mobile container delivery and other tasks; For users, and then through the system scheduling capacity, to achieve “timed door-to-door, delivery to the building”.
  3. Feature storage and sample storage in search recommendation: Hologres’ powerful point-searching ability is utilized to realize real-time feature store, real-time feature store and real-time algorithm effect analysis.
  4. Full-link customer experience: The customer service department stores the relevant multi-channel data of customers in Hologres, realizing the ability to directly provide various detailed queries to consumers (TO C).

There are many similar scenarios. The real-time “being seen” and “being used” of data has become the driving force for the rapid development of enterprises.

Real-time number warehouse support online production system

The second trend is that real-time data warehousing is becoming more and more part of production systems.

Traditionally, real-time data warehouse (DATA warehouse) is a non-production system. Because it is mainly facing internal customers, so although the importance of large screen is very high, but the real time warehouse is not in the production critical link, that is to say, if the real time warehouse is unavailable, the impact on the customer is not great. This is why most real-time data warehouse products in high availability, resource isolation, disaster recovery and other capabilities of the database system is a big gap.

Traditionally, external services are provided through offline/streaming processing + result point search, that is, the key link between the user and the result point search (carried by systems such as HBase, Redis, and MySQL). The advantage of this mode is that it is simple and reliable, but it is also very limited and inflexible. The business is eager to open its internal real-time warehouse capability to external customers (to B, to C) in a controlled manner, and to maintain data and logical consistency between the internal and external systems. Ali advertising, unmanned vehicle delivery, customer full-link experience and other scenarios listed above are all cases of to B or even to C.

As the real-time data warehouse is provided as a service, users put forward higher demands on the concurrency, availability and stability of the service. This is where Hologres has focused his efforts over the past year. In the past year, Hologres introduced multi-copy, hot upgrade, fast failover, resource isolation, read/write separation, disaster recovery and other capabilities to achieve production-level high availability, and has been well used in this year’s Double 11. A few examples:

  1. Last year, Alibaba’s Chief Customer Office (CCO) made dual-link write and storage redundancy to ensure high availability. This year’s Double 11 uses Hologres native high availability scheme to remove manual double link, which saves the manpower input of real-time task development and data comparison on standby data link, reduces the data inconsistency during link switching, and reduces the overall development manpower cost by 200 person-days, which is more than 50% lower than last year. Reduced by 100+ backup link jobs for real-time reinsurance and reduced computing resources by 2000CU.
  2. Alibaba’s Data Technology and Products Department (DT) uses Hologres read-write separation scheme, with high throughput and flexible query without interference; Analysis query QPS increased by 80% while query jitter decreased significantly.

We believe that the systematic production of real-time data warehouse is an inevitable trend, and believe that each real-time data warehouse products will gradually increase the development input in this aspect.

Integration of Analytics Services (HSAP)

The third trend is the integration of analytics services (HSAP).

Hologres is the pioneer in this field. The source is that businesses within Ali Group have strong demands for integration of analytical services. The best practices of integration of analytical services are first implemented within Ali, but we also see more and more products and enterprises advocating and practicing integration of analytical services in the industry.

Analysis service Integration (HSAP) can be understood at several levels: The most basic is that users can use a technology stack (Flink+Hologres) to solve both ad-Hoc Query analysis (internal) and online services (internal, to B, to C), thereby reducing development and maintenance costs. While the real-time warehouse has traditionally done ad-Hoc Query, the Lambda architecture implements online services. The two are completely different in technology stack, data link, development operation and maintenance, but the source of data processed is often the same data, resulting in a large number of redundant development operations, and data consistency is also a big problem. By using a unified technology stack to meet the needs of both aspects, development, operation and maintenance, and governance become simple.

Take ali CCO scenario as an example. After data is written into Hologres row memory table (row memory table has high write throughput, fast primary key query and low Binlog overhead in updating scene), it will be consumed and processed by Flink twice through Hologres Binlog. Hologres stores tables to provide analysis (stores are fast for statistical queries). Inventory tables provide online services/point-of-view, inventory tables provide analytical capabilities.

Higher-level HSAP allows users to implement ad-Hoc Query and online services with a single piece of data on a single platform, while achieving good resource isolation and availability.

For example, in this year’s Double 11 DT department, Hologres read/write separation scheme (two Hologres instances are responsible for real-time write and query respectively, but share the same underlying data store), and multiple read instances are responsible for different types of query respectively, thus ensuring read/write isolation, analysis query isolation, and service query isolation. And there’s only one piece of data. One Data, Multi Workload.

In addition to the benefits mentioned above, another significant advantage of service integration is that the service online speed is significantly faster. After integration, the boundary between analysis and service becomes blurred, so there is little difference between service development and analysis. It can be considered that service is a simple and fixed pattern analysis. As a result, the traditionally complex process of bringing services online is greatly simplified. When there is an urgent need for temporary development, it can be launched immediately, without the cumbersome process.

We believe that the concept of integration of analytics services will be implemented in more scenarios as products like Hologres evolve. This will also feed back HSAP products like Hologres, and better deposit HSAP concepts, methodology and support capabilities into products, so that more users can benefit from HSAP more easily.

Real-time data governance becomes imperative

The fourth trend is the growing importance of real-time data governance.

Real-time data is fatally attractive to businesses. Therefore, the enterprise will consciously or unconsciously gradually increase the input on real-time warehouse. And the real-time data warehouse of each enterprise because of the real-time requirements, often does not implement offline data warehouse so strict methodology and management system. Because there is no governance, data is heavily redundant or irrational, often resulting in a sharp increase in cost and loss of data credibility. In such a large enterprise as Alibaba, the cost of this will stand out, and it has become a demand for real-time data warehouse.

Through the data governance of real-time data warehouse, offline data warehouse, streaming computing, message queue and other whole links, the data can not be “outside the law”, so as to save the cost, improve the quality of data, and truly turn the data into enterprise assets.

Real – time data warehouse database – like

The fifth trend is the real – time warehouse database – like. Big data was born from the sublation of traditional databases. From NoSQL to NewSQL, big data products have walked out of a path independent of databases. But just as with NoSQL to NewSQL, the real-time data warehouse in big data products is also learning from databases, providing better compatibility with databases so that users can use real-time data warehouse products at lower cost.

There are several aspects to this:

  1. SQL operation and traditional database in the protocol, syntax compatibility, so as to facilitate the development of students can use accustomed tools (BI, development tools, etc.) to docking development. The accumulation of big data in this aspect is still not on the database decades of accumulation, quite a number of business students for the database is very skilled, but for big data (especially real-time data warehouse) feel not easy to get started.
  2. The data model and semantics are similar to traditional databases. For example, the concept of Primary Key is lacking in traditional warehouse products, and the atomicity of operations is often not guaranteed in warehouse products, which limits the application of many scenarios. For example, Clickhouse lacks a primary key in the database sense (CK’s primary key is a different thing, not a unique constraint), so it is not appropriate to handle database CDC synchronization scenarios. These two years, the big data industry can clearly see the enhancement of this piece. The most typical examples are near-real time stacks represented by DeltaLake, Iceberge and Hudi that add ACID capability. Of course, due to the architecture, the performance and latency of this near-real time ACID in frequent update scenarios are bottlenecked.

In Alibaba, a large number of scenarios require such primary key-based updating capabilities. Take alibaba’s internal scenarios as an example:

  1. Real-time database synchronization: By real-time synchronization (mirroring) of upstream sub-database sub-tables and multiple business libraries into a big data real-time warehouse, it can provide powerful analysis capabilities of business data, which requires good handling of pure real-time high-frequency UPDATE and DELETE operations.
  2. Flink computes the resulting UPDATE and DELETE (RETRACTION) operations: For example, for GMV statistics, Flink generates UPDATE records when the results are updated, and RETRACTION records (DELETE) are generated in some scenarios, which require the downstream system to be able to handle these two types of events well.
  3. Operations such as risk control are computed by multiple jobs that update a large and wide table in real time (each job updates some of the fields), requiring downstream systems to provide partial update capabilities based on primary keys.

Traditionally, such services have been handled by NoSQL systems such as HBase and Redis or RDS databases such as MySQL and PostgreSQL. The problem with NoSQL, however, is that analysis is generally weak, while the problem with databases is that write performance and size are limited.

These businesses are ubiquitous in big data processing. However, the challenge in Ali is due to the large scale (especially for scenarios like Double 11) and the demanding requirements for primary key-based update performance and latency.

Hologres had both in mind from the beginning of the design. Hologres is fully compatible with PostgreSQL 11 protocols, syntax, functions, etc. Many PostgreSQL extensions (such as PostGIS) can be used directly. At the same time, Hologres provides complete primary key concepts and powerful update capabilities, and provides single SQL ACID. On November 11 this year, some businesses measured more than 3.5 million real-time write updates per second. These capabilities greatly broaden the application scenarios of real-time data warehouse, the traditional scenario carried by NoSQL and RDS is carried by real-time data warehouse, providing users with more powerful analysis and processing tools.

The database – like database of real-time data warehouse is not equivalent to HTAP database. HSAP is weaker in transaction capability than HTAP. Because in the Serving scenario, the full transactional capability of a traditional database is not required. This abandonment resulted in significant improvements in real-time write and query performance, as well as scalability improvements (since the global transaction manager was no longer required). Therefore, HSAP is more suitable for big data scenarios than HTAP.

Real-time data warehouse development agility

The last trend is a change in development methodology, with the development of real-time data warehouses becoming more agile to accommodate the flexibility of analysis scenarios.

In the past, the development of data warehouse often adopts the method of ODS->DWD->DWS->ADS layer by layer in accordance with the classical methodology, and the scheduling method of event-driven or micro-batch between layers is adopted. Layering brings better semantic abstraction and data reuse, but also increases scheduling dependence, reduces data timeliness, and reduces agility of flexible data analysis.

Real-time for warehouse drive business decisions in real time, usually need to rich context information in decision-making, so the height of the traditional custom ADS development method based on business by the bigger challenge, tens of thousands of ADS table maintenance difficulties, low utilization rate, more business side hopes that by DWS even DWD multi-angle data contrast analysis, This puts forward higher requirements for query engine’s computing efficiency, scheduling efficiency and IO efficiency.

With multiple query engine optimization technologies such as vectoring rewrite of computation operator, fine indexing, asynchronous execution and multi-level caching, the computational power of Hologres has been greatly improved in each version. So we see more and more users have adopted agile development way, the calculation of the front stage, only do data quality cleaning, big table width, basic modeling to DWD, DWS, reduce the modeling level, at the same time, the flexible query in the interactive query engine in real analysis, through the analysis of the interactive second stage experience, Underpinning important trends in the democratization of data analysis.

conclusion

Alibaba is one of the earliest companies in the industry to use real-time data warehouse to process massive data. Real-time number warehouse in ali’s development also gradually into the deep water area. Whether production systematic and analytical services integration, real-time data governance (platform) or class database, agile, number of real-time warehouse is with the rapid development of the business requirements and rapid iteration, and in double 11 this year’s drama coruscate gives more and more bright luster, becoming indispensable partner and business assistant.

Business driving technology, data brings value, real-time data warehouse Hologres is growing and polishing together with Alibaba’s core business, from multidimensional complex OLAP analysis to high QPS spot search, high-performance real-time write and update to high availability, providing unified analysis service export for big data platform. Meet the one-stop real-time data warehouse storage, development, governance, service process and scenarios.

We believe that the trend of real-time data warehouse is also applicable to the whole industry, we will gradually put the ability accumulated in Alibaba Double 11 into cloud products, to help customers make good use of real-time data warehouse, and grow together!