“This is the first day of my participation in the Gwen Challenge in November. See details of the event: The last Gwen Challenge in 2021”.

Table of contents: one. Real-time computing early two. Real-time number warehouse construction three. Lambda architecture real-time warehouse four. Real-time data warehouse five of Kappa architecture. Flow batch combined with real-time data warehouse

First, the early stage of real-time computing

Although real-time computing to fire up in recent years, but in the early part of the company there are the demand of the real-time computation, but the data quantity is less, so in terms of real-time formation is not a complete system, basic all development, analyzing the specific issues to a demand to do a basic does not consider the relationship between them, development form is as follows:

As shown in the figure above, after getting the data source, it will go through data cleaning, dimension expansion, business logic processing through Flink, and finally business output directly. When this link is broken down, the data source side will repeatedly refer to the same data source, followed by cleaning, filtering, dimension expansion and other operations, all of which have to be repeated. The only difference is that the code logic of the business is different.

As the demand for real-time data increases for products and business people, this development model presents more and more problems:

  1. With more and more data metrics, smokestack development leads to serious code coupling problems.

  2. There are more and more requirements, some for detailed data, some for OLAP analysis. A single development model cannot meet multiple requirements.

  3. Each demand needs to apply for resources, resulting in rapid expansion of resource costs, resources can not be used intensively and effectively.

  4. Lack of a comprehensive monitoring system to detect and fix problems before they affect the business.

We look at the development of real-time number warehouse and problems, and offline number warehouse is very similar, the late data volume after a variety of problems, offline number warehouse was how to solve? Offline data warehouse decouples data through hierarchical architecture. Multiple services can share data. Can real-time data warehouse also use hierarchical architecture? Yes, of course, but there are some differences in the details of layering versus offline, which I’ll talk about later.

Two, real-time data warehouse construction

In terms of methodology, real-time and offline data warehouse are very similar. In the early stage of offline data warehouse, specific problems are analyzed. Only when the data scale reaches a certain amount will we consider how to manage it. Layering is a very effective way of data governance, so in the real-time data warehouse how to manage the problem, the first consideration is also layering processing logic.

The structure of real-time data warehouse is shown as follows:

From the figure above, we can analyze the role of each layer in detail:

  • Data source: At the data source level, offline and real-time data sources are the same. They are mainly divided into log types and service types. Log types also include user logs, buried point logs, and server logs.

  • Real-time detail layer: layer in detail, in order to solve the problem of repetitive construction, to build unity, using offline for several warehouse model, detailed data layer, the basis of unified construction shall be carried out in accordance with the theme management, detailed layer is for the purpose of the downstream provides direct data are available, and therefore have to be unified processing of base layer, such as cleaning, filtering, enlarge dimensions, etc.

  • Summary layer: the summary layer can directly calculate the results through the concise operator of Flink, and form the summary index pool. All indicators are processed in the summary layer, and all people manage and build according to the unified norms, and form the reusable summary results.

We can see that the layers of real-time data warehouse and offline data warehouse are very similar, such as data source layer, detail layer, summary layer, and even application layer, and their naming patterns may be the same. However, a careful comparison reveals that there are many differences between the two:

  • Compared with offline warehouse, real-time warehouse has fewer layers:

    • Offline for several positions from the current construction experience, several data warehouse detail content is very rich, with detail data generally will also contain the concept of light aggregate layer, application layer data in several positions in offline for several other storehouse, but live in several storehouse, the app application layer data already fell into the application system of the storage medium, can put the number of layers and warehouse table separation.

    • Advantages of less application layer construction: When data is processed in real time, data must be delayed at each layer.

    • Advantages of less summary layer construction: In summary statistics, in order to tolerate some data delay, some artificial delay may be created to ensure the accuracy of data. For example, when collecting the data of a trans-sky order event, it may wait until 00:00:05 or 00:00:10 to ensure that all the data before 00:00 has been received. Therefore, too many levels of aggregation layer can increase the artificial delay of data.

  • Compared with offline data warehouse, data source storage of real-time data warehouse is different:

    • When building an offline warehouse, basically the entire offline warehouse is built on the Hive table. However, in the construction of real-time data warehouse, the same table will be stored in different ways. For example, detailed data or summary data are stored in Kafka, but dimension information such as cities and channels needs to be stored in Hbase, MySQL, or other KV databases.

3. Real-time data warehouse of Lambda architecture

The concept of Lambda and Kappa architecture has been explained in the previous article. If you are not familiar with it, you can click on the link to read real-time computing of big Data

The following figure shows the specific practice of Lambda architecture based on Flink and Kafka. The upper layer is real-time computing, the lower layer is offline computing, the horizontal layer is divided by computing engine, and the vertical layer is divided by real-time warehouse:

Lambda architecture is a classic architecture. In the past, there were not many real-time scenes, which were mainly offline. When the real-time scenes were added, the technology ecology was different due to the different timeliness of offline and real-time scenes. Lambda architecture is equivalent to the addition of a real-time production link, an integration at the application level, dual production, independent. This is also a logical approach to adopt in business applications.

Dual-path production will have some problems, such as processing logic double, development operation and maintenance will also double, resources will also become two resource links. Because of the above problems, another Kappa architecture was evolved.

4. Real-time data warehouse of Kappa architecture

The Kappa architecture is equivalent to Lambda architecture without the offline computing part, as shown in the figure below:

Kappa architecture is relatively simple in terms of architecture design, with unified production and a set of logic for simultaneous offline and real-time production. However, it has great limitations in practical application scenarios, because the same table of real-time data will be stored in different ways, which leads to the need for cross-data sources in association, and the operation of data has great limitations. Therefore, there are few cases that directly use Kappa architecture for production and implementation in the industry, and the scenarios are relatively single.

For those of you familiar with real-time warehouse production, there may be a question about the Kappa architecture. Because we are constantly faced with business changes, much business logic needs to be iterated. Some data produced before, if the caliber is changed, need to recalculate, or even brush historical data. For real-time data warehouse, how to solve the data recalculation problem?

The idea behind the Kappa architecture in this section is to have a message queue, such as Kafka, that can store historical data and allow you to restart consumption from a historical node. Then you need to start a new task to consume Kafka data from an earlier point in time, and when the new task is running at the same pace as the current running task, you can switch the current task downstream to the new task, and the old task can stop. And the original output table can also be deleted.

Five, flow batch combination of real-time data warehouse

With the development of real-time OLAP technology, the current open source OLAP engine has been greatly improved in performance and ease of use, such as Doris, Presto, etc. Coupled with the rapid development of data lake technology, the way of streaming batch combination becomes simple.

The figure below is the real-time data warehouse combined with flow batch:

Data is collected uniformly from log to message queue, and then to real-time data warehouse, as the construction of basic data flow is unified. Later, for real-time features of log class, real-time large-screen applications go through real-time streaming computing. For Binlog class business analysis go to real-time OLAP batch processing.

We see flow batch of combination of the above and several architecture of storage mode has changed, changed by Kafka Iceberg, Iceberg is between the upper calculation engine and an intermediate layer between the underlying storage format, we can define it as a “data organization format”, the underlying storage or HDFS, so why is the middle layer, Is it better to combine convection batch processing? Iceberg’s ACID ability can simplify the design of the entire assembly line, reduce the delay of the entire assembly line, and its modification and deletion ability can effectively reduce overhead and improve efficiency. Iceberg can effectively support batch high-throughput data scanning and concurrent real-time processing of stream computation according to partition granularity.