This is peng Wenhua’s 99th original article \
Several friends in the background message, said to see how the major factories are playing real-time warehouse. In fact, real-time data warehouse and offline data warehouse in the design of the model is the same, but the computing engine and storage is not quite the same. Then we can solve a few problems in the real-time computing scenario. Today, I would like to share with you the architecture of real-time data warehouse.
Real-time computing architecture selection
Current real-time architecture approaches are Lambda and Kappa.
1. Lambda architecture
Lambda architecture has three core layers: batch data processing layer, stream data processing layer and service layer. The batch data processing layer should be used for historical long time data calculation, while the stream data processing layer should be used for short time real-time data calculation. If a requirement needs to be traced to the sum of all the current data, it is good to sum the two pieces of data at the service layer.
The Lambda architecture needs to maintain two sets of computing engines, and if you want to accumulate real-time data from the past to the present, you need to do the same calculations on both sides at the same time, and then add them up, which is very cumbersome. Hence the Kappa architecture, which is very popular these days.
Appa architecture
The design of the Kappa architecture is interesting. The Lambda architecture is separate line and real-time anyway, so you can take numbers from offline libraries and real-time message queues, calculate them separately, and add them up at the service layer.
Kappa’s design philosophy was: Don’t go offline at all, do all the streaming. The data source for streaming computing is the message queue, so I just put all the data I need to compute in the message queue, and let the streaming computing engine compute all the data.
Because all data are stored in Kafka, above the Flink batch data processing engine will be calculated Kafka data stored in the service layer table N. Flink restarts a task and recalculates it in table N+1. When N+1 catches up with table N, it stops the task in table N.
Finally, compare the advantages and disadvantages of the two architectures:
There is no best architecture, only the most appropriate architecture. At present, although the Kappa architecture of stream and batch integration is the latest and most popular architecture mode, most large factories still use the Lambda architecture of batch separation. The problem with the Lambda architecture is not only that two sets of code have to be maintained, but that the data produced by the two sets of code is not consistent at all! A small error rate, multiplied by a large base, makes a big difference. I noticed that the real-time data in the background of wechat official account is different from the offline results, which should be the Lambda architecture.
It’s not because of poor technical ability, but because Kappa also has some problems, such as weak batch processing, laborious data backtracking and limited application scenarios. Even if Kappa can fix these problems, a full replacement would take time and human cost.
So even when Flink is used, it is used either as an offshoot of streaming computing in the Lambda architecture, or as streaming batch integration for specific scenarios.
Real-time calculation of product selection
Whether Lambda or Kappa, real-time computing requires a data source, a data channel, a real-time computing engine, and a storage engine.
The data source is basically a variety of logs, sometimes also need to read some offline storage data, such as various dimension information.
Data channels are basically message-oriented middleware like Kafka and RocketMQ.
Real-time computing engines are currently only Storm, SparkStreaming and Flink.
Storage is more, divided into different types, can be divided into a variety of query-oriented storage, dimensional analysis oriented OLAP, data lake for large applications.
All the components have been combed out for your reference:
I’m going to focus on storage. Redis is recommended if large screen and other applications are directly connected after calculation. For quick query, you are advised to log in to Hbase or ES. For follow-up operations, you are advised to log in to kafka. For structured query, you are advised to log in to MySQL.
If you want to connect OLAP and do multidimensional analysis, you can choose from OLAP. Quasi-real-time multidimensional Impala, GP, Presto, Doris, Kudu, etc., and large and wide tables use ES, CK, Druid.
If a large number of subsequent applications are needed, data lakes are used. At present, Hudi, IceBerg and Delta are basically three kinds. Bytes use Hudi, while Tencent is pushing IceBerg.
In fact, OLAP also has a well-known Kylin, but it is not suitable for real-time data warehouse because it is predictive.
In addition, mega-factory will also develop some components, such as Ali Hologres, Didi DDMQ, Meituan Celler, Mkafka, etc., which will not be written in the paper, there is no reference significance.
Real – time number warehouse practice in major factories
Selection and hierarchical structure of Meituan Delivery real-time number warehouse:
\
This picture of Meituan.com looks very pleasant. The data source is various logs, which flow to real-time and quasi-real-time lines through message queues (Kafka and MAfka). Flink and Storm are used in real-time, and finally they are thrown into Redis, DurID and Hbase. Details can be found in the attachment.
Bytedance real-time data warehouse selection and layered architecture:
In fact, every big factory will try many technologies. Bytes, for example, are processed in batch, microbatch, and stream. Byte real-time is Flink, the entire architecture uses Lambda.
The selection and layered architecture of the real-time data warehouse:
Kafka +Storm/Flink+Druid/Redis/Hbase architecture is used in praise, and this article describes the iterative process of real-time data warehouse in detail, which is of great reference significance.
I’m not going to do any more here, and as for the real-time data lake, there aren’t many available right now, so I’m just going to ignore it. Those interested can download the documentation and explore it for themselves.
conclusion
The real-time warehouse modeling logic is the same as the common warehouse modeling logic, the field by field, the subject by subject, the several layers by several layers, the table to build the table, the multi-dimensional do multi-dimensional.
What has changed is that the original data is not falling into the ground, so we need to solve various problems encountered in various streaming data.
We don’t have a lot of options for data sources, basically Kafka, RocketMQ and other message-oriented middleware;
The computing engine suggests SparkStreaming or Flink, which is less friendly than Storm. If your company has Storm now, use flink-strom, 58 and byte, to do some development work, direct compatibility.
The storage engine is selected according to the application of the front end, which is introduced in front and I will not repeat here.
Finally, give you a summary of the major factory real-time data warehouse construction architecture:
There are things ready for you, the background reply “real-time number warehouse” can download all the information.
Enjoy better with the following articles
Dry goods | what is called understand the business? Five levels of analysis
China \ of relief series | a sigh
One breath series | through data China said
Relief | a sigh through series engine data calculation
Relief series | breath finished data warehouse modeling method
I need your upvotes. I love you