This article is shared by Ju Dasheng, researcher and real-time computing leader of Meituan, and mainly introduces the application practice of Flink to assist meituan’s incremental production of data warehouse. The contents include:

  1. Number warehouse incremental production
  2. Streaming data integration
  3. Streaming data processing
  4. Streaming OLAP applications
  5. The future planning

One, number warehouse incremental production

1. Meituan data warehouse architecture

First introduce the structure and incremental production of Meituan warehouse. As shown below, this is the simple structure of meituan warehouse, which I call three horizontal and four vertical. The so-called three horizontal, the first is through the whole link of metadata and blood, through the whole process of data integration, data processing, data consumption, and data application link. Another piece that runs through the whole link is data security, including restricted domain authentication system, authority system, and overall audit system. According to the flow of data, we divide the process of data processing into four stages: data integration, data processing, data consumption and data application.

In the data integration phase, we have corresponding integrated systems for internal data, such as user behavior data, log data, DB data, and file data, to unify the data into our data processing storage, such as Kafka. In the data processing stage, it is divided into streaming processing link, batch processing link and data warehouse working platform based on this link (vientiane platform). The produced data is imported into the consumer storage through Datalink and finally presented in different forms through applications.

We currently use Flink for a wide range of applications, including data transfer from Kafka to Hive, including real-time processing, data export processes. Today’s share focuses on those areas.

2. Application overview of Meituan Flink

Meituan Flink currently has about 6,000 physical machines supporting about 30,000 operations. The number of topics we consume is around 50,000, and the daily peak traffic is around 180 million per second.

3. Application scenarios of Meituan Flink

The main application scenarios of Meituan Flink include four major chunks.

  • First, real-time number warehouse, operation analysis, operation analysis, real-time marketing.
  • Second, recommendation and search.
  • Third, risk control and system monitoring.
  • Fourth, security audit.

4. Real-time warehouse vs. incremental warehouse production

Now I’m going to introduce the concept of incremental production. The first is timeliness. The second is quality, the quality of the data produced. The third is cost.

There are two deeper meanings of timeliness. The first is real-time and the second is punctuality. Not all business requirements are real-time, and many times our requirements are on-time. For example, do business analysis, every day to get the corresponding yesterday’s business data. Real – time data warehouse is more to solve the real – time requirements. However, in the area of punctuality, as an enterprise, we hope to make a trade-off between punctuality and cost. Therefore, I define incremental production in a warehouse as a trade off between on-time and cost for offline warehouses. In addition, the number of warehouse incremental production to solve a better aspect is quality, problems can be found in time.

5. Advantages of incremental production in warehouse

There are two advantages to warehouse incremental production.

  • Data quality problems can be detected in a timely manner to avoid T+1 data repair.
  • Make full use of resources and advance the time of data production.

As shown in the figure below, what we expect to do is actually the second figure. We want to reduce the resource footprint of offline production, but at the same time we want it to take a step forward.

Streaming data integration

1. Data integration V1.0

Let’s look at the first generation of streaming data integration. When the amount of data is very small and the library is very small, do a batch transfer system directly. Every morning when the corresponding DB data load again, guide to the inside of the warehouse. The advantage of this architecture is that it is very simple and easy to maintain, but its disadvantage is also very obvious. For some large DB or large data, the load time of data may take 2 to 3 hours, which greatly affects the output time of offline data warehouse.

2. Data integration V2.0

Based on this architecture, we added streaming transmission link, we will have a stream transmission acquisition system to collect the corresponding Binlog into Kafka, at the same time through a Kafka 2 Hive program to import it into the original data, and through a layer of Merge, output downstream ODS data required.

The advantage of data integration V2.0 is very obvious, we put the data transfer time to T+0 day to do, the next day only need to do a merge. This time can be reduced from two or three hours to one hour, and the time savings are significant.

3. Data integration V3.0

In form, the third generation architecture for data integration has not changed much since it already does streaming. The key is the flow of the later merge. Merging one hour every morning is still a waste of time and resources, and even puts a lot of pressure on HDFS. So in this case, we iterated on the HIDI architecture.

This is something we do internally based on HDFS.

4.HIDI

We design HIDI with four core demands. First, support Flink engine read and write. Second, Upsert/Delete based on primary keys is supported through MOR mode. Third, small file management Compaction. Fourth, support Table Schema.

With these considerations in mind, let’s compare HIDI, Hudi and Iceberg.

HIDI’s advantages include:

  • Upsert/Delete based on primary keys is supported
  • Support for Flink integration
  • Small file management Compaction

Disadvantages include: Does not support incremental read.

Hudi’s advantages include:

  • Upsert/Delete based on primary keys is supported
  • Small file management Compaction

Disadvantages include:

  • Write qualified Spark/DeltaStreamer
  • SparkStreaming is supported for stream reading and writing

The advantages of Iceberg include: support and Flink integration.

Disadvantages include:

  • Upsert/Delete based on Join is supported
  • Streaming reading is not supported.

5. Streaming data integration effect

As shown in the figure below, we have three stages: data generation, data integration and ETL production. By integrating streaming data to T+0, ETL production can be advanced, saving us costs.

3. Streaming data processing

1.ETL incremental production

Let’s talk about the incremental production process of ETL. Our data comes in from the front, after Kafka, there’s Flink live, then to Kafka, then to the service of the event, and even to the scene of analysis, which is the analysis link that we do ourselves.

The following is a link of batch processing. We integrated it into HDFS through Flink integration, then did offline production through Spark, and exported it to OLAP application through Flink. In such an architecture, incremental production is actually the part marked green in the figure below, and we expect to replace Spark with Flink’s incremental production structure.

2. Sql-ification is the first step in ETL incremental production

Such an architecture has three core capabilities.

  • First, Flink’s SQL capabilities need to align with Spark.
  • Second, our Table Format layer needs to be able to support real-time operations of primary key updates such as Upsert/Delete.
  • Third, our Table Format supports full and incremental reads.

Our full is for querying and repairing data, and our delta is for incremental production. SQL is the first step of ETL incremental production, today’s share is mainly to say that we based on Flink SQL real-time data warehouse platform to support this piece.

3. Real-time data warehouse model

As shown in the figure below, this is a model of a real-time warehouse. Everyone in the industry has seen such a model.

4. Real-time warehouse platform architecture

The platform architecture of real-time data warehouse is divided into resource layer, storage layer, engine layer, SQL layer, platform layer and application layer. Two points should be emphasized here.

  • The first is support for UDFs. Because UDFs are a very important part of the compensatory operator capability, we hope that the UDFS we do here will increase support for SQL capabilities.
  • Second, this framework only supports the ability of Flink Streaming, and we do not have the ability of Flink batch processing, because we imagine that all future architectures will be based on Streaming, which is consistent with the development direction of the community.

5. Real-time data warehouse platform Web IDE

This is a Web IDE for our warehouse platform. In such an IDE, we support an SQL modeling process that supports ETL development capabilities.

Streaming OLAP applications

1. Synchronize heterogeneous data sources

Let’s look at streaming exports and OLAP applications. The following figure shows the synchronization diagram of heterogeneous data sources. There are many open source products that do this. For example, different stores in which data is always exchanged. The idea is to be a middleware like Datalink, or an intermediate platform. And then we’re abstracting the N to N data exchange into an N to 1 data exchange.

2. Datax-based synchronization architecture

The first version of heterogeneous data sources was based on DataX to do synchronization architecture. In this framework, there are tool platform layer, scheduling layer and execution layer.

  • The tasks of the tool platform layer are very simple, mainly connecting users, configuring synchronization tasks, configuring scheduling, operation and maintenance.
  • The scheduling layer is responsible for the scheduling of tasks, of course, for the state management of tasks, as well as the management of the execution machine, a lot of work needs to be done by ourselves.

At the real execution level, synchronizing data from the source to the destination is actually performed through the DataX process, as well as a form of Task multithreading.

  • Within such an architecture, two core problems are identified. The first problem is scalability. Open source standalone DataX is a standalone multi-threaded model. When we need to transfer a very large amount of data, the scalability of standalone multi-threaded model is a big problem. The second problem lies in the scheduling layer. We need to manage machines, synchronized states and synchronized tasks, which is very tedious. When our scheduling machine fails, we need to do the whole disaster recovery alone.

3. Synchronization architecture based on Flink

Based on this architecture, we changed it to a Flink synchronization architecture. The front is the same as the tool platform layer. In the original architecture, we left the task scheduling and execution management in the scheduling layer to Yarn, which freed us from it. Second, our task state management in the scheduling layer can be migrated directly to the cluster.

The architectural advantages of Flink-based Datalink are clear.

  • First, scalability issues are solved, and the architecture is very simple. Now when we break down a synchronous task, it can spread out across a distributed cluster in TaskManager.
  • Second, offline and real-time synchronization tasks are unified in the Flink framework. All of our synchronized Source and Sink primary keys can be shared, which is a great advantage.

3. Key design of synchronization architecture based on Flink

Let’s take a look at the key design of a Flink-based synchronization architecture, and there are four lessons learned here.

  • First, avoid shuffling across TaskManager to avoid unnecessary serialization costs;
  • Second, it is important to design a dirty data collection bypass and failure feedback mechanism;
  • Thirdly, Flink Accumulators were used to design an elegant exit mechanism for batch tasks.
  • Fourth, use S3 unified management of Reader/Writer plug-ins and distributed hot loading to improve deployment efficiency.

OLAP production platform based on Flink

Based on Flink, we made a data export platform like Datalink, and based on Datalink, we made a production platform of OLAP. In addition to the underlying engine layer, we made a platform layer. We manage resources, models, tasks, and permissions accordingly, making OLAP production very fast.

Here are two screenshots of our OLAP production. One is for the management of models in OLAP and the other is for the management of task configurations in OLAP.

5. Future planning

After corresponding iterations, we applied Flink to data integration, data processing, off-line data export, and OLAP production. We expect that in the future the processing of stream batch will be unified and the data will be unified as well. We hope that Flink will be used for data unification in the future, whether it is real-time link or incremental link, to achieve the real integration of stream and batch.