Diversified exploration and practice of Apache Flink in Bilibili

This article is shared by Zheng Zhisheng, head of Bilibili Big data real-time platform. The core of this sharing is to explain the implementation of trillion-level transmission and distribution architecture, as well as how to build a complete set of pre-processing real-time Pipeline based on Flink in the AI field. This sharing mainly focuses on the following four aspects:

I. Real-time past life and present life of station B

Second, Flink On Yarn increased quantization pipeline scheme

3. Some engineering practices in the direction of Flink and AI

Iv. Future development and thinking

I. Real-time past life and present life of station B

1. Ecological scene radiation

When it comes to the future of real-time computing, the key word is the effectiveness of data. First of all, from the whole ecology of big data development, the radiation of its core scene: in the early stage of big data development, the core is the scene of offline computing oriented to the sky. At that time, the effectiveness of data was mostly calculated on a daily basis. It was more about the balance between time and cost.

With the popularization and improvement of data application, data analysis and data warehouse, more and more people put forward higher requirements for the effectiveness of data. For example, when some data needs to be recommended in real time, the usefulness of the data will determine its value. In this case, the whole scenario of real-time computing was born.

However, in the actual operation process, there are also a lot of scenarios, in fact, there is no very high real-time requirements for data, in this case, there are bound to be some new scenarios of data from milliseconds, seconds or days, real-time scenario data is more minute granular some incremental calculation scenarios. For offline computing, it focuses more on cost; For real-time computing, it puts more emphasis on value effectiveness; Incremental computing, on the other hand, is more about balancing cost, combined value and time.

2. Timeliness of STATION B

What is the division of station B in the three dimensions? For station B, 75% of the data is currently supported by offline computation, 20% of the scenarios are real-time computation, and 5% are incremental computation.

For the scenario of real-time computing, it is mainly applied in the whole real-time machine learning, real-time recommendation, advertising search, data application, real-time channel analysis, report, OLAP, monitoring, etc.
For off-line calculation, data radiation surface is wide, mainly with data warehouse;
For incremental computing, some new scenarios, such as the incremental Upsert scenario for Binlog, were launched this year.

3. Poor timeliness of ETL

As for the effectiveness of the problem, in fact, early encountered a lot of pain points, the core of the three aspects:

First, the pipeline lacks computing power. In the early scheme, the data basically fell to ODS by day. In DW layer, the data of all ODS layers of the previous day were scanned on the second day after midnight, that is to say, the overall data could not be pre-cleaned.
Second, the resources containing a large number of operations are concentrated after the early morning, the pressure of the entire resource arrangement will be very large;
Third, real-time and offline gap is more difficult to meet, because for most of the data, pure real-time cost is too high, pure offline effectiveness and poor. At the same time, the warehousing time of MySQL data is not enough. For example, the volume of bullet screen data of station B is very exaggerated, and the synchronization of such business tables often takes more than ten hours, which is very unstable.

4. AI real-time engineering is complex

In addition to the problem of effectiveness, AI real-time engineering also encountered complex problems in the early stage:

First, the calculation efficiency of the whole feature engineering is a problem. For computing scenarios with the same real-time features, data need to be backtracked in offline scenarios, so the computing logic will be repeatedly developed.
Second, the whole real-time link is long. A complete real-time recommended link covers N real-time and M offline more than a dozen operations. Sometimes problems are encountered, and the operation, peacekeeping and control cost of the whole link is very high.
Third, with the increase of AI personnel and the input of algorithm personnel, it is difficult to horizontally expand the experimental iteration.

5. Flink did ecological practice

Under the background of these key pain points, we focused on ecological practice of Flink, including the application of the whole real-time data warehouse, the whole ETL pipeline of incremental quantization, as well as some scenarios of AI-oriented machine learning. This sharing will focus more on incremental pipes and AI plus Flink. The figure below shows the overall scale. Currently, the total volume of transmission and computation, in the trillions of messages, is 30,000 + computing cores, 1,000 + jobs, and more than 100 users.

Flink On Yarn increased quantization pipeline scheme

1. Early architecture

To take a look at the early architecture of the entire pipeline, you can see from the diagram below that Kafka data is consumed primarily by Flume and dropped into HDFS. Flume uses its transaction mechanism to ensure the consistency of data from Source to Channel and then to Sink. Finally, when the data falls to HDFS, the downstream Scheduler will scan for TMP files in the directory to determine whether the data is Ready. To schedule and pull downstream ETL offline jobs.

2. (

I encountered a lot of pain points early on:

The first is data quality.
- MemoryChannel was first used, which had data loss, and FileChannel was later tried, but the performance was not up to the standard. In addition, when HDFS is unstable, the transaction mechanism of Flume causes data to rollback to Channel, which leads to continuous data duplication to a certain extent. In the case of extremely unstable HDFS, the highest repetition rate will reach percentile probability;
- Lzo row storage, the early entire transfer was in the form of delimiters, the Schema of this delimiter is relatively weak, and does not support nested format.
Second, the timeliness of the whole data cannot provide minute-level query, because Flume does not have Checkpoint breaking mechanism like Flink, but controls file closing by idle mechanism.
The third point is downstream ETL linkage. As mentioned above, we mainly scan whether TMP directory is ready or not. In this case, Scheduler will call the API of Hadoop List with NameNode in large quantities, which will cause great pressure on NameNode.

3. Stability related pain points

There are also many problems with stability:

First, Flume is stateless. TMP cannot be shut down after the node is abnormal or restarted.
Second, the early environment without big data is physical deployment mode, which makes it difficult to control resource scaling and costs will be relatively high.
Third, Flume and HDFS have communication problems. For example, when HDFS is blocked, the block of a node will backpressure the Channel. This will cause the Source to stop consuming data in Kafka and pull the offset. Rebalance Kafka to some point. Finally, the global offset will not advance forward, resulting in data accumulation.

4 trillion level incremental pipeline DAG view

With the pain points above, the core solution builds a trillion-level incremental pipeline based on Flink, and the DAG view of the entire runtime is shown below.

First, with Flink, KafkaSource eliminates an avalanche of rebalance. Even if there is a concurrency block in the entire DAG view, it won’t block all Kafka partitions globally. In addition, the essence of the whole scheme is to implement extensible nodes through the Transform module.

The first layer node is Parser, which mainly decompress and deserialize data.
The second layer is the introduction of customized ETL modules provided to users, which can achieve customized cleaning of data in the pipeline;
The third tier is the Exporter module, which supports exporting data to different storage media. For example, when writing to HDFS, it is exported as Parquet. Write to Kafka and export to PB format. At the same time, ConfigBroadcast module is introduced in the whole DAG link to solve the problem of real-time update and hot loading of pipeline metadata. In addition, a checkpoint is performed every minute throughout the link, appending actual incremental data to provide minute-level queries.

5 trillion level incremental pipeline overall view

The overall architecture of Flink On Yarn shows that the entire pipeline view is divided by BU. Each Kafka topic represents the distribution of a certain type of data terminal, and Flink jobs specialize in writing to each type of terminal. As you can see in the view, the entire pipe is assembled for Blinlog data, allowing multiple nodes to operate the pipe.

6. Technical highlights

Let’s take a look at some of the technical highlights at the core of the architecture. The first three are features at the real-time functional level, and the last three are optimizations at the non-functional level.

For data model, mainly through Parquet, using the mapping of Protobuf to Parquet to achieve format convergence;
Partition notification is mainly because a pipeline actually processes multiple streams, and the core solution is the partition ready notification mechanism of multiple streams.
The CDC pipeline uses binlog and HUDI to solve upSERT problems.
Small files are mainly used to solve the problem of file merging through DAG topology at run time.
HDFS communication is actually optimization of many key problems on a trillion scale;
Finally, some optimizations for partition fault tolerance.

6.1 Data Model

Business development is mainly through the assembly of strings, to assemble data one record of the report. The later stage is organized through the definition and management of the model, and its development, mainly through the platform entry provided to the user to record each stream, each table, its Schema, Schema will generate it into a Protobuf file, Users can download the HDFS model file corresponding to Protobuf on the platform. In this way, the development of the client side can be constrained by pb in a strong Schema mode.

At runtime, Kafka’s Source consumes every RawEvent that is actually sent upstream. RawEvent contains PBEvent objects, which are Protobuf records. The data is streamed to the Parser module from Source, and after being parsed, PBEvent is formed. PBEvent stores the entire Schema model entered by users on the platform in the OSS object system, and the Exporter module dynamically loads model changes. The generated event object is then reflected through the PB file, and the event object can finally map the format of the completed Parquet. A lot of cache reflection optimization was done here, which improved the dynamic resolution performance of the whole PB by six times. Finally, we will ground the data to HDFS to form the parquet format.

6.2 Partition Notification Optimization

As mentioned above, the pipe can process hundreds of streams. In the early Flume architecture, it was difficult for each Flume node to sense its own progress. At the same time, Flume can’t handle the global progress. But with Flink, you could do it with Watermark.

Watermark will be generated based on Eventime in the message in Source, which will be transmitted to Sink through each layer of processing. Finally, progress of all Watermark messages will be summarized in a single-threaded manner through Commiter module. When it sees that the global Watermark has been advanced to the partition of the next hour, it sends a message to the Hive MetStore, or writes to Kafka, to tell the partition of the previous hour that the data is ready. This allows the downstream dispatcher to pull up the job in a message-driven manner more quickly.

6.3 Optimization on CDC pipeline

The right side of the image below is actually a complete link of the CDC pipeline. To achieve a complete mapping of MySQL data to Hive data, you need to solve the problems of streaming and batch processing.

The first is to use Datax to fully synchronize MySQL data to HDFS at one time. Next, spark’s job is used to initialize the data into HUDI’s initial snapshot, and then Canal is used to drag Mysql’s binlog data to Kafka’s topic. Then, Flink Job combines the initial snapshot data with incremental data for incremental update, and finally forms HUDI tables.

The whole link is to solve the data loss and weight, the key is to write Kafka for Canal, open the transaction mechanism, to ensure that when the data falls on Kafka topic, the data can be neither lost nor weight in the transmission process. In addition, data may also be repeated and lost in the upper layer of transmission, which is more through the global unique ID plus millisecond time stamp. In the whole stream Job, the data is de-duplicated for the global ID and sorted for the millisecond time to ensure that the data can be updated to the HUDI in an orderly manner.

Then Trace system based on Clickhouse to do storage, to count the number of incoming and outgoing data of each node to achieve accurate data comparison.

6.4 Stability – Merge small files

As mentioned above, after the transformation into Flink, we did Checkpoint per minute, and the number of files was greatly enlarged. Merge operater is introduced into the entire DAG to merge files. Merge is a horizontal merge based on concurrency. One writer corresponds to one merge. So every five minutes Checkpoint, 12 files in an hour, are merged. In this way, the number of files can be greatly controlled within a reasonable range.

6.5 HDFS communication

In the actual operation process, we often encounter the serious problem of the whole job accumulation. In fact, the actual analysis is mainly related to HDFS communication.

In fact, HDFS communication combs through four key steps: initialize State, Invoke, Snapshot and Notify Checkpoint Complete.

The core problem occurs mainly in the Invoke phase, when the Invoke reaches the scrolling condition of the file, triggering flush and close. When the close is actually communicating with the NameNode, it is often blocked.

There is also a problem with the Snapshot phase. Once Snapshot is triggered for hundreds of streams in a pipeline, serial flush and close are also very slow.

Core optimization focuses on three aspects:

First, it reduces the frequency with which files are cut, that is, close. In the Snapshot phase, files are not closed, but more by way of file continuation. Therefore, in the initial state stage, file Truncate is required to perform Recovery.
Second, it is the improvement of asynchronous close. It can be said that the action of CLOSE will not block the processing of the whole total link. For the close of Invoke and Snapshot, the state will be managed into the state, and file recovery will be carried out by initializing the state.
Third, Snapshot also performs parallel processing for multiple streams. At Checkpoint every 5 minutes, multiple streams are actually multiple buckets, which are processed in a serial loop. You can reduce Checkpoint timeout.

6.6 Some optimizations for partition fault tolerance

In fact, in the case of multiple streams in the pipeline, the data of some streams is not continuous every hour.

This would cause partitions, and its Watermark wouldn’t work properly, causing space partition problems. So we introduced the PartitionRecover module in the pipeline runtime, which pushes partition notifications according to Watermark. For some streams of Watermark, the Recover module appends partitions if idelTimeout has not been updated. It adds delay time to scan the Watermark for all streams when it arrives at the end of each partition.

During the transfer process, when Flink job was restarted, there would be a wave of zombie files. We cleaned and deleted zombie files before the whole partition was notified by the COMMIT node in DAG to realize the whole zombie file cleaning. These are some non-functional optimizations.

3. Some engineering practices in the direction of Flink and AI

1. Architecture evolution schedule

Below is a complete timeline of AI direction in real-time architecture.

Back in 2018, a lot of algorithmic experimentalists’ development was more of a workshop. Each algorithmic person will choose a different language to develop different experimental projects based on the language they are familiar with, such as Python, PHP or c++. It is very expensive to maintain and prone to failure;
In the first half of 2019, we mainly did some engineering support for the whole algorithm based on the jar package mode provided by Flink. It can be said that in the beginning of the first half of the year, we actually did some support for stability and universality.
In the second half of 2019, the threshold of model training was greatly reduced through self-developed BSQL, and real-time label and instance were solved to improve the efficiency of the whole experiment iteration.
In the first half of 2020, some improvements will be made centering on the whole feature calculation, the flow batch calculation and the improvement of feature engineering efficiency.
In the second half of 2020, more efforts will be made to streamline the whole experiment and introduce AIFlow to facilitate the flow batch DAG.

2. Review of AI engineering architecture

Looking back at the whole AI project, its early architecture diagram actually reflects the architectural view of the whole AI at the beginning of 2019. Its essence is to support the link pulling of the whole model training by means of some single task, some computing nodes composed of various mixed languages. After the 2019 iteration, the entire near-line training was completely replaced with BSQL mode for development and iteration.

3. Status quo pain points

In fact, at the end of 2019, we encountered some new problems, which mainly focus on the functional and non-functional dimensions.

At the functional level:
- Firstly, the whole link from label to instance stream, model training, online prediction, and even the real experimental effect is very long and complicated.
- Second, the whole real-time feature, offline feature, and stream batch integration, involving a lot of job composition, the whole link is very complex. At the same time, feature calculation should be done for both the experiment and online. Inconsistent results will lead to problems in the final effect. In addition, it is not easy to find where features exist, there is no way to trace them.

At the non-functional level, algorithms students will often encounter things like what Checkpoint is, whether to open it or not, and what configuration is available. In addition, when there is a problem on the line, it is not easy to check, the whole link is very long.
- Therefore, the third point is that the complete experimental progress needs to involve a lot of resources, but for the algorithm, it does not know what these resources are and how much they need. In fact, all these problems cause great confusion to the algorithm.

4. Pain points boil down

In the final analysis, we focus on three aspects:

The first is the question of consistency. From data pretreatment, to model training, and then to prediction, each link is actually fault. Including inconsistency of data, also including inconsistency of calculation logic;
Second, the whole experiment iteration was very slow. A complete experimental link, in fact, for algorithm students, he needs to master things very much. At the same time, the materials behind the experiment cannot be shared. Some features, for example, have to be developed repeatedly behind each experiment;
Third, the cost of operation, peacekeeping and control is relatively high.

The complete experimental link is actually composed of a real-time project and an offline project link, so it is difficult to investigate the problems on the line.

5. Prototype of real-time AI engineering

With such pain points, the prototype of real-time engineering was mainly built in the direction of AI in the past 20 years. The core is to make breakthroughs through the following three aspects.

The first is in some of the capabilities of BSQL, for the algorithm, I hope to reduce the project investment by developing for SQL;
The second is feature engineering, which can satisfy some support of feature by solving some problems of feature calculation by core.
The third is the collaboration of the whole experiment. The purpose of the algorithm is actually the experiment. I hope to create a set of end-to-end experimental collaboration, and finally hope to achieve the “one-click experiment” for the algorithm.

6. Feature engineering – Difficult points

We encountered some difficulties in feature engineering.

The first is real-time feature calculation, because it needs to use the results to the prediction service of the whole line, so it has very high requirements on delay and stability.
Secondly, the whole real-time and offline computing logic is consistent. We often encounter a real-time feature, which needs to go back to the offline data of the past 30 to 60 days. How can the calculation logic of real-time feature be reused in the calculation of offline feature?
The third is the whole offline characteristics of the stream batch is more difficult to get through. The computational logic of real-time features often involves streaming concepts such as window timing, but offline features do not have these semantics.

7. Real-time features

Here’s a look at how we do real-time features, with some of the most typical scenarios on the right. For example, I want to make real-time statistics of users’ playing times of each UP main video in the last minute, 6 hours, 12 hours and 24 hours. For this scenario, there are actually two points:

First, it requires a sliding window to calculate the entire user’s past history. In addition, in the process of data sliding calculation, it also needs to associate some basic information dimension table of the UP host to obtain some videos of the UP host and count his play times. In the final analysis, in fact, encountered two relatively big pain.
- Using Flink’s native sliding window, minute sliding will lead to more Windows and greater performance loss.
- At the same time, fine-grained Windows also lead to too many timers and poor cleaning efficiency.
The second is dimension table query. Multiple keys need to be used to query the corresponding values of HBASE. In this case, concurrent query of arrays needs to be supported.

Under the two pain points, the sliding window is mainly transformed into the mode of Group By, plus the MODE of AGG UDF, and some window data of the whole one hour, six hours, twelve hours and twenty-four hours are stored in the whole Rocksdb. In this way, through UDF mode, the whole data triggering mechanism can realize record-level triggering based on Group By, and the whole semantics and timeliness will be greatly improved. At the same time in the whole AGG UDF function, Rocksdb to do state, in the UDF to maintain the life cycle of data. In addition, the whole SQL is extended to implement the array level dimension table query. The overall effect can actually be in the direction of real-time features, through the mode of large Windows to support various computing scenarios.

8. Features – Offline

Next, take a look at offline. The upper part of the left view is a complete computing link with real-time features. It can be seen that to solve the same SQL, it can also be reused in offline computing, so it is necessary to solve the problem that some corresponding computing IO can be reused. For example, Kafka is used for streaming data input, while HDFS is used for offline data input. Streaming is supported by SOME KV engines such as KFC or AVBase, while offline it needs to be solved by Hive engine. In the final analysis, there are three problems to be solved:

First, the ability to simulate the entire streaming consumption is needed to support HDFS data consumption in offline scenarios.
Second, we need to solve the problem of orderly partition in HDFS data consumption, similar to Kafka partition consumption.
Third, it is necessary to simulate kv engine dimension table consumption and realize hive-based dimension table consumption. There is also a problem that needs to be solved. When each record pulled from HDFS consumes hive tables, there is a corresponding Snapshot, which is equivalent to the timestamp of each data, and the partition of the corresponding data timestamp needs to be consumed.

9. Optimization

9.1 Offline – Partitions are orderly

In fact, the partitioning scheme is mainly based on the data before falling into HDFS, some modifications. First of all, data is transmitted through Kafka before falling into HDFS. After Flink’s job pulls data from Kafka, Eventtime pulls data from the watermark, The concurrency of each Kafka Source is reported to the GlobalWatermark module in JobManager, and GlobalAgg aggregates the progress from each concurrent watermark. To track GlobalWatermark’s progress. Based on the progress of GlobalWatermark, the problem of too fast calculation of concurrent Watermark was calculated, and the control information was sent to Kafka Source through GlobalAgg. Its entire zone propulsion slows down. In this way, in the HDFS Sink module, the whole Event time of the data recorded on the same time slice is basically orderly, and the corresponding partition and corresponding time slice range will be identified on the file name when it finally falls to HDFS. Finally, in the HDFS partition directory, you can realize the ordered directory of data partition.

9.2 Offline – Partition Incremental Consumption

HDFStreamingSource is implemented after data is incrementalized in HDFS, which partitions files for Fecher, has Fecher threads for each file, and each Fecher thread counts each file. Offset handles the progress of the cursor and updates the status to State according to the Checkpoint process.

This can achieve the orderly promotion of the entire document consumption. An offline job involves stopping the entire job while historical data is retraced. In fact, a partition end flag is introduced into the entire FileFetcher module. When each thread counts each partition, it will sense the end of its partition, and the state after the end of the partition will be summed up in cancellationManager. Furthermore, the progress of global partitions will be summarized to Job Manager to update. When all global partitions reach the cursor at the end, the whole Flink Job will be cancelled.

9.3 Offline – Snapshot Dimension table

Before talking about the whole offline data, in fact, the data is stored in Hive, hive HDFS table data of the entire table field information is very much, but the actual offline characteristics, the need for information is actually very little, so you need to perform offline field pruning in hive process. An ODS table is cleaned into a DW table. The DW table will eventually run a Job through Flink, and a Reload scheduler will periodically check the partition of the Watermark to which the data is currently pushed. Pull table information of each partition in Hive. By downloading some data in the hive directory of an HDFS, the file Rocksdb will be reloaded in the entire memory. Rocksdb is actually the component used to provide KV query for dimension table.

The component will contain multiple Rocksdb build building processes, mainly depending on Eventtime in the whole data flow process. If Eventtime is found to be advancing towards the end of hour partition, it will be actively reload through lazy loading mode. Build the partition for the next hour of Rocksdb, and in this way, switch the read for the entire Rocksdb.

10. Experiment flow and batch integration

On the basis of the above three optimizations, namely ordered increment of partition, Fetch consumption of Kafka partition and dimension table Snapshot, real-time feature and offline feature are finally realized, sharing a set of SQL scheme, which enables streaming batch calculation of feature. Then take a look at the whole experiment, a complete stream batch integrated link. It can be seen from the figure that the granularity on the top is the whole offline complete calculation process. The second is the whole near-line process. In fact, the semantics of the calculation used in the off-line process are exactly the same as the semantics of the real-time consumption used in the near-line process. Flink is used to provide SQL calculation.

In fact, Label join uses a Kafka click stream and present stream. When it comes to the whole offline computing link, it uses an HDFS click directory and HDFS present directory. Feature data processing is the same, using Kafka playback data and Hbase script data in real time. For offline, hive manuscript data is used, as well as Hive playback data. In addition to the whole offline and near-line streaming batch through, the real-time data effect generated by the whole near-line is summarized to the OLAP engine, through superset to provide the whole real-time indicator visualization. In fact, it can be seen from the figure that the complete complex flow batch integrated computing link contains computing nodes which are very complex and extensive.

11. Lab collaboration – Challenge

In the next stage, the challenge is more about experimental cooperation. The figure below is the abstraction after simplifying the whole link in front. As can be seen from the figure, the three dashed line area boxes are offline links plus two real-time links respectively, and the three complete links constitute the flow batch of the job, which is actually the most basic process of a workflow. It needs to complete the complete abstraction of workflow, including the driver mechanism of flow batch events, and more hope to use Python to define the complete flow for the algorithm in the FIELD of AI. In addition, the whole input, output and its whole calculation tend to template, so as to facilitate the cloning of the whole experiment.

12. Introduce the AIFlow

In the second half of the year, the whole workflow was introduced in collaboration with the community.

The right side is actually the DAG view of the whole AIFlow link, from which you can see the whole node. In fact, there is no restriction on the type it supports, which can be flow node or offline node. In addition, the entire node to node communication edge can support data-driven and event-driven. The main benefit of introducing AIFlow is that AIFlow provides Python semantics to facilitate the definition of a complete AIFlow workflow, as well as scheduling the progress of the entire workflow.

On the edge of the node, it also supports an entire event-driven mechanism compared to some of the native industry Flow solutions. The benefit is that you can help send an event-driven message to pull up the next offline or live job by tracking the progress of the data partition processed by watermark in Flink between two Flink jobs.

It also supports surrounding services, including message module services for notifications, metadata services, and model-centric services in the AI domain.

13. Python defines Flow

Take a look at how aiFlow-based workflow is ultimately defined in Python. The view on the right is the definition of a complete workflow for an online project. First, it is the whole definition of Spark job, in which dependence is configured to describe the whole downstream dependence relationship. It will send an event-driven message to pull up the following Flink flow job. Streaming jobs can also pull up the following Spark jobs in a message-driven manner. The whole semantic definition is very simple, only need four steps, configure the confG information of each node, define the operation behavior of each node, and its dependency, and finally run the topology view of the whole flow.

14. Event-driven stream batch

Next, take a look at the complete driver mechanism for flow batch scheduling. The complete driver view for the three working nodes is on the right side of the figure below. The first is from Source to SQL to Sink. The yellow box introduced is the extended Supervisor, which collects the global watermark progress. When the entire streaming job finds that watermark can advance to the next partition, it sends a message to the NotifyService. NotifyService receives the message and sends it to the next job, which introduces the flow operator in the entire Flink DAG. NotifyService blocks the whole job until the operator receives the message sent by the previous job. Until the message driver is received, it means that the upstream partition has been completed in the last hour, and then the next flow node can be driven up and run. Similarly, the next workflow node introduces the module GlobalWatermark Collector to aggregate the progress of its processing. When the partition is completed within an hour, it also sends a message to NotifyService. NotifyService drives the module that calls AIScheduler to pull up the Spark offline job to finish the Spark offline. As you can see, the entire link actually supports batch-to-batch, batch-to-stream, stream-to-stream, and stream-to-batch scenarios.

15. Prototype of real-time AI full link

On the basis of the definition and scheduling of the whole flow of flow and batch, the prototype of real-time AI full link was initially constructed in 2020, with the core oriented to experiment. Algorithm students can also be based on SQL to develop the Node Node, Python is able to define the complete DAG workflow. Monitoring, alarm, and O&M are integrated.

At the same time, support from offline to real-time through, from data processing to model training, from model training to experimental results through, as well as end-to-end through. On the right is the link of the whole near-line experiment. The following is a service that provides material data from the entire experimental link to online predictive training. The whole will have three aspects:

First, some basic platform functions, including experiment management, model management, feature management and so on;
Secondly, it also includes the services of some services at the bottom of AIFlow.
Then there are some platform level metadata metadata services.

Fourth, some prospects for the future

In the coming year, we will focus more on two areas of work.

First, the direction of the data lake will focus on some incremental computing scenes from ODS to DW layer, as well as some breakthrough scenes from DW to ADS layer. The core will combine Flink plus Iceberg and HUDI to land in this direction.
On the real-time AI platform, we will further provide a real-time AI collaboration platform for experiments. The core is to create an efficient engineering platform that can refine and simplify algorithm personnel.