Author: Liu Kang
This article is from the Flink Meetup conference held in Shanghai on July 26th. It is shared by Liu Kang, who is currently engaged in the development of platform related to model life cycle in the Department of Big Data Platform, and is mainly responsible for the development of real-time model feature computing platform based on Flink. Familiar with distributed computing, have rich practical experience and in-depth understanding of model deployment and operation and maintenance, and have a certain understanding of model algorithm and training.
The main contents of this paper are as follows:
-
Based on the current situation of real-time feature development in the company, the development background, goal and current situation of real-time feature platform are explained
-
The reason for choosing Flink as the platform computing engine
-
Flink practice: representative usage examples, development for compatible Aerospike (platform storage medium), and pits encountered
-
Current effects & future planning
I. On the basis of the current situation of real-time feature development of the company, the development background, objectives and current situation of real-time feature platform are explained
1. Development, operation and maintenance of original real-time characteristic operations;
1.1. Select a real-time computing platform: Select one of the existing real-time computing platforms :Storm Spark Flink according to the performance indicators of the project (latency, Throughput, etc.)
1.2 Main development operation and maintenance process:
-
More than 80% of jobs require message queue data sources, but message queues are unstructured data and there is no uniform data dictionary. So the message needs to be parsed and the required content determined by consuming the corresponding topic
-
Design and develop computational logic based on scenarios in requirements
-
In cases where real-time data cannot fully meet data requirements, separate offline operations and fusion logic are developed;
For example, in a scenario where 30 days of data is required, but only seven days of data are in the message queue (the default retention time for messages in Kafka), the remaining 23 days need to be supplemented with offline data.
-
Design and develop data verification and error correction logic
Message transmission depends on the network, and message loss and timeout cannot be completely avoided, so a checksum error correction logic is needed.
-
Test online
-
Monitoring and Warning
2. The development pain points of the original real-time feature operation
-
Message queue data source structures have no uniform data dictionary
-
Feature calculation logic is highly customized and the development and test cycle is long
-
If real-time data cannot meet requirements, you need to customize offline jobs and integration logic
-
Verification and error correction schemes do not form best practices and their effectiveness is more dependent on individual ability
-
Monitoring and warning schemes need to be customized based on business logic
3. Determine the platform goals based on the pain points of the arrangement
-
Real-time data dictionary: Provides unified data source registration and management capabilities for topics with a single structured message and topics with multiple structured messages
-
Logical abstraction: Abstracting to SQL reduces the workload & reduces the threshold of use
-
Feature fusion: Provides the feature fusion function to solve the problem that real-time features cannot fully meet data requirements
-
Data verification and error correction: Provides the function to utilize the real-time features of offline data verification and error correction
-
Real-time computing latency: ms class
-
Real-time computing fault tolerance: end-to-end exactly-once
-
Unified monitoring, warning and HA schemes
4. Characteristic platform system architecture
The current architecture is the standard LAMDA architecture, and the offline part consists of Spark SQL + dataX. KV storage system Aerospike is now used. The main difference with Redis is that SSD is used as the main storage. After our pressure test, the read and write performance of most scenes is at the same level as redis.
Real-time part: Using Flink as the computing engine, introduce the user’s usage mode:
-
Registration data source: Currently supported real-time data sources are mainly Kafka and Aerospike. If the data in Aerospike is offline or real-time feature configured on the platform, it will be automatically registered. Kafka data source requires the corresponding schemaSample file to be uploaded
-
Calculation logic: expressed in SQL
-
Define the output: Define the Aerospike table for the output and the Kafka Topic that may be needed to push the Update or Insert data key
After the user completes the above operations, the platform writes all the information to the JSON configuration file. Next, the platform submits the configuration file and the prepared Flinktemplate. jar(containing flink functions required by all platforms) to YARN to start the Flink job.
5. Platform function display
1) Platform function display – data source registration
2) Real-time feature editing – basic information
3) Real-time feature editing – data source selection
4) Real-time feature editing -SQL calculation
5) Real-time feature editing – select output
Second, the reason for choosing Flink
Let’s talk about why we chose Flink to do this feature platform.
It is divided into three dimensions: maximum latency, fault tolerance, and SQL functional maturity
-
Latency: Storm and Flink are pure stream, with a minimum latency of milliseconds. Spark’s pure streaming mechanism is continuous mode, with a minimum latency of milliseconds
-
Fault tolerance: Storm uses xor ACK mode and supports atLeastOnce. Message duplication resolves no. Spark provides exactlyOnce through checkpoint and WAL. Flink uses checkpoint and SavePoint to do exactlyOnce.
-
SQL Maturity: In the current version of Storm SQL is still experimental and does not support aggregation or join. Spark now provides most of the functionality without support for distinct, limit, and Order by for aggregated results. Flink does not support Distinct Aggregators for SQL provided in the community edition
Third, Flink practice
1. Practical examples
2. Compatible development: Flink does not provide read and write support for Aerospike at present, so it needs secondary development
3, encountered the pit
Iv. Current effect of the platform & future planning
Current effect: The online period of real-time features is reduced from 3 to 5 days on average to hours. Future planning:
-
Improve the features of the platform: fusion features, etc
-
Simplify steps and improve user experience
-
According to requirements, further improve SQL functions such as support win start time offset, countTrigger win, etc
The next step is to describe model deployment and model training through SQL or DSL
For more information, please visit the Apache Flink Chinese community website