This paper mainly analyzes in depth from two aspects: what has Alibaba optimized Flink on earth?

Take open source, use open source

A, the SQL layer

In order to truly enable users to develop a set of code according to their own business logic, which can run in a variety of different scenarios at the same time, Flink first needs to provide users with a unified API. After some research, Alibaba Real-time computing that SQL is a very suitable choice. In the batch world, SQL has been tested for decades and is considered a classic. In the field of flow computing, the theory of duality of flow table and ChangeLog that flow is a table have appeared continuously in recent years. On the basis of these theories, Alibaba proposed the concept of dynamic tables, so that stream computing can also be described in SQL as batch processing, and logically equivalent. So that users can use SQL to describe their business logic, the same query at execution time can be a batch processing tasks, can also be a high throughput and low delay flow computing tasks, and even before using batch processing technique for the calculation of historical data, and then automatically converted to flow computing tasks the latest real-time data processing. With this declarative API, the engine has more options and room for optimization. Next, we’ll look at some of the more important optimizations.

The first step is to upgrade and replace the technical architecture of the SQL layer. Developers who have researched or used Flink should know that Flink has two basic apis: DataStream and DataSet. DataStream API is for streaming processing users, while DataSet API is for batch processing users. However, the execution paths of the two apis are completely different, and different tasks need to be generated to execute them. Flink’s native SQL layer, after a series of optimizations, calls the DataSet or DataStream API depending on whether the user wants batch or stream processing. As a result, users are often faced with two sets of almost completely separate technology stacks in daily development and optimization, and many things may have to be done twice. This can also lead to optimizations made on one side of the technology stack that are not available on the other. Therefore, Alibaba proposed a new Quyer Processor in the SQL layer, which mainly consists of a Query Optimizer that can reuse streams and batches as much as possible, and a Query Executor based on the same interface. In this way, more than 80% of the work can be reused, such as some common optimization rules, basic data structures, and so on. At the same time, streams and batches retain their own unique optimizations and operators to meet different job behaviors.

After the technical architecture of THE SQL layer was unified, Alibaba began to seek a more efficient basic data structure, so as to make the execution of Blink in the SQL layer more efficient. In native Flink SQL, a data structure called Row is used uniformly, which consists entirely of JAVA objects forming a Row in a relational database. If the Row consists of an Integer, a floating-point, and a String, then the Row contains a JAVA Integer, Double, and String. As we all know, these JAVA objects have a lot of overhead in the heap, as well as introducing unnecessary boxing and unboxing operations in the process of accessing the data. Based on these issues, alibaba proposed a new data structure, BinaryRow, which represents a Row in relational data like the original Row, but uses binary data to store the data entirely. In the above example, the three different types of fields are represented by JAVA byte[]. This brings many benefits:

1. Firstly, in terms of storage space, it removes a lot of unnecessary extra consumption, making the storage of objects more compact; Secondly, when dealing with network or state store, it can also omit a lot of unnecessary serialization and deserialization overhead; Finally, the entire execution code is more GC-friendly after removing all the unnecessary boxing and unboxing operations.

By introducing such an efficient underlying data structure, the execution efficiency of the entire SQL layer was more than doubled.

In the implementation of operators, Alibaba introduced a wider range of code generation technology. Thanks to the unification of technical architectures and underlying data structures, many code generation techniques can be reused on a wider scale. At the same time, because of the strong type guarantee of SQL, users can know the type of data that the operator needs to deal with in advance, so that more targeted and efficient execution code can be generated. In native Flink SQL, only simple expressions like A > 2 or C + D will be used for code generation technology. After optimization by Alibaba, some operators will be used for overall code generation, such as sorting, aggregation, etc. This gives the user more flexibility to control the logic of the operator and embed the final running code directly into the class, eliminating the overhead of expensive function calls. Some basic data structures and algorithms that apply code generation technology, such as sorting algorithm and HashMap based on binary data, can also be shared and reused between stream and batch operators, allowing users to truly enjoy the benefits brought by the unification of technology and architecture. Performance of stream computing can also be improved by optimizing data structures or algorithms for certain scenarios of batch processing. Next, let’s talk about what changes Alibaba has made to Flink at the Runtime level.

Second, the Runtime layer

The real-time computing team faced challenges in getting Flink to take root in Alibaba’s mass production environment, starting with how to integrate Flink with other cluster management systems. Flink’s native cluster management mode is not yet perfect, nor can it use other relatively mature cluster management systems. Based on this, a series of thorny questions emerge: how to coordinate resources among multi-tenants? How to dynamically apply and release resources? How do I specify different resource types?

In order to solve this problem, the real-time computing team went through a lot of research and analysis. The final solution was to transform the Flink resource scheduling system so that Flink could run on the Yarn cluster. In addition, the Master architecture is reconstructed so that each Job corresponds to each Master, so that the Master is no longer the bottleneck of the cluster. Taking this opportunity, Alibaba and the community jointly launched the new Flip-6 architecture, which turned Flink resource management into a pluggable architecture, laying a solid foundation for the sustainable development of Flink. The fact that Flink now runs seamlessly on YARN, Mesos, and K8s is a testament to the importance of this architecture.

After solving the problem of Flink cluster large-scale deployment, the next step is reliability and stability. In order to ensure the high availability of Flink in the production environment, Alibaba focuses on improving the FailOver mechanism of Flink. First, Master FailOver. Flink’s native Master FailOver will restart all jobs, and any Master FailOver after improvement will not affect the normal operation of jobs. Secondly, the region-based Task FailOver is introduced to minimize the impact of any Task FailOver on users. Armed with these improvements, a large number of Alibaba businesses began to migrate real-time computing to Flink.

Stateful Streaming is the biggest highlight of Flink, and the Checkpoint mechanism based on Chandy-Lamport algorithm enables Flink to have Exactly Once consistency computing capability. However, in the early version of Flink, Checkpoint performance was limited due to the large amount of data. Alibaba also made a lot of improvements on Checkpoint, such as:

1. Incremental Checkpoint mechanism: It is common for Alibaba to encounter big jobs with dozens of TB states in the production environment. It is very expensive to do a full amount of CP, so Alibaba has developed an incremental Checkpoint mechanism.

Checkpoint small file merge: As the number of Flink jobs in the whole cluster increased, the number of CP files also increased, and finally HDFS NameNode was overwhelmed. Alibaba finally reduced the pressure of NameNode by dozens of times by merging several small CP files into one large file.

Although all data can be stored in State, due to some historical reasons, users still need to store some data in external KV storage such as HBase. Users need to access these external data in Flink Job. However, as Flink has always been a single-thread processing model, The delay in accessing external data becomes the bottleneck of the whole system. Obviously asynchronous access is a direct solution to this problem, but it is not easy to get users to write multithreading in UDF while maintaining ExactlyOnce semantics. Alibaba proposed AsyncOperator in Flink, which makes it as easy for users to write asynchronous calls in Flink JOB as to write “Hello Word”, which makes a great leap in Flink JOB throughput.

Flink is designed as a batch stream unified computing engine. After using the lightning-fast stream calculation, the batch users also became interested in living in The Flink community. However, batch computing also brings new challenges. First, In terms of task scheduling, Alibaba has introduced a more flexible scheduling mechanism, which can schedule tasks more efficiently according to the dependency between tasks. The second is data Shuffle. Flink’s native Shuffle Service is bound to TM, and TM cannot release resources after the task is executed. In addition, the original Batch shuffle does not combine files, so it cannot be used in production. Alibaba solves these two problems by developing the Yarn Shuffle Service. When developing Yarn Shuffle Service, Alibaba found it very inconvenient to develop a new Shuffle Service. It needed to invade many parts of Flink code. In order to facilitate other developers to expand different shuffles, Alibaba also revamped the Flink Shuffle architecture to make it pluggable. At present, Alibaba’s search business has been using Flink Batch Job, and has started to serve production.

After more than three years of polishing, Blink has begun to thrive in Alibaba, but the optimization and improvement of Runtime is endless, and a large wave of improvement and optimization is on the way.


For more information, please visit the Apache Flink Chinese community website