The 2019 Alibaba Cloud Summit · Shanghai Developer Conference was grandly opened on July 24. The summit shared with the developers of the future world the technical dry goods in the fields of open source big data, IT infrastructure cloud, database, cloud native, Internet of Things and so on, and jointly discussed the cutting-edge technology trends. This article is collated from the wonderful speech of Alibaba senior technical expert Yang Kert (Rooney) in the open Source Big Data special session, mainly explaining the past and present development of Apache Flink, and sharing the understanding of the future development direction of Apache Flink.
Apache Flink past, Present and Future PPT download
The following content is based on the video and PPT of the speech.
I. Flink’s past
1. The emergence of the Flink
Before the Apache Flink project was donated to Apache, it was initiated by a PhD student of Technical University of Berlin. At that time, the Flink system was still a batch engine based on streaming Runtime, which mainly solved the problem of batch processing. In 2014, Flink was donated to Apache and quickly became one of its top projects. In August 2014, Apache released the first version of Flink, Flink 0.6.0. With better streaming engine support, the value of streaming computing is also being explored and valued. In December of that year, Flink released version 0.7, which officially introduced the DataStream API, the most widely used API in Flink today.
2. Flink 0.9
Support and processing of State is unavoidable for Streaming computing systems. Early Streaming computing systems, such as Storm and Spark Streaming, handed State maintenance and management over to users. This approach brings two problems. On the one hand, it raises the bar for writing streaming computing systems. On the other hand, if users maintain State themselves, the cost of fault tolerance and the cost of the system providing Exactly Once semantics will increase. As a result, Flink version 0.9, released in June 2015, introduced built-in State support and supported multiple State types, such as ValueState, MapState, ListState, etc.
In order to support Exactly Once consistency semantics, local states need to be assembled into a global Checkpoint. The Global Checkpoint mechanism introduced in Flink 0.9 is an improvement based on the classic Chandy-Lamport algorithm. As shown in the figure, Flink inserts barriers periodically in the data source, and the framework takes a snapshot of the local State when it sees them before sending them downstream. We can approximate that Checkpoint barriers only lead to overhead of one message processing, which is negligible compared to normal message processing. After the introduction of Chandy-Lamport algorithm, It is no longer a tradeoff for Flink to provide high throughput and delay on the premise of ensuring Exactly Once, and it can guarantee both high throughput and low latency at the same time. While other systems often need to make a trade-off between throughput and delay when doing similar design. High consistency affects throughput, whereas consistency cannot be guaranteed under large throughput.
3. Cornerstone of Flink 1.0
Flink version 1.0 added event-based computing support and introduced Watermark, which effectively tolerated out-of-order data and late data. Flink 1.0 also has built-in support for various Windows, such as scrolling, sliding, session Windows out of the box, as well as the flexibility to customize Windows. Together with the State API and efficient Checkpoint support added to Flink 0.9, this is the cornerstone of Flink 1.0.
Ii. Alibaba and Flink
After 2015, Alibaba began to pay attention to Flink computing engine, and highly recognized the advanced design concept of Flink system, and was optimistic about its development prospects. Therefore, Alibaba began to use Flink in a large number of internal, but also made drastic improvements to Flink.
1. Reconstruct the distributed architecture
After working with the community, the first big step was to reconfigure the distributed architecture, given the huge amount of internal business data and online pressure. Early Flink no clear division between the characters, most of the responsibility to be in the same role, such as job scheduling and resource application, the allocation of Task, and this role will also need to manage all the homework in the cluster, within the work very big ali scenario, soon revealed such bottlenecks. In the process of reconstructing distributed architecture, Ali consciously separated the roles of scheduling jobs and applying for resources, and set two responsibilities: Job Manager and Resource Manager. After that, Resource Manager could be fully processed by plug-ins, facilitating the connection with various Resource scheduling systems. Such as YARN and Kubernetes. Take Docking Kubernetes as an example, just write a plug-in, all operations can smoothly operate in the entire environment, greatly simplifying the process. At the same time, the architecture also supports the use of independent Job Manager and Resource Manager for each Job, which greatly improves scalability. A cluster can easily support thousands of jobs.
2. The incremental Checkpoint
In order to solve the problem of tens of TERabytes of State data, Ali introduced the incremental Checkpoint mechanism in Flink. In earlier versions, Flink Checkpoint copies all the local State data of each Task to reliable storage. When the magnitude of State reaches TB, backing up the full amount of data every time is obviously an unacceptable scenario. The incremental Checkpoint mechanism is also easy to understand. Instead of flushing all State data to reliable storage at each Checkpoint, only the newly added State data is backed up. In the case of abnormal restart, the full amount of data is used for recovery. With this mechanism in place, Flink can easily process tens of terabytes of State data. This problem was also the biggest constraint on our internal machine learning system at the time, and after solving this problem, the application of Flink streaming became more extensive.
3. Flow control mechanism based on credit
Flink version 1.0 shares a TCP channel between multiple workers. If multiple operators are in the same Task Manager and the network connection between them is TCP shared, if one of them generates backpressure, the processing efficiency of other operators in the same process will be affected, resulting in unstable operation. Therefore, ali introduced the flow control mechanism based on credit at the network layer, and each Operator could not send data to the TCP channel without limit. Each Operator has its own credit, which needs to be reduced when it sends data to the downstream. When the downstream actually consumes data, the credit score is added back and the upstream can continue to send data to the virtual Channel. After Flink introduced a sophisticated flow control mechanism, the throughput or delay of the job became more stable, and the whole job would not be unstable because of the temporary jitter of one operator.
4. Streaming SQL
There are a lot of homework in Alibaba Group. As the platform maintainer, if there is any problem in the user’s homework, I need to check the user’s code to find out the problem in the first time. However, the amount of user code varies from tens of thousands of lines to hundreds of lines, making maintenance costs very high. So Ali chose unified Streaming SQL as the development language and could understand the user’s intention by looking at the user’s SQL. There are many other benefits to choosing SQL. For example, SQL integrates an optimizer, allowing the system and framework to help users optimize their jobs and improve their execution efficiency. The semantics of Streaming SQL need to be explained here, and this is also a typical problem for some users new to Streaming SQL. Simply put, Streaming SQL is semantically identical to traditional batch SQL, with only differences in execution mode and result output. For example, the following figure is a user’s score table, which needs to do a simple sum of scores and calculate the last update time of the results. In SQL statements, SUM(Score) calculates the Score and MAX (Time) is taken at the same Time. Unlike batch processing, the real-time nature of Streaming data makes it impossible for the Streaming SQL to see all the data at once when running, as in 12: At 01, the Streaming SQL will count an empty record, assuming that the system has not seen a single record. As the records flow in, the first result is output at 12:04, which is the calculation of the data recorded before 12:04. At 12:07, you can see all the data in the current table and do an updated output of the results. Assuming that the USER_SCORES table exists to begin with, the result of a batch run is the same as the result of a stream calculation, which illustrates the consistency of SQL semantics for a stream batch.
5. Flink’s service in Ali
On November 11, 2018, Alibaba’s service scale has exceeded 10,000 clusters. A single job has reached tens of terabytes of status data, and all the jobs added up to PB level. More than ten trillion events are processed every day. At the zero peak of Double 11, data processing had reached 1.7 billion bits per second.
In the past, Flink has largely focused on the areas of Continuous Processing and Streaming Analytics, including the DataStream API and later Streaming SQL. Flink has not only gained a foothold in Continuous Processing and Streaming Analytics, but has become the current leader in the field.
Three, Flink now
1. Architectural changes in Flink 1.9
The latest version of Flink is 1.9, and Flink has made major architectural changes with this release. First of all, the Table API and SQL API of previous versions of Flink were built on top of two underlying apis, namely DataStream API and DataSet API. After major architectural changes in Flink 1.9, the Table API and DataStream API have become peer apis. The difference is that DataStream PROVIDES an API that is more closely related to the physical execution plan. The engine can execute jobs based on the user’s description without too much optimization or intervention. The Table API and SQL are relational expression apis, which are used by users to describe what they want to do. After understanding the user’s intention, the framework and the optimizer translate it into efficient concrete execution diagrams. Flink will share a unified DAG layer and Stream Operator, while the Runtime layer will retain the distributed Streaming DataFlow.
2. Unify Operator abstraction
The change in Flink’s architecture raises the issue of uniform Operator abstraction because the original Operator abstraction only applies to Flink’s Streaming jobs and Flink’s DataSet API does not use the original Operator abstraction. Flink’s early code referenced the classical database approach, with all operators executed in pull mode. As shown in the figure below, Filter operator tries to pull data upstream, while HashJoin operator tries to pull data from both ends (Build end and Probe end) to Join. In the case of low latency and high throughput requirements, Flink’s Streaming job is performed in a push mode, and the framework pushes the data to all required operators after it has been read. In order to unify the Operator abstraction and enable the Streaming Operator to perform HashJoin operations, Ali extends the protocol to include semantics in which the Operator informs the framework of the desired input order. In the figure below, HashJoin tells the Framework to push Probe data to HashJoin first after HashJoin processes the Build side and builds the Hashtable. In the past, when developers support stream or batch processing, many operators need to write two sets of programs. After unifying the Operator abstraction, operators can be reused to help developers improve development efficiency and achieve twice the result with half the effort.
Table API & SQL 1.9 new features
- Table API & SQL 1.9 introduces a new type system for SQL. The type system of the previous Table layer reused the Runtime TypeInformation, but encountered many limitations in the actual operation process. The introduction of a new SQL type system can better align SQL semantics.
- Preliminary DDL support: Flink also introduced preliminary DDL support in this release, allowing users to define or Drop tables using simple syntax such as Create Table or Drop Table.
- Table API enhancements: The Table API was originally a relational expression API. Table API & SQL 1.9 now adds more flexible apis such as Map and FlatMap.
- Unified Catalog API: Table API & SQL 1.9 introduced a unified Catalog API, which can be easily interlinked with other catalogs. For example, Flink can directly read and process tables in Hive by implementing a plug-in that interacts with Hive. Metastore through a unified Catalog API.
- Blink Planner: The Table API adds support for Blink Planners because the upper layer requires the SQL planner to dock with the lower Runtime after making major changes to the lower Runtime. To ensure that users of the original Table API were as unaffected as possible, the community kept the original Flink Planner intact. But at the same time a new Blink Planner was introduced to interface with the new Runtime design.
Blink Planner Feature
Blink Planner has added a number of new features. First of all, Blink Planner binaries the data structure, adds richer built-in functions, introduces Minibatch optimization in aggregation, and adopts a variety of methods to solve hot data encountered in the process of aggregation, etc. In addition, dimension table association is widely used in flow calculation, and developers need to expand the data volume dimension of data flow, so Blink Planner also supports dimension table association. TopN is widely used in the field of e-commerce. The TopN function provided by Blink Planner can easily complete the function of counting the top merchants in transaction volume. After a simple extension to the TopN feature, Blink Planner also supports efficient streaming de-duplication. It is worth mentioning that Blink Planner has been able to fully support batch processing, and the current Ali internal version can run a complete set of standard Benchmark tests such as TPC-H and TPC-DS.
Batch optimization
Flink implements more optimization for batch processing at the Runtime layer. The classic problem in batch processing is error handling recovery. As shown in the figure below, Flink can flexibly adjust the transmission type of each side in the topology, directly connect between A and B, insert Cache layer between B and C, and output Cache data at the output end to reduce the cost of FailOver transmission. Suppose an error occurs at node D and the error is retraced from node D to the recalculated range. When retracing to the Cache level, if the result of B1 already exists in DFS or the Cache is somewhere else, the error retracing does not need to continue. In order to ensure consistency, it is necessary to go back to the Cache layer and simply restart the downstream jobs that have not been executed or half executed. If there is no Cache support, the nodes are all connected to each other. When an error occurs on node D, the error will spread to the whole graph. In the case of Cache support, only small subgraphs need to be restarted, which can greatly improve the recovery efficiency of Flink in the face of errors.
Plug-in Shuffle Manager: Flink 1.9 adds the Shuffle plug-in. Users can implement an intermediate Shuffle layer and receive intermediate data through a dedicated Service. The Shuffle Service based on Yarn can also be used.
5. Ecological
Flink version 1.9 has major ecological inputs, such as increased Hive compatibility. With the introduction of a unified Catelog API, Flink can now read Hive Metastore directly. Users can use Flink SQL to process Hive data and write the data back to the Hive table in a compatible data format. If there are follow-up Hive jobs, users can continue operations in the Hive table. In addition, Flink and Zeppelin have been integrated to provide a better development experience. Users can use Flink SQL directly in Notebook or write Flink jobs using the Python API.
6. Chinese community
The Flink community takes Chinese users very seriously. Support for Chinese documents has been added to the Flink community website. In addition, the community has opened a Flink Chinese user mailing list. After subscribing to the mailing list, users can describe their questions in Chinese, and there will be a lot of enthusiastic fans in the community to help answer their questions.
Flink is already a leader in real-time and streaming computing, with a focus on batch support. Both the introduction of a more powerful SQL execution engine and friendlier support for error recovery at the Runtime layer demonstrate the importance of batch processing in Flink 1.9, and this is just the beginning.
Iv. Future development direction of Flink
1. Micro Services case
As shown in the figure below, the e-commerce system includes order layer, order transaction system, inventory system, payment system and logistics system. First, Micro Services drive calls between systems in the form of events. The user triggers an order, the order system receives the order for calculation logic, and then invokes the inventory system. The above operation is a typical event-driven model. To ensure performance and stability, RPC Calls need to be used in different Micro Services. If synchronous RPC Calls are used, the problem of thread data ballooning needs to be solved. Therefore, Async Calls need to be introduced between Micro Services. Due to the limited processing capacity of each Micro Service, for example, when the RPC ratio of order to inventory is 1:10, we cannot send RPC calls to the downstream system without limit. Therefore, we need to introduce a set of flow control mechanism to appropriately slow down the amount of RPC sent. However, user traffic is difficult to predict. The best solution is that each Micro Service can be independently expanded or shrunk. Back to the order system, when the order system is under great pressure, the order layer can be expanded, or when the inventory flow is at a low peak, the service capacity can be reduced, all systems need data persistence, and behind the system can not leave DB support.
In summary, Micro Service requires several core elements. The first is event-driven, and the second is asynchronous transmission between systems. At the same time, it needs to have a good flow control mechanism to dynamically expand and shrink the capacity between and within nodes. Finally, it needs to have its own DB, which can be understood as Micro Service needs to support State and be able to store historical states.
It is not hard to see that Micro Service’s requirements are covered by Flink. Firstly, Flink is a message-driven system with a very fine flow control mechanism. Because of the natural decoupling between networks, Flink’s data transfers are asynchronous; In addition, Flink can add or subtract concurrency for each operator individually, built-in State support, and so on. Micro Services scenarios are far larger than stream computing and batch processing scenarios, and it is believed that in the near future, Flink community will also do more exploration and attempts in this direction, to achieve support for event-driven Application service scenarios.
Apache Flink’s first Geek Challenge
With the help of Jia Yangqing, the first Apache Flink Geek Challenge jointly organized by Ali Cloud Computing Platform Business Division, Tianchi Platform and Intel is coming!
Focus on machine learning and computing performance, two hot fields, enter the contest, make yourself a technical generalist, and have a chance to win $10W.
To understand competition details: tianchi.aliyun.com/markets/tia…