Abstract: Apache Flink is an open source computing platform for distributed data stream processing and batch data processing. It can provide the function of supporting both stream processing and batch processing based on the same Flink runtime. Apache Flink version 1.9.0 has been released. What are the major milestones, key changes and new features? In this article, Alibaba senior technical expert Wu Chong has brought you the introduction of Apache Flink version 1.9.0.

This sharing is mainly divided into the following three aspects:

  1. Milestone of Flink 1.9.0
  2. Key changes and new features in Flink 1.9.0
  3. conclusion

I. Milestone significance of Flink 1.9.0

The following figure shows two news articles published by Ali Technology’s wechat official account in the middle of 2019. One is “Ali officially contributes Blink code to Apache Flink”, which introduces that Blink opened source and contributed to Apache Flink in January 2019, and the other is “1.5 million lines of code were modified! Apache Flink 1.9.0 makes these major changes!” This is the first release after Bink merged into Flink in August 2019. I put these two stories together because the release of Flink 1.9.0 was a milestone for both Blink and Flink.



At the beginning of 2019, when Blink open source contributed to Apache Flink, one key point was that Blink would support open source with a branch of Flink. Blink would Merge its major optimization points into Flink to make Flink better together. Now, half a year later, with the release of Blink version 1.9.0, Alibaba’s Blink team can proudly declare that they have fulfilled their promise. Therefore, when we combine these two reports, we can find that some of the new functions of Blink are now available in Flink version 1.9.0, and that the efficiency and execution of Flink community are very high.

Key changes and new features in Flink 1.9.0

This section introduces the key changes and new features of Flink 1.9.0.

Architecture upgrade Overall, if a software system has a major change, it is basically due to an architecture upgrade, and Flink is no exception. Flink’s distributed streaming execution engine has a set of relatively independent DataStream API and DataSet API, which are responsible for streaming computation and batch processing respectively. On this basis, Flink also provides a unified Table API and SQL for streaming batch. Users can use the same Table API or SQL to describe streaming calculation jobs and batch jobs. They only need to tell the Flink engine to run in streaming mode or batch mode at runtime. The Table layer will optimize the jobs into DataStream or DataSet jobs. However, the Flink 1.8 architecture has some drawbacks at the bottom, namely DataStream and DataSet don’t share much code at the bottom. Secondly, the apis of Flink and Flink are completely different, which will lead to a large amount of work for repeated development at the upper level, which will make the development and maintenance of Flink more and more expensive in the long term.



Based on the above problems, Blink made some new explorations in architecture, and determined the future architectural route of Flink after close discussion with the community. In future versions of Flink, the API of DataSet will be completely removed, SteamTransformation will be used as the underlying API to describe batch and stream jobs, Table API and SQL will translate stream jobs into SteamTransformation. So in Flink 1.9, in order not to affect the experience of users using previous versions, we needed a solution that allowed the old and new architectures to coexist. To this end, the Community of Flink developers worked hard to come up with the Flink 1.9 architecture on the right, which split the API and implementation modules, and proposed a Planner interface that supports different Planner implementations. The specific work of the Planner is optimizing and translating into physical execution diagrams, which is what Flink Query Processor does. Flink moved all of the original implementation to Flink Query Processor and put all the functions from Blink Merge into the Blink Query Processor. In this way, we can kill two birds with one stone. Not only can the Table module become clearer after being split, but more importantly, it will not affect the experience of users of the old version. Meanwhile, users can enjoy the new functions and optimization of Blink.

Table API & SQL refactoring and new features

In the Table API & SQL Refactoring and New Features section, Flink also merged a number of SQL features added from Blink in version 1.9.0. These new functions are all hammered out inside Alibaba and precipitated, I believe that Flink can make a further step. Here selected some of the more important results for you to introduce, such as support for SQL DDL, reconfiguration of the type system, efficient streaming TopN, efficient streaming deduplication, community attention for a long time dimension table association, support for MinBatch and a variety of hot solution means, complete batch support, Python Table API and Hive integration. A brief introduction to these new features will follow.



SQL DDL: In the past, if you want to register a Source or Table Sink, you must register it through Java, Scala code or configuration file. However, in Flink 1.9, the SYNTAX of SQL DDL is supported to directly register or delete the Table.

Refactoring the type system: In Flink 1.9, a new data type system was implemented that was fully aligned with the SQL standard to support richer types. This new type system also lays a solid foundation for future Flink SQL to support more complete and complete functionality.

TopN: In Flink 1.9, it provides powerful streaming capabilities and the community’s long awaited TopN for real-time ranking calculation, the ability to calculate the top stores in real time or filter the real-time streaming data.

High efficiency flow deweighting: In the real production system, many ETL operations or tasks fail to achieve end-to-end consistency, which may lead to duplicate data at the detail layer. When these data are handed to the summary layer for summary, the index is too large, and some values are calculated. Therefore, a de-duplication is usually done before entering the summary layer. A more efficient de-duplication function in stream computing is introduced here, which can filter duplicate data at a lower cost.

Dimension table association: Associates data in MySQL, HBase, and Hive tables in real time.

MinBatch& Multiple means to solve hot spots: In terms of performance optimization, Flink 1.9 also provides some means to optimize performance, such as optimization of MinBatch to improve throughput and multiple means to solve hot spots.

Full batch support: Flink 1.9 has full batch support, and efforts will continue in the next release to support TBDS for out-of-the-box performance.

Python Table API: The Python Table API was also introduced in Flink 1.9, which is a major step forward in the multilingual direction of Flink. Allows Python users to easily play with features like Flink SQL.

Hive integration: Hive is an important force in the Hadoop ecosystem. In order to promote Flink batch processing, Hive integration is also necessary. Happily, there are also two Hive PMCS among the contributors to Flink 1.9 to drive the integration effort. The first thing to solve is how Flink reads Hive data. At present, Flink has completed the access to Hive MetaStore. Flink can directly access the data in Hive MetaStore. At the same time, Flink can also directly store meta information in its table data into Hive MetaStore for Hive access. At the same time, we also added Hive Connector to support CSV format, users only need to configure Hive MetaStore to directly read Flink. On top of this, Flink 1.9 also adds compatibility with Hive custom functions. Hive custom functions can be run directly in Flink SQL.

Batch Improvement: Fine-grained batch Job Recovery (FLIP-1)

Flink 1.9 also offers a number of improvements in batch processing, most notably fine-grained batch job recovery. This optimization point was raised a long time ago, and in version 1.9 we finally put the finishing touches on unfinished features. In Flink 1.9, if an error occurs in a batch job, Flink will first calculate the area affected by the error. This is called a Fault Region, because some nodes in a batch job need to be transmitted through Pipeline data. As for other nodes, they can store the data in Blocking mode and then read the stored data downstream. If the output of the operator has been completely saved, there is no need to pull up the operator to run again, so that the error recovery is controlled in a relatively small range. If you take it to the extreme of shuffling data at each location, this is similar to MapReduce’s Map behavior, but Flink supports a more advanced use where users can control whether each shuffling location is directly connected to the network or transferred by file shuffling. This is one of Flink’s core differences.



After file Shuffle is created, people also wonder whether this function can be plug-in to enable file Shuffle to other places. At present, the community is also making efforts in this direction, for example, Yarn can be used to implement Shuffle or a distributed service can be developed to Shuffle files. This architecture has been implemented in Ali, which can handle 100 TB jobs in a single job. When Flink is equipped with such a plug-in mechanism, it can easily connect with Shuffle, which is more efficient and flexible, so that Shuffle, a long-standing problem in batch processing, can be better solved.

Stream Processing improvements: State Processor API(FLIP-43)

Stream processing has always been the core of Flink, so the 1.9 version of Flink also proposed many improvements in stream processing, adding a very practical function called Sate Processor API, which can help users directly access the State stored in Flink. The API makes it very easy to read, modify, and even rebuild the entire State. The power of this feature lies in several aspects. The first is the flexibility to read external data, such as reading from a database to build Savepoint autonomously, solving the cold start problem of the job, so that you don’t have to redo the entire data from N days ago.

In addition, with the help of the State Processor API, users can directly analyze the data in the State, because this part of data has been in the black box before, and it is impossible to know whether the data stored in the black box is right or wrong, and whether there are exceptions. When the State Processor API is introduced, The user can analyze State data as if it were normal data to detect anomalies and analyze faults. The third point is the correction of dirty data. For example, if a dirty data has polluted the State, the State Processor API can be used to repair and correct the State. The last point is State migration, but the user changes the job logic and wants to reuse most of the State from the original job, or wants to upgrade the structure of the State and can use the API to do so. In terms of flow processing, many common tasks and problems can be solved by the State Processor API provided in Flink 1.9. Therefore, it can be seen that the application prospect of this API is very broad.

Refactored Web UI

In addition to the improvements described above, Flink 1.9.0 also offers a refreshed Web UI as shown below. The latest front-end UI was created by professional Web front-end engineers and refactored with the latest AngularJS. You can see that the latest Web UI is very fresh and modern, and it is also a clean stream of Apache open source software built-in UI.



Third, summary

After intensive development, Flink 1.9.0 not only welcomed a large number of Chinese developers, contributed a large amount of code, but also brought a lot of users. As you can see from the chart below, Flink version 1.9.0 exceeds the previous two versions combined, both in terms of the number of issues resolved and the number of code commits. Flink 1.9.0 is the most active version of Flink since it was opened to the public, with 1.5 million lines of code changed, about six times as many as the previous version. As can be seen from the number of Contributor, Flink has attracted more and more contributors, and many of them are from China. In addition, according to Apache’s official release of open source project activity indicators, Flink’s indicators are also among the best.



From all these, we can see that Flink 1.9.0 is a beginning, and both the function and ecology of Flink will become better and better in the future. We also sincerely hope that more developers can join the Flink development community to make Flink better and better.

At the end of November this year, the world’s largest Apache Flink official conference will be held in Beijing, which will be attended by more than 2,000 developers. I hope you can pay attention.

In addition, the Apache Flink Geek Challenge is being held, and those who are interested can follow it.

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.