This article is the first in a series of articles on Flink SQL, which covers the basics, practices, tuning, and internal implementation of the new version of Apache Flink.
1. Development history
On August 22nd this year, Apache Flink released version 1.9.0 (hereinafter referred to as 1.9). In Flink 1.9, the Table module ushered in the upgrade of the core architecture and introduced many functions contributed by Alibaba Blink team. This paper sorts out the Table module architecture and introduces how to use Blink Planner.
Table module of Flink includes Table API and SQL. Table API is a KIND of SQL API. Through Table API, users can operate data like operating tables, which is very intuitive and convenient. SQL as a declarative language, with standard syntax and specifications, users can do data processing without caring about the underlying implementation, very easy to get started, Flink Table API and SQL implementation of about 80% of the code is common. As a unified streaming batch computing engine, Flink Runtime layer is unified, but before Flink 1.9, Flink API layer has been divided into DataStream API and DataSet API. Table API & SQL is located on DataStream API and DataSet API.
Flink 1.8 Table architecture
In the Flink 1.8 architecture, if users need to stream computing and batch processing at the same time, users need to maintain two sets of business code, and developers also need to maintain two sets of technology stack, which is very inconvenient. Flink community has long envisioned batch data as a bounded stream data and batch processing as a special case of stream computing, so as to realize stream batch unification. Alibaba Blink team has done a lot of work in this regard, and has realized stream batch unification of Table API & SQL layer. Fortunately, Alibaba has given Blink open source back to the Flink community. In order to realize the streaming batch unification of the entire Flink system, Flink community developers basically finalized the future technical architecture of Flink after several rounds of discussions based on the experience of Blink team.
Flink Future architecture
In the future architecture of Flink, the DataSet API will be abolished, and there are only DataStream API and Table API & SQL for users. In the implementation layer, these two apis share the same technology stack and use a unified DAG data structure to describe jobs. Complete stream batch unification with a unified StreamOperator to write operator logic and a unified stream distributed execution engine. Both apis provide streaming computing and batch processing functions. DataStream API provides a lower-level and more flexible programming interface. Users can describe and orchestrate operators by themselves without excessive interference or optimization from the engine. Table API & SQL provides intuitive Table API, standard SQL support, and the engine will optimize according to the user’s intention and select the best execution plan.
2.Flink 1.9 Table architecture
The architecture of Blink’s Table module has realized streaming batch unification since open source, which is the first step towards the future architecture of Flink and ahead of the Flink community. Therefore, when Flink 1.9 is integrated into Blink Table code, in order to ensure that the existing Flink Table architecture and Blink Table architecture can coexist and evolve towards the future Flink architecture, Developers in the community are dedicated to flip-32 (FLIP Improvement Proposals), which are Proposals for major changes to Flink. FLIP – 32 is: In general, there is good news in annual Fawning-table for future contributions. In general, there is good news in annual Fawning-table for future contributions. It can be said that Flink 1.9 is Flink’s first step towards the complete unification of the future architecture.
Flink 1.9 Table architecture
In the new architecture of Flink Table, there are two query handlers: Flink Query Processor and Blink Query Processor correspond to two planners, named Old Planner and Blink Planner respectively. The query handler is the concrete implementation of Planner, Transform Table API & SQL jobs into Transformation DAG recognized by Flink Runtime using the parser, Optimizer, codeGen and other processes Flink Runtime performs the scheduling and execution of jobs.
The query processor of Flink has different branches for streaming computing and batch processing. The underlying API of streaming computing is DataStream, while that of batch processing is DataSet API. On the other hand, the Blink query processor realizes the unification of the streaming batch job interface, and the underlying API is Transformation.
3.Flink Planner and Blink Planner
The new architecture of Flink Table implements the plug-in of query processor, and the community retains the original Flink Planner (Old Planner) completely, while introducing the new Blink Planner. The user can choose to use Old Planner or Blink Planner.
In terms of model, The Old Planner did not consider the unification of streaming and batch jobs. The implementation of streaming and batch jobs was different, and they were respectively translated into DataStream API and DataSet API at the bottom. On the other hand, Blink Planner treats batch data sets as bounded streaming data, and both stream computing jobs and batch jobs will eventually be translated into the Transformation API. In terms of architecture, Blink Planner implements BatchPlanner and StreamPlanner respectively for batch processing and stream computing, sharing most of the code and much of the optimization logic. The Old Planner implements two completely separate systems for batch and stream calculation code, and basically does not implement reuse of code and optimized logic.
In addition to the advantages of model and architecture, Blink Planner has accumulated many practical functions based on the massive business scenarios within Alibaba Group, focusing on three aspects:
-
Blink Planner improves the code generation mechanism, optimizes some operators, and provides rich and practical new functions, such as dimension table Join, Top N, MiniBatch, streaming de-duplication, and data slant optimization of aggregation scenes.
-
Blink Planner’s optimization strategy is an optimization algorithm based on a common subgraph, including two strategies: cost-based optimization (CBO) and rule-based optimization (CRO), which makes the optimization more comprehensive. Also, Blink Planner supports data source statistics from catalog, which is important for CBO optimization.
-
Blink Planner offers more built-in functions, more standard SQL support, tPC-H has been fully supported in Flink 1.9, and higher-order TPC-DS support is planned for the next release.
Overall, the Blink query processor is more architecturally advanced and more functional. Flink 1.9 still uses Flink Planner by default for stability reasons, but users can specify it explicitly in the job if they want to use Blink Planner.
4. How do I start Blink Planner
In an IDE environment, only two Blink Planner dependencies need to be introduced to enable Blink Planner.
< the dependency > < groupId > org. Apache. Flink < / groupId > < artifactId > flink - table - API - scala - bridge_2. 11 < / artifactId > < version > 1.9.0 < / version > < / dependency > < the dependency > < groupId > org. Apache. Flink < / groupId > < artifactId > flink - table - planner - blink_2. 11 < / artifactId > < version > 1.9.0 < / version > < / dependency >Copy the code
The configuration for stream and batch jobs is very similar. Just set StreamingMode or BatchMode in EnvironmentSettings. The Settings for stream jobs are as follows:
// ********************** // BLINK STREAMING QUERY // ********************** import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.java.StreamTableEnvironment; StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment(); EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build(); StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings); // or TableEnvironment bsTableEnv = TableEnvironment.create(bsSettings); BsTableEnv. SqlUpdate (...). ; bsTableEnv.execute();Copy the code
The batch job setup is as follows:
// ****************** // BLINK BATCH QUERY // ****************** import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.TableEnvironment; EnvironmentSettings bbSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build(); TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings); BbTableEnv. SqlUpdate (...). bbTableEnv.execute()Copy the code
If the job needs to run in a clustered environment, set the scope of the Blink Planner related dependencies to Provided when packaging to indicate that these dependencies are provided by the clustered environment. This is because Flink already packaged Blink Planner related dependencies when it compiled the package and did not need to reintroduce them to avoid conflicts.
5. Long-term community plans
At present, TableAPI & SQL has become a first-class citizen of the Flink API, and the community will put more effort into this module. In the near future, Blink Planner will become the default Planner once it stabilizes, and Old Planner will retire in due course. At present, the community is also trying to give DataStream batch capability, so as to unify the stream batch stack, and then the DataSet API will be retired.
The author: Ba Shu Zhen
The original link
This article is the original content of the cloud habitat community, shall not be reproduced without permission.