Flink 1.10 has been released with great anticipation and enthusiasm. It is the largest upgrade to the Flink community to date. Here, I will introduce the data flow programming model in the upgraded Flink from eight main points.
1. Level of abstraction
Flink provides four levels of abstraction to develop stream/batch applications.
The lowest level of abstraction provides only the flow of state. It is embedded in the DataStream API through the Process Function. It allows users to freely process events in one or more streams with a consistent fault-tolerant state. In addition, users can register event time and processing time callbacks, allowing programs to implement complex calculations.
In fact, most applications do not require the low-level abstractions described above and instead program against core apis such as the DataStream API (bounded/unbounded streams) and the DataSet API (bounded datasets). These apis provide common artifacts for data processing, such as various forms of user-specified transformations, joins, aggregations, Windows, states, and so on. The data types handled in these apis are represented by classes in the corresponding programming language.
The low-level Process Function is integrated with the DataStream API, allowing low-level abstraction of only certain operations. Other primitives for finite data sets, such as loops/iterations, are provided by the dataset API.
The Table API is a central declarative DSL Table that can be dynamically changed (when representing streams). The follow Table API (extension) relational model: Table has a mode connection (similar to a Table in a relational database) and API provides comparable operations, such as selection, project, connection, group by, aggregation of Table API process definition should be performed in the form of a statement of the logical operation, not exactly specified operation code appearance. Although the Table API can be extended with various types of user-defined functions, it is not as expressive as the Core API. , but it’s much cleaner to use (no coding required). In addition, the Table API program goes through an optimizer that applies optimization rules before execution. Seamlessly convert between tables and DataStream/DataSet, allowing applications to mix the Table API with DataStream and DataSet apis.
The highest level of abstraction that Flink provides is SQL. This abstraction is similar to the Table API in semantics and presentation, but represents the program as an SQL query expression. SQL queries interact closely with table APIS in SQL abstraction, and table apis can be executed in tables defined in.
2. Programs and data flows
The basic building blocks of the Flink program are flows and transformations. (Note that the DataSet used in Flink’s DataSetAPI is also internally streaming – more on that later.) Conceptually, a stream is a data record stream (which may never end), whereas a transformation is an operation that treats one or more streams as an operation. Input, and thus produce one or more output streams.
When executed, the Flink program maps to a stream data flow consisting of a stream and transformation operators. Each data flow starts with one or more sources and ends with one or more sinks. The data flow is similar to any directed acyclic graph (DAG). Although special forms of loops can be allowed through iterative constructs, for the most part, we’ll cover them for simplicity.
There is usually a one-to-one correspondence between the transformations in the program and the operators in the data flow. Sometimes, however, a transformation may contain more than one transformation operator. The source and sink are documented in the flow connector and batch connector documentation. The conversion is recorded in the DataStream operator and DataSet conversion.
3. Parallel data flow
Programs in Flink are parallel and distributed in nature. During execution, a flow has one or more flow partitions, and each operator has one or more operator subtasks. Operator subtasks are independent of each other and are executed in different threads, and may be executed on different machines or containers. The number of operator subtasks is the parallelism of that particular operator. The parallelism of a flow is always the parallelism of its production operator. Different operators in the same program may have different degrees of parallelism.
Streams can transfer data between two operators in one-to-one (or forward) or reallocation mode: A one-to-one stream (for example, between the Source and map () operators in the figure above) preserves partitioning and ordering of elements. This means that the subtask [1] of the map () operator will see the subtask of the Source operator
[1] produce elements in the same order. Reallocating the stream (above between map () and keyBy/window and keyBy/window and Sink) changes the partition of the stream. Each operator subtask sends data to a different target subtask, depending on the transformation selected. Examples are keyBy () (which repartitions by hash key), broadcast (), or rebalance () (which repartitions randomly). In a reallocation exchange, the order between elements only remains in each pair of send and receive subtasks (for example, subTask [1] of map () and subtask of map ()
[2]) keyBy/window). Thus, in this example, the order within each key is preserved, but parallelism does introduce uncertainty about the order in which the aggregation results of different keys reach the receiver.
4. Windows
Summary events (e.g., count, sum) work differently on a stream than in batch processing. For example, it is impossible to count all elements in a flow because the flow is often infinite (unbounded). Instead, the aggregation of streams (count, sum, etc.) is scoped by a window, such as “count in last 5 minutes” or “sum of last 100 elements.”
Windows can be time driven (e.g., every 30 seconds) or data driven (e.g., every 100 elements). It is often possible to distinguish between different types of Windows, such as scrolling Windows (with no overlap), sliding Windows (with overlap), and session Windows (interrupted by inactive gaps).
5. The time
When referencing time in a streaming application (for example, defining a window), different concepts of time can be referenced: event time is the time when the event was created. It is typically described by time stamps in events, such as those attached by production sensors or production services. Flink accesses event timestamps through the timestamp dispatcher.
The receive time is the time when the event enters the Flink data stream by the source operator.
The processing time is the local time of each operator performing a time-based operation.
6. Stateful operations
While many operations in the data flow only look at one event at a time (for example, event parsers), some operations remember information about multiple events (for example, window operators). These operations are called stateful.
The state of stateful operations is maintained in what can be thought of as an embedded key/value store. State is strictly partitioned and distributed along with streams read by stateful operators. Therefore, key/value state is only accessible on keycontrol after the keyBy () function, and is limited to the values associated with the key of the current event. Aligning the flow and status keys ensures that all status updates are local, ensuring consistency without transaction overhead. This alignment also allows Flink to reassign state and transparently adjust flow partitions.
7. Fault tolerant checkpoints
Flink achieves fault tolerance by combining stream replay with checkpoint. Checkpoints are related to specific points in each input flow and the corresponding state of each operator. By restoring the operator’s state and replaying the event from the checkpoint, you can restore the streaming data flow from the checkpoint while maintaining consistency (fully one-time processing semantics).
Checkpoint intervals are a way of balancing the cost of fault tolerance against recovery time (the number of events that need to be replayed) during execution. The description inside fault Tolerance provides more information about how Flink manages checkpoints and related topics. For more information on enabling and configuring checkpoints, see the CheckPointing API documentation.
8. Batch flow
Flink executes a batch program as a special case of a stream program, in which case the stream is bounded (with a finite number of elements). The A dataset is internally treated as a data stream. Thus, the above concepts apply in the same way to batch programs, as well as to streaming programs, with a few exceptions:
Batch program fault tolerance does not use checkpoints. Recovery is done by fully replaying the stream. This is possible because the input is bounded. This pushes the cost more towards recovery, but reduces the cost of routine processing by avoiding checkpoints. State operations in the DataSet API use simplified in-memory/out-of-kernel data structures instead of key/value indexes.
If you find this article helpful, please give it a thumbs up
The editor will continue to bring you hot technical articles, let’s progress together