Apache Flink Zero Basics (I) : Basic concept analysis

Authors: Chen Shouyuan, Dai Zili

Definition, architecture and principle of Apache Flink

Apache Flink is a distributed big data processing engine that can perform stateful or stateless calculations on finite data streams and infinite data streams. It can be deployed in a variety of cluster environments to perform rapid calculations on data of various sizes.

1. Flink Application

To understand Flink application development, it is necessary to first understand The basic processing semantics of Flink such as Streams, State and Time, as well as the multi-level API of Flink that combines flexibility and convenience.

  • To those who are bounded by a bounded stream is to have unlimited data Streams. And bounded stream is restricted to the size of the stead fast data collection, namely the limited data streams, the difference between the two infinite data streams of data will be deduction and continue to increase with time, the state of ongoing is calculated and there is no end, relatively limited data flow data size is fixed, computing will eventually completed and is in a state of end.

  • State refers to the data information in the calculation process, which plays an important role in fault-tolerant recovery and Checkpoint. Stream computing is essentially Incremental Processing, so continuous query is required to maintain the State. In addition, to ensure exact-once semantics, data needs to be written to the state; Persistent storage, on the other hand, guarantees Exactly once in case the entire distributed system fails or dies, which is another value of state.

  • Time can be divided into Event Time, Ingestion Time and Processing Time. Flink’s infinite data flow is a continuous process, and Time is an important basis for us to judge whether the business status is lagging behind and whether the data Processing is timely.

  • API, API is usually divided into three layers, which can be divided into SQL/Table API, DataStream API and ProcessFunction from top to bottom. The expression ability and business abstraction ability of API are very strong, but the closer to the SQL layer, the expression ability will gradually weaken. The abstraction capability increases, whereas the ProcessFunction layer API is expressive and can do a variety of flexible and convenient operations, but the abstraction capability is relatively small.

2.Flink Architecture

In the framework part, it is mainly divided into the following four points:

First, Flink has a unified framework to handle both bounded and unbounded data flows

Second, flexible deployment, Flink bottom support a variety of resource schedulers, including Yarn, Kubernetes and so on. Flink’s own Standalone scheduler is also flexible in deployment.

Third, extremely high scalability, which is very important for distributed systems. Alibaba Double 11 large screen uses Flink to process massive data, and the peak value of Flink can reach 1.7 billion/second during the use process.

Fourth, extreme streaming performance. Compared with Storm, Flink’s biggest feature is that it completely abstractions state semantics into the framework, supports local state reading, avoids a lot of network IO, and greatly improves the performance of state access.

3.Flink Operation

There will be a special course later. Here, I will simply share Flink’s content on operation and maintenance and business monitoring:

  • Flink has 7 X 24 hour highly available SOA (Service Oriented Architecture) because Flink provides a consistent Checkpoint in implementation. Checkpoint is the core of Flink’s fault tolerance mechanism. It periodically records the Operator status during calculation and generates snapshots for persistent storage. When Flink jobs fail, they can be selectively recovered from Checkpoint to ensure computing consistency.

  • Flink itself provides functions and interfaces such as monitoring, operation and maintenance (O&M), and a built-in web user interface (WebUI) that provides DAG charts and various metrics for running jobs to help users manage job status.

4. Application scenarios of Flink

4.1 Application scenario of Flink: Data Pipeline

The core scenario of Data Pipeline is similar to Data handling and carries out partial Data cleaning or processing in the process of handling. On the left side of the whole business architecture diagram is Periodic ETL, which provides streaming ETL or real-time ETL and can subscribe messages of message queue and process messages. After cleaning, data is written to the downstream Database or File system in real time. Scenario Example:

  • Number of real-time warehouse

When the downstream is building a real-time warehouse, the upstream may need a real-time Stream ETL. In this process, the data will be cleaned or expanded in real time. After the cleaning, the data will be written to the whole link of the downstream real-time data warehouse, which can ensure the timeliness of data Query and form real-time data collection, real-time data processing and real-time Query in the downstream.

  • Search engine recommendations

The search engine takes Taobao as an example. When a seller launches a new product, the background will generate a real-time message flow, which will be processed and expanded through the Flink system. The processed and expanded data is then indexed in real time and written to the search engine. In this way, when taobao sellers put new products online, they can achieve search engine search in seconds or minutes.

4.2 Flink Application scenario: Data Analytics

As shown in the figure, Batch Analytics is on the left and Streaming Analytics is on the right. Batch Analysis refers to using traditional methods such as Map Reduce, Hive and Spark Batch to analyze, process and generate offline reports for jobs. Streaming Analytics uses a Streaming Analysis engine such as Storm. Flink processes and analyzes data in real time, and applies to many scenarios such as real-time large-screen and real-time reports.

4.3 Flink Application scenario: Data Driven

To some extent, all real-time Data processing or streaming Data processing belong to Data Driven, while streaming computing is Data Driven computing in nature. When the risk control system needs to process a variety of complex rules, Data Driven will write the rules and logic to Datastream’s API or ProcessFunction’s API. Then the logic is abstracted into the whole Flink engine. When the external Data flow or events enter, corresponding rules will be triggered, which is the principle of Data Driven. After triggering some rules, Data Driven will process or give early warning, and these early warning will be sent to the downstream to generate business notification. This is the application scenario of Data Driven, and Data Driven is more applied to the processing of complex events in the application.

2. Concept analysis of “stateful streaming processing”

1. Traditional batch processing

The traditional batch processing method is to continuously collect data, divide multiple batches based on time, and then periodically perform batch calculation. But suppose you need to count the number of event transitions per hour. If the transitions span a defined time partition, traditional batch processing carries the results of the mediation operation to the next batch for calculation. In addition, traditional batch processing still carries the mediation state to the result of the next batch of operations when the received events are out of order, which is not satisfactory.

2. Ideal method

First, there must be an ideal method. The ideal method is that the engine must be able to accumulate state and maintain state, which represents all events received in the past that affect the output.

Second, time. Time means that the engine has a mechanism to manipulate data integrity and output results when all data has been fully received.

Thirdly, the ideal method model needs to produce results in real time, but more importantly, the new continuous data processing model is used to process real time data, which is the best consistent with the characteristics of Continuous data.

3. Streaming processing

In simple terms, streaming processing means that there is an endless data source continuously collecting data, and the code is the basic logic of data processing. The data of the data source is processed by the code to produce results, and then output. This is the basic principle of streaming processing.

4. Distributed streaming processing

Assuming that Input Streams has many users, each user has its own ID. If we count the number of occurrences of each user, we need to stream the occurrence events of the same user to the same operation code. This is the same concept as other batches of group BY. So just like Stream, you have to do partitioning, you have to set up the keys, and then you have to get the same key streams to the same Computation instance to do the same computation.

5. Stateful distributed streaming processing

As shown in the figure, variable X is defined in the above code. X will be read and written in the process of data processing. When the final output result is output, the output content can be determined according to variable X, that is, state X will affect the final output result. The first focus of this process is co-partitioned key by, so every key will flow to computation instance, which is the same principle as the number of user occurrences, which is called state. This state will definitely accumulate in the same Computation instance with events of the same key.

When a partition enters a stream, the stream’s cumulative state becomes copartiton. The second focus is embeded local state Backend. When there are too many keys, the state may exceed the memory load of a single node. At this time, the state must be maintained by a stateful backend. In this state, the backend can be maintained using in-memory under normal conditions.

3. Advantages of Apache Flink

1. State tolerance

When we consider state fault tolerance, it is inevitable to think of state fault tolerance precisely once. The accumulated state applied in the operation, each input event is reflected in the state, and the state is changed precisely once. If the state is modified more than once, it also means that the results produced by the data engine are not reliable.

  • How to ensure that a state has a Exactly-once guarantee?

  • How do I create a Global consistent snapshot for multiple operators with local state in a distributed scenario?

  • More importantly, how do you create a snapshot without interrupting the computation?

1.1 Accurate fault-tolerant method for simple scenes

Or in terms of the number of user appearances, if the number of user appearances is not calculated accurately, not precisely once, then the results generated can not be used as a reference. Before considering the accurate fault tolerance guarantee, we first consider the simplest use scenario, such as the entry of infinite flow of data, and the subsequent operation of a single Process. After each calculation, a state will be accumulated. In this case, if we want to ensure that the Process produces accurate state tolerance after each data is processed, A snapshot is taken after the state is changed. The snapshot is included in the queue and compared with the corresponding state. A consistent snapshot is completed to ensure accuracy.

1.2 Distributed state fault tolerance

As a distributed processing engine, Flink performs multiple local state operations in a distributed scenario and generates only one globally consistent snapshot. If a globally consistent snapshot needs to be generated without interrupting the operation value, it involves distributed state fault tolerance.

  • Global consistent snapshot

As for Global Consistent Snapshot, when the Operator performs operations on each node in a distributed environment, the first way to generate Global consistent Snapshot is to process that the snapshot points of each piece of data are continuous. This operation flows through all the operands. After changing all the operands, you can see the state of each operand and the position of the operation, which is called consistent Snapshot. Of course, Global Consistent Snapshot is also an extension of simple scenarios.

  • Fault tolerance to restore

The back end maintains the state, which is passed to the shared DFS each time a Checkpoint is generated. When any of the processes fails, the state of all operations can be directly restored from three full Checkpoint points and reset to the corresponding position. The existence of Checkpoint enables the whole Process to implement Exactly-once in a decentralized environment.

1.3 We call Distributed Snapshots

How Flink continuously generates Global Consistent Snapshot without interrupting the operation is based on the extension of simple Lamport algorithm mechanism. Flink keeps a Checkpoint barrier in a Datastream, and its Checkpoint barrier is n-1, and so on. Checkpoint barrier N indicates that all data within this range is Checkpoint barrier N.

For example, if a Checkpoint barrier N is generated, the job Manager triggers a Checkpoint in Flink, and the Checkpoint barrier is generated from the data source. When a job starts Checkpoint barrier N, it fills the table in the lower left corner.

As shown in the figure, when some events are marked red and Checkpoint barrier N is also red, it indicates that these data or events are responsible for Checkpoint barrier N. The data or events following the white part of Checkpoint barrier N do not belong to Checkpoint barrier N.

When a data source receives a Checkpoint barrier N, it stores its state. For example, when a data source reads a Kafka partition, the state of the data source is its current location in the Kafka partition. This state is also written into the table above. Downstream Operator 1 will start working on data belonging to Checkpoint barrier N, When Checkpoint barrier N follows the flow to Operator 1,Operator 1 reflects all data belonging to Checkpoint barrier N in the state. If a Checkpoint barrier N is received, a snapshot is created for the Checkpoint.

After the snapshot is completed and continues downstream, Operator 2 will also receive all data and search Checkpoint barrier N to check its status. When the status receives Checkpoint barrier N, it is directly written to Checkpoint N. We can check whether the Checkpoint barrier N has completed a complete table called “Distributed Snapshots”. Distributed snapshots are used for status fault tolerance. When a node fails, it can be recovered at its previous Checkpoint. Job Manager Flink job Manager Flink job Manager Flink job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink For example, Checkpoint N + 1, Checkpoint N + 2, etc., can be synchronized. By using this mechanism, Checkpoint can be continuously generated without blocking the operation.

2. Maintain status

State maintenance is a piece of code that maintains state values locally and requires a local state back end to support them when they are very large.

In Flink, getRuntimeContext().getState(desc); This set of apis is deregistered. Flink has a variety of state backends. After registering the state using API, the state is read through the state back end. Flink has two different state values and two different state backends:

  • The JVM Heap state back end, which is suitable for a small number of states, can be used when the number of states is not large. The JVM Heap state can be read or written using Java Object Read/writes for every value that needs to be read. But when Checkpoint calls the local state of every operation into Distributed Snapshots, it needs to be serialized.

  • RocksDB state backend, which is an out of core state backend. The local state backend of the Runtime allows the user to read the state through disk, which is equivalent to maintaining the state on disk, at the cost of serialization and deserialization every time the state is read. When a snapshot is required, the application is serialized, and the serialized data is directly transferred to the central shared DFS.

Flink currently supports the above two state backends, one is pure memory state back end, and the other is the state back end with resource disks. You can choose the corresponding state back end according to the number of states when maintaining the state.

3.Event – Time

3.1 Different time types

Prior to Flink and other advanced streaming Processing engines, big data Processing engines only supported processing-time Processing. Suppose you define a window for computing Windows, and Windows computing is set to perform billing every hour. When Processing-time is used, the data engine calculates the data received between 3 o ‘clock and 4 o ‘clock. In fact, when making reports or analyzing results, we want to know the output results of actual data generated between 3 o ‘clock and 4 o ‘clock in the real world, so we must use event-time to know the output results of actual data.

As shown in the figure, event-time is equivalent to an Event, which has a timestamp when the data is generated at the source, and the timestamp is needed for subsequent operations. Graphically, the initial queue receives the data and divides the data into batches every hour, which is what the Event-Time Process is doing.

3.2 Event-time Processing

Event-time is re-bucketing with the actual timestamp generated by the Event. The data corresponding to the Time from 3pm to 4pm is placed in the Bucket from 3pm to 4pm, and then the Bucket generates the result. Therefore, the concept of event-time and processing-time is compared in this way.

The importance of event-time is to record the Time when the engine outputs the calculation result. To put it simply, the streaming engine runs continuously for 24 hours and collects data. Assuming that there is a Windows Operator in the Pipeline doing operations, and the result can be generated every hour, when the operation value of Windows is output, this point is the essence of event-time processing. Used to indicate that the received data has been received.

3.3 Watermarks

Flink actually uses watermarks to realize the function of Event-time. Watermarks is also a special event in Flink, the essence of which is that when an operand receives Watermarks with a timestamp “T” it means that it will not receive new data. The advantage of using Watermarks is that you can accurately predict the deadline for receiving data. For example, suppose that the difference between the expected time of receiving data and the time of outputing results is delayed by 5 minutes, then all Windows operators in Flink search for data from 3 o ‘clock to 4 o ‘clock, but need to wait another 5 minutes until the collection is complete because of the delay 4: Data at 05:00, at which point the data collection at 4pm is judged to have been completed, and then the data results at 3pm to 4pm will be produced. The results for this time period correspond to the watermarks section.

4. Save and migrate status

Streaming applications run all the time, and there are several important operational considerations:

  • Changing application logic/fixing bugs, etc., how to migrate the state of the previous execution to the new execution?

  • How do I redefine the parallelization of a run?

  • How do I upgrade the version number of a computing cluster?

A Checkpoint fits all of these requirements perfectly, but Flink also has another term, Savepoint. When a Checkpoint is created manually, it is called a Savepoint. The difference between Savepoint and Checkpoint is that Flink periodically generates checkpoints for a stateful application using distributed snapshots. Savepoint is a manually generated Checkpoint. Savepoint records the state of all operators in a streaming application.

As shown in the figure, Savepoint A and Savepoint B are the first things that need to be done to generate Savepoint, whether it is to change the underlying code logic, fix bugs, upgrade Flink version, redefine the application, parallelization degree of computation, etc.

Savepoints are generated by manually inserting Checkpoint barriers into all pipelines to generate distributed snapshots. These distributed snapshot points are savepoints. Savepoint can be stored anywhere, and when changes are complete, they can be recovered and executed directly from Savepoint.

For example, Kafka collects data continuously. Savepoint stores the Checkpoint time and Kafka’s location during Savepoint recovery. So it needs to recover to the latest data. Event-time ensures that the results of any operation are exactly the same.

Assuming that the re-operation after recovery uses Process Event-time and the Windows window is set to 1 hour, the re-operation can include all the results of the operation into a single Windows within 10 minutes. If you use event-time, it’s like Bucketing. In Bucketing, no matter how many recalculations are performed, the final recalculation time and the results produced by Windows are guaranteed to be exactly the same.

Four,

This paper starts with the definition, architecture and basic principle of Apache Flink, analyzes the basic concepts related to big data flow computing, and then briefly reviews the historical evolution of big data processing and the principle of stateful streaming data processing. Finally, Apache Flink’s natural advantages as one of the best streaming engines in the industry are analyzed from the current challenges of stateful streaming processing. I hope it will help you to clarify the basic concepts involved in the big data streaming processing engine and make it easier to use Flink.