Apache Flink is a distributed processing engine for stateful computation of unbounded and bounded data streams. Flink can run in a variety of common resource managers and quickly compute data of various sizes in memory.

The data model

Flink all data is treated as a stream of events. The flow of events is divided into bounded and unbounded flows.

  • Bounded flow: Flow with defined start and end boundaries. Flink processes all the data together after it has been loaded. The data does not need to be valid and can be sorted after processing. This is batch processing in Flink.
  • Unbounded flow: Data flow has only a beginning and never ends. Because there is no end, the data needs to be processed continuously and in a specific order. This is real-time processing in Flink.

Flink regards batch processing as a special kind of stream processing. So batch streams can use the same processing.

Time to deal with

Three times are supported in Flink:

  • Event Time: indicates the Time when data is generated
  • Ingestion Time: Time when data is ingested
  • Processing Time: Indicates the Flink machine Time

Flink supports eventTime-based time processing and Watermarks filtering of late data.

Flink APIs

Flink offers different levels of API:

  • Stateful Stream Processing: Allows users to freely process Stream events and provides a consistent fault-tolerant state. You can do some complicated customization.
  • Core APIs: Provide APIs for transitions, connections, aggregations, Windows, and states.
  • Table API: Similar to tables in a database, it abstracts many of the apis for Table processing.
  • SQL: Data can be processed using standard SQL.

Cluster architecture

  • Client: Generates the dataflow graph from the code and submits it to the JobManager.
  • JobManager: Manages clusters, allocates resources, and implements high availability
  • TaskManager: run the Task

The operation mode

DataFlows

Source: the Collections/Socket/File (HDFS local)/GenerateSequence/ES/Kafka…

Transformation:

  • Filter, map, and flatMap records based on singling records
  • Window-based:
    • Non keyedStream: timeWindowAll/countWindowAll/windowAll
    • keyedStream: timeWindow/countWindow/window
  • Merge multiple streams:
    • Non keyedStream: union/join/connect/coGroup
    • keyedStream: interval join
  • Split a stream: split/sideOutput
  • Physical grouping operation: the purpose is to do data distribution and allocation
Broadcast (): sends all the upstream data to all the downstream tasks. Datastream.forward (): sends all the upstream data to all the downstream tasks. Rebalance (): sends data in rotation. Datastream.recale (): sends data in rotation. Local training in rotation mode to send data dataStream. PartitionCustom () : the custom to sendCopy the code

Sink: File/ES/Kafka…

The DataFlows model is event-triggered. In a partition, after an event has been calculated by one node, the event can be sent to the next node for calculation instead of waiting for the next node to complete the calculation. In contrast, the RDD of Spark can be calculated only after all data of the upstream RDD in a partition has been calculated.

Parallel computing

Stateful computation

Each Flink partition calculation generates status data locally, which is retrieved and updated during calculation. State is always accessed locally, which helps Flink achieve high throughput and low latency. You can choose to keep the state on the JVM heap, or on disk if the state is too large.

Snapshots

Flink provides fault-tolerant, once-only semantics through a combination of state snapshots and stream replay. These snapshots capture the entire state of the distributed computation, recording offsets to the input queue, and the state resulting from the ingestion of data up to that point in the entire job diagram. In the event of a failure, data can be recovered from snapshot.

Jvm-based memory management

  • Applications can exceed the main memory size limit, reducing garbage collection overhead
  • Object serialization binary storage