What is the Flink

Apache Flink is an open source computing platform for distributed data streaming and batch data processing. It supports both stream and batch applications.

Flink characteristics

  • Existing open source computing solutions treat stream processing and batch processing as two different application types: stream processing generally needs to support low latency and Exactly once guarantee, while batch processing needs to support high throughput and efficient processing
  • Flink fully supports stream processing, that is, the input data stream is unbounded when viewed as stream processing; Batch processing is treated as a special kind of stream, except that its input data stream is defined as bounded

Flink component stack

 

  • Deployment layer:

    Flink supports various deployment modes: local, cluster (Standalone/YARN), cloud (GCE/EC2)

  • The Runtime layer:

    The Runtime layer provides all the core implementations that support Flink computing, such as distributed Stream processing, JobGraph to ExecutionGraph mapping, scheduling, and so on, and provides the base services for the upper API

  • The API layer:

    The stream-processing and batch-oriented API are mainly implemented, in which stream-processing corresponds to DataStream API and batch-processing corresponds to DataSet API

  • Libaries layer:

    Application-specific implementation computing frameworks built on top of the API layer also correspond to stream-oriented and batch-oriented respectively

Flink has its own advantages

  • Supports high throughput, low latency, and high performance stream processing

    (Flink runs the same business code faster than Spark Streaming and Strom, flink supports millisecond delay operation)

  • Supports highly flexible Window operations

  • Support for the exact-once semantics of stateful computation

Apache Flink SparkStreaming Storm
architecture The architecture is between Spark and Storm. The master-slave structure is similar to Spark Streaming and DataFlow Grpah is similar to Storm. The data stream can be represented as a directed graph. Each vertex is a user-defined operation, and each edge represents a flow of data. The architecture depends on Spark, and each Batch process depends on driver. It can be understood as Spark DAG in the time dimension. Micro-Batch It is in master-slave mode and depends on ZK. There is little dependence on the master during processing.
Fault tolerance Based on chandy-Lamport Distributed Snaps Hots Checkpoint Medium WAL and RDD lineage mechanism High Records ACK Medium
Deal with models and delays Single time processing. Sub-second low latency All events in a time window. High second delay One event at a time. Sub-second low latency
Data processing guarantee Exactly once Exactly once(implementation adopts candy-Lamport algorithm, i.e. Marker-checkpoint)High 6. Record-level initials (not explained) Medium

Basic concepts of Flink & programming model

The basic concept

  • The basic building blocks of Flink are streams and transformations.
  • Each data flow starts at one or more sources and ends at one or more sinks
  • Aggregations on streams need to be scoped by Windows, such as “calculate the last 5 minutes” or the sum of the last 100 elements.
  • Windows are usually divided into different types, such as scroll Windows (no overlap), sliding Windows (overlap), and session Windows (interrupted by inactive interval time).

Basic architecture

  • Flink is based on the master-slave style architecture
  • When the Flink cluster starts, one JobManager process and at least one TaskManager process are started

  •  

TaskManager

  1. The Worker who is actually responsible for performing the calculation executes a group of tasks of Flink Job on it
  2. Manages resource information, such as memory, disk, and network information, and reports the resource status to JobManagers upon startup.

JobManagers

  1. Coordinator of the Flink system, it is responsible for receiving Flink jobs and scheduling the execution of multiple tasks that comprise the Job.
  2. Cell phone Job status information and manage TaskManager from nodes in Flink cluster.

Client

  1. When a user submits a Flink program, a Client will be created first, which will preprocess the Flink program submitted by the user and submit it to the Flink cluster.
  2. The Client will assemble a JobGraph from the Flink application submitted by the user and submit it as a JobGraph.

Quick start

mvn archetype:generate \ -DarchetypeGroupId=org.apache.flink \ -DarchetypeArtifactId=flink-quickstart-java \ -darchetypeversion =1.7.2 \ -darchetypecatalog =local MVN archetype:generate \ -darchetypegroupid =org.apache.flink \ -darchetypeartifactid =flink-quickstart-scala \ -darchetypeVersion =1.7.2 \ -darcheTypecatalog =localCopy the code