1. The Flink basis

1. A brief introduction to Flink

Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink also provides core functions such as data distribution, fault tolerance mechanism and resource management. Flink provides a number of apis with a high level of abstraction to allow users to write distributed tasks:

DataSet API performs batch processing operations on static data and abstracts static data into distributed data sets. Users can easily use various operators provided by Flink to process distributed data sets. It supports Java, Scala and Python.

DataStream SUPPORTS Java and Scala. DataStream abstracts streaming data into distributed data streams.

Table API, query operations on structured data, abstraction of structured data into relational tables, and through SQL-like DSL query operations on relational tables, support Java and Scala.

In addition, Flink also provides domain libraries for specific application domains, e.g. Flink ML, Flink’s machine learning library, provides machine learning Pipelines API and implements various machine learning algorithms. Gelly, Flink’s graph computing library, provides the related API of graph computing and various graph computing algorithm implementation.

2. The difference between Flink and traditional Spark Streaming?

This is a very macro problem because there are so many differences between the two frameworks. But there is one important point to make when interviewing: Flink is a standard real-time processing engine, event-driven. Spark Streaming is a micro-batch model.

Here are some key differences between the two frameworks:

  1. Spark Streaming at runtime includes Master, Worker, Driver, Executor, and Flink at runtime includes Jobmanager, Taskmanager, and Slot.

  2. Spark Streaming generates tiny batches of data continuously to build directed acyclic graph DAG. Spark Streaming creates DStreamGraph, JobGenerator and JobScheduler in sequence. Flink generates StreamGraph from user-submitted code, optimizes it to generate JobGraph, and then submits it to JobManager for processing. JobManager generates ExecutionGraph from JobGraph. ExecutionGraph is the core data structure of Flink scheduling. JobManager schedules jobs based on ExecutionGraph.

  3. Time mechanism Spark Streaming supports a limited time mechanism, only processing time. Flink supports three definitions of time for stream handlers: processing time, event time, and injection time. It also supports the watermark mechanism to process lagging data.

  4. For Spark Streaming tasks, we can set a checkpoint and then, if there is a failure and restart, we can recover from the last checkpoint. However, this behavior only keeps data from being lost and may be repeated, rather than processing semantics exactly at one time. Flink uses the two-phase commit protocol to solve this problem.

3. What is the component stack of Flink?

According to the Flink website, Flink is a system with a layered architecture. Each layer contains components that provide specific abstractions to serve the components at the top.

From bottom to bottom, each layer represents the following: Deploy layer: This layer mainly involves the deployment mode of Flink. As shown in the figure above, Flink supports a variety of deployment modes including Local, Standalone, Cluster and Cloud. Runtime layer: The Runtime layer provides the core implementation that supports Flink computing, such as distributed Stream processing, JobGraph-to-ExecutionGraph mapping, scheduling, and so on, and provides basic services for the upper API layer. The API layer: The API layer mainly implements stream-oriented processing and Batch processing apis, in which stream-oriented processing corresponds to DataStream API, and batch-oriented processing corresponds to DataSet API. Later versions, Flink has plans to unify DataStream and DataSet apis. Libraries layer: This layer is called the Flink application framework layer. According to the division of the API layer, the implementation computing framework built on the API layer to meet the specific application also corresponds to stream-oriented processing and batch-oriented processing respectively. Stream-oriented processing support: CEP (complex event processing), SQL-like operations (Table based relational operations); Batch-oriented support: FlinkML (Machine learning library), Gelly (Graph processing).

4. Does Flink have to rely on Hadoop components to run?

Flink can be completely independent of Hadoop and run independently of Hadoop components. However, as a big data infrastructure, Hadoop system is beyond any big data framework. Flink can integrate many Hadooop components, such as Yarn, Hbase, HDFS, and so on. For example, Flink can be integrated with Yarn to schedule resources, read and write HDFS, or use HDFS to perform checkpoints.

5. How big is your Flink cluster?

Note that this question seems to ask you about the size of the Flink cluster in your application, but it also hides another question: How many cluster sizes can Flink support? To answer this question, you can describe the cluster scale, nodes, memory, and deployment mode (Flink on Yarn in general) in your production process. In addition, you can run Flink tasks on a small cluster (less than five nodes) or thousands of TB nodes at the same time.

6. Do you know the basic programming model of Flink?

The above is a flow chart from Flink’s website. As can be seen from the figure above, the basic construction of Flink program is that data input comes from a Source, which represents the input end of data, is transformed through Transformation, and then ends up in one or more Sink receivers. A stream is a set of records that never stops, and a transformation is an operation that takes one or more streams as inputs and produces one or more output streams. When executed, the Flink program maps to the Streaming Dataflows, consisting of Streams and Transformation Operators.

7. What are the roles in the Flink cluster? What is the role of each?

Flink program has three roles: TaskManager, JobManager and Client. Among them

  • JobManager plays the role of the manager Master in the cluster. It is the coordinator of the whole cluster, responsible for receiving Flink jobs, coordinating checkpoints, Failover recovery, etc., and managing TaskManager of slave nodes in the Flink cluster.

  • A TaskManager is a group of tasks that perform Flink jobs on the Worker that is actually responsible for computing. Each TaskManager manages the resource information on its node, such as memory, disk, and network, and reports the resource status to The JobManager when it is started.

  • Client is the Client for Flink program submission. When a user submits a Flink program, a Client will be created first. The Client will preprocess the Flink program submitted by the user and submit it to the Flink cluster for processing. Therefore, the Client needs to obtain the JobManager address from the Flink program configuration submitted by the user, establish a connection to the JobManager, and submit the Flink Job to the JobManager.

8. Describe the concept of Task Slot in Flink resource management

As mentioned in the Flink architecture role, TaskManager is the Worker that is actually responsible for performing the computation. TaskManager is a JVM process that executes one task or more subtasks in separate threads. To control how many tasks a TaskManager can accept, Flink introduced the concept of Task Slot. In simple terms, TaskManager divides the resources managed on its nodes into different slots: a fixed-size subset of resources. This prevents tasks of different jobs from competing with each other for memory resources, but the main requirement is that Slot only performs memory isolation. CPU isolation is not done.

9. What are the common operators of Flink?

The most common operators of Flink include: Map: DataStream → DataStream. A parameter is input to generate a parameter. The function of Map is to convert the input parameter. Filter: Filters out the specified data. KeyBy: Groups groups according to the specified key. Reduce: merges results. Window: a Window function that groups the data for each key according to certain properties (for example, the data that arrived within 5s)

10. Do you know the Flink zoning strategy?

Understand what partitioning policies are. Partitioning policies are used to determine how data is sent downstream. Currently Flink supports the implementation of eight partitioning strategies.

Here is the partitioning strategy inheritance diagram for the entire Flink implementation:

The GlobalPartitioner data is distributed to the first instance of the downstream operator for processing.

ShufflePartitioner data is randomly distributed to each instance of the downstream operator for processing.

The RebalancePartitioner data is looped through to each instance downstream for processing.

The RescalePartitioner circulates to each instance of the downstream operator based on the parallelism of the upstream and downstream operators. It’s A little hard to understand here, but let’s say upstream parallelism is 2, and the numbers are A and B. The downstream parallelism is 4, and the number is 1,2,3,4. So A circulates data to 1 and 2, and B circulates data to 3 and 4. Assume that the upstream parallelism is 4 and the numbers are A, B, C, D. The parallelism of the downstream is 2, and the number is 1,2. So A and B send data to 1, and C and D send data to 2.

The BroadcastPartitioner outputs upstream data to each instance of the downstream operator. Suitable for large data sets and small data sets to do Jion scenarios.

The ForwardPartitioner is used to output records to the downstream local operator instance. It requires the same degree of parallelism of upstream and downstream operators. In short, the ForwardPartitioner is used to do console printing of data.

KeyGroupStreamPartitioner Hash partitioning. The Hash value of the Key is output to the downstream operator instance. CustomPartitionerWrapper User-defined divider. Users need to implement the Partitioner interface themselves to define their own partitioning logic. Such as:

static class CustomPartitioner implements Partitioner<String> {
   @Override
   publicintpartition(String key, int numPartitions) {
     switch (key){
       case "1":
         return 1;
       case "2":
         return 2;
       case "3":
         return 3;
       default:
         return 4; }}}Copy the code