The introduction

Hello everyone, I am ChinaManor, which literally translates to Chinese code farmer. I hope I can become a pathfinder on the road of national rejuvenation, a ploughman in the field of big data, an ordinary person who is unwilling to be mediocre.

Flink Knowledge Review papers are as follows:

Multiple choice

1. Which of the following is not the Dataset conversion operator () A. ReadTextFile B reduce DISTINCT D rebalance

A

2. A keyed state B operate state C broadcast state D Transform state Is incorrect

D

Check point status backend (state backend). Which of the following is incorrect? () A Mongodb state backend B MemoryState backend

A

() A If the time window is defined based on ngesingtTime, it will form A ventTimeWindow, requiring the message itself to carry the message C A ProcessingTime window will form a ProcessingTime window before the esting Timewindow

D

5. What are the unsuitable scenarios for FINK? () A Real-time data pipeline data extraction B Real-time data warehouse and real-time ETL C Event-driven scenarios, such as alarms and monitoring D Large quantities of data for offline (T +1) report calculation

D

Multiple choice

1 FIK stream processing features () A. Supports Window operations with event time B. Support for the exact-once semantics of stateful computation C. Support for fault tolerance based on lightweight distributed Snapshot implementation D. Support for automatic program optimization: avoid shue, sorting and other expensive operations under certain circumstances, intermediate results need to be cached

ABCD

2. Which of the following is the state storage provided by FINK (a. lOState Backend B. Memory Backend Tate Backend D. Rocks DBState Backend)

BCD

3. Which two kinds of interfaces are provided by the fink core component () a. Batch interface B. C. Table processing interface D. Complex event processing interface

AB 4. Flink on YARN Which two submission modes are available () A. yarn-alone B. yarn-session C. yarn-cluste D. standalone

BC

5. Restart policies implemented by Fink include () A Failure Rate Restart Strategy B. Strate C. Fallback Restart strategy D. No restart policy

ABCD

True or false:

6 Task Slot is the smallest carrier of resource allocation within taskManager and represents a subset of resources that can be automatically resized according to resource requirements.

F

The open method in 7Fink’s Rich function is executed once for every entry of data. (a)

F

8. The bottom layer of fink’s stream processing operation is batch processing, which is a special batch operation. (a)

F

9. The high availability mode of Fink is used to prevent single points of failure of JobManager and ensure high availability of the cluster. (a)

T

10 Hlink SoL Runtime is a unified primer for streams and batches. HlinkSQL can unify streams and batches at AP layer. (a)

The following is a mock interview. What would you say if the interviewer asked you about Flink?

1. A brief introduction to Flink

The core of Flink is a streaming data flow execution engine, which provides data distribution, data communication and fault tolerance mechanism for distributed computing of data flow. Based on the stream execution engine, Flink provides a number of apis with higher levels of abstraction for users to write distributed tasks: DataSet API performs batch processing operations on static data and abstracts static data into distributed data sets. Users can easily use various operators provided by Flink to process distributed data sets. It supports Java, Scala and Python. DataStream SUPPORTS Java and Scala. DataStream abstracts streaming data into distributed data flows. Users can easily perform various operations on distributed data flows. Table API, the structured data into a row query operation, the structured data abstracted into a relational Table, and through the SQL-like DSL on the relational Table for various query operations, support Java and Scala. In addition, Flink also provides domain libraries for specific application domains, e.g. Flink ML, Flink’s machine learning library, provides machine learning Pipelines API and implements various machine learning algorithms. Gelly, Flink’s graph computing library, provides the related API of graph computing and various graph computing algorithm implementation.

2. What is the difference between Flink and Spark Streaming?

The task of Spark Streaming relies on the driver, executor, and worker. The driver and Excutor also rely on the cluster manager Standalone, yarn, etc. The Flink runtime is mainly JobManager, TaskManage and TaskSlot. Spark Streaming is a microbatch process. Run the Spark Streaming at a specified time. Run a job and process one batch of data. Flink is event-driven, and events can be understood as messages. Event-driven applications are stateful applications that inject events from one or more streams, update state by triggering calculations, or react to injected events with external actions. Task scheduling: Spark Streaming scheduling can be divided into DGA graph construction, stage division, Taskset generation, task scheduling, etc. Flink will generate StreamGraph first, then generate JobGraph. The jobGraph is then submitted to The Jobmanager to complete the transition from jobGraph to ExecutionGraph, which is scheduled and executed by the Jobmanager. Time mechanism: Flink supports three time mechanisms: event time, injection time, processing time, and watermark mechanism for processing delayed data. Spark Streaming supports only processing time, while Structured Streaming supports event time and watermark mechanisms. Fault-tolerant mechanism: The two guarantee exactly-once in different ways. Spark Streaming saves offsets and events. Flink uses the two-phase commit protocol to solve this problem.

3 What are the partitioning policies in Flink?

Partitioning policies are used to determine how data is sent downstream. Currently Flink supports the implementation of partitioning policies in 8.

1) The GlobalPartitioner data is distributed to the first instance of the downstream operator for processing.

2) ShufflePartitioner data will be randomly distributed to each instance of the downstream operator for processing.

3) The RebalancePartitioner data is circulated to each instance downstream for processing.

RescalePartitioner circulates to each instance of the downstream operator based on the parallelism of the upstream and downstream operators. It’s A little hard to understand here, but let’s say upstream parallelism is 2, and the numbers are A and B. The downstream parallelism is 4, and the number is 1,2,3,4. So A circulates data to 1 and 2, and B circulates data to 3 and 4. Assume that the upstream parallelism is 4 and the numbers are A, B, C, D. The parallelism of the downstream is 2, and the number is 1,2. So A and B send data to 1, and C and D send data to 2.

The BroadcastPartitioner outputs upstream data to each instance of the downstream operator. Suitable for large data sets and small data sets to do Jion scenarios.

6) ForwardPartitioner ForwardPartitioner is used to output the record to a local operator instance downstream. It requires the same degree of parallelism of upstream and downstream operators. In short, the ForwardPartitioner is used to do console printing of data.

7) KeyGroupStreamPartitioner Hash partitioning. The Hash value of the Key is output to the downstream operator instance.

8) CustomPartitionerWrapper User – defined divider. Users need to implement the Partitioner interface themselves to define their own partitioning logic

4. Do you know the parallelism of Flink? What do I need to pay attention to when setting parallelism in Flink?

The Flink program consists of multiple tasks (Source, Transformation, Sink). Tasks are divided into parallel instances for execution, each of which processes a subset of the input data for the task. The number of parallel instances of a task is called parallelism. The parallelism of characters in Flink can be set at several different levels: Operator Level, Execution Environment Level, Client Level and System Level. Flink can set up several levels of Parallelism, In flink-conf.yaml, parallelism is parallelism. Default The configuration item specifies system-level parallelism by default for all execution environments; In ExecutionEnvironment, setParallelism can be used to set default parallelism for Operators, Data Sources and Data sinks. If operators, data Sources and Data sinks have parallelism, the parallelism of ExecutionEnvironment Settings will be overridden. Note the following priorities: Operator layer > Environment layer > Client layer > system layer.

5 What restart policies does Flink support? How to configure them? Restart policy types:

Fixed Delay Restart Strategy Failure Rate Restart Strategy No Restart Strategy Fallback Fallback Restart Strategy

6 What is the role of Flink’s distributed cache? How to use it?

Flink provides a distributed cache, similar to Hadoop, that allows users to easily read local files in parallel functions and place them in taskManager nodes, preventing tasks from pulling repeatedly. This cache works as follows: a program registers a file or directory (a local or remote file system, such as HDFS or S3), registers the cache file through ExecutionEnvironment and gives it a name. When the program executes, Flink automatically copies files or directories to the local file systems of all TaskManager nodes only once. The user can look up a file or directory by the specified name and then access it from the taskManager node’s local file system.

7 Broadcast variables in Flink. What should I pay attention to when using broadcast variables?

In Flink, there may be several different parallel instances of the same operator, and the calculation process may not be carried out in the same Slot, especially between different operators. Therefore, the calculation data of different operators cannot be accessed to each other as between Java arrays. The Broadcast variable solves this situation. We can understand the broadcast variable as a common shared variable. We can broadcast a dataset, and then different tasks can obtain the data on the node. There is only one copy of the data on each node.

8. What kinds of Windows are supported in Flink? Talk about their usage scenarios

  1. Tumbling Time Window if we need to statistic the total number of users to buy goods in every minute, you need to user behavior events per minute into line segmentation, the segmentation is to become the rolling Time Window (Tumbling Time Window). The rollover window can split the data stream into non-overlapping Windows, and each event can only belong to one window.
  2. Sliding Time Window we can calculate the total number of items purchased by the user in the last minute every 30 seconds. This type of Window is called a Sliding Time Window. In sliding Windows, one element can correspond to multiple Windows.
  3. Tumbling Window when we want to Count every 100 total users purchase behavior event statistics, then fill 100 elements in each time Window, would be to calculate of the Window, the Window is what we call the rolling counting Window (Tumbling Count Windows). The window size shown in the figure above is 3.
  4. In such a stream of user interaction events, the first thought is to aggregate the events into the Session Window (a period of continuous user activity), separated by inactive gaps. As shown in the figure above, you need to calculate the total number of purchases made by each user during the active period. If the user is not active for 30 seconds, the session is considered disconnected (assuming the Raw data Stream is the purchase activity stream of a single user). In general, a window defines a finite set of elements on an infinite stream. The collection can be time-based, number of elements, combination of time and number, session gap, or custom. Flink’s DataStream API provides concise operators for common window operations, as well as a generic window mechanism that allows users to define their own window allocation logic.

9 Describe the Flink On Yarn mode

The Client uploads jar packages and configuration files to the HDFS cluster. 2. The Client submits tasks to Yarn ResourceManager and applies for resources 3.ResourceManager allocates Container resources and starts ApplicationMaster. AppMaster loads Flink Jar packages and configures the construction environment to start JobManager JobManager and ApplicationMaster run in the same Container. Once they are successfully started, the AppMaster knows the address of the JobManager (AM it on its own machine). It generates a new Flink configuration file for TaskManager (they can then connect to JobManager). This configuration file is also uploaded to HDFS. In addition, the AppMaster container provides Flink’s Web services interface. All ports allocated by YARN are temporary ports. ApplicationMaster applies for work resources from ResourceManager. NodeManager loads Flink Jar packages, configures the build environment, and starts TaskManager 5. After TaskManager starts, it sends heartbeat packets to JobManager and waits for JobManager to assign tasks to it

10. What are the time types in Flink? Introduce each one?

The time in Flink is inconsistent with the time in the real world, which is divided into event time, intake time and processing time in Flink. If the time window is defined based on EventTime, an EventTimeWindow will be formed, requiring that the message itself should carry EventTime. If the time window is defined based on IngesingtTime, the window will be formed IngestingTimeWindow, based on source systemTime. If the time window is defined against the ProcessingTime baseline, the ProcessingTimeWindow will be formed using operator’s systemTime.

At this point, the interviewer has been satisfied with your mastery of Flink, so the next step to impress the interviewer :***

11. What is WaterMark? What kind of problem is it to solve? How to create watermarks? What is the principle of watermarking?

Watermark was a mechanism developed by Apache Flink to handle EventTime window calculations, and was essentially a timestamp. Watermark is used to handle out-of-order events, which are usually handled using the watermark mechanism in conjunction with Windows.

12 Flink’s fault tolerance mechanism

Flink implements fault tolerance based on distributed snapshots and partially resendable data sources. You can customize the interval for snapshot of the entire Job. If a Job fails, Flink restores the entire Job to the latest snapshot and resends the snapshot data from the data source.

13 Flink appears data skew when using Windows, what is your solution?

Note: The data skew generated by Windows refers to the difference in the amount of data accumulated in different Windows, which is mainly caused by the generation speed of source data. Core ideas: 1. Redesign key 2. Do pre-aggregation before window calculation

14 Flink task, the delay is very high, may I ask what tuning strategy you have?

Flink has never had a cameo appearance in the file, and we can’t say that I don’t have a cameo file with ٩(❛ᴗ❛ jun) danjun: p

First, determine the cause of the problem, find the most time-consuming point, and determine the performance bottleneck. For example, tasks frequently backpressure, find the backpressure point. Mainly through: resource tuning, job parameter tuning. Resource tuning refers to tuning the number of concurrent operators (parallelism), CPU (core), heap memory (HEAP_memory), and other parameters in a job. Job parameter tuning includes: parallelism setting, State setting, checkpoint setting.

How does Flink’s memory management work

Instead of storing a large number of objects on the heap, Flink serializes them all into a pre-allocated block of memory. In addition, Flink uses a lot of out-of-heap memory. If the data to be processed exceeds the memory limit, some data is stored on hard disks. Flink implemented its own serialization framework for directly manipulating binary data.

How does Flink support batch integration

The developers of Flink consider batch processing to be a special case of stream processing. Batch processing is limited stream processing. Flink uses one engine to support the DataSet API and DataStream API.

State storage in Flink

Flink often needs to store intermediate states during calculation to avoid data loss and state recovery. Different state storage policies can affect how state persistence interacts with checkpoint. Flink provides three state storage modes: MemoryStateBackend, FsStateBackend, and RocksDBStateBackend.

18. How does Flink guarantee Exactly-once semantics

Flink implements end-to-end consistency semantics by implementing two-phase commit and state saving. The steps are as follows: BeginTransaction Creates a temporary folder to write data into this folder. PreCommit Writes cached data to a file and closes the formal commit. The temporary file that has been written before is placed in the target directory. This means that the final data is abort and the temporary file is discarded if the failure occurs after the pre-commit is successful and before the formal commit. Pre-committed data can be submitted or deleted based on status.

19. How does Flink deal with back pressure

Flink is based on the producer-consumer model for message delivery, and Flink’s backpressure design is also based on this model. Flink uses efficient bounded distributed blocking queues, like Java’s generic BlockingQueue. As downstream consumers slow down, upstream gets clogged.

Late but there, the interview can’t be without code questions:

The use of JAVA or Scala language programming fink Word Count Word statistics.

Very classic wordcount questions, similar to scala, Spark,MapReduce handwriting WC you can write?

TXT file under/export/ server/data. The content is Spark Flink Flume hadoop Flink Spark Flume Hadoop

The following uses Flink computing engine to realize streaming data processing: receiving data from Socket, real-time word frequency statistics WordCount

Java version:

// 1. Prepare the environment -env
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		// 2. Prepare data -source
// DataStreamSource
      
        inputDataStream = env.socketTextStream("node1.itcast.cn", 9999);
      
		DataStreamSource<String> inputDataStream = env.readTextFile("D:\\0615\\bigdata-flink\\datas\\wordcount.data");

		// 3. Transformation
		// TODO:Flow calculation word frequency statistics WordCount and processing ideas are basically the same
		SingleOutputStreamOperator<Tuple2<String, Integer>> resultDataStream = inputDataStream
				// Split words
				.flatMap(new FlatMapFunction<String, String>() {
					@Override
					public void flatMap(String line, Collector<String> out) throws Exception {
						for (String word : line.trim().split("\\s+")) { out.collect(word); }}})// Convert the binary
				.map(new MapFunction<String, Tuple2<String, Integer>>() {
					@Override
					public Tuple2<String, Integer> map(String word) throws Exception {
						return new Tuple2<>(word, 1); }})// Group aggregation
				.keyBy(0).sum(1);

		// 4. Output result -sink
		resultDataStream.print();

		// 5. -execute is triggered
		env.execute(_02StreamWordCount.class.getSimpleName());
Copy the code

Think about the Scala version,Python version how to write?

How do I consume data from Kafka and filter out data with a status of SUCCESS and write it to Kafka

{“user_id”: “1”, “page_id”:”1″, “status”: “success”}

{“user_id”: “1”, “page_id”:”1″, “status”: “success”}

{“user_id”: “1”, “page_id”:”1″, “status”: “success”}

{“user_id”: “1”, “page_id”:”1″, “status”: “success”}

{“user_id”: “1”, “page_id”:”1″, “status”: “fail”}

Official documents:

Ci.apache.org/projects/fl…

Code implementation:

//1. Prepare the environment flow execution environment and flow table
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);

//2. Run SQL to create the input_kafka table
        TableResult inputTable = tEnv.executeSql(
                "CREATE TABLE input_kafka (\n" +
                        " `user_id` BIGINT,\n" +
                        " `page_id` BIGINT,\n" +
                        " `status` STRING\n" +
                        ") WITH (\n" +
                        " 'connector' = 'kafka',\n" +
                        " 'topic' = 'input_kafka',\n" +
                        " 'properties.bootstrap.servers' = 'node1:9092',\n" +
                        " 'properties.group.id' = 'testGroup',\n" +
                        " 'scan.startup.mode' = 'latest-offset',\n" +
                        " 'format' = 'json'\n" +
                        ")"
        );
/ / create output_kafka
        TableResult outputTable = tEnv.executeSql(
                "CREATE TABLE output_kafka (\n" +
                        " `user_id` BIGINT,\n" +
                        " `page_id` BIGINT,\n" +
                        " `status` STRING\n" +
                        ") WITH (\n" +
                        " 'connector' = 'kafka',\n" +
                        " 'topic' = 'output_kafka',\n" +
                        " 'properties.bootstrap.servers' = 'node1:9092',\n" +
                        " 'format' = 'json',\n" +
                        " 'sink.partitioner' = 'round-robin'\n" +
                        ")"
        );

// Filter values based on whether status is success
        String sql = "select " +
                "user_id," +
                "page_id," +
                "status " +
                "from input_kafka " +
                "where status = 'success'";

        Table ResultTable = tEnv.sqlQuery(sql);
	//3.toRetractStream
        DataStream<Tuple2<Boolean, Row>> resultDS = tEnv.toRetractStream(ResultTable, Row.class);
	//4
        resultDS.print();
	//5. Execute SQL to insert the filtered success table into output_kafka
        tEnv.executeSql("insert into output_kafka select * from "+ResultTable);


	//6.excute
        env.execute();
Copy the code

Let’s try one more question:

Use Java or Scala programming language consumption kafka data and in the data processing stage filter out the country Code is not cN content and print output assume: cluster host hostname for Node Kafka topic for data Kafka consumption group is Default Group Example data {dt”;” The 2020-08-05 10:11:09 “, “country Code” : “CN”, “data” : [{” type “s1, score: 0.8),” “type” : 52, score “: 0. 3}}} {“dt”:”202008-05 10: 13: 12″,”country Code”: KW”,”data”: [{“type”: “s2″,”score”: 0. 41.”type”: “s1″,”score”: 0.3}}} {” dt “:” 202008-05 10, 12, 15 “, “country Code” : “US”, “data” [{” type “:” s4, “” score” : 0.3). The “type” : 52, “” score: 0.5}}}

conclusion

Above is big data Flink interview questions — Flink high frequency examination points, thousands of words, the topic part of the collation from the network is mainly to prepare for the exam in the near future, and for the same as the author of the new review Flink

If you feel like you haven’t learned Flink before, I have prepared the latest Flink series for you in 2021:

The latest and most complete Flink series tutorial notes in 2021 _Flink

Flink mind Map 2021 The latest Flink mind Map — cute new production (great details

May you have your own harvest after reading, if there is a harvest might as wellThree even a keyThe ~