Different Flink primer

preface

Wechat search [Java3y] to pay attention to this simple man, like attention is the biggest support for me!

The text has been uploaded to my GitHub: github.com/ZhongFuChen… , has over 300 original articles, most recently in the serialized interview and project series!

Some time ago I wrote an introduction to Storm, and many of my classmates told me, “My Lord, times have changed”.

The company is taking Storm cluster offline recently, so we have to change all Storm missions to Flink.

So I started Flink recently, and now I want to share the relevant knowledge of Flink.

(At the time of writing the above paragraph, it has been a quarter since the article was published. I am sorry that this article has been delayed for a quarter.)

I have to say that Flink is really popular in the past two years 🔥 this article mainly talks about Flink entry when some may not understand the point or read the official introduction not to understand the point (API I will not elaborate, should be able to understand it with more use).

What is Flink?

On Flink’s website, you can set the official document language to Chinese, so we can see the official introduction like this:

We can read every word in the picture above, but we can’t read it together.

In any case, we can learn: Is Flink a distributed computing processing engine

Distributed: “its storage or computing is distributed across multiple servers and aggregated to achieve the final result.”
Real-time: processing speed is millisecond or second
Computing: Can be understood simply as processing of data, such as cleansing (organizing data to extract useful data)

Based on the one-sentence introduction on the official website, we can think of a lot of things.

This article will give you a brief introduction to some basic concepts of Flink, and you can use this article to get a start on Flink when you actually use it. Now Storm has been abandoned by many people, what are Flink’s advantages over Storm? Let’s take a look at Flink.

What are boundaries and no boundaries?

Apache Flink is a framework and distributed processing engine for stateful computation on both borderless and bounded data streams.

In fact, there is an official introduction, but it is not easy for beginners to understand, I come to the kindergarten.

You’ve seen Flink, you’ve seen message queues, right? So how do you use message queues? Producer produces data, sends it to brokers, consumers, and is done.

When consuming, should we control when Producer sends messages? I don’t think so. There’s one. I’ll take one. There’s nothing wrong.

Such unprocessed messages are borderless by default.

It is easy to understand if there is a boundary: if there is no boundary and conditions are added, then there is a boundary. What are the conditions? For example, if I want to add a time: I want to consume data from August 8 to August 9, that is bounded.

When do I use no boundaries, and when do I use boundaries? That makes sense. I do data cleansing: take one, I do one, this borderless is fine. I want to do statistics: what is the PV (Page view) per hour, so I set a one-hour boundary and save one hour of data to process at a time.

On Flink, the operation of setting “boundaries” is called opening Windows, and Windows can be simply divided into two types:

Time window (TimeWindows) : Aggregate according to the time window, for example, the data is processed once for an hour as mentioned above.
Counting window (CountWindows) : as specifiedA number ofTo aggregate, say, every 10 pieces of data that come in.

Looks very human (mom doesn’t have to worry about me needing to get together anymore)…

In addition, Flink also considers the accuracy of data when using window aggregation. For example, now THAT I have five pieces of data at 11:06 and four pieces of data at 11:07, I am aggregating on a per-minute basis.

In theory, Flink should aggregate five data points at 06 and four data points at 07. However, due to the delay of the network, Flink received three pieces of data of 06 minutes at 07 minutes. If you don’t do any processing, then 07 might have processed 7 pieces of data.

For some scenarios that require accurate results, this doesn’t make much sense. Therefore, Flink can specify “Time semantics” for us, instead of specifying the default Processing Time of “data to Flink” for aggregation Processing, we can specify the aggregation Time as “Event Time” for Processing.

The time when the event occurred refers to the actual time recorded in the log

2020-11-22 00:00:02.552 INFO  [http-nio-7001-exec-28] c.m.t.rye.admin.web.aop.LogAspect 

Copy the code

Although the aggregation Time is specified as “Event Time”, the problem of data out of order is still not solved (5 pieces of data were generated on 06 minutes, but only 3 pieces were received on 06 minutes, and the remaining two pieces were received on 07 minutes. What should we do in this case? Should I aggregate at 06? What about the two 06 data received at 07?

Flink can also set waterMarks for us. Flink means that the data receiving is not orderly due to network delay, which I can understand. You like this, according to their own situation, you can set a “delay time”, such as the delay time, I then aggregated unified aggregation.

For example: Now THAT I know the data is likely to be delayed by one minute, I set the waterMarks to be delayed by one minute.

Reading: Because Event Time of “Event occurrence Time” is set, Flink can detect the occurrence Time of each record, while waterMarks is set to delay one minute. When Flink finds that the data of 07 minutes :59 seconds comes to Flink, Make sure that the 06 minutes’ worth of data comes in (because of the 1 minute delay) and then the 06 minutes’ worth of window data is aggregated.

What is stateful?

Apache Flink is a framework and distributed processing engine for stateful computation on both borderless and bounded data streams.

What is stateful and what is stateless?

Stateless we can simply think of it as: each execution is independent of the results of the last or N executions.

Stateful we can simply think of it as: execution depends on the result of the last or N executions, and execution depends on the processing result of previous events.

For example, if we want to count the PV(Page View) of articles, now every time one clicks on an article, a message will appear in Kafka. Now THAT I’m running statistics on a streaming platform, is it stateful or stateless?

If we want to do it in Storm, we might put each processing result into an “external store” and then calculate it based on this “external store” (we don’t use Storm Trident here), Storm would be stateless.

For example, I store the data I get every time and store it in Redis. When I get a piece of data, I will first check the current value of Redis and then add it up with the current value.

If we want to do it in Flink, Flink itself provides such a function for us to use. We can rely on Flink’s “storage” to manage each processing result and execute the logic of calculation.

It can be simply considered that Flink itself provides us with the function of “storage”, and we can rely on Flink’s “storage” every time we execute, so it is stateful.

So where does Flink store all this stateful data?

There are three main places:

memory
File system (HDFS)
Local database

If Flink is suspended, some data may be lost in memory, and some data may be stored in disk. Then, during the restart (for example, the message queue will be pulled again), won’t you be afraid of losing or exceeding data?

At this point, you may have seen another of Flink’s more famous features elsewhere: precision one-time

(To put it simply: when Flink encounters an unexpected event and hangs, what mechanism is there to ensure that the processing data is not duplicated and lost as much as possible?)

What is exactly once?

As we all know, there are three kinds of semantics of flows:

Exactly once: There is one and only one, no more, no less
At least once: At least once
At most once: At most one, maybe none

What does Flink mean by precise one-time?

Flink’s exact one-off means that the state is persisted only once to the final storage medium (local database /HDFS…).

The Source data stream has the following numbers 21,13,8,5,3,2,1,1, and then Flink needs to do the sum operation (sum).

Now that we have processed 2,1,1, the cumulative value is 4, and Flink has stored the accumulated state 4.

The program keeps going down, processing 5,3, and now adds up to 12, but before Flink can store 12 to the final medium, the system dies.

Flink will restart the system and restore it to a cumulative value of 4, so 5,3 will have to be computed again, and the program will continue.

Some of you might think that precise one-time doesn’t mean that a piece of code is executed only once, not more than once, or not at all. These 5 and 3, didn’t you double count them? How can you be accurate just once?

Obviously, it’s impossible to execute code only once. We have no control over which line of code the system hangs on, so if you hang before the current method has finished executing, you still have to re-execute that method.

Therefore, the state is persisted only once to the final storage medium (local database /HDFS), which is called exactly once under Flink (the calculated data may be duplicated (inevitably), but the state is stored only once on the storage medium).

So how often is Flink stored? We manually configured this ourselves.

The so-called CheckPoint actually means that Flink will save the status information at a specified period of time. If Flink dies, it can retrieve the last status information, replay the unsaved data to perform the calculation, and finally realize exactly once.

So how did CheckPonit do it? Think about what we did in Kafka to achieve “at least once” in the business? We pull the data from Kafka, and when we’re done, we manually submit the offset (tell Kafka I’m done)

We commit offset after completing the business rule, and checkponit is the same as checkponit.

Again, how does a checkpoint know it’s run out? Flink inserts barriers in the flow processing process, and each link will be reported when it reaches the barrier. When all sinks report the barrier, the checkpoint has been completed.

It is important to note that the precise one-time implementation of Flink only ensures that the internal state is accurate once. If you want to achieve end-to-end precision once, you need the support of the end

The data source must be replayable, and unacknowledged data can be re-read for issuing faults
FlinkData needs to be stored on disk media (not in memory) and can be recovered if a failure occurs
The sending source needs to support transactions (from read to write transactions need to be supported to ensure that there is no failure in the process)

The last

This article has made a simple introduction to Flink, and I hope it will be helpful to you when you get started. In the future, I plan to write another Flink article to have a deeper understanding of CheckPoint mechanism. If you are interested, you can click “Check” to receive it as soon as possible.

Sanwei has sorted out all the [factory interview knowledge points], [resume template] and [original article] into an e-book, with a total of 1263 pages! Click on the link below to get it directly

GitHub
Gitee access is faster

The content of PDF document is typed by hand, you can ask me directly if you don’t understand

preface

What is Flink?

What are boundaries and no boundaries?

What is stateful?

What is exactly once?

The last

Sanwei has sorted out all the [factory interview knowledge points], [resume template] and [original article] into an e-book, with a total of 1263 pages! Click on the link below to get it directly

Related Posts

Abstract Data Type ADT (C Primer Plus Version 6)

Gitlab- A tutorial on using CI

Equipment Download-generic Inbound and Mapping process logic analysis