About the author: Wang Zhiwu, senior big data engineer. This article is excerpted from The Education column
42 Talk to Beat Flink easily
Hello, I’m Wang Zhiwu. Today’s share focuses on the application scenarios of Flink.
Real-time computing at its best
In the past decade, real-time computing for the data age has followed. From Storm, as we first knew it, to Spark’s sudden emergence, it quickly took over the whole real-time computing space. Until the end of January 2019, Alibaba’s internal version of Flink officially open source! The news of Flink’s open source immediately swept the circle of friends. Spark has always been the leader in the field of big data computing, and it has suddenly become an era of two-power rivalry.
Apache Flink (hereinafter referred to as Flink) has attracted much attention for its advanced design concept and powerful computing power. How to quickly apply Flink in the production environment, better integrate it with the existing big data ecological technology, and fully exploit the potential of data has become a difficult problem faced by many developers.
This article is selected from the education column “42 Easy Customs Flink”.
Flink application scenarios
Since opening source in early 2019, Flink has quickly become a hot technology framework for real-time computing with big data. As the main contributor of Flink, Alibaba took the lead in promoting and using Flink in the whole group. In addition, due to Flink’s natural streaming characteristics and more advanced architecture design, Flink has set off a boom of application in major companies as soon as it appeared.
Many Internet companies, such as Alibaba, Tencent, Baidu, Bytedance, Didi and Huawei, have taken Flink as an important starting point of future technology and are eager to upgrade and promote the use of technology within their own companies. Meanwhile, Flink has become one of the most active projects in the Apache Foundation and GitHub community.
Let’s take a look at the many application scenarios supported by Flink.
Real-time data computation
If you are familiar with big data technology, you should be familiar with the following requirement scenarios:
Alibaba will broadcast live on Singles’ Day every year. How does real-time monitoring of big screens work?
Would you like to take a look at the top five best-selling products in the big push?
I am the operation and maintenance of the company. I hope to receive the load of the server in real time.
As we can see, data computing scenarios need to extract valuable information and metrics from raw data, such as the above mentioned real-time sales, TOP5 sales, and server load.
Traditional analysis is usually done by using batch queries, or by logging events (usually messages in production) and building applications based on them into limited data sets (tables). To get the results of the latest calculations, you must first write them to a table and re-execute the SQL query, then write the results to a storage system such as MySQL and regenerate them into reports.
Apache Flink supports both streaming and batch analysis applications, which is what we call batch streaming in one. Flink is responsible for real-time data acquisition, real-time calculation and downstream transmission in the above requirements scenario.
This article is selected from the education column “42 Easy Customs Flink”.
Real-time data warehouse and ETL
ETL (extract-transform-load) is a process in which the data of business system is extracted, cleaned and transformed and then loaded into data warehouse.
Traditional offline data warehouse stores business data in a centralized manner and uses fixed computing logic to perform ETL and other applications such as post-modeling output reports. An offline data warehouse is used to construct T+1 offline data, pull incremental data every day through scheduled tasks, create subject dimension data related to each service, and provide T+1 data query interfaces externally.
The figure above shows the difference between offline data warehouse ETL and real-time data warehouse. It can be seen that the calculation and real-time data of offline data warehouse are poor. The value of the data itself will gradually weaken with the passage of time, so the data must reach the hands of users as soon as possible after the occurrence of real-time data warehouse construction demand also emerged.
The construction of real-time data warehouse is an essential part of “data intelligent BI”, and it is also an inevitable challenge in large-scale data application.
Flink has natural advantages in real-time data warehouse and real-time ETL:
- State management, real-time data warehouse will carry out a lot of aggregated calculations, these need to access and manage the state, Flink supports powerful state management;
- Rich API, Flink provides extremely rich multi-level API, including Stream API, Table API and Flink SQL;
- Ecological improvement, real-time data warehouse is widely used, Flink supports a variety of storage (HDFS, ES, etc.);
- Flink is already unifying the API for streaming and batch computing.
Event-driven applications
Do you have the need to:
Our company has tens of thousands of servers. We hope to separate the CPU, MEM and LOAD information from the messages reported by the servers for analysis, and then trigger the user-defined rules for alarm.
I am a security operations person in a company and would like to be able to identify crawlers from daily access logs and do IP restrictions?
Event-driven applications are a class of stateful applications that extract data from one or more streams of events and trigger calculations, status updates, or other external actions based on incoming events.
In a traditional architecture, we need to read and write remote transactional databases such as MySQL. In event-driven applications, data and computation are not separated and the application only needs to access the data locally (memory or disk), resulting in higher throughput and lower latency.
The following features of Flink support event-driven applications perfectly:
- For efficient State management, Flink’s State Backend can store intermediate State information.
- Rich window support, Flink support includes scrolling Windows, sliding Windows and other Windows;
- Multiple Time semantics. Flink supports Event Time, Processing Time, and Ingestion Time.
- Flink supports At Least Once or Exactly Once levels of fault tolerance.
summary
Apache Flink supports application development for many different scenarios from the ground up.
Key features of Flink include batch streaming, Exactly-Once, powerful state management, and more. Flink also supports multiple resource management frameworks, including YARN, Mesos, and Kubernetes. Alibaba has taken the lead in promoting the use of Flink in the whole group, and it has been proved that Flink can be expanded to thousands of cores, and its state can reach TB level, while still maintaining the characteristics of high throughput and low latency.
As a result, Flink has become our first choice in the field of real-time computing.
In addition, Flink also supports stateful operators, fault tolerance, Checkpoint, exact-once semantics and more advanced features to support users’ needs in different business scenarios.
That’s all for this lesson. In the next lesson, I will introduce “Flink introduction and SQL form implementation”. See you next time.
This article is selected from: Pull check education column “42 easy customs clearance Flink” attention to my public number: IT technology thinking, reply: 123, can get free factory interview real questions oh ~ copyright statement: The copyright of this article belongs to Pull hook education and the columnist. Any media, website or individual shall not be reproduced, linked, reposted or otherwise copied and published/published without the authorization of this agreement, the offender shall be corrected.