Summary of the Window
Streaming is a data processing engine designed to process infinite data sets, which is an ever-growing and essentially infinite data set, while Window is a means of cutting infinite data into finite blocks for processing.
A Window is at the heart of infinite buckets. It separates an infinite stream into finite buckets on which we can perform computing operations.
The Window type
Windows can be divided into two categories:
-
CountWindow: Generates a Window based on the specified number of data bars, irrespective of time.
-
TimeWindow: Generates Windows based on time.
Timewindows can be classified into three categories: Tumbling Window, Sliding Window, and Session Window.
-
Tumbling Windows
Slice the data according to a fixed window length.
Features: Time alignment, fixed window length, no overlap.
Application scenario: It is suitable for BI statistics (aggregation calculation in each period).
The scrollwindow allocator allocates each element to a window of a specified window size. Scrollwindows have a fixed size and do not overlap. For example, if you specify a scroll window size of 5 minutes, the window will be created as shown below:
-
Sliding Windows
A sliding window is a more generalized form of a fixed window, which consists of a fixed window length and a fixed sliding interval.
Features: Time alignment, fixed window length, can have overlap.
Application scenario: Statistics of the latest period (the failure rate of an interface within the last 5 minutes is required to determine whether to alarm).
The sliding window allocator allocates elements to a fixed-length window. Similar to a rolling window, the size of the window is configured by the window size parameter, and another window sliding parameter controls how often the sliding window starts. Therefore, a sliding window can overlap if the sliding parameter is smaller than the window size, in which case the element is allocated to multiple Windows.
For example, if you have a 10-minute window and a 5-minute slide, then the 5-minute window in each window contains the data generated in the last 10 minutes, as shown in the figure below:
3. Session Windows
It consists of a series of events combined with a timeout interval of a specified length of time, similar to the session of a Web application, that is, if no new data is received for a period of time, a new window is generated.
Features: Time unaligned.
Application scenario: Online user behavior analysis.
The Session window allocator uses session activity to group elements. Compared with scrolling and sliding Windows, the session window does not have overlapping and fixed start and end times. On the contrary, when it no longer receives elements within a fixed period of time, the inactivity interval is generated. So this window will close. A session window is configured with a session interval, which defines the length of an inactive period. When this period occurs, the current session closes and subsequent elements are allocated to a new session window.
Window API
TimeWindow
TimeWindow is a window in which all the data in a specified time range is composed and all the data in a window is calculated at a time.
-
Rolling window The default Time window of Flink is divided according to the Processing Time, and the data obtained by Flink is divided into different Windows according to the Time of entering Flink.
val minTempPerWindow = dataStream .map(r => (r.id, r.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(15)) .reduce((r1, r2) => (r1._1, r1._2.min(r2._2))) Copy the code
The interval can be specified using one of time.milliseconds (x), time.seconds (x), time.minutes (x), and so on.
-
SlidingEventTimeWindows
The name of the function for the sliding window and the scrolling window is the same, except that you need to pass two parameters, one is window_size and the other is sliding_size.
In the code below, sliding_size is set to 5s, which means that the output is computed every 5s, each time for all elements in the window range of 15s.
val minTempPerWindow: DataStream[(String.Double)] = dataStream .map(r => (r.id, r.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(15), Time.seconds(5)) .reduce((r1, r2) => (r1._1, r1._2.min(r2._2))) // .window(SlidingEventTimeWindows.of(Time.seconds(15),Time.sec onds(5)) Copy the code
The interval can be specified using one of time.milliseconds (x), time.seconds (x), time.minutes (x), and so on.
CountWindow
CountWindow triggers execution based on the number of the same key elements in the window, and only counts the result of keys whose number of elements reaches the window size.
Note :CountWindow window_size refers to the number of elements with the same Key, not the total number of all input elements.
-
The default CountWindow is a scroll window. You only need to specify the window size. When the number of elements reaches the window size, the execution of the window will be triggered.
val minTempPerWindow: DataStream[(String.Double)] = dataStream .map(r => (r.id, r.temperature)) .keyBy(_._1) .countWindow(5) .reduce((r1, r2) => (r1._1, r1._2.max(r2._2))) Copy the code
-
The sliding window
The name of the function for the sliding window and the scrolling window is the same, except that you need to pass two parameters, one is window_size and the other is sliding_size.
In the code below, sliding_size is set to 2, which means that each time two of the same keys are received, the window range is 10 elements each time.
val keyedStream: KeyedStream[(String.Int), Tuple] = dataStream.map(r => (r.id, r.temperature)).keyBy(0) // When the number of a key reaches 2, the calculation is triggered to calculate the contents of the last 10 elements of the key val windowedStream: WindowedStream[(String.Int), Tuple.GlobalWindow] = keyedStream.countWindow(10.2) val sumDstream: DataStream[(String.Int)] = windowedStream.sum(1) Copy the code
window function
Window functions define computations to be performed on data collected in a Window. They can be divided into two main types:
- Incremental aggregation functions compute each piece of data as it arrives, keeping it in a simple state. Typical incremental aggregation functions include ReduceFunction and AggregateFunction.
- Full window functions collect all the data of the window first, and then iterate over all the data when calculating. ProcessWindowFunction is a full-window function.
Other optional apis
-
.trigger() — The trigger defines when the window closes, triggers the calculation and outputs the result
-
Evitor () — remover
Define the logic to remove some data
-
AllowedLateness () – Allows late data to be processed
-
SideOutputLateData () – Puts late data into the side output stream
-
GetSideOutput () — Gets the side output stream