Hive Window Functions (35)

Window function initialization

In SQL, there is a class of functions called aggregate functions, such as sum(), avg(), Max (), etc. These functions can aggregate multiple rows of data into a single row according to the rule. Generally, the number of rows after the aggregation is less than the number of rows before the aggregation. But sometimes we want to display both before and after the data is aggregated, so we introduce window functions. Window functions are the last function executed in the SQL statement and are only before the Order BY clause **, so we can think of the SQL result set as input data **

Actually the description above, still can’t let our good understanding what window function, but we know it can make the detail and aggregation results coexist, actually this a bit and a bit like the strakes, learn when we talk about the strakes, elevation is a record of input, then multiple data output, the scale of data and the original number n times, Matches multiple outputs

The window function is to perform summary calculation on specific data, and then output the original detailed data and its corresponding summary information in a line

Learn about window functions from examples

Let’s take a look at some of the most intuitive examples, and then dive into the cold grammar rules, principles and all that

To prepare data

Build table statements

create table ods_t_window(
	name string,
	orderdate string,
	cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ', ';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/userorder.txt' OVERWRITE INTO TABLE ods_t_window;
Copy the code

data

Jack 2015-01-01,10 Tony 2015-01-02,15 Jack,2015-02-03,23 Tony 2015-01-04,29 Jack,2015-01-05,46 Jack,2015-04-06,42 Tony 2015-01-07,50 Jack 2015-01-08,55 mart,2015-04-08,62 mart,2015-04-09,68 Neil 2015-05-10,12 mart,2015-04-11,75 Neil, 2015-06-12, 80 mart, 2015-04-13, 94Copy the code

Total number of customers who purchased in April 2015

In fact, from the above definition of requirements, we find that it also needs customers who have purchased – detailed information, total number of people – summary information

select 
	name,count(*) over(a)from 
	ods_t_window
where 
	substring(orderdate,1.7) = 'the 2015-04'
;
Copy the code

According to the query result, there were 5 purchase records in April 2015,mart bought 4 times and Jack bought 1 time. In fact, most of the time, we only look at the results. There are two ways to do this

The first implementation is to use a distinct implementation directly

select 
	distinct name,count(*) over(a)from 
	ods_t_window
where
	substring(orderdate,1.7) = 'the 2015-04'
;
Copy the code

The second implementation is to use group by, which has better performance

select 
	name,count(*)
from 
	ods_t_window
where
	substring(orderdate,1.7) = 'the 2015-04'
group by
	name
;
Copy the code

After the heavy we can see the details after the heavy

Customer purchase details and total monthly purchase amount

select
    name,orderdate,cost,sum(cost) over(partition by month(orderdate))
from
    ods_t_window
;
Copy the code

Actually this time we can understand a little bit about the window this thing, first the calculation of window function, here is the sum, we know that the sum is an aggregate function, normally we need to do to a set of data calculation, tend to be a complete table or a complete partition, but how to have a over window function, The data set calculated by ods_T_window is the data in a window, so the window function divides the entire table, that is, ods_t_window, into many Windows according to partition by definition. All the data are grouped according to the boundary value, while the window calculation function before over is carried out within each group. If the grouping is exceeded, the function is recalculated.

The sum_WINDOW_0 field is the same as the sum_WINDOW_0 field because the January data is in the January window, and the same is true for February

Customer purchase details and accumulated monthly purchase amount

Before we calculated the total monthly purchase amount, we saw that in a window, for example, the total amount of January is calculated at one time, so the sum_WINDOW_0 field of the data of January is the same, that is 205

In addition, the purchase details in January are also out of order, not in ascending or descending order, because the data in January are all in the same window, so sometimes we want them to be in order in the window

select
    name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate)
from
    ods_t_window
;
Copy the code

This shows that the data is ordered within the window (if the window itself is ordered, it is globally ordered), and that the sum_WINDOW_0 field is cumulative, for example, the sum value of 25 on line 2 is 10 on line 1 + 15 on line 2, and you can do the same thing

Customer’s purchase details and the total amount of the customer’s past 2 purchases

In fact, we can make a simple analysis of this demand first. Before, we calculated the total amount or the accumulated amount of the month. The object of calculation was the month, and all the Windows were opened for time. A window size is a partition by month(OrderDate). For the current requirement, we can find that our statistical information is targeted at customers, that is to say, our calculation object is customers, that is to say, we need to open the window to customers

We identified above the window object, then we analyse the past 3 times what to do, we studied the accumulation in front of the calculation, using the real order by statement to complete, it is the calculation of the current line in the window of all data (including the current line) accumulation, also is the window of the first data to the current line (including the current line)

+-------+-------------+-------+---------------+ | name | orderdate | cost | sum_window_0 | + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | jack | | | | 10 10 2015-01-01 current line the first line (10) summary values (10) | Tony | 2015-01-02 15 25 | | | current line the first line (10) + (15) summary values (25) | Tony 29 | | 2015-01-04 | | 54 current line the first line (10) + (15) + the third line (29) summary values (54) | jack  | 2015-01-05 | 46 | 100 | | tony | 2015-01-07 | 50 | 150 | | jack | 2015-01-08 | 55 | 205 | | jack | 2015-02-03 | 23 | 23 | start from here is a new window | jack 42 | | 2015-04-06 | | 42 | mart | 2015-04-08 | 62 | 104 | | mart | 2015-04-09 | 68 | 172 | | mart | 2015-04-11 | 75 | 247 | | mart | 2015-04-13 | 94 | 341 | | neil | 2015-05-10 | 12 | 12 | | neil | 2015-06-12 | 80 80 | | + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- +Copy the code

Order by: patition by: patition by: patition by: patition by Order by is also part of a window function. A window can define which data we evaluate in the window. By default, we evaluate all data in the window

select
    name,orderdate,cost,
    sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and current row)
from
    ods_t_window
;
Copy the code

Between 2 PRECEDING and current row is the definition of the first two and current rows of the current row

+-------+-------------+-------+---------------+ | name | orderdate | cost | sum_window_0 | + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | jack | | | | 10 10 2015-01-01 the first line (10) summary values (10) no two rows in front of the | jack | 46 56 | | 2015-01-05 | the first line the (10) (46) summary values (56) only the front line | jack | 2015-01-08 | | 111 | 55 first line (10) the second line (46) the third line (55) summary values (111). Have two line | jack in front of the 23 | 124 | | 2015-02-03 | the second line (46) the third line (55) in the fourth row (23) summary values (124) with the front two lines | jack | 2015-04-06 | | 120 | 42 the third line (55) The fourth row (23), the fifth line (42) summary values (120) with the front two lines | mart | 2015-04-08 | 62 | 62 | | mart | 2015-04-09 | 68 | 130 | | mart | 75 | 2015-04-11  | 205 | | mart | 2015-04-13 | 94 | 237 | | neil | 2015-05-10 | 12 | 12 | | neil | 2015-06-12 | 80 | 92 | | tony | 2015-01-02 | 15 | 15 | | tony | 2015-01-04 | 29 | 44 | | tony | 2015-01-07 | 50 | 94 | +-------+-------------+-------+---------------+Copy the code

If the window clause is not specified, all data of the window will be calculated by default. If the order by clause is specified, all data of the window will be calculated. If the sequence between UNBOUNDED PRECEDING and current row is in order by, the default window clause is “rows between UNBOUNDED PRECEDING and current row”. Between the UNBOUNDED foregoing and the PRECEDING n lines of data

Now let’s take a look at all the data evaluated with order by specified

select
    name,orderdate,cost,
    sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
from
    ods_t_window
;
Copy the code

Can we manually specify the cumulative calculation method? That is, manually specify the default window clause of order BY, such as the purchase details of the second requirement customer and the accumulated monthly purchase amount

select
    name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate)
from
    ods_t_window
;
Copy the code

Now let’s try to specify it manually

select
    name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate rows between UNBOUNDED PRECEDING and current row)
from
    ods_t_window
;
Copy the code

Definition of window functions

Through the study of the previous example, we have the window function with the preliminary cognition, and also introduces the window function can make summary and detail information together on display, it can be convenient we do a lot of things, one thing need pay attention to and the difference between is window function objects and our regular summary functions of calculation of the object is not the same, A window function evaluates the data in a window. A regular function evaluates the entire table or partition

So we can think of the window function is composed of two parts, the first part is the calculation function, the second part is the definition of the window

Let’s take a look at the complete definition of window functions

The definition of a window function is shown in the screenshot above. Let’s take a look at these definitions one by one

Window computes function

The window calculation function determines what kind of calculation we perform on the data in the window. Can it be sum, count, min, Max, AVG, etc.

The window calculation function is used to calculate the data in the window, so we will introduce how to calculate the data in the window – what data belongs to a window

The window flag (defines)over

We can think of over as a flag for a window function. It marks us that the function is a window function and not a normal function, so the parameters related to the window definition are passed to over as arguments

In fact, it is more accurate to define the window definition, rather than the tag, which defines the window data partition method — partition by, the window data calculation method — order by, the window calculation boundary rows between

Window defines the boundary patition by

In front of us we introduced the window calculation function is to calculate the data in a window, that we are ze Yang to define a window? Or what kind of data do we think belongs to a window, that’s patition by that defines what data belongs to a window, if there’s no patition by statement, then the whole data source is in a window, right

Over (partition by month(orderDate) order by orderDate) defines data equal to month(orderDate) as belonging to a partition. Sum (cost) over(partition by name) defines that data with the same name belong to a partition

If we define the partition of Windows by patition, then our window evaluation function can be applied to each window, and actually here we see the window function that we’ve been talking about, it defines the window evaluation function, and then it applies to the list of Windows that it defines

Window calculation method order by

Order by looks like a sort, but if we just calculate the sum of the data in a window, it doesn’t matter whether the data in the window is in order or not, then why sort it? Actually, there is one kind of sum that does matter, and that is the cumulative sum of the data, So once the order by appears in our window definition, our calculation becomes cumulative. Whether we accumulate sum or count depends on whether our window calculation function is sum or count

Order BY indicates the use of cumulative computation, that is, the behavior of changing all the data in the window defined by our default calculation patition BY, so it introduces another calculation boundary based on the data in the window. The original boundary is all the data in the window, that is, the boundary of the window. Order by: order by: order by: order by:

The window calculates the boundary rows between

The data scope defined by “Rows Between” is also called the window clause

Order by affects the calculation boundary of the default window calculation function, and it has its own calculation boundary. In fact, its calculation boundary is limited by “Rows Between”. The default “rows between Rows between UNBOUNDED PRECEDING and current row” is provided, so we will not write the window clause

PRECEDING: go
Back the FOLLOWING:
CURRENT ROW: indicates the CURRENT ROW
To those who are UNBOUNDED.
UNBOUNDED PRECEDING represents the PRECEDING boundary
- N PRECEDING(how many PRECEDING lines, for the current row)
UNBOUNDED FOLLOWING: means to the back of the boundary
- N FOLLOWING(number of rows, for current row)

Here’s an example

select 
  name,orderdate,cost,
  sum(cost) over(a)as sample1,Add all the rows
  sum(cost) over(partition by name) as sample2,Group by name and add data within the group
  sum(cost) over(partition by name order by orderdate) as sample3,-- Group by name and add data within the group
  sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )  as sample4 ,-- Same as sample3, aggregation from the start to the current row
  sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   and current row) as sample5, -- Aggregate the current line and the previous line
  sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   AND 1 FOLLOWING  ) as sample6,-- The current line and the preceding and following lines
  sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 -- The current line and all subsequent lines
from 
	ods_t_window;
Copy the code

conclusion

Window functions In addition to our common statistics function +over, we can call it a general window function, and Hive also combined with application scenarios to provide us with window functions for specific scenarios, such as row_number, we call it a special window function

Application scenarios: Used for partition sorting dynamic Group By, Top N, cumulative calculation, and hierarchical query. In Hive, row_number(),rank, and dense_rank() are used for Top N. We will describe all window functions separately later

In the window function, some clauses in the definition of the window can be omitted. If only the partition by clause is used and no order by is specified, our aggregation is within a group (within a specific window). If the partition by clause is not used, the default value of the window clause is to accumulate data from the starting point to the current row. If the window clause is not used, the default value of the window clause is to accumulate data from the starting point to the current row. If the partition by clause is not used, the default value of the window clause is to accumulate data from the current row. The entire input data set is computed in a single window