In SQL, there is a class of functions called aggregate functions, such as sum(), AVg (), Max (), etc. These functions can aggregate multiple rows of data into a single row according to the rule. Generally, the number of rows after the aggregation is less than the number of rows before the aggregation. But sometimes we want to display both the pre-aggregation and post-aggregation data, so we introduce window functions. Window functions are also called OLAP functions/analysis functions. Window functions have both grouping and sorting functions.

This article is divided into two parts: the first part is Hive window functions in detail, analyze a variety of window functions (almost all of the Hive window functions); The second part is the practical application of window functions. There are five examples in this part, which are all classic examples that are commonly used in work and must be asked in interviews.

This article was first published on the official account [Learning big Data in five minutes]

Hive Window function

The most important keywords for window functions are partition by and order by

XXX over (partition by XXX order by XXX)

Note: The partition by and order by in over() are optional. The partition by and order by in over() are optional.

I’ve divided the window functions into several categories, and we’ll go through them one by one.

SUM, AVG, MIN, MAX

To explain these window functions, let’s create a table to illustrate them with practical examples.

First create the user access page table user_pv

create table user_pv(
cookieid string,  Cookie of user login, i.e. user id
createtime string, The date -
pv int -- Page views
); 
Copy the code

Add the following data to the table above:

cookie1,2021- 05- 10.1
cookie1,2021- 05- 11.5
cookie1,2021- 05- 12.7
cookie1,2021- 05- 13.3
cookie1,2021- 05- 14.2
cookie1,2021- 05- 15.4
cookie1,2021- 05- 16.4
Copy the code

  • The SUM () is used

Run the following query statement:

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1 
from user_pv;
Copy the code

The result is as follows :(due to the command line, the field name and value in the following picture are not correct, please pay attention to distinguish!

Run the following query statement:

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid ) as pv1 
from user_pv;
Copy the code

The results are as follows:

SQL > add order by to over(); SQL > add order by to over()

So beware:

  • Over () + order by = pv accumulation from the starting point to the current row, e.g., pv1 of no.11 = PV of no.10 + PV of No.11, no.12 = pv of no.10 + no.11 + no.12;

  • Over () without order by means that all values in the group are added.

AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX, AVG, MIN, MAX Not all tie values. Same thing for MIN and MAX.

2. ROW_NUMBER, RANK, DENSE_RANK, NTILE

Log in to user_pv with the same user as before, and replace the data in the log table with the following:

cookie1,2021- 05- 10.1
cookie1,2021- 05- 11.5
cookie1,2021- 05- 12.7
cookie1,2021- 05- 13.3
cookie1,2021- 05- 14.2
cookie1,2021- 05- 15.4
cookie1,2021- 05- 16.4
cookie2,2021- 05- 10.2
cookie2,2021- 05- 11.3
cookie2,2021- 05- 12.5
cookie2,2021- 05- 13.6
cookie2,2021- 05- 14.3
cookie2,2021- 05- 15.9
cookie2,2021- 05- 16.7
Copy the code

  • ROW_NUMBER () USES:

ROW_NUMBER() generates a sequence of records in a group, starting from 1.

SELECT 
cookieid,
createtime,
pv,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn 
FROM user_pv;
Copy the code

The results are as follows:


  • RANK and DENSE_RANK use:

RANK() generates the RANK of data items in a group. Equal ranks leave empty Spaces in the ranking.

DENSE_RANK() generates the rank of the data item in the group. Equal ranks will not leave empty places in the ranking.

SELECT 
cookieid,
createtime,
pv,
RANK(a)OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK(a)OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM user_pv 
WHERE cookieid = 'cookie1';
Copy the code

The results are as follows:


  • NTILE usage:

Sometimes there is a need: if the data is sorted into three parts, and the business person only cares about one part, how do they get the middle third of the data out? The NTILE function will do that.

Ntiles can be thought of as dividing an ordered set of data evenly into a specified number of buckets, with bucket numbers assigned to each row. If the buckets cannot be evenly distributed, buckets with smaller numbers are allocated first, and the difference in the number of rows in each bucket is at most 1.

Then, you can select the first or last n parts of data based on the bucket number. The data is presented in its entirety, just the corresponding data is labeled; To get a specific number of data, you need to nest another layer according to the label.

SELECT 
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM user_pv 
ORDER BY cookieid,createtime;
Copy the code

The results are as follows:

3. LAG, LEAD, FIRST_VALUE, LAST_VALUE

To illustrate these window functions, start by creating the user access page table: user_URL

CREATE TABLE user_url (
cookieid string,
createtime string,  -- Page access time
url string       -- Visited page
);
Copy the code

Add the following data to the table:

cookie1,2021- 06- 10 10:00:02,url2
cookie1,2021- 06- 10 10:00:00,url1
cookie1,2021- 06- 10 10:03:04.1url3
cookie1,2021- 06- 10 10:50:05,url6
cookie1,2021- 06- 10 11:00:00,url7
cookie1,2021- 06- 10 10:10:00,url4
cookie1,2021- 06- 10 10:50:01,url5
cookie2,2021- 06- 10 10:00:02,url22
cookie2,2021- 06- 10 10:00:00,url11
cookie2,2021- 06- 10 10:03:04.1url33
cookie2,2021- 06- 10 10:50:05,url66
cookie2,2021- 06- 10 11:00:00,url77
cookie2,2021- 06- 10 10:10:00,url44
cookie2,2021- 06- 10 10:50:01,url55
Copy the code

  • Use of LAG:

LAG(Col,n,DEFAULT) is the value in the NTH row up in the statistical window.

The first argument is the column name, the second argument is the NTH row up (optional, default is 1), and the third argument is the default (if the NTH row up is NULL, NULL if not specified).

SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1.'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
FROM user_url;
Copy the code

The results are as follows:

Explanation:

last_1_time: 1970-01-01 00:00:00 cookie1:1970-01-01 00:00:00 cookie1:1970-01-01 00:00:00 cookie1:1970-01-01 00:00:00 2021-06-10 10:00:02 cookie1 last_2_time: Specifies the up 2 the value of the line, to specify a default value cookie1 first line, the second line up 2 act NULL cookie1, 2 act NULL cookie1 up the fourth row, the second line up 2 behavior value, the 2021-06-10 10:00:02 cookie1 7 lines, Up 2 is the fifth row, 2021-06-10 10:50:01Copy the code

  • Use of LEAD:

In contrast to the LAG

LEAD(Col,n,DEFAULT) is used to count the values in the NTH row in the window.

The first argument is the column name, the second argument is the NTH row (optional, default is 1), and the third argument is the default (if the NTH row is NULL, NULL if not specified).

SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1.'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time 
FROM user_url;
Copy the code

The results are as follows:


  • Use FIRST_VALUE:

After taking the sort within the group, the end of the current row, the first value.

SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 
FROM user_url;
Copy the code

The results are as follows:


  • Use of LAST_VALUE:

After taking the group sort, up to the current row, the last value.

SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 
FROM user_url;
Copy the code

The results are as follows:

If you want to select the last value in a group, you need to work around it:

SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 
FROM user_url 
ORDER BY cookieid,createtime;
Copy the code

FIRST_VALUE = FIRST_VALUE; FIRST_VALUE = FIRST_VALUE;

The results are as follows:

Pay special attention to order by here

If you do not specify ORDER BY, the sorting is chaotic and the result is wrong

SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2  
FROM user_url;
Copy the code

The results are as follows:

The createtime of urL2 and URL55 above is neither the first nor the last, so the result is chaotic.

4. CUME_DIST

Create an employee salary table: staff_salary

CREATE EXTERNAL TABLE staff_salary (
dept string,
userid string,
sal int
);
Copy the code

Add the following data to the table:

D1,user1,1000 D1,user2,2000 D1,user3,3000 D2,user4,4000 D2,user5,5000Copy the code

  • Use of CUME_DIST:

The result of this function depends on the sort order of order by.

CUME_DIST: the number of rows less than or equal to the current value/the total number of rows in a group. Order Default order: positive

For example, the percentage of the population that has less than or equal to their current salary.

SELECT 
dept,
userid,
sal,
CUME_DIST(a)OVER(ORDER BY sal) AS rn1,
CUME_DIST(a)OVER(PARTITION BY dept ORDER BY sal) AS rn2 
FROM staff_salary;
Copy the code

The results are as follows:

Explanation:

Rn1: there is no partition, all data is 1 group, total number of rows is 5, first row: number of rows less than or equal to 1000 is 1, therefore, 1/5=0.2, third row: number of rows less than or equal to 3000 is 3, therefore, 3/5=0.6 rn2: Grouped by department, the number of lines dPET =d1 is 3, and the number of lines less than or equal to 2000 is 2, so 2/3=0.666666666666Copy the code

GROUPING SETS, GROUPING__ID, CUBE, ROLLUP

These analysis functions are commonly used in OLAP, do not add up, and need to be counted according to the parameters of drilling and drilling in different dimensions, such as the number of UV hours, days, and months.

Again, create a user access table: user_date

CREATE TABLE user_date (
month STRING,
day STRING, 
cookieid STRING 
);
Copy the code

Add the following data to the table:

1, 2021-03202-03-10, cookie1 1, 2021-03202-03-10, cookie5 12, 2021-03202-1-03 - cookie7 12, 2021-04202-1-04 - cookie3 1, 2021-04202-04-13, cookie2 13, 2021-04202-04-1 cookie4 1, 2021-04202-04-16, cookie4 1, 2021-03202-03-10, cookie2 1, 2021-03202-03-10, cookie3 1-4, 2021-04202-12, cookie5 13, 2021-04202-04-1 cookie6 15, 2021-04202-04-1 cookie3 15, 2021-04202-04-1 cookie2 1, 2021-04202-04-16, cookie1Copy the code

  • Using GROUPING SETS:

Grouping sets is a convenient way of grouping multiple group BY logic into a SQL statement.

It is equivalent to UNION ALL of the GROUP BY result sets of different dimensions.

SELECT 
month.day.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID 
FROM user_date 
GROUP BY month.day 
GROUPING SETS (month.day) 
ORDER BY GROUPING__ID;
Copy the code

GROUPING__ID = “month”, “day”, 1 = “month”, 2 = “day”. GROUPING__ID = “month”, 2 = “day”.

The results are as follows:

SQL is equivalent to:

SELECT month.NULL as day.COUNT(DISTINCT cookieid) AS uv,
1 AS GROUPING__ID 
FROM user_date 
GROUP BY month 

UNION ALL 

SELECT NULL as month.day.COUNT(DISTINCT cookieid) AS uv,
2 AS GROUPING__ID 
FROM user_date 
GROUP BY day;
Copy the code

  • Use of CUBE:

Aggregate based on all combinations of the dimensions of GROUP BY.

SELECT 
month.day.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID 
FROM user_date 
GROUP BY month.day 
WITH CUBE 
ORDER BY GROUPING__ID;
Copy the code

The results are as follows:

SQL is equivalent to:

SELECT NULL.NULL.COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM user_date

UNION ALL 

SELECT month.NULL.COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM user_date GROUP BY month 

UNION ALL 

SELECT NULL.day.COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM user_date GROUP BY day

UNION ALL 

SELECT month.day.COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM user_date GROUP BY month.day;
Copy the code

  • ROLLUP usage:

Is a subset of CUBE, dominated by the leftmost dimension, from which hierarchical aggregation is performed.

For example, hierarchical aggregation with the month dimension:

SELECT 
month.day.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID  
FROM user_date 
GROUP BY month.day
WITH ROLLUP 
ORDER BY GROUPING__ID;
Copy the code

The results are as follows:

By switching month and day, the day dimension is aggregated hierarchically:

SELECT 
day.month.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID  
FROM user_date 
GROUP BY day.month 
WITH ROLLUP 
ORDER BY GROUPING__ID;
Copy the code

The results are as follows:

In this case, aggregating by day and month is the same as aggregating by day, because you have a parent-child relationship, and it would be different if you had some other combination of dimensions.

Window function practical application

1. Second highest salary

The difficulty is simple.

Write an SQL query to get the second highest Salary in the Employee table.

+----+--------+
| Id | Salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+
Copy the code

For example, the Employee table above, the SQL query should return 200 as the second highest salary. If no second highest salary exists, the query should return NULL.

+---------------------+
| SecondHighestSalary |
+---------------------+
| 200                 |
+---------------------+
Copy the code

This problem can be solved using the row_number function.

Reference code:

SELECT
  *
  FROM(
    SELECT Salary, row_number(a)over(order by Salary desc) rk 
    FROM Employee
  ) t WHERE t.rk = 2;
Copy the code

Simpler code:

SELECT DISTINCT Salary
FROM Employee
ORDER BY Salary DESC
LIMIT 1 OFFSET 1
Copy the code

OFFSET: indicates the OFFSET from which data is to be selected. 0 indicates the first data.

2. Score ranking

The difficulty is simple.

Write an SQL query to implement the score ranking.

If two scores are the same, they Rank the same. Note that the next place after the split should be the next consecutive integer value. In other words, there should be no “gaps” between rankings.

+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+
Copy the code

For example, based on the given Scores table above, your query should return (in order of score from highest to lowest) :

+-------+------+
| Score | Rank |
+-------+------+
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.50  | 4    |
+-------+------+
Copy the code

Reference code:

SELECT Score,
dense_rank(a)over(order by Score desc) as `Rank`
FROM Scores;
Copy the code

3. Sequential numbers

The difficulty is medium.

Write an SQL query to find all numbers that appear at least three times in a row.

+----+-----+
| Id | Num |
+----+-----+
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |
+----+-----+
Copy the code

For example, given the Logs table above, 1 is the only number that occurs at least three times in a row.

+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+
Copy the code

Reference code:

SELECT DISTINCT `Num` as ConsecutiveNums
FROM
  (
    SELECT Num,
    lead(Num, 1.null) over(order by id) n2,
    lead(Num, 2.null) over(order by id) n3 
    FROM Logs
  ) t1
WHERE Num = n2 and Num = n3
Copy the code

4. Login for consecutive N days

The difficulty is difficult.

Write an SQL query to find the IDS and names of active users, defined as those who have logged into their accounts for at least five consecutive days, and return a table of results sorted by ID.

List the Accounts:

+----+-----------+
| id | name      |
+----+-----------+
| 1  | Winston   |
| 7  | Jonathan  |
+----+-----------+
Copy the code

Table Logins:

+----+-------------+
| id | login_date  |
+----+-------------+
| 7  | 2020- 05- 30  |
| 1  | 2020- 05- 30  |
| 7  | 2020- 05- 31  |
| 7  | 2020- 06- 01  |
| 7  | 2020- 06.  |
| 7  | 2020- 06.  |
| 7  | 2020- 06- 03  |
| 1  | 2020- 0607 -  |
| 7  | 2020- 06- 10  |
+----+-------------+
Copy the code

For example, given the Accounts and Logins tables above, a user with ID =7 has logged in to the account for at least 5 consecutive days

+----+-----------+
| id | name      |
+----+-----------+
| 7  | Jonathan  |
+----+-----------+
Copy the code

Ideas:

  1. Deweight: Since everyone may log in more than once a day, deweight is required
  2. Sort: Sort the login date for each ID
  3. Difference: Calculates the difference between login date and sort to find consecutive login records
  4. Select count(*) group by ID, count(*)
  5. Retrieve records logged in for more than 5 days
  6. Obtain the user name corresponding to the ID through table merge

Reference code:

SELECT DISTINCT b.id, name
FROM
  (SELECT id, login_date,
    DATE_SUB(login_date, ROW_NUMBER(a)OVER(PARTITION BY id ORDER BY login_date)) AS diff 
   FROM(SELECT DISTINCT id, login_date FROM Logins) a) b
INNER JOIN Accounts ac
ON b.id = ac.id
GROUP BY b.id, diff
HAVING COUNT(b.id) > = 5
Copy the code

Note:

  1. DATE_SUB (DATE, X); DATE_SUB (DATE, X);
  2. How to find consecutive dates: by the difference between the sort and the login date, because the sort is continuous, so if the login date is continuous, the difference is the same;
  3. GROUP BY and HAVING: Use COUNT to find ids with consecutive days greater than 5 BY GROUP BY id and difference. Note that COUNT does not have to appear after SELECT, but can be used in HAVING

5. Median frequency query for a given number

The difficulty is difficult.

The Numbers table holds the values of Numbers and their frequencies.

+----------+-------------+
|  Number  |  Frequency  |
+----------+-------------|
|  0       |  7          |
|  1       |  1          |
|  2       |  3          |
|  3       |  1          |
+----------+-------------+
Copy the code

In this table, the numbers are 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 3, so the median is (0 + 0) / 2 = 0.

+--------+
| median |
+--------|
| 0.0000 |
+--------+
Copy the code

Write a query to find the median of all numbers and name the result median.

Reference code:

select
avg(cast(number as float)) as median
from
  (
    select Number,
    Frequency,
    sum(Frequency) over(order by Number) - Frequency as prev_sum,
    sum(Frequency) over(order by Number) as curr_sum 
    from Numbers
  ) t1, (
    select sum(Frequency) as total_sum 
    from Numbers
  ) t2
where
t1.prev_sum < = (cast(t2.total_sum as float) / 2) 
and
t1.curr_sum > = (cast(t2.total_sum as float) / 2)
Copy the code

A full set of big data development interview question bank tutorials, including big data knowledge review, resume writing guidance