Hive window function/analysis function
In SQL, there are a class of functions called aggregate functions, such as sum(), avg(), Max (), etc. These functions can regularly aggregate multiple rows into one row. Generally, the number of rows after the aggregate is less than the number of rows before the aggregate. But sometimes we want to display both the data before and after the aggregation, so we introduce window functions. Window functions, also known as OLAP functions/analysis functions, have both grouping and sorting functions.
The most important keywords for window functions are Partition By and Order By.
Over (partition by XXX order by XXX)
The sum, avg, min, Max function
To prepare data
Construction Sentences: create table bigdata_t1( cookieid string, createtime string, --day pv int ) row format delimited fields terminated by ','; Load data local inpath '/root/hivedata/bigdata_t1.dat' into table bigdata_t1; Cookie1,2018-04-10,1 cookie1,2018-04-11,5 cookie1,2018-04-12,7 cookie1,2018-04-13,3 cookie1,2018-04-14,2 Cookie1, 2018-04-15, 4 cookie1, 2018-04-16, 4 starting a SET of intelligent local hive. The exec. Mode. Local. Auto = true;
The SUM function is used in conjunction with the window function: the result is related to the ORDER BY and is in ascending ORDER BY default.
#pv1 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime) as pv1 from bigdata_t1; #pv2 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2 from bigdata_t1; #pv3 select cookieid,createtime,pv, sum(pv) over(partition by cookieid) as pv3 from bigdata_t1; #pv4 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4 from bigdata_t1; #pv5 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5 from bigdata_t1; #pv6 select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6 from bigdata_t1; PV1: PV accumulation from the starting point to the current row in the group, for example, PV1 of No. 11 = PV of No. 10 + PV of No. 11, PV2 of No. 12 = PV of No. 10 +11 +12, PV2: same as PV1, PV3: cumulative PV4 of all PVs in the group (cookie1) : The current row in the group + 3 previous rows, for example, 11 =10 +11, 12 =10 +11 +12, 13 =10 +11 +12 +13, 14 =11 +12 +13 +14 PV5: Current row +3 rows +1 row +14 =11 +12 +13 +14 +15 =5+7+3+2+4=21 For example, number 13 =13 +14 +15 +16 =3+2+4+4=13, number 14 =14 +15 +16 =2+4+4=10
If ROWS BETWEEN is not specified, the default is from the starting point to the current row.
If you do not specify ORDER BY, all values in the group are added up;
The key is to understand what ROWS BETWEEN means, also known as the WINDOW clause:
Preceding: go
Back the following:
Current Row: The current row
The starting point of unbounded:
Unbounded preceding represents the previous starting point
Unbounded following: indicating to the following end points
AVG, MIN, MAX, and SUM are the same.
Row_number, rank, dense_rank, ntile function
To prepare data
Cookie1,2018-04-10,1 cookie1,2018-04-11,5 cookie1,2018-04-12,7 cookie1,2018-04-13,3 cookie1,2018-04-14,2 Cookie1,2018-04-15,4 cookie1,2018-04-16,4 cookie2,2018-04-10,2 cookie2,2018-04-11,3 cookie2,2018-04-12,5 Cookie2,2018-04-13,6 cookie2,2018-04-14,3 cookie2,2018-04-15,9 cookie2,2018-04-16,7 CREATE TABLE bigdata_t2 (cookieid string, createtime string, --day pv INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; Load data local inpath '/root/hivedata/bigdata_t2.dat' into table bigdata_t2;
- ROW_NUMBER () is used
ROW_NUMBER(), starting at 1, generates a sequence of records within the group, in order.
SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM bigdata_t2;
- Rank and DENSE_RANK are used
Rank () generates the RANK of the item in the group. Equal RANK will leave a space in the RANK.
DENSE_RANK() generates the rank of the data item in the group. Equal rank does not leave a space in the rank.
SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM bigdata_t2
WHERE cookieid = 'cookie1';
- NTILE
Sometimes there is a need: if the data is sorted into three parts, and the business only cares about one part, how can the middle third of the data be extracted? The NTILE function does.
Ntile is a collection of ordered data divided equally among a specified number of buckets (num), with bucket numbers assigned to each row. If the distribution cannot be equal, the smaller numbered buckets are assigned first, and the number of rows that can be placed in each bucket differs by at most 1.
You can then select the first or last fraction of N based on the bucket number. The data will be presented in its entirety, just tagging the corresponding data; To get a specific fraction of the data, you need to nest another layer according to the label out.
SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM bigdata_t2
ORDER BY cookieid,createtime;
Some other window functions
Lag, lead, first_value, last_value function
- Lags (col,n,DEFAULT) : The first parameter is the column name, the second parameter is the NTH line (optional,DEFAULT is 1), and the third parameter is the DEFAULT value (if the NTH line is NULL, the DEFAULT value is set, or if not specified, it is NULL).
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time, LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM bigdata_t4; last_1_time: The default is '1970-01-01 00:00:00' cookie1. The default is '1970-01-01 00:00:00' cookie1. The default is '1970-01-01 00:00:00' cookie1. Last_2_time Cookie1, row 6, row 1 up, row 5, row 5 = "2015-04-10 10:00:02 Cookie1"; = "2015-04-10 10:00:02 Cookie1"; = "2015-04-10 10:00:02 Cookie1"; = "2015-04-10 10:00:02 Cookie1"; Up 2 acts as the value in the fifth row, 2015-04-10 10:50:01
- LEAD
Lead (col,n,DEFAULT) is used for the Nth row down in the statistics window. The first argument is the column name, the second argument is the Nth row down (optional,DEFAULT is 1), and the third argument is the DEFAULT (if the Nth row is NULL, the DEFAULT value is taken, or if not specified, it is NULL).
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM bigdata_t4;
- FIRST_VALUE
After sorting within the group, by the current row, the first value
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM bigdata_t4;
- LAST_VALUE
The last value to the current row after sorting within the group
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM bigdata_t4;
If you want the last value sorted in a group, you need to work around it:
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM bigdata_t4
ORDER BY cookieid,createtime;
Pay special attention to the order by
If you do not specify ORDER BY, the sorting is out of ORDER and the wrong result will occur
SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM bigdata_t4;
Cume_dist, percent_rank function
These two sequence analysis functions are not commonly used. Note that sequence functions do not support the WINDOW clause
- Data preparation
D1,user1,1000 d1,user2,2000 d1, 3000 d2,user4,4000 d2,user5,5000 CREATE EXTERNAL TABLE data_t3 userid string, sal INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; Load data local inpath '/root/hivedata/bigdata_t3.dat' into table bigdata_t3;
- The sort order of CUME_DIST and ORDER BY is related
For example, the percentage of the total number of people whose current salary is less than or equal to CUME_DIST
SELECT dept, userid, sal, CUME_DIST() OVER(ORDER BY sal) AS rn1, CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM bigdata_t3; Rn1: no partition, the total number of rows is 5, row 1: number of rows less than or equal to 1000 is 1, therefore, row 3: number of rows less than or equal to 3000 is 3, therefore, 3/5= 0.6rn2: Grouped by department, the number of rows for DPET = D1 is 3, the second row: the number of rows less than or equal to 2000 is 2, therefore, 2/3= 0.66666666666666666666666666
- PERCENT_RANK
PERCENT_RANK The RANK value of the current row in the PERCENT_RANK group is -1/ the total number of rows in the group is -1
SELECT dept, userid, sal, PERCENT_RANK() OVER(ORDER BY sal) AS rn1, PERCENT_RANK() OVER(ORDER BY sal) AS rn11 RANK (1) OVER(PARTITION BY NULL) AS rn12, PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM bigdata_t3; Rn1: rn1 = (rn11-1)/(rn12-1) the first line, (1-1)/(5-1) = 0/4 = 0, the second line, (2-1)/(5-1) = = 0.25 in the fourth row, a quarter (4-1)/(5-1) = 3/4 = 0.75 rn2: (1-1)/(3-1)=0; (3-1)/(3-1)=1
The grouping sets, grouping__id, cube, a rollup function
These analysis functions are typically used in OLAP and do not add up, but need to be counted based on drilling and drilling metrics in different dimensions, such as the number of UVs per hour, day, and month.
- Data preparation
10, 8, 2018-03201-03 - cookie1 8, 2018-03201-03-10, cookie5 8, 2018-03201-03-12, cookie7 8, 2018-04201-04-12, cookie3 8, 2018-04201-04-13, cookie2 8, 2018-04201-04-13, cookie4 8, 2018-04201-04-16, cookie4 8, 2018-03201-03-10, cookie2 10, 8, 2018-03201-03 - cookie3 8, 2018-04201-04-12, cookie5 8, 2018-04201-04-13, cookie6 8-04-15, 2018-04201 cookie3 CREATE TABLE bigdata_t5 (month STRING, day STRING, day STRING, day STRING, day STRING) cookieid STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; Load data local inpath '/root/hivedata/bigdata_t5.dat' into table bigdata_t5;
- GROUPING SETS
Grouping Sets is a convenient way to write group by logic in a single SQL statement.
This is equivalent to UNION ALL of the Group By result sets of different dimensions.
Grouping__id, which indicates which grouping set the result belongs to.
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day GROUPING SETS (month,day) ORDER BY GROUPING__ID; Grouping_id represents the grouping set to which the group results belong. According to grouping sets, month, day, 1 represents month. SELECT DATE,NULL,COUNT(DISTINCT COOKIEID) AS UV,1 AS GROUPING__ID FROM BIGDATA_T5 GROUP BY MONTH UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day;
Such as:
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID; Equivalent to SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;
- CUBE
Aggregate according to all combinations of the dimensions of GROUP BY.
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID; SELECT * FROM BIGDATA_T5 UNION ALL SELECT NULL,NULL,COUNT(DISTINCT COOKIEID) AS UV,0 AS GROUPING__ID FROM BIGDATA_T5 UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;
- ROLLUP
Is a subset of Cube, dominated by the leftmost dimension from which the hierarchy is aggregated.
For example, hierarchical aggregation with the Month dimension: SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID; If the order of month and day is switched, then the hierarchy is aggregated by day dimension: SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID; (Here, the aggregation according to the sky and the moon is the same as the aggregation according to the day, because there is a parent-child relationship, if it is a combination of other dimensions, it will be different.)