Window function exercises
In fact, the daily use of window functions is more, plus before we respectively introduced the various window functions, today we will practice and summarize, so as to better grasp the window function
The title
Topic 1: The maximum transaction amount of each user by the end of each month and the cumulative total transaction amount of that month
The data source format is as follows
The name of the table | Table annotation | field | Annotation fields | |
---|---|---|---|---|
ods_sales_orders | Order sheet | sales_order_key | Order is the primary key | An order represents the sale of a product |
ods_sales_orders | Order sheet | create_date | Order date | |
ods_sales_orders | Order sheet | customer_key | Customer number | |
ods_sales_orders | Order sheet | product_key | Product number | |
ods_sales_orders | Order sheet | english_product_name | Product name | |
ods_sales_orders | Order sheet | cpzl_zw | Product subtype | |
ods_sales_orders | Order sheet | cplb_zw | Product category | |
ods_sales_orders | Order sheet | unit_price | The product is monovalent |
The output data is in the following format
customer_key | Umonth (current month) | Ucount (Monthly Orders) | Current_max (Maximum transaction Amount) | Current_sum (Total transaction amount for the month) |
---|---|---|---|---|
11009 | 2018-12 | 1 | 53.99 | 53.99 |
1358999 | 2019-2 | 1 | 28.99 | 28.99 |
1358999 | 2019-4 | 1 | 69.99 | 98.98 |
1359000 | 2019-1 | 1 | 2294.99 | 2294.99 |
1359002 | The 2019-11 | 1 | 8.99 | 8.99 |
1359003 | 2020-1 | 1 | 1120.49 | 1120.49 |
1359005 | 2019-2 | 1 | 782.99 | 782.99 |
1359009 | 2019-1 | 1 | 2384.07 | 2384.07 |
1359014 | 2019-1 | 1 | 69.99 | 69.99 |
1359014 | 2019-2 | 1 | 69.99 | 69.99 |
Ideas:
- 1. Group the data according to customers and their year-month
- 2. Sum of monthly sales amount after grouping
- 3. Use window functions to calculate the maximum value (Max) and cumulative value (sum) for each customer in different month groups.
Answer:
select t.customer_key, t.umonth, t.ucount,
max(current_max) over(partition by t.customer_key order by umonth) as current_max,
sum(current_sum) over(partition by t.customer_key order by umonth) as current_sum
from
(
select
customer_key, substr(create_date,1.7) as umonth,
count(sales_order_key) as ucount,
max(unit_price) as current_max,
sum(unit_price) as current_sum
from
adventure_ods.ods_sales_orders
group by
customer_key, substr(create_date,1.7) ) tpm;Copy the code
Note: with TMP as () can also be used:
with tmp as (
select
customer_key, substr(create_date,1.7) as umonth,
count(sales_order_key) as ucount,
max(unit_price) as current_max,
sum(unit_price) as current_sum
from
adventure_ods.ods_sales_orders
group by
customer_key, substr(create_date,1.7))select
customer_key, umonth, ucount,
max(current_max) over(partition by customer_key order by umonth) as current_max,
sum(current_sum) over(partition by customer_key order by umonth) as current_sum
from tmp limit 10;
Copy the code
Comment on:
The above method is for the sake of window function of window function, but also calculate wrong you see the final result is cumulative and maximum is the same, the problem is there, 1 is a summary of the nature of the demand as a result, so we don’t have to use the window function after 2 sub queries each user will have only one data every month
select
customer_key, substr(create_date,1.7) as umonth,
count(sales_order_key) as ucount,
max(unit_price) as current_max,
sum(unit_price) as current_sum
from
adventure_ods.ods_sales_orders
group by
customer_key, substr(create_date,1.7)
Copy the code
That will do
Topic two: find the user number corresponding to different products
The data source format is as follows
The user no. | product | Buy time |
---|---|---|
1 | A | 2019-12-23 |
1 | B | 2019-12-23 |
2 | C | 2019-12-23 |
2 | A | 2019-12-24 |
2 | B | 2019-12-23 |
create table ods_user_product_log(
userid string,
product string,
ctm string
) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/ods_user_product_log.txt' OVERWRITE INTO TABLE ods_user_product_log;
Copy the code
Required Output Example: User Number – Product 1- Product 2(the first two products)
** for example: ** 1-a -B (in chronological order, realize the same time without limiting the order)
Ideas:
- 1. Use window functions to group user numbers and sort products by time
- 2. Use left concatenation or other methods to join and then filter
- 3. Use concat or other functions to concatenate results
select
userid,product,row_number(a)over (partition by userid order by ctm) as rn
from
ods_user_product_log
;
Copy the code
Next we can use autocorrelation to get the next product to use, since it is autocorrelation we can write with as
with tmp as (
select
userid,product,row_number(a)over (partition by userid order by ctm) as rn
from
ods_user_product_log
)
select
a.userid,a.product,b.product
from
tmp a
inner join
tmp b
on
a.userid=b.userid
where
a.rn=1
and b.rn=2
;
Copy the code
Next you just use concat_ws
Lead,lag, can replace autocorrelation in many situations. Let’s see how lead can be used to fulfill the above requirements
select concat_ws('-', a.userid, a.product, a.next_product)
from (
select a.userid,
a.product,
lead(product, 1) over (partition by userid order by ctm) as next_product
from ods_user_product_log a
) a
where next_product is not null
;
Copy the code
Again, this is going to take one more step
However, it is important to note that lead and lag are good, but in this example, if I want to calculate not the first two products, but the whole product, then you have to use autocorrelation
Query the number of customers who bought in May
In fact, we see that this topic is a common aggregation operation, because this is the customer details, not the purchase details
select
a.customer_key,count(customer_key)
from
ods_sales_orders a
where
month(create_date)='05'
group by
a.customer_key
;
Copy the code
Of course, if you’re thinking of using Windows, that’s fine
select
customer_key,
count(*) over(a)from
ods_sales_orders
where
month(create_date)="5"group by
customer_key
;
Copy the code
Topic four: query customer purchase details and monthly purchase total
This is a very typical window function application, where detail and summary data are needed
select
* ,
sum(unit_price) over(partition by customer_key,substr(create_date,1.7))
from
ods_sales_orders
;
Copy the code
Question 5: Query customer purchase details and monthly cumulative purchase total and monthly purchase total
select
* ,
sum(unit_price) over(partition by customer_key,substr(create_date,1.7) sort by create_date rows between unbounded preceding and current row ) as sumcost
from
ods_sales_orders
Copy the code
Question 6: Query the customer’s last purchase time
Use the offset window function lag()
select
* ,
lag(create_date,1) over(partition by customer_key sort by create_date) as last_time
from
ods_sales_orders
Copy the code
Query the order information of the last 20%
Tip: Use ntile(x) : Divide X portions. X is an integer.
Use the ntile function to divide order times into 5 stacks in order
select * from (
select
*.ntile(5) over(sort by create_date asc) as five_num
from
ods_sales_orders
) t
where
five_num = 1
Copy the code
conclusion
Application scenarios of window functions
(1) Used for partition sorting
(2) dynamic Group By
(3)
(4) Cumulative calculation
(5) Hierarchical query
Common window functions
Summary function:
- Sum (col) over() : Sums col in groups. Syntax for over() is as follows
- Count (col) over() : Calculates the number of col groups. The syntax of over() is as follows
- Min (col) over() : Calculates the minimum value for col
- Max (col) over() : calculates the maximum value of col in groups
- Avg (col) over() : Group to find the average value of col columns
Get specific logging functions:
- First_value (col) over() : the first col value of a partition sorted
- Last_value (col) over() : indicates the last col value of a partition
- Lag (col,n,DEFAULT) : calculates the col value of the preceding n rows. N is optional and the DEFAULT value is 1. DEFAULT If the value of the col value in the preceding n row is NULL, the value is the DEFAULT value
- Lead (col,n,DEFAULT) : collects statistics for col values in the next n rows. N is optional and the DEFAULT value is 1. DEFAULT If the value of col values in the first row is NULL, the DEFAULT value is used
Sharding function:
- Ntile (n) : Used to divide grouped data into N slices in sequence and return the current slice value. Note: n must be of type int.
Ranking function:
- Row_number () over() : ranking function. It is not repeated and is suitable for generating primary keys or non-parallel ranking
- Rank () over() : a ranking function with parallel rankings. Such as: 1,1,3
- Dense_rank () over() : rank function with parallel ranking. Such as: 1,1,2
Calculate the percentage function:
- cume_dist
- percent_rank