@[TOC]

A Hive, SQL

1.1. Window functions

1.1.1. Calculate the cumulative access times of each user

There are the following users accessing data (T1)SQL is required to calculate the cumulative number of access times for each user, as shown in the following table: answer1. Create table


create table action (
	userId string,
	visitDate string,
	visitCount int
) row format delimited fields terminated by "\t";

Copy the code

2. Insert data

insert into table action values('u01'.'2021/1/21'.'5');
insert into table action values('u02'.'2021/1/23'.'6');
insert into table action values('u03'.'2021/1/22'.'8');
insert into table action values('u04'.'2021/1/20'.'3');
insert into table action values('u01'.'2021/1/23'.'6');
insert into table action values('u01'.'2021/2/21'.'8');
insert into table action values('U02'.'2021/1/23'.'6');
insert into table action values('U01'.'2021/2/22'.'4');
Copy the code

3. Convert the time format

select from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm') from action;
Copy the code

4. Count the access counts of each user in a month

select
        lower(userId) userId, -- Change case
        from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm') visitMonth,
        sum(visitCount) visit_month_count
from action
group by lower(userId),from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm')
Copy the code

5. Count the total number of visits

select 
    t1.userId userId,
    t1.visitMonth visitMonth,
    t1.visit_month_count visit_month_count,
    sum(t1.visit_month_count) over (partition by userId order by visitMonth rows between UNBOUNDED PRECEDING AND CURRENT ROW) sum_month
from 
(
    select
        lower(userId) userId,
        from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm') visitMonth,
        sum(visitCount) visit_month_count
    from action
    group by lower(userId),from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm')
) t1
Copy the code

Field to explain

Following the first row of the partition, we have an unbounded class. N is the offset relative to the current row following: preceding, and unbounded by the end of the partition. N: indicates the offset from the current row. Current Row: indicates the current row. The offset is 0Copy the code

For more information, please follow my official account [Big Data]