Lead and lag
First_value (last_value) and last_value (last_value) are all values up to the first and last rows of the current row
Lag and Lead analysis functions can fetch N rows of data after the current row in a single query. Although sorting is not necessary, it usually makes sense to fetch the first or last N rows of data only in sorting scenarios
This operation can replace self-join of tables, and LAG and LEAD are more efficient.
Lag/Lead(Col,n,DEFAULT) is used to calculate the value of the NTH row before or after the current row in a statistical window
- The first parameter is the column name,
- The second argument is the NTH line up (optional, default is 1),
- The third parameter is the default value (if the NTH value above is NULL, or NULL if not specified).
Lag takes the data before the current row, and lead takes the data after the current row
The test data
We have a text user_access_log.txt that records the user’s access to the page id,ctime, and URL representing the user ID, access time, and url of the page visited, respectively
Peter 2015-10-12 01:10:00 url1
Peter 2015-10-12 01:15:10 url2
Peter 2015-10-12 01:16:40 url3
Peter 2015-10-12 02:13:00 url4
Peter 2015-10-12 03:14:30 url5
Marry 2015-11-12 01:10:00 url1
Marry 2015-11-12 01:15:10 url2
Marry 2015-11-12 01:16:40 url3
Marry 2015-11-12 02:13:00 url4
Marry 2015-11-12 03:14:30 url5
Copy the code
Data description: Peter 2015-10-12 01:10:00 URL1, indicating that Peter entered the webpage URL1 on 2015-10-12 01:10:00, that is, the time of entering the webpage was recorded.
create table ods_user_log(
userid string,
ctime string,
url string
) row format delimited fields terminated by '\t';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/user_access_log.txt' OVERWRITE INTO TABLE ods_user_log;
Copy the code
Learn lead and LAG from examples
Calculate the first access time of the current user and the last access time and next access time of the current access time
So we know from the description that the condition of the partition or the condition of the definition of the subwindow is the user itself, and we know from our previous study that the first access time we can calculate using first_value, so let’s see how we can implement this requirement
select
userid,url,ctime,
first_value(ctime) over(partition by userid order by ctime) as first_ctime,
lag(ctime,1) over(partition by userid order by ctime) as lag_ctime,
lead(ctime,1) over(partition by userid order by ctime) as lead_ctime
from
ods_user_log
;
Copy the code
Here we also briefly analyze the result. For the first row of data, there is no LAG_ctime, that is, there is no last access time, because it is the first access. For lead_ctime, it is the next access time 2015-11-12 01:15:10
Calculate how long a user spends on each page
The user Peter is browsing the web, and at one moment, Peter clicks into a page, and after a period of time, Peter goes to another page, and so on and so on. How can we count the time that Peter stays in a particular web page, or how can we count the total time that a web user stays in?
To calculate the time of Peter staying in urL1, it is necessary to subtract the time of entering URL1 from the time of entering URL1, that is, 2015-10-12 01:15:10. This time is not only the time of leaving urL1, but also the time of entering URL2. So we need to get the start and end times of a user’s stay on a page
select
userid,url,ctime as startTime,
lead(ctime,1) over(partition by userid order by ctime) as leaveTime
from
ods_user_log
;
Copy the code
With entry and exit times, it’s easy to calculate how long a user spends on a particular page by subtracting the entry time from the departure time
select
userid,url,ctime as startTime,
lead(ctime,1) over(partition by userid order by ctime) as leaveTime,
unix_timestamp(lead(ctime,1) over(partition by userid order by ctime) ) -unix_timestamp(ctime) as stayTime
from
ods_user_log
;
Copy the code
How can we implement this requirement without functions such as LAG and lead? We can implement this with autocorrelation, but there are many different ways to implement this autocorrelation. Here I implement one with row_number
with a as (
select
userid,url,ctime as startTime,
row_number(a)over (partition by userid order by ctime ) as rn
from
ods_user_log
)
select
a1.userid userid,a1.startTime startTime,a2.startTime leaveTime,unix_timestamp(a2.startTime )-unix_timestamp(a1.startTime ) as stayTime
from
a a1
inner join
a a2
on
a1.userid=a2.userid
and (a2.rn-a1.rn=1);Copy the code
Here we mainly use the condition (a2.rn-a1.rn=1) to find out when the user enters and leaves the page
Usage scenarios
In fact, there are many application scenarios of LEAD and LAG, especially in the scenario where calculation requires self-correlation. If the head is properly used, the performance of our program can be greatly improved and the writing method can be simplified. Let’s look at another example below
I have a log table that records all the information about the status of the merchant rate change user ID, rate, time, and now I have a requirement to take out the data row that has the rate change in chronological order, and output the data format merchant ID, current time, change time, original rate, changed rate
100,0.1,2016-03-02 100,0.1,2016-02-02 100,0.2,2016-03-05 100,0.2,2016-03-06 100,0.3,2016-03-07 100,0.1,2016-03-09 100,0.1,2016-03-10 100,0.1,2016-03-10 200,0.1,2016-02-02 200,0.1, 2016-03-05 200,0.2,2016-03-06 200,0.3, 2016-03-07 200,0.1, 2016-03-07 200,0.1, the,0.1 200 2016-03-10, 2016-03-10Copy the code
Now let’s start building the table
create table ods_user_rate_log(
userid string,
rate double,
ctime string
) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/ods_user_rate_log.txt' OVERWRITE INTO TABLE ods_user_rate_log;
Copy the code
Now analyse, first of all, we are separate each merchants, so our child window defined condition is that a user ID, and then we track is the rate of change over time so we are sorted by time, finally, we need the rate of changed data, so we need to compare the original rate and rate next time
Step 1: Get the current rate and the next rate
select
userid ,ctime,rate , lead(rate,1) over (partition by userid order by ctime) new_rate
from
ods_user_rate_log
;
Copy the code
Step 2: find the record of the rate difference and return it. Note that we also need to obtain the time of the change
select
*
from (
select
userid,
ctime,
lead(ctime, 1) over (partition by userid order by ctime) new_date,
rate,
lead(rate, 1) over (partition by userid order by ctime) new_rate
from ods_user_rate_log
) tmp
where
rate! =new_rate
;
Copy the code
conclusion
-
Lag and lead are mainly used to calculate N rows before and after the current row. Normally, we would sort the data because the number of rows before and after the current row is meaningful only if it is ordered
-
Lag and lead can be used in place of autocorrelation in certain situations