Lead and lag

First_value (last_value) and last_value (last_value) are all values up to the first and last rows of the current row

Lag and Lead analysis functions can fetch N rows of data after the current row in a single query. Although sorting is not necessary, it usually makes sense to fetch the first or last N rows of data only in sorting scenarios

This operation can replace self-join of tables, and LAG and LEAD are more efficient.

Lag/Lead(Col,n,DEFAULT) is used to calculate the value of the NTH row before or after the current row in a statistical window

  • The first parameter is the column name,
  • The second argument is the NTH line up (optional, default is 1),
  • The third parameter is the default value (if the NTH value above is NULL, or NULL if not specified).

Lag takes the data before the current row, and lead takes the data after the current row

The test data

We have a text user_access_log.txt that records the user’s access to the page id,ctime, and URL representing the user ID, access time, and url of the page visited, respectively

Peter	2015-10-12 01:10:00	url1
Peter	2015-10-12 01:15:10	url2
Peter	2015-10-12 01:16:40	url3
Peter	2015-10-12 02:13:00	url4
Peter	2015-10-12 03:14:30	url5
Marry	2015-11-12 01:10:00	url1
Marry	2015-11-12 01:15:10	url2
Marry	2015-11-12 01:16:40	url3
Marry	2015-11-12 02:13:00	url4
Marry	2015-11-12 03:14:30	url5
Copy the code

Data description: Peter 2015-10-12 01:10:00 URL1, indicating that Peter entered the webpage URL1 on 2015-10-12 01:10:00, that is, the time of entering the webpage was recorded.

create table ods_user_log(
    userid string,
    ctime string,
    url string
) row format delimited fields terminated by '\t';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/user_access_log.txt' OVERWRITE INTO TABLE ods_user_log;
Copy the code

Learn lead and LAG from examples

Calculate the first access time of the current user and the last access time and next access time of the current access time

So we know from the description that the condition of the partition or the condition of the definition of the subwindow is the user itself, and we know from our previous study that the first access time we can calculate using first_value, so let’s see how we can implement this requirement

select
    userid,url,ctime,
    first_value(ctime) over(partition by userid order by ctime) as first_ctime,
    lag(ctime,1) over(partition by userid order by ctime) as lag_ctime,
    lead(ctime,1) over(partition by userid order by ctime) as lead_ctime
from
    ods_user_log
;
Copy the code

Here we also briefly analyze the result. For the first row of data, there is no LAG_ctime, that is, there is no last access time, because it is the first access. For lead_ctime, it is the next access time 2015-11-12 01:15:10

Calculate how long a user spends on each page

The user Peter is browsing the web, and at one moment, Peter clicks into a page, and after a period of time, Peter goes to another page, and so on and so on. How can we count the time that Peter stays in a particular web page, or how can we count the total time that a web user stays in?

To calculate the time of Peter staying in urL1, it is necessary to subtract the time of entering URL1 from the time of entering URL1, that is, 2015-10-12 01:15:10. This time is not only the time of leaving urL1, but also the time of entering URL2. So we need to get the start and end times of a user’s stay on a page

select
    userid,url,ctime as startTime,
  	lead(ctime,1) over(partition by userid order by ctime) as leaveTime
from
    ods_user_log
;
Copy the code

With entry and exit times, it’s easy to calculate how long a user spends on a particular page by subtracting the entry time from the departure time

select
    userid,url,ctime as startTime,
    lead(ctime,1) over(partition by userid order by ctime) as leaveTime,
    unix_timestamp(lead(ctime,1) over(partition by userid order by ctime) ) -unix_timestamp(ctime) as stayTime
from
    ods_user_log
;
Copy the code

How can we implement this requirement without functions such as LAG and lead? We can implement this with autocorrelation, but there are many different ways to implement this autocorrelation. Here I implement one with row_number

with a as (
    select
        userid,url,ctime as startTime,
        row_number(a)over (partition by userid order by ctime ) as rn
    from
        ods_user_log
)
select
    a1.userid userid,a1.startTime startTime,a2.startTime leaveTime,unix_timestamp(a2.startTime )-unix_timestamp(a1.startTime ) as stayTime
from
    a  a1
inner join
    a  a2
on
    a1.userid=a2.userid
    and (a2.rn-a1.rn=1);Copy the code

Here we mainly use the condition (a2.rn-a1.rn=1) to find out when the user enters and leaves the page

Usage scenarios

In fact, there are many application scenarios of LEAD and LAG, especially in the scenario where calculation requires self-correlation. If the head is properly used, the performance of our program can be greatly improved and the writing method can be simplified. Let’s look at another example below

I have a log table that records all the information about the status of the merchant rate change user ID, rate, time, and now I have a requirement to take out the data row that has the rate change in chronological order, and output the data format merchant ID, current time, change time, original rate, changed rate

100,0.1,2016-03-02 100,0.1,2016-02-02 100,0.2,2016-03-05 100,0.2,2016-03-06 100,0.3,2016-03-07 100,0.1,2016-03-09 100,0.1,2016-03-10 100,0.1,2016-03-10 200,0.1,2016-02-02 200,0.1, 2016-03-05 200,0.2,2016-03-06 200,0.3, 2016-03-07 200,0.1, 2016-03-07 200,0.1, the,0.1 200 2016-03-10, 2016-03-10Copy the code

Now let’s start building the table

create table ods_user_rate_log(
    userid string,
    rate double,
    ctime string
) row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/ods_user_rate_log.txt' OVERWRITE INTO TABLE ods_user_rate_log;
Copy the code

Now analyse, first of all, we are separate each merchants, so our child window defined condition is that a user ID, and then we track is the rate of change over time so we are sorted by time, finally, we need the rate of changed data, so we need to compare the original rate and rate next time

Step 1: Get the current rate and the next rate

select
    userid ,ctime,rate , lead(rate,1) over (partition by userid order by ctime)  new_rate
from
     ods_user_rate_log
;
Copy the code

Step 2: find the record of the rate difference and return it. Note that we also need to obtain the time of the change

select 
    *
from (
         select
             userid,
             ctime,
             lead(ctime, 1) over (partition by userid order by ctime) new_date,
             rate,
             lead(rate, 1) over (partition by userid order by ctime) new_rate
         from ods_user_rate_log
) tmp
where
    rate! =new_rate
;
Copy the code

conclusion

  • Lag and lead are mainly used to calculate N rows before and after the current row. Normally, we would sort the data because the number of rows before and after the current row is meaningful only if it is ordered

  • Lag and lead can be used in place of autocorrelation in certain situations