Customer behavior record segmentation

Is mainly used in the calculation method we calculate the user on our web site or application on time, or the users at a certain time spent on the page, in fact, in the previous study Hive window function lead and lag window function of grammar when we also introduced how to calculate the user in a page on the residence time, It is the time for the user to enter the next page minus the time for the user to enter the current page. Because we mainly talked about the application of the window function in front, so we did not consider whether this calculation is appropriate. Today we will look at this problem

Background to session splitting

So let’s see why we want to do session splitting, and let’s say we want to calculate how long the user stays in our APP, and let’s say we don’t do session splitting, so our stay is going to be the user’s last visit minus the user’s first visit, which doesn’t make sense, Left us, because the user may be in the middle time of application, in fact, if this is really the case said, as long as we can capture the user leaves the event, and then calculate every time left and came in to the time difference between the last time, we can call this session, which will be together the multiple session length.

But many times we can’t catch the users leave events, or the user is not normal leave, such as the user directly to clean up the background in this case we need to build a session, thus calculated, and the user is a series of reports on the data session segmentation, thus the users on the platform of the stay time is calculated.

To build the core of the session is our user behavior can be divided according to certain time, and flink session window is a bit like, here is our rule is that if two behaviors between the more than 30 minutes, we think that the two data belong to two different sessions, we call this time interval session timeout time, Generally, the session timeout time of our APP is 30 minutes, and that of the Web terminal is 10 minutes

To prepare data

The first thing we need to do is prepare a bunch of user access data, which is a bunch of user behavior data, because we’re building sessions ourselves, so we don’t care what the events are, but of course if you can combine the behavior events of the data with the physical partitioning then the session will be more accurate, right

1 2020-11-20 10:01:51 url1 1 2020-11-20 10:05:51 url1 1 2020-11-20 10:43:51 url2 1 2020-11-20 10:46:51 url1 1 2020-11-20  10:49:51 url2 1 2020-11-20 10:50:51 url2Copy the code

You can import this data into Hive, or you can use it like this

/* User ID: uid url: URL access time: event_time */
WITH user_log AS(
    SELECT
        uid,
        datetime(event_time) event_time,
        event_key
    FROM
    (
        SELECT '1' AS uid, 'the 2020-11-20 10:01:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:05:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:43:51' AS event_time, 'url2' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:46:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:49:51' AS event_time, 'url2' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:50:51' AS event_time, 'url2' AS url
    )tmp
)
SELECT * FROM user_log;
Copy the code

Build the session

Once we have the data, we can build a session. First, to calculate the time interval between two actions, we need to sort the user data by access time to calculate the time interval between two adjacent actions and determine whether they belong to different sessions

Sort and number by access time

It is important to note that we partition by user first and then sort by user, since sessions are for a single user

SELECT
  *,
  ROW_NUMBER() OVER(PARTITION BY uid ORDER BY event_time) AS act_id
FROM
  user_log;
Copy the code
uid	event_time	url	act_id
+----+-----------+----+-------+
1	2020-11-20 10:01:51	url1	1
1	2020-11-20 10:05:51	url1	2
1	2020-11-20 10:43:51	url2	3
1	2020-11-20 10:46:51	url1	4
1	2020-11-20 10:49:51	url2	5
1	2020-11-20 10:50:51	url2	6
Copy the code

Computing session window

Next we in according to our calculations on the basis of the above rules to compute if the current behavior on time and it’s a behavior of interval if we will be more than 30 minutes the current behavior can be seen as the beginning of the session, if the current behavior without a behavior that is the first article on data we also see it as the beginning of the session, it is important to note us

Sessions are for a single user

select
  a.uid,a.url,a.event_time,
  if(b.event_time is null or datediff(a.event_time,b.event_time,'mi')>10,a.act_id,null) as session_id
from
  order_data a
left join
  order_data b
on
  a.uid=b.uid and a.act_id=b.act_id+1
Copy the code

If session_id is null, the current event is the start of the session. If session_id is null, the interval between the current event and the last event is less than 30 minutes, so it is not the start of the session. So here we also filter out that the data whose session_id is not NULL is the start event of our session

select * from(
  select
    a.uid,a.act_id,a.event_time,
    if(b.event_time is null or datediff(a.event_time,b.event_time,'mi')>10,a.act_id,null) as session_id
  from
    order_data a
  left join
    order_data b
  on
    a.uid=b.uid and a.act_id=b.act_id+1
)
where
  session_id is not null
;
Copy the code

In this step, we can calculate the session_id from order_date, which is the sorted data. We can also use the window function that we learned before, so we don’t need to self-associate, but we still need to filter

select
  uid,act_id,event_time
  if(last_time is null or datediff(event_time,last_time,'mi')>10,act_id,null) as session_id
from(
  select
    uid,event_time,act_id,lag(event_time,1) over(partition by uid order by event_time ) as last_time
  from
    order_data
)
;
Copy the code

Partition each activity into a specific session

In the previous section, we numbered each activity. We called the first activity of the window session_id, and we calculated all the seeion_id. Session_id represents each session. It is also the first behavior data in each session.

So what we’re going to do is we’re going to partition each activity into a specific session, and before we start we need to know how to do that, we already know that session_id represents the minimum act_id in that session, Then we know that the act_ID of the other data in the session must be greater than or equal to the session session session_id, equal because we’re counting the behavior itself, because there may only be one data in a session.

For example, the act_ID of the data in the third session must be greater than the session_ID of the second session, but we know that the data belongs to the third session, not the second session. So we have the right constraint that the current session id is the maximum session ID less than the act_ID of the current session.

It might be a little hard to understand, so just go to the code

select
  a.*,b.session_id
from
  order_data a
left join
  session_data b
on
  a.uid=b.uid
  and a.act_id> =b.session_id
Copy the code

Act_id = 1; act_id=6; act_id= 1; act_id=6

In fact, we just need to sort the above calculated data by row_number in reverse order of session_id, and select the data with serial number 1

select
  *
from (
  select
    uid,url,event_time,act_id,session_id,session_start_time,
    row_number(a)over(partition by uid,act_id order by session_id desc ) as rn
  from (
    select
      a.*,b.session_id,b.event_time as session_start_time
    from
      order_data a
    left join
      session_data b
    on
      a.uid=b.uid
      and a.act_id> =b.session_id
  )
)
where
  rn=1
Copy the code

In fact, at this point we’re going to have all of our data and we’re going to have the session that it belongs to, the time that the session started, and then we’re going to be able to calculate how long the user has been on the platform which is the sum of sessions, In fact, we can see that the duration of a session is the time of the last record of the session minus the time of the first record

Calculates the user’s stay duration

We just need to calculate the duration of each session separately and add up the total session duration

select 
  uid,sum(session_time) as page_duration
from(
  select
    uid,session_id,max(event_time) as session_end_time,min(event_time) as session_start_time,datediff(max(event_time),min(event_time),'ss') as session_time
  from
    user_session
  group by
    uid,session_id
)
group by
    uid
Copy the code
uid	   page_duration
+----+--------------+
1     	660
Copy the code

A complete SQL

It should be noted that for the purposes of the demonstration, I have broken the SQL into sections according to our demo flow, so you can combine them as you use them

WITH user_log AS(
    SELECT
        uid,
        datetime(event_time) event_time,
        url
    FROM
    (
        SELECT '1' AS uid, 'the 2020-11-20 10:01:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:05:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:43:51' AS event_time, 'url2' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:46:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:49:51' AS event_time, 'url2' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:50:51' AS event_time, 'url2' AS url
    )tmp
),
order_data as(
  SELECT
    *.ROW_NUMBER(a)OVER(PARTITION BY uid ORDER BY event_time) AS act_id
  FROM
    user_log
),
session_data as (
  select * from(
    select
      a.uid,a.act_id,a.event_time,
      if(b.event_time is null or datediff(a.event_time,b.event_time,'mi')>10,a.act_id,null) as session_id
    from
      order_data a
    left join
      order_data b
    on
      a.uid=b.uid and a.act_id=b.act_id+1
  )
  where
    session_id is not null
),
user_session as(
  select
    *
  from (
    select
      uid,url,event_time,act_id,session_id,session_start_time,
      row_number(a)over(partition by uid,act_id order by session_id desc ) as rn
    from (
      select
        a.*,b.session_id,b.event_time as session_start_time
      from
        order_data a
      left join
        session_data b
      on
        a.uid=b.uid
        and a.act_id> =b.session_id
    )
  )
  where
    rn=1
)

select
  uid,sum(session_time) as page_duration
from(
  select
    uid,session_id,max(event_time) as session_end_time,min(event_time) as session_start_time,datediff(max(event_time),min(event_time),'ss') as session_time
  from
    user_session
  group by
    uid,session_id
)
group by
    uid
;

Copy the code

Calculate the length of time spent on each page

In fact, to calculate the duration of each page we did this before, but the difference this time is that we introduced sessions, that is, we calculated the duration of each page based on the session, which is to calculate the duration of each page in each session, and then add them up to get the duration of each page, Take a look at our previous article on the window functions lead and LAG

Count the time the user left the current page

It is important to note that our calculation rules are based on a specific session

select
    uid,url,event_time,session_id,act_id,next_actid,next_time,datediff(next_time,event_time,'ss') as pageview_duration
from (
  select
    uid,url,event_time,session_id,act_id,
    lead(event_time,1) over(partition by uid,session_id order by act_id) as next_time,
    lead(act_id,1) over(partition by uid,session_id order by act_id) as next_actid
  from
    user_session
)
where
  next_time is not null
Copy the code

In this case, next_time is the time left from the current page. Here, we filter next_time not null because the last data in the session did not have next_time

Pageview_duration is the duration of a URL, but it is important to note that a URL may have more than one duration within a session

Calculate how long a user spends on each page

We just need to sum the pageView_duration above by user

select 
  uid,url,sum(pageview_duration) as pageview_duration
from (
  select
      uid,url,event_time,session_id,act_id,next_actid,next_time,datediff(next_time,event_time,'ss') as pageview_duration
  from (
    select
      uid,url,event_time,session_id,act_id,
      lead(event_time,1) over(partition by uid,session_id order by act_id) as next_time,
      lead(act_id,1) over(partition by uid,session_id order by act_id) as next_actid
    from
      user_session
  )
  where
    next_time is not null
)
group by 
  uid,url
Copy the code

A complete SQL

It should be noted that for the purposes of the demonstration, I have broken the SQL into sections according to our demo flow, so you can combine them as you use them

WITH user_log AS(
    SELECT
        uid,
        datetime(event_time) event_time,
        url
    FROM
    (
        SELECT '1' AS uid, 'the 2020-11-20 10:01:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:05:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:43:51' AS event_time, 'url2' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:46:51' AS event_time, 'url1' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:49:51' AS event_time, 'url2' AS url
        UNION ALL
        SELECT '1' AS uid, 'the 2020-11-20 10:50:51' AS event_time, 'url2' AS url
    )tmp
),
order_data as(
  SELECT
    *.ROW_NUMBER(a)OVER(PARTITION BY uid ORDER BY event_time) AS act_id
  FROM
    user_log
),
session_data as (
  select * from(
    select
      a.uid,a.act_id,a.event_time,
      if(b.event_time is null or datediff(a.event_time,b.event_time,'mi')>10,a.act_id,null) as session_id
    from
      order_data a
    left join
      order_data b
    on
      a.uid=b.uid and a.act_id=b.act_id+1
  )
  where
    session_id is not null
),
user_session as(
  select
    *
  from (
    select
      uid,url,event_time,act_id,session_id,session_start_time,
      row_number(a)over(partition by uid,act_id order by session_id desc ) as rn
    from (
      select
        a.*,b.session_id,b.event_time as session_start_time
      from
        order_data a
      left join
        session_data b
      on
        a.uid=b.uid
        and a.act_id> =b.session_id
    )
  )
  where
    rn=1
)
select
  uid,url,sum(pageview_duration) as pageview_duration
from (
  select
      uid,url,event_time,session_id,act_id,next_actid,next_time,datediff(next_time,event_time,'ss') as pageview_duration
  from (
    select
      uid,url,event_time,session_id,act_id,
      lead(event_time,1) over(partition by uid,session_id order by act_id) as next_time,
      lead(act_id,1) over(partition by uid,session_id order by act_id) as next_actid
    from
      user_session
  )
  where
    next_time is not null
)
group by
  uid,url
;
Copy the code

conclusion

Today we are going to focus on session splitting, the whole process is not complicated, but we need to understand the purpose of each step, and then we will start our calculation based on session