This article originally appeared on Walker AI

About user retention is an essential part of the major data analysis platform, enterprises generally use retention rates measure the activity of the user, and can direct response direct measure of the function value of the products, the retention rate is one of the most important indicators to measure the quality of the user, so calculating retention rate is the number of data analysis from the bottom of the fundamentals. So here are some real-world examples of retention analytics.

1. Prepare

Learn how to calculate retention by ClickHouse (Cond1, ConD2,…) Function calculates retention

Create table: user basic information table: login_event

CREATE TABLE login_event -- User login event
(
    `accountId` String COMMENT 'ID of account'.-- Unique ID of the user
    `ds` Date COMMENT 'date' -- User login date
)
ENGINE = MergeTree
PARTITION BY accountId
ORDER BY accountId
Copy the code

Derivative: Inserts user login data for August

-- Insert data
insert into login_event values (10001,toDate('2020-08-01'), (10001,toDate('2020-08-08')), (10001,toDate('2020-08-09')), (10001,toDate('2020-08-10')), (10001,toDate('2020-08-12')),
(10001,toDate('2020-08-13')), (10001,toDate('2020-08-14')), (10001,toDate('2020-08-15')), (10001,toDate('2020-08-16')), (10001,toDate('2020-08-17')), (10001,toDate('2020-08-18')),
(10001,toDate('2020-08-20')), (10001,toDate('2020-08-22')), (10001,toDate('2020-08-23')), (10001,toDate('2020-08-24')), (10002,toDate('2020-08-20')), (10002,toDate('2020-08-22')), (10002,toDate('2020-08-23')), (10002,toDate('2020-08-01')), (10002,toDate('2020-08-11')), (10002,toDate('2020-08-12')), (10002,toDate('2020-08-13')), (10002,toDate('2020-08-20')),
(10002,toDate('2020-08-15')), (10002,toDate('2020-08-30')), (10002,toDate('2020-08-20')), (10002,toDate('2020-08-01')), (10002,toDate('2020-08-06')), (10002,toDate('2020-08-24')), (10003,toDate('2020-08-05')), (10003,toDate('2020-08-08')), (10003,toDate('2020-08-09')), (10003,toDate('2020-08-10')), (10003,toDate('2020-08-11')), (10003,toDate('2020-08-13')),
(10003,toDate('2020-08-15')), (10003,toDate('2020-08-16')), (10003,toDate('2020-08-18')), (10003,toDate('2020-08-20')), (10003,toDate('2020-08-01')), (10003,toDate('2020-08-21')),
(10003,toDate('2020-08-22')), (10003,toDate('2020-08-24')), (10003,toDate('2020-08-26')), (10003,toDate('2020-08-25')), (10003,toDate('2020-08-27')), (10003,toDate('2020-08-28')),
(10003,toDate('2020-08-29')), (10003,toDate('2020-08-30')), (10004,toDate('2020-08-01')), (10004,toDate('2020-08-02')), (10004,toDate('2020-08-03')), (10004,toDate('2020-08-04')),
(10004,toDate('2020-08-05')), (10004,toDate('2020-08-08')), (10004,toDate('2020-08-09')), (10004,toDate('2020-08-10')), (10004,toDate('2020-08-11')), (10004,toDate('2020-08-14')),
(10004,toDate('2020-08-15')), (10004,toDate('2020-08-16')), (10004,toDate('2020-08-17')), (10004,toDate('2020-08-19')), (10004,toDate('2020-08-20')), (10004,toDate('2020-08-21')),
(10004,toDate('2020-08-22')), (10004,toDate('2020-08-23')), (10004,toDate('2020-08-24')), (10004,toDate('2020-08-23')), (10004,toDate('2020-08-23')), (10004,toDate('2020-08-25')),
(10004,toDate('2020-08-27')), (10004,toDate('2020-08-30'));
Copy the code

2

To calculate the number of times retained, 3 retained, 7 retained, 14 retained and 30 retained by a daily active user, we divided the problem into three steps:

  • Find a daily active user

  • Find the login information of an active user on the 2nd, 3rd, 6th, 13th and 29th

  • Calculate the number of daily active users who logged in on days 2, 3, 6, 13, and 29, and calculate the n-day retention rate

Solution 1:


-- Calculate the number of days 2, 3, 6, 13, 29 of 2020-08-01 active users retained, calculate the retention rate
SELECT
    ds,
    count(accountIdD0) AS activeAccountNum,
    count(accountIdD1) / count(accountIdD0) ASLeave a ` ` time,count(accountIdD3) / count(accountIdD0) AS `3Leave `,count(accountIdD7) / count(accountIdD0) AS `7Leave `,count(accountIdD14) / count(accountIdD0) AS `14Leave `,count(accountIdD30) / count(accountIdD0) AS `30Leave a `FROM
( -- Use LEFT JOIN to find the active user of 2020-08-01 on day 2, 3, 6, 13, 29
    SELECT DISTINCT
        a.ds AS ds,
        a.accountIdD0 AS accountIdD0,
        IF(b.accountId = ' '.NULL, b.accountId) AS accountIdD1,
        IF(c.accountId = ' '.NULL, c.accountId) AS accountIdD3,
        IF(d.accountId = ' '.NULL, d.accountId) AS accountIdD7,
        IF(e.accountId = ' '.NULL, e.accountId) AS accountIdD14,
        IF(f.accountId = ' '.NULL, f.accountId) AS accountIdD30
    FROM
    (-- Find active users on 2020-08-01
        SELECT DISTINCT
            ds,
            accountId AS accountIdD0
        FROM login_event
        WHERE ds = '2020-08-01'
        ORDER BY ds ASC
    ) AS a
    LEFT JOIN test.login3_event AS b ON (b.ds = addDays(a.ds, 1)) AND (a.accountIdD0 = b.accountId)
    LEFT JOIN test.login3_event AS c ON (c.ds = addDays(a.ds, 2)) AND (a.accountIdD0 = c.accountId)
    LEFT JOIN test.login3_event AS d ON (d.ds = addDays(a.ds, 6)) AND (a.accountIdD0 = d.accountId)
    LEFT JOIN test.login3_event AS e ON (e.ds = addDays(a.ds, 13)) AND (a.accountIdD0 = e.accountId)
    LEFT JOIN test.login3_event AS f ON (f.ds = addDays(a.ds, 29)) AND (a.accountIdD0 = f.accountId)
) AS temp
GROUP BYDs results:-----------------------------------------┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ds ┬ ─ activeAccountNum ─ ┬ leave ─ ─ time ┬ ─ ─3Leave ─ ┬ ─7Leave ─ ┬ ─14Leave ─ ┬ ─30Leave ─ ┐ │2020- 08- 0140.250.2500.50.75│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.022 sec.

Copy the code

Solution 2:


-- Determine the number of active users retained on days 2, 3, 6, 13, 29 on 2020-08-01, calculate the retention rate, calculate the retention rate
SELECT DISTINCT
    b.ds AS ds,
    ifnull(countDistinct(if(a.ds = b.ds, a.accountId, NULL)), 0) AS activeAccountNum,
    ifnull(countDistinct(if(a.ds = addDays(b.ds, 1), b.accountId, NULL)) / activeAccountNum, 0) ASIfnull (countDistinct(if(a.ds= addDays(b.ds, 2), b.accountId, NULL)) / activeAccountNum, 0) AS `3Leave `, ifnull (countDistinct (if (a. d. s= addDays(b.ds, 6), b.accountId, NULL)) / activeAccountNum, 0) AS `7Leave `, ifnull (countDistinct (if (a. d. s= addDays(b.ds, 13), b.accountId, NULL)) / activeAccountNum, 0) AS `14Leave `, ifnull (countDistinct (if (a. d. s= addDays(b.ds, 29), b.accountId, NULL)) / activeAccountNum, 0) AS `30Leave a `FROM
  -- Use INNER JOIN to find out the login status of the active user of 2020-08-01 in the following 1 to 30 days
(
    SELECT
        ds,
        accountId
    FROM login_event
    WHERE (ds < = addDays(toDate('2020-08-01'), 29)) AND (ds > = '2020-08-01'))AS a
INNER JOIN
-- Find active users on 2020-08-01
(
    SELECT DISTINCT
        accountId,
        ds
    FROM test.login3_event
    WHERE ds = '2020-08-01'
) AS b ON a.accountId = b.accountId
GROUP BYDs results:-----------------------------------------┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ds ┬ ─ activeAccountNum ─ ┬ leave ─ ─ time ┬ ─ ─3Leave ─ ┬ ─7Leave ─ ┬ ─14Leave ─ ┬ ─30Leave ─ ┐ │2020- 08- 0140.250.2500.50.75│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.019 sec.
Copy the code

Solution 3:


SUM(r[index]) = 2020-08-01 = 2, 3, 6, 13, 29
SELECT
    toDate('2020-08-01') AS ds,
    SUM(r[1]) AS activeAccountNum,
    SUM(r[2]) / SUM(r[1]) ASLeave a ` ` time,SUM(r[3]) / SUM(r[1]) AS `3Leave `,SUM(r[4]) / SUM(r[1]) AS `7Leave `,SUM(r[5]) / SUM(r[1]) AS `14Leave `,SUM(r[6]) / SUM(r[1]) AS `30Leave a `FROM
-- Find the logins of active users on days 2, 3, 6, 13, 29 on 2020-08-01, 1/0 => logged in/logged out
(
    WITH toDate('2020-08-01') AS tt   
SELECT
    accountId,
    retention(
      toDate(ds) = tt, 
      toDate(subtractDays(ds, 1)) = tt, 
      toDate(subtractDays(ds, 2)) = tt, 
      toDate(subtractDays(ds, 6)) = tt,
      toDate(subtractDays(ds, 13)) = tt,
      toDate(subtractDays(ds, 29)) = tt
    ) AS r
  -- Find the login data of 2020-08-01 active users in the following 1 to 30 days
FROM login_event
WHERE (ds > = '2020-08-01') AND (ds < = addDays(toDate('2020-08-01'), 29))
GROUP BY accountId
)
GROUP BYDs results:-----------------------------------------┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ds ┬ ─ activeAccountNum ─ ┬ leave ─ ─ time ┬ ─ ─3Leave ─ ┬ ─7Leave ─ ┬ ─14Leave ─ ┬ ─30Leave ─ ┐ │2020- 08- 0140.250.2500.50.75│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┘1 rows in set. Elapsed: 0.009 sec.
Copy the code

3. Summary

  • Method one, use the traditional method of multi-table association, understand ClickHouse program ape all know, multi-table association is ClickHouse natural enemy, run relatively slow.

  • Method 2: Use a table association, judge the date difference by the IF function, and find the required date user data. Compared with method 1, multi-table association is reduced, and the running speed is improved.

  • Method 3: Use the built-in Retention function of ClickHouse. Retention function is an advanced aggregation function in ClickHouse. This function can accept multiple conditions, based on the result of the first condition. Finally, an array of 1s and 0s is returned. The retention rate can be calculated by counting the number of corresponding 1s in the array.

Compared with the three methods, using ClickHouse’s built-in Retention retention function is faster and more efficient for a large number of data sets. It improves the low speed and low efficiency of the calculation method of user retention rate in the existing technology, and then achieves the effect of improving the calculation speed and efficiency.


PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!