This article is reprinted from itopic.org/liucunlv-sh…
Before counting retention, let’s understand the concept of retention, as described in Baidu Baike:
Users who start using the app for a certain period of time and continue using the app after a certain period of time are considered retention; The proportion of these users to the new users is the retention rate, which is calculated every 1 unit of time (day, week, month). As the name suggests, retention refers to “how many users stay.” Retention and retention reflect the quality of your app and your ability to retain users.
To put it simply, the first day 100 new users were added, the next day 50 of those 100 were logged in, and the third day 30 of those 100 were logged in… The next day retention rate is 50%, and the third day retention rate is 30%.
In the statistical system often need to statistics user retention rate, here sort out the user retention rate statistics of several ways to achieve.
1. Through the last login time
There is a unique table for new users that contains at least three fields: uid, reg_time, and last_visited_time. The user updates the last access time (last_visited_time) after each visit. Assume that 100 new users are registered on Day 3.6 and need to be counted the next day, the reg_time is 3.6 and the last_visited_time is 3.7 on the morning of day 3.8.
SELECT COUNT(*) FROM TBL_NAME WHERE DATE(reg_time) = '2014-03-06' AND DATE(last_visited_time) = '2014-03-07'
Copy the code
The implementation is simple, but the problem is also obvious, if these users happen to have access at 0 o ‘clock and update the access time first, the retention rate will not be recorded, this will not be too big for the overall results, ignore for now. A more obvious problem is the inability to repeat statistics, which cannot be done if the script fails or if a recount is required. Of course, there are advantages, is convenient statistics, but also convenient to add N days retention.
2, by creating a separate field implementation
Individual fields can be designed like this: UID,reg_time, DAY_2, DAY_3, DAY_4… Wait, when the user has access the next day, the day_2 field is updated with 1, and the day_3 field is updated with 1. This series of fields defaults to 0. If the same statistics are retained tomorrow, the SQL should look like this:
SELECT COUNT(*) FROM TBL_NAME WHERE DATE(reg_time) = '2014-03-06' AND day_2 = 1
Copy the code
This method can repeat statistics, but it is not convenient to extend, if the current 15-day process is not considered, you need to modify the table structure to add DAY_15.
3, through the bit operation
The values recorded in the data table above are many zeros and ones, which can be used as binary zeros and ones to indicate whether access was made that day, with 1 indicating yes and 0 indicating no access. The design table has these fields, uid,reg_time,retension, assuming retention records, then
On the first day, access 0 0 0 0 0 0 0 0 0 0 0 0 0 1 corresponds to the decimal 1, which is recorded as 1
On the second day, after access, retention is updated to 3
The fourth day of visit 0 0 0 0 0 0 0 0 0 1 0 11 No visit on the third day, after the visit, Rentention is updated to 11
And so on, the next step is to calculate the retention of that day, for example, the next day. Perform bitwise and operation on the next day’s data with the second bit being 1 and the other bits being 0
If the result is 2, it means that the next day was accessed. If the result is not 2, the result is 0, it means that the next day was not accessed. Mysql > select * from day 1; mysql > select * from day 2; mysql > select * from day 2;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 to 1Copy the code
On the third day there was no visit, and on the fourth day there was:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 | 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 = 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1Copy the code
In SQL, it’s (N for NTH day access)
UPDATE TBL_NAME SET retention = retention | 2^(N-1) WHERE uid = 'XX'
Copy the code
And the update operation can be repeated operation, in the day because the bitwise or you just need to have a 1 to 2 days for the first time update 1 | 2 = 3, the second update 3 | 2 = 3. The visible values are the same. After hearing this scheme, I also doubt the efficiency problem. In 1000W data, the statistical speed is almost the same as the index time in reg_time, so there is no problem. An integer of 4 bytes and 32 bits can represent 32 different holders. an integer of 8 bytes can also be used. In general, the method is scalable and reproducible, so it works.
Bit operations have only been seen before in permissions, here is a good way to use, looking forward to more thinking, the following is the basic operation of bit operations:
Use Redis to measure retention
Reprinted from blog.csdn.net/agonie20121…
Let’s start with a scenario: a website that needs to count the number of consecutive users who log in within a week, and the number of users who log in within a month or retention rate.
This is difficult to do with a traditional database like Mysql. But if you do it in Redis, it’s pretty easy. Both Redis collection types and Bitmap types can be easily implemented. Today, we’re going to focus on using Bitmaps to count active users.
What is the Bitmaps
Bitmaps are not actual data types, but rather a collection of byte-oriented operations defined on strings. Because strings are binary-safe blocks, their maximum length is 512 megabytes, which is best set to 2^32 different bytes.
One of the biggest advantages of Bitmaps is that it is extremely space-efficient when storing information. For example, in a system where different users are identified by incremental user ids, a single bit of information for four billion users (to know, for example, whether the user wants to receive the latest email) is recorded using only 512 megabytes of memory.
In a computer system, the smallest unit of information is a byte, one byte equals eight bits, each of which can only be a 0 or a 1 (the only two numbers a computer recognizes). With Bitmaps, you can manipulate bits directly.
You can think of Bigmaps as an array, where each bit can only be a 0 or a 1, and the array’s subscripts are used as offsets.
Here are a few bitmaps-related commands:
You can see from the following results that Bitmaps actually store strings
127.0.0.1:6379> set hello big
OK
127.0.0.1:6379> getbit hello 0
(integer) 0
127.0.0.1:6379> getbit hello 1
(integer) 1
127.0.0.1:6379> getbit hello 2
(integer) 1
Copy the code
setbit
Setbit Key offset value: Sets the corresponding bit value
For example, if users 3, 8, 23, 32 visit the website today, then
setbit user:view:2020-5-17 3 1
setbit user:view:2020-5-17 8 1
setbit user:view:2020-5-17 23 1
setbit user:view:2020-5-17 32 1
Copy the code
Development tip: Many application ids do not start with 1. Many start with specified numbers, such as 1001 and 10001. For these, we can subtract the initial value when setting to avoid wasting space
getbit
Getbit key offset Obtains the value of the specified bit
If I want to know if users 8 and 45 have logged in today, then
127.0.0.1:6379> getbit user:view:2020-5-17 8
(integer) 1
127.0.0.1:6379> getbit user:view:2020-5-17 45
(integer) 0
Copy the code
You can see that user 8 has logged in today, but user 45 has not logged in today.
bitcount
Bitcount key [start] [end] Retrieves the number in the specified range of 1
I want to know how many users logged in today, then
127.0.0.1:6379> bitcount User :view:2020-5-17 (integer) 4
Operations between Bitmaps
bitop op destkey key [key ...]
The bitop command can do intersection (and), union (OR), not (NOT), xor (XOR) for multiple bitmaps and store the result in destkey.
If you want to know the number of users who logged in on three consecutive days, the number of users who logged in on May 17, 18, and 19
The landings during the three days were as follows:
-
On May 17, users 3, 8, 23 and 32 logged in
-
Users 3, 23, 43 and 54 logged in on May 18
-
Users 3, 5, 23, 32, 56 and 78 logged in on May 19
127.0.0.1:6379> bitop and three:and user:view:2020-5-17 user:view:2020-5-18 user:view:2020-5-19 127.0.0.1:6379> bitcount three:and (integer) 2Copy the code
If you want to know, how many users logged in in the last three days.
127.0.0.1:6379> bitop or three:or user:view:2020-5-17 user:view:2020-5-18 user:view:2020-5-19
(integer) 10
127.0.0.1:6379> bitcount three:or
(integer) 9
Copy the code
As you can see, a total of nine users logged in during these three days.
In actual combat
With the knowledge mentioned above, we can complete the desired requirements: need to count the number of consecutive login users in a week, and login users in a month.
Firstly, the pseudo-code is as follows:
for ($i = 0; $i < 20000; $i++) {
$userId = mt_rand(1, 10000);
$date = time() - 86400 * mt_rand(0, 30);
$key = 'userlogin_'.date('Ymd', $date);
$redis->setBit($key, $userId, 1);
}
Copy the code
Retrieve all logins within a week. We don’t want to fetch all logins at once. Instead, we want to paging a certain number of logins at once.
for ($i = 1; $i <= 7; $i ++) { $key = "userlogin_".date('Ymd', time() - (86400*$i)); if ($i == 1) { $redis->bitOp('and', 'week_logined', $key); } else { $redis->bitOp('and', 'week_logined', 'week_logined', $key); $userIds = []; $userIds = []; for ($i=1; $i<=10000; $i++) { $ret = $redis->getBit('week_logined', $i); $ret && $userIds[] = $i; if (count($userIds) >=50) break; }Copy the code
In bitOP, the first time, because week_logined does not exist, there is only one key for op. When we start the second time, we have 2 keys to do op.
Get the login user within a month, the idea is basically the same as above, but change and to OR
for ($i = 1; $i <= 3; $i ++) { $key = "userlogin_".date('Ymd', time() - (86400*$i)); $redis->bitOp('or', 'month_loginOnce', 'month_loginOnce', $key); $userIds = []; for ($i=1; $i<=10000; $i++) { $ret = $redis->getBit('month_loginOnce', $i); $ret && $userIds[] = $i; }Copy the code
As you can see, there are some differences between or and AND. Or, you don’t need to judge the first time. The reason why, you feel it yourself.