The “Playing with Redis” series focuses on the basics and advanced applications of Redis. This article is the first [9] in the “Play Redis” series. Please visit the public account “Zxiaofan” for the latest series of articles, or search “Play Redis zxiaofan” on Baidu.

Keywords: Playing Redis, Daily/monthly activity of Microblog, UV statistics, HyperLogLog;

The outline

  • What are the challenges of daily live data statistics
    • Can Bitmaps be used for daily activity statistics?
    • Characteristics of daily active data statistics
  • HyperLogLog introduction
    • HyperLogLog will know
    • Difference between HyperLogLog and Sets
  • How do I use HyperLogLog
    • HyperLogLog command comparison analysis
    • Description of the HyperLogLog command
    • Precautions for the HyperLogLog command
    • HyperLogLog command example
  • Application scenarios of HyperLogLog

Noun explanation

  • Daily Active User (DAU) The number of Daily Active users

It is often used to reflect the operation of websites, Internet applications or online games. DAU usually measures the number of users who log in or use a product within a day (counting the number of users who log in repeatedly).

  • Number of Monthly Active users (MAU)

Monthly Active Users Count the number of users who log in to or use a product within a month (statistics month) (excluding those who log in repeatedly).

  • Note: Daily and monthly activity reflect user activity, but not user stickiness.

1. What are the challenges of daily live data statistics

On February 26, 2020, Weibo released its financial results for the fourth quarter and the full year of 2019. By the end of 2019, Weibo had 516 million monthly active users, a net increase of about 54 million from the end of 2018, with mobile devices accounting for 94 percent of the total, data showed. In 2019, weibo’s annual revenue rose to 12.24 billion yuan, with advertising revenue reaching 10.6 billion yuan.

1.1. Can Bitmaps be used for daily activity statistics?

The application of Bitmaps in big data was mentioned in the previous article, so can Bitmaps be used to collect daily live data? Let’s do a calculation (using 100 million users as an example) :

Statistical methods Footprint calculation 100 million users (M)
MySQL 32bit int data type An int requires 4 bytes of storage and can store 32 bits 10^8 / (1024 * 1024 * 8/32) ≈ 381 M
Redis Bitmaps Bitmaps supports 512MB individually, unlike int which stores only 32 bits individually 10^8 / (1024 * 1024 * 8) ≈ 12M

Using Bitmaps to calculate daily and monthly activity:

  • Calculate daily activity: bitcount key obtains the number of keys 1.
  • Calculate monthly activity: Calculate OR of all bitmaps of 30 days, and then calculate bitcount;
  • Calculate the retention rate: Yesterday’s retention = the number of consecutive logins yesterday and today/the number of logins yesterday. That is, yesterday’s bitmap and today’s bitmap are calculated and divided by the number of bitcounts yesterday.

Based on the above calculations, we found that Bitmaps were already very space-efficient. Statistics on the daily activity of a website is no longer the case, but large Internet companies in addition to daily activity, UV, PV and so on need to statistics. In the face of thousands or even more statistical modules, one module needs 12M in a day, 12M * 365/1024 ≈ 4.3g in a year, and 12M * 365/1024/1024 ≈ 4.2T in a year in 1000 modules. So the revolution has not yet succeeded, and we still need to economize!

1.2. Characteristics of daily active data statistics

  • Data needs to be de-weighted;
  • Certain deviations are allowed in the data. There is not much difference between 101W and 102W.
  • Occupy as little space as possible;

2. HyperLogLog is introduced

2.1. HyperLogLog will know

HyperLogLog (HLL) is a probabilistic data structure used for cardinality calculation, which generally supports statistics of non-repeating elements in sets.

Regular cardinality calculations require a memory space to store elements that have already been counted, so that certain elements are not counted twice. Redis offers an algorithm that trades accuracy for memory space, with a standard error of less than 1%. It takes only 12K to complete the statistics (plus a few bytes for HLL itself), and if there are fewer elements in HyperLogLog, the memory space required is even smaller. The standard error for HyperLogLogs is 0.81%.

When the number or volume of input elements is very large, the space required for HLL is fixed and small. 12KB of memory computes cardinality of close to 2^64 different elements.

Although the technical implementation of HyperLogLog is a different data structure, the underlying layer is still Redis strings, so you can use the GET command to GET the serialized data, and use the SET command to deserialize the data and store it in Redis.

2.2. Differences between HyperLogLog and Sets

Comparison/data type Sets HyperLogLog
Whether statistical elements are actually stored storage No elements are stored, only existing tags
Add elements SADD PFADD
Number of statistical elements SCARD PFCOUNT
Remove elements SREM Deletion of elements is not supported

3. How do I use HyperLogLog

HyperLogLog core command: PFADD, PFCOUNT, PFMERGE;

3.1. Comparison and Analysis of HyperLogLog commands

The command function parameter
PFADD Add elements to the HLL data structure key element [element …]
PFCOUNT Returns the base value of HLL key [key …]
PFMERGE Merge multiple HLL data to destkey destkey sourcekey [sourcekey …]

PF in HLL operation command: acronym for Philippe Flajolet, inventor of the HyperLogLog data structure.

3.2. Description of the HyperLogLog command

3.3. Precautions of the HyperLogLog Command

  • PFADD stores only tags, not elements themselves;
  • PFCOUNT is actually a write command, which may be recalculated and stored;
  • If there are multiple keys, PFCOUNT is dynamically combined and the calculation results are not cached. Therefore, avoid multiple keys when executing PFCOUNT in the production environment.
  • If there are more than one key, the PFCOUNT is merged and then calculated. The result is the cardinality value (note: it is not the sum of the cardinality values) after the multiple objects are merged.
  • PFMERGE computes the union of sourcekeys.
  • If the destkey already exists, the result of the PFMERGE destkey is the union of dest and source.

3.4. HyperLogLog Command Example

// Pfadd and pfcount Example @zxiaofan 127.0.0.1:6379> pfAdd HLL 1 (integer) 1 127.0.0.1:6379> pfAdd HLL 1 (integer) 0 127.0.0.1:6379> pfadd HLL 2 3 4 (integer) 1 127.0.0.1:6379> pfcount HLL (integer) 4 127.0.0.1:6379> pfcount HLL :notexist (integer) 0 127.0.0.1:6379> pfAdd hll2 a B (integer) 1 127.0.0.1:6379> pfcount Hll2 (integer) 2 127.0.0.1:6379> pfcount HLL hll2 (integer) 6 127.0.0.1:6379> get HLL "HYLL \ x01 \ x00 \ x00 \ x00 \ x04 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00A \ xee \ x84 [v \ x80Mt \ x80Q, \ x8cC \ xf3" 127.0.0.1:6379 > set HLL: error Error666 OK 127.0.0.1:6379> pfcount HLL: Error (error) WRONGTYPE Key is not a valid HyperLogLog string value.Copy the code
// pfmerge example @zxiaofan 127.0.0.1:6379> pfadd hllM1 12 3 4 5 (integer) 1 127.0.0.1:6379> pfadd hllm2 5 6 7 8 (integer) 1 127.0.0.1:6379> pfmerge hllm3 hllM1 hllm2 OK 127.0.0.1:6379> pfcount hllm3 (integer) 8 127.0.0.1:6379> pfadd hllm4 7 8 9 10 11 12 14 14 (integer) 1 127.0.0.1:6379> pfmerge hllm4 hllM1 hllm2 OK 127.0.0.1:6379> pfcount hllm4 (integer) 13Copy the code

4. Application scenarios of HyperLogLog

4.1. The website lives day by day

Daily activity: one HLL per day. PFADD HLL20200719 userID when a user logs in. Monthly activity: Merge all daily activity data of the current month, PFMERGE HLL202007 HLL20200701 HLL20200702 HLL20200703…

4.2. UV web pages

UV (Unique Visitor) Within 1 day; Cookie is the identifier; Multiple visits from the same client count as only one visitor. For example, the boss wants to see in real time how many unique visitors some pages of the company’s website have been accessed from 0 PM today to now.

4.3. Other scenarios

  • Search engine keyword search volume;
  • The number of online users;
  • Data analysis scenarios based on cardinality counting.

【 Redis series articles @zxiaofan】

How to Play Redis- Jingdong Check-in And Jingdou

Redis- The Boss gives you an in-depth understanding of distributed locks

Playing with Redis- How to access the Massive data in Redis efficiently

Redis- Key Commands for Advanced Programmers

Redis- Connection commands that DEVELOPERS should also know

Redis-Redis Advanced Data Structures and Core Commands -ZSet

Redis-Redis Basic Data Structures and Core Commands

Redis-Redis Installation, Background startup and Uninstallation


Good luck!

Life is all about choices!

In the future, you will be grateful for your hard work now!

【CSDN】 【GitHub】 【OSCHINA】 【The Denver nuggets】 【Language finches】 【Wechat official account】