Gauss Redis, the best choice for counting!

This article is from huawei cloud community “Huawei Cloud PB-level database GaussDB(for Redis) Revealed the eighth issue: Using Gauss Redis to count”, the original author: Yan Fat.

The background,

When we turn on the mobile phone to check weibo, we will start to deal with various counters. When we register an account, Weibo records a set of data for us: the number of followers, the number of followers, the number of dynamic events… ; When we brush posts, we pay attention to the hot search situation every day, and weibo needs to record a group of search volume for each hot search. Behind this string of data are the counters in action.

Counter can be divided into conventional counter and cardinal counter, for conventional counter, just need to carry on the simple increase or decrease to the counter; For cardinality counters, elements need to be de-duplicated, such as counting searches, to ensure that each user’s multiple searches are counted only once. Redis has corresponding data types for statistics for both requirements. However, open source Redis is a weak consistent database, in particular scenarios, weak consistent counting can not meet business requirements, so we need a strong consistent database for counting.

GaussDB(for Redis for short) is a strongly consistent and persistent NoSQL database developed by Huawei and compatible with Redis5.0. This paper will introduce the application scenarios of conventional counter and cardinal counter and the use of Gauss Redis to realize counting.

Two, conventional counter

2.1 How to use Redis for general counting

Redis implements regular counters with two data types: String and Hash.

2.1.1 Using String Counts

When we need to maintain a small number of counters, such as counting the number of registered users on a website, String counters are appropriate. The Incr and Decr commands provided by Redis increment and Decr the key of String type respectively:

127.0.0.1:6379> SET counter 100
OK
127.0.0.1:6379> INCR counter
(integer) 101
127.0.0.1:6379> DECR counter
(integer) 100
Copy the code

In addition to Incr and Decr commands, the Redis String type also provides Incrby and Decrby commands in the syntax format:

  • Incrby: Indicates the incrby key count

Add count to key. Count can be positive or negative.

127.0.0.1:6379> INCRBY counter 10
(integer) 10
127.0.0.1:6379> INCRBY counter -20
(integer) -10
Copy the code
  • Decrby: decrby key count

Select count from key; count can be positive or negative;

127.0.0.1:6379> DECRBY counter 10
(integer) -10
127.0.0.1:6379> DECRBY counter -20
(integer) 10
Copy the code

2.1.2 Using Hash Counting

When you need to maintain multiple closely related counters, you can use a Hash structure for counting. For example, when we register a Microblog account, weibo will record some user data for each user, such as the number of fans and followers, etc., which are bound to the corresponding user. Therefore, this group of counters can be recorded in the same Hash key using hincrby command. The syntax format is:

  • Hincrby: Filed count of a hincrby key

Filed of the Hash key increment count, which can be positive or negative, returns the result of the corresponding field:

127.0.0.1:6379> HGET userid field (nil) 127.0.0.1:6379> HINCRBY Userid field 1 (integer) 1 127.0.0.1:6379> HINCRBY Userid field-1 (integer) 0 127.0.0.1:6379> HGET userid field "0"Copy the code

2.2 Common counter usage scenarios

The use of conventional counters is very wide, for social products, the number of users’ fans, the number of followers, the number of posts like, favorites… ; For video websites, the number of times the video is played (PV statistics, Page View); For e-commerce seckilling, it is necessary to count the quantity of goods and control the flow. Redis has significant performance advantages in the case of high concurrency, making it ideal for these scenarios.

For example, to handle high concurrent read and write, Redis is usually deployed on the upper layer of MySQL as a cache. In order to resist heavy flow, use a counter for current limiting. For example, if we want to control 10,000 requests per second, we can initialize a counter=10000, and then each time a request comes in, we decrease counter by one, and when counter goes to zero, we block subsequent requests. Every once in a while, reset counter=10000 to ensure that heavy traffic does not flood the underlying MySQL.

Cardinal statistics: Principle and use of HyperLogLog

** Cardinality counting ** refers to counting the number of non-repeating elements ina data set. It is a common scenario in practical applications. For example, the number of users who visit a website during a period of time, the number of daily active users of online games, etc.

In the case of small amount of data, we can save all the data for recalculation. In Redis, you can use Set and Zset to save data and then count the number of elements in the Set. However, when the amount of data is large, this method will consume large storage space, so other algorithms need to be considered.

Consider a situation where when we log on to Weibo, it records our login and counts how many active users there are each day. Obviously, we do not need or should record active user ids, and a small error does not affect the statistical use of active users. In this case, we can use HyperLogLog to count. HyperLogLog is a counting algorithm that uses very little memory to achieve huge statistics, which is very suitable for cardinality estimation of big data scenarios and is implemented as a data type in Redis.

3.1 Principles of HyperLogLog

3.1.1 From Bernoulli test to radix counting

HyperLogLog is a cardinality estimation algorithm whose idea comes from Bernoulli process.

The Bernoulli process is simply a coin flip. If YOU flip a coin, the probability of getting heads or tails is 1/2. So heads is 1, tails is 0, and if you flip a coin multiple times until you get the first heads, that’s a flip trial, and you get a sequence of flips, like 001, we know that the probability of that sequence is zero.

Conversely, if we keep doing the toss test, when the first “001” sequence occurs, we can simply estimate that we have 8 toss tests (in fact, this is a maximum likelihood estimate).

The principle of HyperLogLog is to treat each element as a throw test and estimate the number of elements by recording the maximum number of throws in the test. When we insert an element into the set, it is regarded as a throwing test, and the same element corresponds to a sequence of throwing results. To convert each element to a “01” sequence, we can use a hash function to convert:

Here, we have a simple estimation algorithm. We only need to record the maximum value of the position where the first “1” appears in the hash result, but obviously, when the amount of data is small, such an estimate will have a large error, and the effect of individual elements on the estimate is not smooth.

3.1.2 Barrel average error reduction

In order to reduce the influence of a single estimator, HyperLogLog uses the method of multiple bucket tests to reduce the error. The method is to take the first several bits in the hashed bitmap as the bucket number and the remaining bits as the test result.

For the results in each bucket, the harmonic mean is calculated to obtain the cardinality estimate (compared with the arithmetic mean, the harmonic mean can effectively improve the problem of excessive influence of extreme values in the case of small cardinality) :

3.2 the HyperLogLog Redis

Redis implements HyperLogLog as a data type. For each element, Redis hashes it into a 64-bit binary string, using the lower 14 bits to represent the subscript of the bucket (so the number of buckets is 1<<14=16384), and the remaining bits to simulate the Bernoulli distribution. Each bucket requires six bits; The maximum number of elements can be counted, occupying about 12 K memory; Its standard error is 0.81%.

Redis supports only three HyperLogLog commands, pfadd, pfcoun, pfmerge. The syntax is as follows:

  • Pfadd: Adds all element parameters to the HyperLogLog data structure

PFADD key element1 [element2…]

Returns 1 if at least one element was added, 0 otherwise

If no element is specified, a Hyperloglog key is created

127.0.0.1:6379> pfadd key1 ele1 ele2
(integer) 1
127.0.0.1:6379> pfadd key1
(integer) 0
127.0.0.1:6379> pfadd key2
(integer) 0
Copy the code
  • Pfcount: Returns the cardinality estimate for the given HyperLogLog

Syntax: PFCOUNT key1 [key2…]

Returns the cardinality of the corresponding HyperLogLog. If there are multiple keys, the combined cardinality of multiple keys is returned.

127.0.0.1:6379> pfcount key1
(integer) 0
127.0.0.1:6379> pfadd key1 ele1 ele2
(integer) 1
127.0.0.1:6379> pfadd key2 ele1 ele3
(integer) 1
127.0.0.1:6379> pfcount key1
(integer) 2
127.0.0.1:6379> pfcount key1 key2
(integer) 3
Copy the code
  • Pfmerge: Merges multiple Hyperloglogs into one

PFMERGE destkey sourcekey1 [sourcekey2…]

Combine the sourcekey and destkey. If the destkey does not exist, the sourcekey is created

Return OK

127.0.0.1:6379> pfadd key1 ele1 ele2
(integer) 1
127.0.0.1:6379> pfadd key2 ele1 ele3
(integer) 1
127.0.0.1:6379> pfcount key3
(integer) 0
127.0.0.1:6379> pfmerge key3 key1 key2
OK
127.0.0.1:6379> pfcount key3
(integer) 3
Copy the code

3.3 Application Scenarios of HyperLogLog

HyperLogLog, as a cardinality statistics algorithm for computing large amounts of data, can be very powerful in scenarios such as the number of registered users, daily number of accessed IP addresses, and real-time statistics of online users.

  • Statistics website UV(Unique Visitor)

For a web page, we want to know how much attention the page gets, we can count how many users (IP) click on the page. To do this, we set a record for each time period, for example, 127.0.0.1 is the IP that visited the web page at 1:00 on January 1, 2021:

Pfadd key_prefix_2021010101 127.0.0.1 ""Copy the code

When we need to count how many IP addresses visited this webpage in one hour from 0 to 1 o ‘clock on this day:

pfcount key_prefix_2021010101
Copy the code

Need to collect statistics of web page visits from 8 am to 12 am:

Pfcount key_prefix_2021010109... key_prefix_2021010112Copy the code

At the end of the day, you need to count and save the day’s access:

pfmerge key_prefix_2021010101 ...... key_prefix_2021010124
Copy the code

For a popular web page, such a counting method can obviously greatly save storage space.

  • User portrait

User portrait is based on the various data users leave on the Internet, to attach a series of labels to users, such as the user’s gender, age, hobbies and so on. During data analysis, you can use HyperLogLog to save and analyze data.

1. For each label, create hyperloglog key values to save data, such as man, woman, basketball… For each value that needs to be recorded, a key needs to be created for recording.

2. For each additional user, add elements to all record keys using pfadd.

3. During data analysis, use PFCount to make statistics of the data to be analyzed.

Fourth, gauss Redis advantage in counting

4.1 Problems with open source Redis

In production environments, to avoid single points of failure and enhance database availability, Redis often duplicates data on different servers. When a large number of concurrent requests come in, the primary write/secondary read mode can be used to maximize the utilization of the server resources of the primary/secondary nodes. Because the master/slave synchronization of Redis is asynchronous, when the master node writes data, the slave node does not guarantee to update data immediately. If the data is read at this time, it will read expired old data, resulting in data inconsistency.

When the primary node fails and goes down, data inconsistency becomes more serious. After the primary node fails, the sentinel node will promote the secondary node to the primary node, and the data buffer accumulated on the original primary node will be completely lost. In the second kill service of e-commerce, if the replication buffer of the primary node is stacked, the counter of the secondary node and the primary node will be much larger. Once the primary node breaks down and the master/standby switchover occurs, the flow pressure will easily exceed the threshold, and a large amount of data may crush MySQL and lead to system unavailability.

4.2 How to solve Gauss Redis

Gauss Redis with the help of the gauss brand “memory and computation separation” architecture, the full amount of data sink into the strong consistent storage layer (DFV Pool), completely abandon the open source Redis asynchronous replication mechanism; The computing layer fragments massive data and automatically takes over in the event of a failure, thus realizing high availability of services.

The DFV Pool is a company-level Data Lake developed by Huawei. The DFV Pool is an advanced architecture featuring distributed, strong consistency, and high performance. At the bottom layer, the storage of three copies is strongly consistent, ensuring that data is strongly consistent at any point in time, data is not lost in the case of faults, and the absolute accuracy of counting for services such as seckill is met. In addition, with the help of the storage and computation separation architecture, Gauss Redis also has the advantages of low cost, large capacity and instant expansion.

Five, the conclusion

Gauss Redis is based on community Redis and combines with Huawei self-developed DFV Pool for strong consistency, second capacity expansion, over-availability, low cost and other advantages to ensure the accuracy and reliability of counting.

Author: Huawei Cloud Gauss Redis team.

Pay more attention to technical articles, gaussian Redis official blog: bbs.huaweicloud.com/community/u…

Vi. Reference materials

1. Redis Application Scenarios – Counters

Blog.csdn.net/nklinsirui/…

2. Explanation of HyperLogLog Algorithm and how Redis applies it

Juejin. Cn/post / 684490…

3. Experimental Results and Practical Suggestions of Five Common Cardinality Estimation Algorithms

Blog.codinglabs.org/articles/ca…

4. From Acquaintance to Mutual Pity: Redis and Computing and Storage Separation tetralogy

Bbs.huaweicloud.com/blogs/25304…

Huawei Cloud PB-level database GaussDB(for Redis) Revealed Issue 7: Gauss Redis and Strong Consistency

Bbs.huaweicloud.com/blogs/25688…

Click to follow, the first time to learn about Huawei cloud fresh technology ~