1. Business background

Recently we received a statistical request that we record the UV every time we browse the video in order to monitor the video effect. When the equivalent was small, there was no problem. With the growth of the number of users, the problem of occupying large space began to be exposed. If there are 10 million users and the SET structure is used to record UV, we only store the user ID of int type, 10 million *4byte/ 1024KB / 1024MB =38.14G, this is the data of one video day, obviously the storage cost is too high, we must find a solution that can meet the demand and occupy very little space.

Ii. Technology selection

After investigation, the following technical solutions were found and analyzed one by one.

1. Remove the state of Flink

MapState is the state type of KeyedState in Flink. The deduplicative implementation in this way is an exact deduplicative result, and the device ID is saved in MapState. The state is directly stored in RocksDB. RocksDB is a kV database similar to HBase. With good local performance, it is not a problem to maintain a large state set.

The advantage of this approach is that the data accuracy is very high, which challenges the checkpoint efficiency and success rate as the state increases. In addition, the state needs to be cleaned regularly to remove the data for a long time.

2. De-weight the bitmap

The idea behind a bitmap is to use a bit to hold a certain state, which is suitable for large amounts of data, but not a lot of state. Suppose we use the Redis bitmap, we first convert the user ID to an integer through the algorithm. The integer is the offset, and each ID occupies one bit. 1 is accessed.

The advantages of this scheme are small memory, convenient query, high efficiency, can be accurate to deweight, the disadvantage is that the need to calculate the offset, if the offset is too large will take up a lot of space.

3. Bloom filter

The Bloom filter consists of a long binary vector and a series of random mapping functions. It is suitable for filtering the scene of missing elements. If the judgment does not exist, it must not exist. If the judgment does exist, there is a certain probability that it does not exist. When the PUV is calculated, a variable is used to count, and this filter is used to determine whether the counter is increasing.

The advantages of this scheme are high space efficiency and short query time. The disadvantage is that there is a misjudgment rate. In the case of multiple dimensions, the initial allocated space is the same, which will occupy a large space.

4, hyperloglog

Hyperloglog is an algorithm designed to do cardinality statistics. It can estimate the number of different data in a set. It does not save metadata, only records the number, and supports input of very large volume of data. Hyperloglog type structure also exists in Redis, which can use 12K of memory and allow 2^64 data statistics with an error of 0.81%.

One of the things about this scheme is that it takes up less space, so the space doesn’t get bigger as the number gets bigger. The disadvantage is that it has a standard error rate of 0.81%, which can only be used for counting, not for querying elements.

Based on the above schemes, scheme 4 is more suitable for the scenario where 10 million users, hundreds of millions of videos are counted and certain error rate is allowed. The memory can be controlled within 12K and the query speed can also meet the requirements.

Three, HyperLoglog principle analysis

1. Principle description

HyperLogLog algorithm is derived from the thesis HyperLogLog the analysis of a near-Optimal Cardinality estimation algorithm. To understand the principle of HLL, first start from Bernoulli experiment. The Bernoulli experiment was about flipping a coin. A Bernoulli experiment is the equivalent of flipping a coin, no matter how many flips it takes to get one head, it’s called a Bernoulli experiment.

We used k for the number of flips, n for the number of flips, and k_max for the maximum number of flips, and we found that the relationship between n and k_max was n=2^(k_max), but we also found another problem when the number of trials was very small, The error of this estimation method will be large. For example, we carry out the following three experiments:

  • The first test: 3 flips appear heads, where k=3, n=1;
  • In the second test, k=2, n=2;
  • Test 3:6 flips come up heads, where k=6 and n=3.

For these three groups of experiments, k_max=6 and n=3, but it is obvious that 3≠2^6 is put into the estimation formula. To solve this problem, HLL introduces the bucket partition algorithm and harmonic average to make the algorithm more close to the real situation.

Bucket partitioning algorithm means that the original data is evenly divided into M parts, and the average of each section is calculated and multiplied by M, so as to reduce the error caused by contingency and improve the accuracy of prediction. In simple terms, it is to divide one piece of data into multiple parts and divide one round of calculation into multiple rounds of calculation.

The harmonic mean refers to an optimization algorithm that uses averages rather than averages directly.

For example, Xiao Ming’s monthly salary is 1,000 yuan, while Xiao Wang’s is 100,000 yuan. If you take the average, Xiao Ming’s average salary becomes (1,000 + 100,000)/2=50500 yuan, which is obviously inaccurate. The result calculated by using the harmonic mean algorithm is 2/(1/1000+1/100000)≈1998 yuan, which is more consistent with the actual mean.

2, the specific calculation process

This is the online calculation: hyperloglog content. research. Neustar. Biz/blog/HLL ht…

  • The first step on the input value of 3157369 hash value 2297270962, turn the hash value of the binary 11100000101001010101001011101110100111010
  • Step 2: Set the last 6 bits of the binary number to the index of the bucket (rectangle 16 by 4). Why 6 bits? Because 2 to the sixth power is the same thing as the number of arrays 64, and you can see that index is 50, and the red square is where you put it
  • The third step takes the first 1 from right to left, which is 10, or 2
  • In the last step, multiple different values are scattered into different buckets, and each bucket has its k_max. And then when you try to figure out how many there are, it’s an estimate. Finally, the estimated value can be obtained by combining all the k_max in the barrels and substituting it into the estimation formula.

Four, Hyperloglog Redis combat

The Redis HyperLogLog implementation uses 16384 buckets, that is 2^14, each bucket maxbits need 6 bits to store, maxbits=63, So our total memory footprint is 2 to the 14th times 6/8 is equal to 12K bytes. \

HyperLogLog provides two instructions, pfAdd and pfCount, which are well understood literally. One is to increase the count and the other is to get the count. Pfadd is used in the same way as sadd for a set. If you want a user ID, you insert the user ID. The use of pfCount is the same as that of scard.

127.0.0.1:6379> PFADD runoobkey "redis"

(integer) 1

127.0.0.1:6379> PFADD runoobkey "mongodb"

(integer) 1

127.0.0.1:6379> PFADD runoobkey "mysql"

(integer) 1

127.0.0.1:6379> PFCOUNT runoobkey
Copy the code

We increased the number to 10w to see how big the total difference was.

Public class Test {public static void main(String[] args) {// connect to local redis Jedis Jedis = new Jedis("localhost"); System.out.println(jedis.getClient().getPort()); Println (" Local Redis server successfully connected "); for (int i = 0; i < 100000; i++) { jedis.pfadd("my-user-uv", "user" + i); } long total = jedis.pfcount("my-user-uv"); System.out.printf("%d %d\n", 100000, total); jedis.close(); }}Copy the code

That’s a difference of 275, or 0.275% as a percentage, which is not a high error rate for the above UV statistical requirements. Then we run the script again, which is equivalent to adding data to the side, and look at the output. We can see that the pfCount result has not changed at all, still 99725, indicating that it does have the deduplication function.

The resources

  • Excellent cardinality statistics
  • Redis HyperLoglog, Bloom Filter, Bitmap