Author: Lin Guanhong/The Ghost at my Fingertips

The Denver nuggets: juejin. Cn/user / 178526…

Blog: www.cnblogs.com/linguanh/

Making: github.com/af913337456…

Tencent cloud column: cloud.tencent.com/developer/u…

Worm hole block chain column: www.chongdongshequ.com/article/153…


directory

  • Problem of prototype
  • To choice
  • HyperLogLog
  • Bernoulli test
  • Optimization of estimation
  • associated
    • Bit string
    • Points barrels
    • The corresponding
  • Application of HyperLogLog in Redis
    • HyperLogLog principle in Redis
  • Deviation correction
  • Shoulders of giants

PS: My technical book: “Blockchain Ethereum DApp Development Combat” has been published and can be online shopping

Problem of prototype

If you want to implement such a function:

Count how many times users click on a page of APP or web page every day. Repeated clicking of the same user is recorded as 1 time.

You might be smart enough to immediately think that using a data structure like HashMap would suffice for de-duplication. Yes, this is one solution, but there are other solutions.

The problem is not difficult, but when the variables involved in the problem reach a certain order of magnitude, no matter how simple the problem will become a problem. Assuming the APP has a million or more active users, using HashMap will result in a large amount of memory in the application.

Let’s try to estimate the memory footprint of the HashMap in dealing with the above problem. Suppose the Key in the HashMap is string and the value is bool. Key indicates the USER Id. Value is whether to click enter. Obviously, when millions of different users visit. The memory footprint of this HashMap is: 1 million * (string + bool).

To choice

Arguably, of the existing solutions to this problem, HashMap is the most memory intensive. If you don’t have a lot of statistics, you can use this approach to solve the problem and it’s easy to implement.

In addition, there are B+ trees, Bitmap Bitmap, and HyperLogLog algorithm solution mainly introduced in this article.

Under certain conditions, if the error rate in front of the huge amount of data is allowed to be within the acceptable range, 10 million page views allow the final statistics of ten or twenty thousand such as less, then you can use HyperLogLog algorithm to solve the above counting similar problems.

HyperLogLog

HyperLogLog, hereinafter referred to as HLL, is an upgraded version of the LogLog algorithm, which can provide inaccurate de-counting. The following characteristics exist:

  • The code is difficult to implement.
  • Can use very little memory to count huge amounts of data, inRedisIn the implementation ofHyperLogLog, you only need to12KMemory can be counted2 ^ 64A data.
  • There is some error in counting, and the error rate is low as a whole. The standard error was 0.81%.
  • The error can be setAuxiliary computing factorI’m lowering.

Those of you who know a little about the memory footprint of basic data types in programming will be surprised to learn that it can count 2^64 data with only 12K of memory. Why? Here’s an example:

In the Java language, long takes up 8 bytes, and a byte has 8 bits, i.e. 1 byte = 8 bits, i.e., the maximum number that can be represented by the long data type is: 2^63-1. Corresponds to the number 2 ^ 64 above, suppose that there are 2 ^ 63-1, so many number from 0 to 2 ^ 63-1, according to the long and 1 K = 1024 bytes of rules to calculate the total number of memory, is: (8/1024) (2 ^ 63-1) * K, this is a very large number of storage space is far more than 12 K. HyperLogLog, on the other hand, can do it in 12K.

Bernoulli test

Before we understand why HyperLogLog is able to use very little memory to count huge amounts of data, we need to take a look at Bernoulli’s experiment.

Bernoulli’s test is part of mathematical probability theory, and its allusion comes from flipping a coin.

A coin has two sides, so if you flip it up and drop it, there’s a 50% chance that you’ll get heads or tails. So let’s say we keep flipping a coin until it comes up heads, and we record it as a complete experiment, and we could flip it once and get heads, or we could flip it four times and get heads. No matter how many flips you get, if you get heads, it’s a trial. This experiment is called Bernoulli’s experiment.

So for multiple Bernoulli tests, let’s say this is n times. That means we got n heads. Suppose the number of flips experienced by each Bernoulli test is K. The first Bernoulli test, let’s call it k1, and so on, and the NTH test is kn.

So, for the n Bernoulli trials, there must be a maximum number of flips, k, for example, 12 flips before we get heads, so I’ll call this k_max, for the most flips.

Bernoulli test is easy to draw the following conclusions:

  1. The number of flips in n Bernoulli processes is not greater than k_max.
  2. N Bernoulli processes, at least one of which is equal to k_max

Finally, combined with the maximum likelihood estimation method, we find that there is an estimation correlation between n and k_max: n = 2^(k_max). This method of predicting the characteristics of the whole data flow through local information seems to be beyond our basic cognition. Probability and statistics are needed to deduce and verify the correlation.

For example:

So on the first trial, we have 3 tosses before we get heads, so k is equal to 3 and n is equal to 1 on the second trial, we have 2 tosses before we get heads, so k is equal to 2 and n is equal to 2 on the third trial, we have 6 tosses before we get heads, so k is equal to 6 and n is equal to 3 on the NTH trial, we have 12 tosses before we get heads, so n is equal to 2 to the 12thCopy the code

Assuming that the number of experimental groups in the above example is 3, then k_max = 6, and finally n=3, we put it into the estimation formula, obviously: 3 ≠ 2^6. In other words, when the number of trials is very small, the error of this estimation method is very large.

Optimization of estimation

In the three sets of examples above, we call the round estimate. If you just do one round, when n is large enough, the error rate of the estimate is relatively small, but it’s still not small enough.

Can you do multiple rounds? For example, take 100 or more rounds of testing, and then take the k_max of each round, and then take the average, i.e., k_mx/100. And then you estimate n. Here is the LogLog estimation formula:

The DVLL of the above formula corresponds to N. Constant is the correction factor, and its specific value is uncertain, which can be set in branches according to the actual situation. M is the number of rounds. The R with a bar on the head is the average :(k_max_1 +… + k_max_m)/m.

This algorithm optimization by increasing the number of test rounds and then taking the k_max average is LogLog’s approach. The difference between HyperLogLog and LogLog is that instead of using an average, HyperLogLog uses a harmonic average. The advantage of the harmonic mean over the mean is that it is less susceptible to large values. Here’s an example:

Seeking average salary:

A has 1000/ month and B has 30,000 / month. The average is (1000 + 30000) / 2 = 15500

The harmonic mean is: 2/(1/1000 + 1/30000) ≈ 1935.484

Obviously, the harmonic mean works better than the mean. Here’s how the harmonic mean is calculated, and the sigma is the summation sign.

associated

We already know that n can be estimated from the k_max that appears in a Bernoulli experiment in the case of the coin toss.

So how does this estimation relate to the following question?

Count how many times users click on a page of APP or web page every day. Repeated clicking of the same user is recorded as 1 time

HyperLogLog does this. For the input data, perform the following steps:

1. The bit string

Converts the data into a string of bits using the hash function. For example, if you enter 5, it becomes: 101. Why do you want to do that?

Because to correspond to the flip of a coin, in the string of bits, 0 is tails, 1 is heads, so if a number is eventually converted to 1,0010,000, then from right to left, from low to high, we can say that the first time a 1 appears, it’s heads.

Therefore, based on the above estimation conclusion, we can estimate the total number of experiments conducted by the maximum number of heads in the multiple coin flipping experiment, and also estimate the data stored by the maximum position k_max where 1 appears after transformation.

2. The barrels

Barrel by barrel is how many rounds. Abstract to the computer storage, is to store a large array S with the unit of bit (bit) and the length of L, divide S into m groups evenly, notice that the M group, is the corresponding number of rounds, and then the number of bits occupied by each group is average, set as P. It is easy to get the following relationship:

  • L = S.length
  • L = m * p
  • In the unit of K, the memory occupied by S = L / 8/1024

In Redis, HyperLogLog is set to: m=16834, P =6, L=16834 * 6. The memory used is 16834 x 6/8/1024 = 12 KB

Visualize as:

Group 0 group 1.... Group 16833 [000 000] [000 000] [000 000] [000 000] [000 000].... [000] 000Copy the code

3. The corresponding

Now back to the question of counting users on our original APP page.

  • Set the key of the APP home page to main
  • The user id is idn, n->0,1,2,3….

In this statistical problem, different user ids identify a user, so we can use the user ID as the input to be hashed. That is:

Hash (ID) = string of bits

Different user ids must have different bit strings. For every bit string, there must be at least one 1. We compare each bit string to a Bernoulli test.

Now we’re going to divide it by the wheel, which means we’re going to divide the barrel. So we can set how many digits of the first bit of each string will correspond to the bucket number when converted to base 10. Suppose that the lower two bits of the string are used to calculate the lower flag of the bucket, then there is a bit string of the user ID: 1001011000011. Its bucket subscript is: 11(2) = 1*2^1 + 1*2^0 = 3, in the third bucket, the third round.

In the example above, after calculating the bucket number, the remaining bit string is: 10010110000. From low to high, the first occurrence of 1 is 5. In other words, in the third barrel, in the third round, kmax is equal to 5. The binary number for 5 is 101, because each bucket has P bits. When p>=3, 101 can be stored.

Imitating the process above, multiple user ids are split into different buckets, each with its own K_max. Then, when it comes to figuring out how many users hit the mian page, it’s an estimate. Finally, the k_max in all buckets is combined into the estimation formula to obtain the estimated value.

Here is HyperLogLog’s estimation formula combined with harmonic mean. The variable interpretation is the same as LogLog’s:

Application of HyperLogLog in Redis

First, in Redis, HyperLogLog is one of its high-level data structures. The following commands are provided, including but not limited to:

  • Pfadd key value: Saves a value corresponding to the key
  • Pfcount key: counts the number of key values

Recall that the original APP page had a problem counting users. If key corresponds to the page name, value corresponds to the user ID. And then the problem fits perfectly.

HyperLogLog principle in Redis

We have already realized that in its implementation, there are 16384 buckets, that is: 2^14 = 16384, each bucket has six bits, the largest number that can be expressed by each bucket is: 2^5+2^4+… +1 = 63, binary: 111 111.

For the pfadd key value command

When the value is saved, it is hashed into a 64-bit string of bits. The first 14 bits are used to divide buckets. The first 14 bits converted to base 10 are the bucket label.

The reason we chose 14 digits is because there are 16,384 buckets, and 2^14 = 16,384, which is exactly the right time to use up the buckets without waste. Suppose the first 14 bits of a string are: 00 0000 0000 0010, which has a decimal value of 2. Index will be converted and placed in bucket number 2.

Index conversion rules:

First, the full value bit string is 64 bits, and after subtracting 14, there are 50 bits left, so the extreme case is that the position of 1 is 50 bits. So index is equal to 50. Change index to base 2, which is 110010.

Because of the 16,384 buckets, each bucket is made up of 6 bits. Just as 110010 was set in barrel Number 2. Note that 50 is already the worst case, and it’s all accommodated. Then the rest can certainly be accommodated without thinking.

Because fpADD keys can be set to multiple values. For example:

pfadd lgh golang
pfadd lgh python
pfadd lgh java
Copy the code

According to the above method, different values will be set to different buckets. If the values appear in the same bucket, the first 14 values will be the same, but the positions of 1 after them will be different. So compare whether the original index is larger than the new index. If yes, replace it. If no, no change is required.

Finally, the 16384 buckets corresponding to a key have multiple values, and each bucket has one k_max. When pfcount is called, you can calculate how many times the key is set to value, or statistical value, using the estimation method described earlier.

The value is converted to a 64-bit string of bits and is eventually logged to each bucket as described above. 64-bit decimal is: 2^64, HyperLogLog can count up to 2^64 numbers using only: 16384 * 6/8/1024 K storage space.

Deviation correction

The constant variable is not a constant value in the calculation formula of the estimate, it will be set as the case may be, such as the following.

Suppose: m is the number of buckets, and P is the logarithm base 2 of m.

// m is the number of buckets
switch (p) {
   case 4:
       constant = 0.673 * m * m;
   case 5:
       constant = 0.697 * m * m;
   case 6:
       constant = 0.709 * m * m;
   default:
       constant = (0.7213 / (1 + 1.079 / m)) * m * m;
}
Copy the code

Shoulders of giants

The math is powerful enough that a simple coin toss can lead to such an impressive algorithm.

Thanks for the following two posts:

All images in this article are from:

www.jianshu.com/p/55defda6d…

The content of this paper is referenced to:

www.rainybowe.com/blog/2017/0…

Manual visual observation of LogLog and HyperLogLog changes website:

Content. research. Neustar. Biz/blog/HLL. Ht…