preface

Good brothers, finish the previous chapterDo you know how to Redis Bitmaps. As the title suggests,BitmapsHas the good brother met yet? What? You haven’t seen it yet? Well, don’t just stare. Look at it. Be sure to like it and follow it. To be reasonable is to be clear,BitmapsI used a lot of scenes in big data volume (I shed tears of regret without being exposed to large data volume), todayHyperLogLogThis thing is also often used for the cardinality statistics under the large amount of data, but I have not used, find a chance to use in the current project, incidentally dig a hole (manual dog head to save life).

An overview of the

First, HyperLogLog is not a data structure, but a radix 1 statistical algorithm. HyperLogLog can make use of a small memory space to complete the independent total statistics, data sets can be IP, Email, ID, etc. Because HyperLogLog only calculates cardinality from the input elements and does not store the input elements themselves, HyperLogLog cannot return individual elements of the input, as collections do.

The principle of

It’s mentioned aboveHyperLogLogIt uses a probabilistic algorithm, which stores the elementshashValue to count the number of elements. Here’s an example:

One day Xiao Ming and Xiao Hong are playing happily on the playground. Xiao Ming red the face said to little red suddenly let’s play a game play flip a coin, will you be my girlfriend when I win, lose my word I will be your boyfriend, rule is I am responsible to flip a coin, every time to the national emblem is a round, I may decide to throw a few rounds, finally I will tell you the longest leg I how many times, And then you have to guess how many rounds I threw. Small red red face say ah, but this is not easy to guess ah, you throw it first, I want to calculate the probability, so quickly in my mind to draw a picture.

kIt’s every turn1(1 is the national emblem surface, 0 is the number surface) the number of times used, we know is the largestkValue,kmaxBecause each flip of a coin has only0and1Two cases. Therefore, the probability of kmax appearing in any turn is
( 1 / 2 ) k m a x (1/2) ^{kmax}
, so it can be inferred that n =
2 k m a x 2 ^{kmax}
. Probability calls this kind of problem Bernoulli experiments2.

Then Ming has completed n turns and tells Red that the longest toss is 3. Xiao Hong has a plan, immediately say his answer 8, the final result is: Xiao Ming only threw a round, Xiao Hong lost the angry of xiao Ming said to play the game do not let girlfriend win you a man cheating with female feelings, you go, we can not (did not think of it, ha ha ha).

Careful brother can see that the above probability algorithm is problematic (causing Red to lose),Philippe FlajoletThe professor introduces the concept of buckets, computation, in response to the above problemmA weighted average of buckets, so that you can get a more accurate answer (and actually make other corrections). The final formula is shown here

Back to RedisHyperLogLogFor a newly inserted string, first get 64 bitshashValue, using the first 14 bits to locate the bucket
2 14 2 ^ {14}
, 16,384 barrels). The next 50 digits are Bernoulli processes, and each bucket has6bit, records the first occurrence of 1countIf thecount>oldcount, usecountreplaceoldcount.

The command

Operating HyperLogLog in Redis provides only three commands

Add 1

## format, key, element
pfadd key element [element... ]## Add an element, return 1 on success
127.0.0.1:6379> pfadd 2020- 12- 14:unique:ids "uuid-1" "uuid-2" "uuid-3" "uuid-4"
(integer) 1
Copy the code

2 Calculation cardinality

Pfcount is used to calculate the independent total of one or more Hyperloglogs

## format, key: key
pfcount key [key... ]## return the total number
127.0.0.1:6379> pfcount 2020- 12- 14:unique:ids
(integer) 4
Copy the code

3 merger

Pfmerge Can calculate the union of multiple HyperLogLog and assign the value to destkey

## format, destkey: result set key, sourcekey: key to be merged
pfmerge destkey sourcekey [sourcekey .]
## add 2020-12-13 add element
127.0.0.1:6379> pfadd 2020- 12- 13:unique:ids "uuid-4" "uuid-5" "uuid-6" "uuid-7"
(integer) 1
## Calculate 2020-12-13 and 2020-12-14 cardinals
127.0.0.1:6379> pfmerge 2020- 12 _13_14:unique:ids 2020- 12 _13:unique:ids 2020- 12- 14:unique:ids
OK
127.0.0.1:6379> pfcount 2020- 12 _13_14:unique:ids
(integer) 7
Copy the code

Memory usage

1 Initial memory statistics

127.0.0.1:6379> info memory
# memory statistics
used_memory:835144
used_memory_human:815.57K
Copy the code

2 Insert batch data

elements=""
key="020-12-14:unique:ids"
for i in `seq 1 1000000`
do
elements="${elements} uuid-"${i}
if [[ $((i%1000)) == 0 ]];
then
redis-cli pfadd ${key} ${elements}
elements=""
fi
done
Copy the code

3 Statistics Usage Memory

Only about 15K of memory was added after the element was added

info memory
# memory statistics
used_memory:850616
used_memory_human:830.68K
Copy the code

4 Accuracy Analysis

Pfcount does not result in a million executions

127.0.0.1:6379> pfcount 2016_05_01:unique:ids
(integer) 1009838
Copy the code

Usage scenarios

HyperLogLog has a very small memory footprint but an error rate. So in use is the need to conform to the following two points

  1. Only to calculate the independent total, you don’t need to get a single piece of data, as stated above, only the base of the calculation is stored, not the data itself.
  2. Can tolerate certain error rate, accuracy analysis also mentioned above.

conclusion

Understanding HyperLogLog requires a certain amount of algorithmic knowledge, which I also have a headache with. But this article down good brothers should have a certain understanding of HyperLogLog. Specific algorithm this will not go deep, a first two big, this important task to good brothers to study it. Good elder brother, flush flush….. Do not forget to share a wave (manual dog head face protection).

That’s the end of this issue. Welcome to leave your comments in the comments sectionAsk for attention, ask for likes

Do you know how to Redis Bitmaps


  1. The cardinal number is a positive integer that represents the number of non-repeating elements in a set. For example, a dataset {1, 3, 5, 7, 5, 7, 8} would have a cardinality of {1, 3, 5, 7, 8} and a cardinality (not repeating elements) of 5. ↩
  2. The Bernoulli experiment can be explained at ↩