I. Introduction to HyperLogLog

HyperLogLog is an approximate optimal algorithm for cardinality estimation first proposed by Flajolet and colleagues in 2007. But unlike the original paper, it seems like a lot of books including Redis authors are calling it a new data structure.

(Algorithm implementation does require a specific data structure to implement)

.

About Cardinal statistics

Cardinality Counting is usually used to count the number of unique elements ina set.

Consider this scenario: if you are responsible for the development and maintenance of a large website, one day the boss asked the product manager for the UV of every page on the site (unique visitors, each user is recorded only once a day), and then asked you to develop the statistics module, how would you do it?

If PV is counted, it is very easy to configure a separate Redis counter for each page, adding the date of the day to the key suffix of this counter. The INCRBY command is executed once for each request and all PV data is finally counted.

But UV is different in that it has to be de-duplicated, and multiple access requests from the same user in a day can only be counted once. This requires that each web request needs to carry the user ID, whether the login user or not login user, need a unique ID to identify.

You might think of one right off the bat

Simple solution

:
Set up a separate set for each pageTo store all user ids that accessed the page that day. But such
The problemIs this:

  1. Huge storage space:If your site has a large number of visitors, the number of sets you need to store will be very large. Resources expended for a feature to be reworked can be directly made available to you
    Boss beats you to death;
  2. Statistical complexity:So many sets if you want to aggregate statistics, it’s a complicated thing;

Common methods of radix statistics

There are generally two better solutions than set sets for things like this that require cardinality statistics:

First: B trees

The biggest advantage of B-tree is high efficiency of insertion and search. If b-tree is used to store data to be counted, it can quickly determine whether new data exists and insert elements into B-tree quickly. To compute the base value, you just need to count the number of nodes in the B tree.

However, maintaining the B-tree structure in memory can solve the problem of statistics and calculation, but does not save memory.

Second: bitmap

Bitmap can be understood as a data structure that stores specific data through an array of bits. Each bit can contain information independently. Bit is the smallest storage unit of data, so it can save a lot of space. If you define a large array of bits, each element in the basic statistics corresponds to a bit in the array, for example:

Bitmaps also have the obvious advantage of being able to easily merge multiple statistical results, requiring only multiple results to be different or, and greatly reducing storage memory. If you want to count the cardinality value of 100 million data, about the required memory: 100_000_000/8/1024/1024 ≈ 12 M, if you use a 32-bit int to represent each statistical data, about the required memory: 32 * 100_000_000/8/1024/1024 ≈ 381 M

As you can see, the memory savings from Bitmap are obvious, but still not enough. It takes 12 M to count the cardinality of an object. If 10,000 objects are counted, it needs close to 120 G, which is still not applicable to big data scenarios.

Probability algorithm

In fact, no better efficient algorithm has been found to accurately calculate cardinality in big data scenarios, so using probabilistic algorithm is a good solution without pursuing absolute accuracy.

The probabilistic algorithm does not store the data set itself directly, but estimates the cardinality through certain probabilistic statistical methods, which can greatly save memory and ensure the error control within a certain range. Current probabilistic algorithms used for radix counting include:

  • Linear Counting(LC): As an early cardinality estimation algorithm, LC is not excellent in terms of spatial complexity. In fact, the spatial complexity of LC is the same as that of the simple bitmap method above (but with a reduction in the level of constant term), which is O(N)
    max)
  • LogLog Counting(LLC): LogLog Counting saves more memory than LC, and the space complexity is only O(log)
    2(log
    2(N
    max)))
  • HyperLogLog Counting(HLL): HyperLogLog Counting is an optimization and improvement based on LLC. With the same space complexity, the error of cardinality estimation is smaller than that of LLC

Among them, HyperLogLog’s performance is amazing, above we have simply calculated that using bitmap to store 100 million statistics about 12 M memory, and in HyperLogLog, only need less than 1 K memory can do it! HyperLoglog implemented in Redis also requires only 12 K of memory and can count 264 data with a standard error of 0.81%!

How is this done? ! Below hurry to understand!

Two, HyperLogLog principle

Consider a coin toss game: You toss a coin n times in a row, then say the maximum number of heads in a row, and I’ll guess how many times you toss a coin.

This is easy to understand, for example: you say you this time

Two times in a row at most

Heads, then I know that you didn’t have a lot of flips this time, so

I would probably guess 5

Or some other smaller number, but if you say this time you

Up to 20 times in a row

Heads, I don’t think that’s possible, but I still know that you put in an awful lot of time, so

I said GUN…

.

In the meantime, I might ask you to repeat the experiment, and then I’ll get more data and I’ll get a better estimate. Let’s rephrase the game:

This figure means that we give a series of random integers and record the maximum length K of the low consecutive zero bits, which is the maxbit in the figure. By this K value, we can estimate the number N of random numbers.

Code experiment

We can simply write code to do an experiment to explore the relationship between K and N:

public class PfTest {

    static class BitKeeper {

        private int maxbit;

        public void random() {
            long value = ThreadLocalRandom.current().nextLong(2L << 32);
            int bit = lowZeros(value);
            if (bit > this.maxbit) {
                this.maxbit = bit;
            }
        }

        private int lowZeros(long value) {
            int i = 0;
            for (; i < 32; i++) {
                if(value >> i << i ! = value) {break; }}return i - 1;
        }
    }

    static class Experiment {

        private int n;
        private BitKeeper keeper;

        public Experiment(int n) {
            this.n = n;
            this.keeper = new BitKeeper();
        }

        public void work() {
            for (int i = 0; i < n; i++) {
                this.keeper.random();
            }
        }

        public void debug() {
            System.out
                .printf("%d %.2f %d\n", this.n, Math.log(this.n) / Math.log(2), this.keeper.maxbit);
        }
    }

    public static void main(String[] args) {
        for(int i = 1000; i < 100000; i += 100) { Experiment exp = new Experiment(i); exp.work(); exp.debug(); }}}Copy the code

This is the same as the process shown in the figure, so why PfTest? The command in Redis also has a PF prefix, remember, because the inventor of HyperLogLog mentioned above is Philippe Flajolet.

Intercept part of the output to view:

//n   n/log2 maxbit
34000 15.05 13
35000 15.10 13
36000 15.14 16
37000 15.18 17
38000 15.21 14
39000 15.25 16
40000 15.29 14
41000 15.32 16
42000 15.36 18
Copy the code

You’ll find a significant linear correlation between the logarithms of K and N: N is about 2k

One step further: barrel average

If N is between 2k and 2k+1, the estimated value in this way is equal to 2K, which is obviously unreasonable. Therefore, we can use multiple BitKeeper to perform weighted estimation and get a more accurate value:

Public class PfTest {static class BitKeeper} static class Experiment {private int n; private int k; private BitKeeper[] keepers; public Experiment(int n) { this(n, 1024); } public Experiment(int n, int k) { this.n = n; this.k = k; this.keepers = new BitKeeper[k];for (int i = 0; i < k; i++) {
                this.keepers[i] = new BitKeeper();
            }
        }

        public void work() {
            for (int i = 0; i < this.n; i++) {
                long m = ThreadLocalRandom.current().nextLong(1L << 32);
                BitKeeper keeper = keepers[(int) (((m & 0xfff0000) >> 16) % keepers.length)];
                keeper.random();
            }
        }

        public double estimate() {double sumbitsInverse = 0.0;for(BitKeeper keeper: keepers) {sumbitsInverse += 1.0 / (float) keeper.maxbit;
            }
            double avgBits = (float) keepers.length / sumbitsInverse;
            return Math.pow(2, avgBits) * this.k;
        }
    }

    public static void main(String[] args) {
        for (int i = 100000; i < 1000000; i += 100000) {
            Experiment exp = new Experiment(i);
            exp.work();
            double est = exp.estimate();
            System.out.printf("%d %.2f %.2f\n", i, est, Math.abs(est - i) / i); }}}Copy the code

The process is a little similar to the inside of the talent show, a bunch of professional judges scoring, but there are some of the judges for their love so give up, some judges have low, so is generally going to shield the highest and lowest points, and then calculate the average, such the score is fair and just the same.

The code above has 1024 “judges” and uses the harmonic mean, the reciprocal average, to calculate the average, which effectively smoothen the effect of outliers:

Avg = (3 + 4 + 5 + 104) /4 = 29 AVG = 4 / (1/3 + 1/4 + 1/5 + 1/104) = 5.044Copy the code

Observe the output of the script with the percentage error in single digits:

100000 94274.94 0.06
200000 194092.62 0.03
300000 277329.92 0.08
400000 373281.66 0.07
500000 501551.60 0.00
600000 596078.40 0.01
700000 687265.72 0.02
800000 828778.96 0.04
900000 944683.53 0.05
Copy the code

The real HyperLogLog is a bit more complex and accurate than the sample code above. The above algorithm will divide by zero in the case of very small random numbers, because maxbit = 0 is not reciprocal.

The real HyperLogLog

Have a magical website, can let you observed dynamically HyperLogLog how the algorithm performed: content. research. Neustar. Biz/blog/HLL ht…

Some of these concepts are explained here so that you can click step to see them for yourself:

  • M indicates the number of buckets:As you can see, this is divided into 64 buckets;
  • The blue bits indicate the position in the bucket:Like in the picture
    101110In fact, it means binary
    46, so the element is counted in the middle large table
    Register ValuesIn the 46th bucket marked red;
  • The green bit indicates where the first 1 occursAs you can see from the graph, the first digit in the green bit is 1 from right to left
    Register ValuesWrite 1 in bucket 46;
  • The red bit represents the sum of the green bit values:The next element that appears in the 46th bucket is accumulated;

Why count the first occurrence of 1 in the Hash value?

Since the position of the first 1 can correspond to the number of flips of the first heads in our coin flipping game, according to the conclusion of the coin flipping experiment above, the cardinal number in the data set can be deduced by recording the position K of the first occurrence of each data, and the maximum value Kmax can be derived: N = 2Kmax

Why is the PF memory footprint 12 KB?

We used 1024 buckets in the algorithm above and only 64 in the demo, but in Redis’ HyperLogLog implementation we used 16,384 buckets, i.e. : 214, that is, the large Register Values form in the middle of the website above has 16,384 rows.

However, the maximum amount of data that Redis can count is 264, that is, the maxbit of each bucket needs 6 bits to store, and the maximum can be denoted as maxbit = 63, so the total memory occupied is :(214) x 6/8

(Each bucket takes up 6 bits, which takes up 16,384 bits, divided by 8 and converted into KB)

And the result is that
12 KB.

Three, Redis HyperLogLog implementation

From the above, we have a certain understanding of the algorithm and thought of HyperLogLog, and we know that the actual space occupied by a HyperLogLog is about 12 KB, but Redis is very abnormal for the optimization of memory, when the count is relatively small, most of the bucket count value is zero. At this point, Redis will properly save space and switch to another sparse storage mode, as opposed to the normal storage mode called dense storage, which occupies a constant 12 KB.

Intensive memory structure

The intensive storage structure is very simple, namely 16384 6-bit string bitmaps:

As we all know, a byte is made up of 8 bits, so the arrangement of 6 bits causes some buckets to cross the byte boundary, and we need to do a proper shift concatenation of the one or two bytes to get a specific count.

If the bucket number is index, the starting byte offset of the 6 bity count value is offset_bytes, and the actual bit offset of the byte is offset_bits, then we have:

offset_bytes = (index * 6) / 8
offset_bits = (index * 6) % 8
Copy the code

The former is the quotient and the latter is the remainder. For example, the byte offset of bucket 2 is 1, which is the second byte. Its bit offset is 4, where the fifth bit of the second byte starts with the count of bucket 2. Note that the byte order is left low and right high, whereas normally we use bytes left high and right low.

If offset_bits is less than or equal to 2, the six bits are inside a byte. We can use the following expression to get the count val:

val = buffer[offset_bytes] >> offset_bits  # shift to the right
Copy the code

If offset_bits is greater than 2, then it involves crossing byte boundaries and we need to concatenate bits of two bytes:

# low value
low_val = buffer[offset_bytes] >> offset_bits
# number of low values
low_bits = 8 - offset_bits
# splice, keep the lower 6 bits
val = (high_val << low_bits | low_val) & 0b111111
Copy the code

The Redis source code is a little more obscure, however, and looks like it only considers the case of crossing byte boundaries. This is because if 6 bits are in a single byte, the high_val value in the above code is zero, so this code can take care of both single-byte and double-byte:

// Get the count of the specified bucket#define HLL_DENSE_GET_REGISTER(target,p,regnum) do { \
    uint8_t *_p = (uint8_t*) p; \
    unsigned long _byte = regnum*HLL_BITS/8; \ 
    unsigned long _fb = regnum*HLL_BITS&7; \  8 = # % & 7unsigned long _fb8 = 8 - _fb; \ unsigned long b0 = _p[_byte]; \ unsigned long b1 = _p[_byte+1]; \ target = ((b0 >> _fb) | (b1 << _fb8)) & HLL_REGISTER_MAX; The \}while(0) // Set the bucket count#define HLL_DENSE_SET_REGISTER(p,regnum,val) do { \uint8_t *_p = (uint8_t*) p; \ unsigned long _byte = regnum*HLL_BITS/8; \ unsigned long _fb = regnum*HLL_BITS&7; \ unsigned long _fb8 = 8 - _fb; \ unsigned long _v = val; \ _p[_byte] &= ~(HLL_REGISTER_MAX << _fb); \ _p[_byte] |= _v << _fb; \ _p[_byte+1] &= ~(HLL_REGISTER_MAX >> _fb8); \ _p[_byte+1] |= _v >> _fb8; The \}while(0)
Copy the code

Sparse storage structure

Sparse storage is suitable for cases where many counts are zero. The following figure shows the state of the general sparse storage count:

Redis provides several different expressions when the number of consecutive buckets is zero:

  • 00xxxxxxThe: prefix with two zeros indicates the number of zeros in the next 6bit integer value plus one. Note that the number is added because zero is meaningless. Such as
    00010101Said continuous
    22Zero counter.
  • 01xxxxxx yyyyyyyy: 6bit indicates the continuity at most
    64Zero counter, so that the extended 14 bits can represent the most continuous
    16384Zero counter. This means that HyperLogLog data structure
    16384The initial state of the bucket. All counters are zero and can be directly represented as 2 bytes.
  • 1vvvvvxx: the middle 5bit represents the count value, and the tail 2bit represents several consecutive buckets. It means continuous
    (xx +1)All of them
    (vvvvv + 1). Such as
    10101011Said continuous
    4All of them
    11.

Pay attention to

The third way above

The maximum value can only be expressed
32, and HyperLogLog intensive storage single count value with 6 bits, the maximum can be expressed
63.
When a count value of sparse storage needs to be adjusted to greater than32, Redis will immediately convert HyperLogLog’s storage structure, converting sparse storage to dense storage.

Object head

In addition to the 16,384 bucket count, HyperLogLog also has some additional fields to store, such as total count cache and storage type. So it uses an extra object header to represent:

struct hllhdr { char magic[4]; /* Magic string"HYLL"*/ uint8_t encoding; /* HLL_DENSE or HLL_SPARSE. */ uint8_t notUsed [3]; /* Reserving three bytes may be used in the future */ uint8_t card[8]; / / uint8_t registers[]; /* Count all buckets */};Copy the code

So the entire internal structure of HyperLogLog is the HLL object header plus the counting bitmap of 16,384 buckets. It is represented internally in Redis as a string bitmap. You can treat HyperLogLog objects as normal strings:

> PFADD codehole python java golang
(integer) 1
> GET codehole
"HYLL\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80C\x03\x84MK\x80P\xb8\x80^\xf3"
Copy the code

However, you cannot use the HyperLogLog directive to manipulate ordinary strings because it needs to check if the object header magic string is “HYLL”.

4. Use of HyperLogLog

HyperLogLog provides two instructions PFADD and PFCOUNT, literally one for increment and the other for fetch count. The PFADD and SCARD sets are used in the same way. If you want to add a user ID, insert the user ID into the set.

> PFADD codehole user1
(interger) 1
> PFCOUNT codehole
(integer) 1
> PFADD codehole user2
(integer) 1
> PFCOUNT codehole
(integer) 2
> PFADD codehole user3
(integer) 1
> PFCOUNT codehole
(integer) 3
> PFADD codehole user4 user 5
(integer) 1
> PFCOUNT codehole
(integer5)Copy the code

We can write a script in Java to see how accurate HyperLogLog really is:

public class JedisTest {
  public static void main(String[] args) {
    for (int i = 0; i < 100000; i++) {
      jedis.pfadd("codehole"."user" + i);
    }
    long total = jedis.pfcount("codehole");
    System.out.printf("%d %d\n", 100000, total); jedis.close(); }}Copy the code

The output is as follows:

100000, 99723,Copy the code

The 100,000 pieces of data were found to be only 277 off, a percentage error of 0.277%, which is really not high for a huge amount of UV requirements.

Of course, in addition to the PFADD and PFCOUNT above, a third PFMEGER directive is provided to add multiple count values together to form a new PF value:

> PFADD  nosql  "Redis"  "MongoDB"  "Memcached"
(integer) 1

> PFADD  RDBMS  "MySQL" "MSSQL" "PostgreSQL"
(integer) 1

> PFMERGE  databases  nosql  RDBMS
OK

> PFCOUNT  databases
(integer6)Copy the code

reading

  1. Redis (1) – 5 kinds of basic data structures – www.wmyskxz.com/2020/02/28/…
  2. Redis (2) – jump table – www.wmyskxz.com/2020/02/29/…
  3. Redis (3) – distributed lock delve into – www.wmyskxz.com/2020/03/01/…

Further reading

  1. The original algorithm HyperLogLog 】 : the analysis of a near – optimal cardinality estimation algorithm – algo. Inria. Fr/flajolet/Pu…

The resources

  1. Redis New Data Structure: The HyperLogLog – antirez.com/news/75
  2. Magical HyperLogLog algorithm – www.rainybowe.com/blog/2017/0…
  3. Deep exploration Redis – zhuanlan.zhihu.com/p/43426875 HyperLogLog internal data structure
  4. Redis Deep Adventure. By Qian Wenpin
  • This article has been included in my Github Programmer growth series
    【More Than Java】, learn More Than Code, welcome star:Github.com/wmyskxz/Mor…
  • Personal public account: wmyskxz.
    Personal independent domain name blog: wmyskxz.com, adhere to the original output, scan code below attention, 2020, and you grow together!

Thank you very much for reading this, if you think this article is well written, if you think there is something about “I don’t have three hearts”, please like, follow, share and leave a comment!

Creation is not easy, your support and recognition, is the biggest motivation for my creation, we will see you in the next article!