1. Base count

Before getting to know HyperLogLog, let’s take a look at Cardinality Counting.

1.1 concept

Cardinal count is used to count the number of non-repeating elements in a set, such as daily demand scenarios, such as UV statistics of the page or the number of online users, registered IP number, etc.

If you were asked to implement this requirement, how would you think about achieving it? The simple way to do this is to record all the non-repeating sets S in the set, the new element x, first determine whether x is in S, if not, add x to S, otherwise no record. The usual SET data structure can be implemented.

But this implementation, if the amount of data is getting bigger and bigger, what’s the problem?

  • As the amount of statistical data increases, the corresponding storage memory increases linearly.
  • As set S gets bigger, the cost of determining whether x is in set S gets bigger.

Is there any other solution that can reduce the problems caused by the above two questions? The answer is definitely there. Here is a brief introduction.

1.2 methods

There are three kinds of radix counting in common use: B+ tree, bitmap and probability algorithm.

  • B + tree. B+ tree insertion and search efficiency is relatively high. You can quickly find out if an element exists and insert it. If you want to calculate cardinality values (non-repeating element values), you only need the number of nodes in the tree. But there is still the problem of not saving memory space.
  • Bitmap. A bitmap is a data structure that holds specific data in an array of bits. The cardinality count maps each element to one of the bits, such as 010010101, which represents [1,4,6,8]. The newly added element only needs the existing array of bits and the newly added element to be bited or evaluated. This way can greatly reduce the memory, if the storage of 100 million data, about only 100000000/8/1024/1024 ≈ 12M of memory. Compared to B+ trees, it is a lot of savings, but in some very big data scenarios, if there are 10,000 objects with 100 million data, it needs 120 GB of memory, so it can be said that the memory consumption in certain scenarios is quite large.
  • Probability algorithm, probability algorithm is through sacrifice accuracy for space, for scenario does not require absolute accuracy, the probability algorithm is a good choice, because the probability algorithm does not store data collection itself directly, through a certain method of probability and statistics forecast base value, at the same time guarantee the error within a certain range, this approach can greatly reduce memory. HyperLogLog is an implementation of the probabilistic algorithm, which is highlighted below.

2. HyperLogLog

2.1 the principle

The idea of HyperLogLog principle is to record the maximum value K of the position where the first 1 appears in the bit string of the number in the set by giving a set of N elements, which can also be understood as the maximum number of consecutive zeros in the statistical binary low position. The k value is used to estimate the number of non-repeating elements in the set, m, which is approximately 2 to the k.

The following figure is from the network. Given a certain number of users, a string of bitStrings is generated by Hash, in which the maximum consecutive zero digit count is 4, and the number of users that are not repeated is 2 ^ 4 = 16.

The following code demonstrates this.

2.2 Code Demonstration

Some reference code https://kuaibao.qq.com/s/20180917G0N2C300?refer=cp_1026

# content of hyperloglog_test.py
class BitsBucket(object):
    def __init__(self):
        self.maxbit = 0

    @staticmethod
    def get_zeros(value):
        for i in range(31) :if (value >> i) & 1:
                break
        return i

    def add(self, m):
        self.maxbit = max(self.maxbit, self.get_zeros(m))

class HyperLogLogTest(object):
    def __init__(self, n, bucket_cnt=1024):
        self.n = n
        self.bucket_cnt = bucket_cnt
        self.bits_bucket = [BitsBucket() for i in range(bucket_cnt)]

    @staticmethod
    def generate_value(a):
        return random.randint(1.2**32 - 1)

    def pfadd(self):
        for i in range(self.n):
            value = self.generate_value()
            bucket = self.bits_bucket[((value & 0xfff0000) > >16) % self.bucket_cnt]
            bucket.add(value)

    def pfcount(self):
        sumbits_inverse = 0
        for bucket in self.bits_bucket:
            if bucket.maxbit == 0:
                continue
            sumbits_inverse += 1.0 / float(bucket.maxbit)
        avgbits = float(self.bucket_cnt) / sumbits_inverse
        return 2**avgbits * self.bucket_cnt
Copy the code

BitsBucket class, is to calculate the maximum number of continuous low in a set, HyperLogLogTest to achieve two methods, PFAdd is random N elements, add elements to a set of buckets, pfcount is to calculate bucket_cnT bucket average cardinality count value.

The reason why bucket_CNT buckets are counted is because of the random probability of the algorithm, if one bucket, the error rate is very high, and then the concept of bucket average is proposed, which divides the statistics into m buckets, and each bucket calculates its own base estimate, and then averages those estimates to get the overall base estimate.

Now test it out:

# content of hyperloglog_test.py
def main(bucket_cnt=1024):
    print("bucket cnt: {}, start".format(bucket_cnt))
    for i in range(100000.1000000.100000):
        hyperloglog = HyperLogLogTest(i, bucket_cnt)
        hyperloglog.pfadd()
        pfcount = hyperloglog.pfcount()
        print("original count: {} ".format(i),
              "pfcount: {}".format('%.2f' % pfcount), "error rate: {}%".format(
                  '%.2f' % (abs(pfcount - i) / i * 100)))
    print("bucket cnt: {}, end \n\n".format(bucket_cnt))


buckets = [1.1024]
for cnt in buckets:
    main(cnt)
Copy the code

Bucket_cnt = 1 and bucket_CNT = 1024 are tested, and the results are as follows:

➜ HyperLogLog git:(Master) University python3 Hyperloglog_test.py bucket CNT:1, start
original count: 100000  pfcount: 65536.00 error rate: 34.46%
original count: 200000  pfcount: 131072.00 error rate: 34.46%
original count: 300000  pfcount: 131072.00 error rate: 56.31%
original count: 400000  pfcount: 524288.00 error rate: 31.07%
original count: 500000  pfcount: 1048576.00 error rate: 109.72%
original count: 600000  pfcount: 2097152.00 error rate: 249.53%
original count: 700000  pfcount: 262144.00 error rate: 62.55%
original count: 800000  pfcount: 1048576.00 error rate: 31.07%
original count: 900000  pfcount: 262144.00 error rate: 70.87%
bucket cnt: 1, end

bucket cnt: 1024, start
original count: 100000  pfcount: 97397.13 error rate: 2.60%
original count: 200000  pfcount: 192659.65 error rate: 3.67%
original count: 300000  pfcount: 287909.86 error rate: 4.03%
original count: 400000  pfcount: 399678.34 error rate: 0.08%
original count: 500000  pfcount: 515970.76 error rate: 3.19%
original count: 600000  pfcount: 615906.34 error rate: 2.65%
original count: 700000  pfcount: 735321.47 error rate: 5.05%
original count: 800000  pfcount: 808206.55 error rate: 1.03%
original count: 900000  pfcount: 950692.17 error rate: 5.63%
bucket cnt: 1024, end
Copy the code

It can be seen that bucket_CNt =1, the error is very large, when it is 1024, the algorithm can be used basically. However, HyperLogLog implemented in Redis is more complex and can control the error of 0.81%. The following focuses on the application of HyperLogLog in Redis.

3. Redis HyperLogLog implementation

HyperLogLog is available in Redis in version 2.8.9. For details, check out Redis New Data Structure: The HyperLogLog blog post by Antirez

3.1 usage

Usage involves three commands:

  • Pfadd adds an element to the key
  • Pfcount Counts the number of non-repeating elements in the key
  • Pfmerge Merges elements in multiple keys
127.0.0.1:6379> PFADD pf_tc tc01
(integer) 1
127.0.0.1:6379> PFADD pf_tc tc02
(integer) 1
127.0.0.1:6379> PFADD pf_tc tc03
(integer) 1
127.0.0.1:6379> PFADD pf_tc tc04 tc05 tc06
(integer) 1
127.0.0.1:6379> PFCOUNT pf_tc
(integer) 6
127.0.0.1:6379> PFADD pf_tc tc04 tc05 tc06
(integer) 0
127.0.0.1:6379> PFCOUNT pf_tc
(integer) 6

127.0.0.1:6379> PFADD pf_tc01 tc07 tc08 tc09 tc10 tc01 tc02 tc03
(integer) 1
127.0.0.1:6379> PFCOUNT pf_tc01
(integer) 7
127.0.0.1:6379> PFMERGE pf_tc pf_tc01
OK
127.0.0.1:6379> PFCOUNT pf_tc
(integer) 10
127.0.0.1:6379> PFCOUNT pf_tc01
(integer) 7
Copy the code

Feel very accurate, next write a script to test.

3.2 Error analysis

Let’s write some Python code to test the error

class HyperLogLogRedis(object):
    def __init__(self, n):
        self.n = n
        self.redis_client = redis.StrictRedis()
        self.key = "pftest:{}".format(n)

    @staticmethod
    def generate_value(a):
        return random.randint(1.2**32 - 1)

    def pfadd(self):
        for i in range(self.n):
            value = self.generate_value()
            self.redis_client.pfadd(self.key, value)

    def pfcount(self):
        return self.redis_client.pfcount(self.key)


def main(a):
    for i in range(100000.1000000.100000):
        hyperloglog = HyperLogLogRedis(i)
        hyperloglog.pfadd()
        pfcount = hyperloglog.pfcount()
        print("original count: {} ".format(i),
              "pfcount: {}".format('%.2f' % pfcount), "error rate: {}%".format(
                  '%.2f' % (abs(pfcount - i) / i * 100)))

main()
Copy the code

The part of the code is still slightly changed based on 2.2, replacing the part we tested before with redis HyperLogLog function.

The test results are as follows:

➜ HyperLogLog git:(Master) University python3 Hyperloglog_redis.py Original Count:100000  pfcount: 99763.00 error rate: 0.24%
original count: 200000  pfcount: 200154.00 error rate: 0.08%
original count: 300000  pfcount: 298060.00 error rate: 0.65%
original count: 400000  pfcount: 394419.00 error rate: 1.40%
original count: 500000  pfcount: 496263.00 error rate: 0.75%
original count: 600000  pfcount: 595397.00 error rate: 0.77%
original count: 700000  pfcount: 712731.00 error rate: 1.82%
original count: 800000  pfcount: 793678.00 error rate: 0.79%
original count: 900000  pfcount: 899268.00 error rate: 0.08%
Copy the code

The basic error is about 0.81%, why the standard error is 0.81%, because Redis used 16384 barrels, HyperLogLog standard error formula is 1.04/ SQRT (m), m is the number of barrels, so in Redis, M =16384, the standard error is 0.81%.

3.3 Memory Analysis

Redis uses 16,384 buckets to store computational HyperLogLog, how much memory will that take up? Redis can count up to 2^64 data, which means that the maximum maxbits per bucket need 6 bits to store (2^6=64). So 16384 * 6/8 = 12KB of memory.

The first section mentioned that BitMap 100 million data needs 12M, if 2^64 data, rough calculation needs 1500 TB, while HyperLogLog only needs 12KB, you can imagine the power of HyperLogLog, but this is not to say that BitMap is bad. Every data structure has its most suitable application scenarios. It can only be said that HyperLogLog is a very powerful algorithm in the scenario of radix statistics.

If the number of elements is small, Redis will adopt sparse storage structure with a size less than 12KB, and adopt dense storage structure with a fixed size of 12KB. The storage implementation is implemented by using Redis string bitmap bitmap, that is, 16,384 consecutive buckets with each bucket occupying 6 Bits.

Read the Redis source for more details :github.com/antirez/red…

Related articles:

  • Redis Advanced Theme BloomFilter

The code is available at github.com/fuzctc/tc-r…

For more Redis related articles and discussions, please pay attention to the public account: “Tiancheng Technology Talk”