HyperLogLog is a probabilistic data structure used to estimate the cardinality of data. The data set can be the IP address, E-mail address, or user ID of a site visitor.

The cardinality is the number of different values in a set, for example, the cardinality of a, B, C, and D is 4, and the cardinality of a, B, C, and D is still 4. Even though a appears twice, it’s only counted once.

Calculating the cardinality of a dataset accurately consumes a large amount of memory to store the dataset. When traversing a data set, the only way to determine if the current traversal value already exists is to compare that value to the values already traversed. As the number of data sets increases, memory consumption becomes impossible to ignore and even critical.

There are generally three ways to use the cardinality of Redis statistical set, namely using Redis HashMap, BitMap and HyperLogLog. The first two data structures consume significantly more memory as the collection grows by an order of magnitude, but HyperLogLog does not.

Redis HyperLogLog reduces memory consumption by sacrificing accuracy, requiring only 12K of memory and being able to count 2^64 data with a standard error of 0.81%. So is HyperLogLog suitable for scenarios where the accuracy is not high, such as daily and monthly statistics?

This is an amazing result, recording such a large data base in such a small amount of memory. Here we will take you to an in-depth understanding of the use of HyperLogLog, basic principles, source code implementation and specific test data analysis.

Use of HyperLogLog in Redis

Redis provides PFADD, PFCOUNT, and PFMERGE commands to use HyperLogLog.

PFADD is used to add elements to HyperLogLog.

> PFADD visitors alice bob carol
(integer) 1
> PFCOUNT visitors
(integer) 3
Copy the code

If the approximate cardinality estimated by HyperLogLog has changed since the execution of the PFADD command, the command returns 1, otherwise 0. If the given key does not exist when the command is executed, the program creates an empty HyperLogLog structure before executing the command.

The PFCOUNT command gives the approximate cardinality that HyperLogLog contains. After calculating the cardinality, PFCOUNT stores the value in HyperLogLog for caching and does not need to calculate the cardinality again until the next PFADD execution is successful.

PFMERGE Merges multiple HyperLogLog into one HyperLogLog. The cardinality of the merged HyperLogLog is close to the cardinality of the union of all input HyperLogLog.

> PFADD customers alice dan
(integer) 1
> PFMERGE everyone visitors customers
OK
> PFCOUNT everyone
(integer4)Copy the code

Memory consumption comparison experiment

Let’s compare the memory consumption of the following three data structures: HashMap, BitMap and HyperLogLog.

We first use the Lua script to insert a certain number of numbers into the Redis data structure, then execute the bgsave command, and finally use the RDB command of redis-rdb-tools to check the memory size of each key.

The following is the Lua script. If you are not familiar with the implementation of Lua scripts in Redis, you can take a look at my previous article “Distributed traffic limiting based on Redis and Lua”.

local key = KEYS[1]
local size = tonumber(ARGV[1])
local method = tonumber(ARGV[2])

for i=1,size,1 do
  if (method == 0)
  then
    redis.call('hset',key,i,1)
  elseif (method == 1)
  then
    redis.call('pfadd',key, i)
  else
    redis.call('setbit', key, i, 1)
  end
end
Copy the code

We loaded the Lua script into Redis through the script load command of redis cli, and then used evalsha command to insert ten million numbers into the three data structures of HashMap, HyperLogLog and BitMap respectively. Then use the RDB command to view the memory consumption of each structure.

[root@VM_0_11_centos ~]# redis-cli -a 082203 script load "$(cat HyperLogLog.lua)"
"6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8"
[root@VM_0_11_centos ~]# redis-cli -a 082203 evalsha 6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8 1 hashmap 10000000 0
(nil)
[root@VM_0_11_centos ~]# redis-cli -a 082203 evalsha 6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8 1 hyperloglog 10000000 1
(nil)
[root@VM_0_11_centos ~]# redis-cli -a 082203 evalsha 6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8 1 bitmap 10000000 2
(nil)


[root@VM_0_11_centos ~]# rdb -c memory dump.rdb 
database,type, key, size_in_bytes, encoding, num_elements len_largest_element, expiry 0, string, bitmap, 1310768, string, 1250001125001, 0, string, hyperloglog, 14392, string, 12304123-04, 0,hash, a hashmap, 441326740, hashtable, 10000000, 8,Copy the code

We conducted two rounds of experiments, inserting 10,000 digits and 10 million digits respectively. The memory consumption statistics of the three data structures are shown below.

As can be seen from the table, BitMap consumes the least memory when the order of magnitude is 10,000 and HyperLogLog consumes the least memory when the order of magnitude is 10 million. However, overall, HyperLogLog consumes 14,392 bytes of memory. HyperLogLog is unique in terms of memory consumption.

The basic principle of

HyperLogLog is a probabilistic data structure that uses probabilistic algorithms to count the approximate cardinality of a set. The origin of its algorithm is Bernoulli process.

Bernoulli’s process is a coin flip experiment. If you flip a normal coin, it could land heads or tails, and the probability of both is 1/2. Bernoulli’s process is to keep flipping the coin until it lands on heads, and count the number of flips k. For example, if I flip a coin and I get heads, k is 1; The first flip is tails, you keep flipping until you get heads on the third flip, where k is 3.

For n Bernoulli processes, we’re going to get n headsWhere the maximum value here is k_max.

According to a series of mathematical deduction, we can draw a conclusion:Let’s use this as an estimate of n. That is, you can approximate how many Bernoulli processes have been performed based on the maximum number of flips.

Below, we will explain how HyperLogLog simulates Bernoulli’s process and finally calculates the cardinality of sets.

When HyperLogLog adds an element, it converts the element into a 64-bit string using the Hash function. For example, if you type 5, the element becomes 101(the preceding 0 is omitted, the same below). These strings of bits are analogous to a Bernoulli process of flipping a coin. Bit string, 0 represents the flip a coin to the ground is negative, 1 on behalf of the flip a coin to the ground is positive, if a data was eventually into 10010000, then from low to high, we can think that this string of bit string can represent a Bernoulli process, first digits of 1 to 5, is 5 times to appear positive.

Therefore, the basic idea of HyperLogLog is to estimate the overall cardinality by using the maximum position of the first 1 in the bit string of the number in the set. However, this prediction method has a large error. In order to improve the error situation, HyperLogLog introduces the concept of bucket average and calculates the harmonic average of M buckets.

HyperLogLog in Redis is divided into 2^14 buckets, that is 16,384 buckets. Each bucket contains a 6-bit array, as shown in the figure below.

HyperLogLog takes the bottom 14 bits of the 64-bit bit string described above, and its value corresponds to the number of the bucket, and then sets the value of the first 1 in the remaining 50 bits into the bucket. The maximum value for a 1 in 50 bits is 50, so the 6-bit array in each bucket can represent that value.

Before setting the value, check whether the value entered into the bucket is greater than the old value in the bucket. If the value is greater than the old value, the value is not set. An example is shown in the figure below.

In this case, for the sake of performance, the current cardinality is not counted. Instead, the flag position in the Card attribute of the HyperLogLog header is set to 1, indicating that the current cache value is invalid next time the PFcount operation is performed, and the cache value needs to be counted again. Later in the pfcount process, if this flag is found to be invalid, the new cardinality is recalculated and placed in the cardinality cache.

In the calculation of approximate cardinality, the values in each bucket are calculated separately and plugged into the DV formula above to perform harmonic averaging and result modification to obtain the estimated cardinal value.

Redis source code analysis

Let’s first look at the definition of a HyperLogLog object

struct hllhdr { char magic[4]; / * mana"HYLL"*/ uint8_t encoding; /* HLL_DENSE or HLL_SPARSE. */ uint8_t notused[3]; /* Reserved bits, all 0. */ uint8_t card[8]; / / uint8_t registers[]; /* Data byte array */};Copy the code

Registers array in the HyperLogLog object is a bucket, which has two storage structures, namely dense storage structure and sparse storage structure. The two structures only involve the representation of storage and bucket, from which we can see the extreme pursuit of Redis to save memory.

Let’s first look at the relatively simple dense storage structure, which is also very simple and clear, since I want to have 2^14 6 bit buckets, THEN I really use enough uint8_t bytes to represent, but this will involve the byte position and bucket conversion, because bytes have 8 bits, while buckets only need 6 bits.

So we need to convert the bucket sequence number to the corresponding byte offset offset_bytes and its internal bit offset offset_bits. Note that the small endian order, with the high order on the right, needs to be reversed.

If offset_bits is less than or equal to 2, it indicates that a bucket is in this byte. You only need to reverse the byte to get the bucket value.

If offset_bits is greater than 2, it indicates that a bucket is distributed in two bytes. In this case, invert the contents of the two bytes and splice them to obtain the bucket value, as shown in the following figure.

HyperLogLog’s sparse storage structure is designed to save memory consumption. Instead of really finding that many byte arrays to represent 2^14 buckets, it uses a special byte structure.

To facilitate the representation of sparse storage, Redis assigns an instruction to each of the above three byte representations.

  • ZERO: one byte: indicates the number of consecutive buckets. The first two digits are 00. The last six digits are the number of buckets.
  • XZERO: two bytes, indicating the number of consecutive buckets. The first two bytes are 01, and the last 14 bytes are the number of buckets. The maximum value is 16384.
  • VAL: a byte, indicating the number of consecutive buckets. The first byte indicates 1, and the fourth byte indicates the number of consecutive buckets. Therefore, the maximum number of buckets is 32. The last two indicate how many buckets in a row.

So, an initial HyperLogLog object only needs 2 bytes, or an XZERO, to store its data, rather than consuming 12K of memory. When HyperLogLog inserts a small number of elements, it can be represented with a small number of XZERO, VAL, and ZERO, as shown below.

The conditions for Redis to switch from sparse storage to dense storage are:

  • Any count changes from 32 to 33 because the VAL instruction can no longer hold it, and it can represent up to 32
  • The total number of bytes consumed by sparse storage exceeds 3000 bytes. This threshold can be adjusted using the hll_SPARse_max_bytes parameter.

Specific Redis HyperLogLog source code because it involves more bit operations, here is not much with you to learn. If you have any better understanding or questions about HyperLogLog, please leave a comment.

reference

Thoughtbot.com/blog/hyperl… Juejin. Cn/post / 684490… Juejin. Cn/post / 684490…