Deep Redis HyperLogLog internal data structure analysis

HyperLogLog algorithm is a very clever algorithm to approximate the number of massive deweighting elements. It internally maintains 16,384 buckets to record the number of elements in each bucket. When an element arrives, it hashes to one of the buckets, affecting the bucket count with some probability. Since it is a probabilistic algorithm, the individual bucket count is not accurate, but the average sum of all the bucket count values will be very close to the real count.

In order to understand the HyperLogLog algorithm, we first simplify its counting logic. Because it is a de-count, if it is an accurate de-count, it must use the set set, use the set to record all the elements, and then use the scard instruction to get the size of the set to get the total count. Because there are many elements, a single set will be very large, so the set is broken into 16,384 small sets. When an element arrives, the hash algorithm dispatches the element to one of the small collection stores, and the same element is always hashed to the same small collection. So the total count is just the sum of the sizes of all the small sets. Counting precisely in this way allows you to subtract elements as well as add them.

Described in Python code as follows

# coding:utf-8
import hashlib

class ExactlyCounter:

    def __init__(self):
        Allocate 16384 empty sets first
        self.buckets = []
        for i in range(16384):
            self.buckets.append(set([]))
        # use md5 hash algorithm
        self.hash = lambda x: int(hashlib.md5(x).hexdigest(), 16)
        self.count = 0

    def add(self, element):
        h = self.hash(element)
        idx = h % len(self.buckets)
        bucket = self.buckets[idx]
        old_len = len(bucket)
        bucket.add(element)
        if len(bucket) > old_len:
            # If the quantity changes, the total is +1
            self.count += 1

    def remove(self, element):
        h = self.hash(element)
        idx = h % len(self.buckets)
        bucket = self.buckets[idx]
        old_len = len(bucket)
        bucket.remove(element)
        if len(bucket) < old_len:
            # If the quantity changes, the total is -1
            self.count -= 1


if __name__ == '__main__':
    c = ExactlyCounter()
    for i in range(100000):
        c.add("element_%d" % i)
    print c.count
    for i in range(100000):
        c.remove("element_%d" % i)
    print c.count
Copy the code

There is no obvious benefit to collection shredding because the total memory footprint is not reduced. HyperLogLog is definitely not this algorithm, it needs to optimize this small set, compress its storage space, make its memory very small. In HyperLogLog algorithm, the space occupied by each bucket is actually only 6 bits, which naturally cannot contain all the elements in the bucket. It records the log value of the number of elements in the bucket.

To illustrate what this log is, let’s first consider a small problem. A random integer value, there’s a 50% chance that the integer ends with a 0, either a 0 or a 1. Similarly, there is a 25% chance of having two zeros at the end, 12.5% chance of having three zeros, and so on, 2 to the minus k. If we randomly pick a lot of integers, the number of integers we don’t know, but we’re recording the maximum number K of consecutive zeros at the end of integers. We can use this K to approximate the number of integers, which is 2 to the K.

And, of course, it’s going to be very inaccurate, because maybe then you randomly pick a lot of integers, but the maximum number of consecutive zeros at the end K doesn’t change, but the estimate is still 2 to the K. You might think that if this K was a floating-point number, every time you randomly pick up a new element, it would go up a little bit, and you’d get a much better estimate.

HyperLogLog allocates 16384 buckets and then mediates the maximum number K of all buckets to get an average of the trailing zero maximum number K#. K# is a floating point number, and using the average of 2^K# to estimate the total number of elements is relatively accurate. But this is just a simplification of the algorithm, the real algorithm has a lot of correction factors, because there are too many mathematical theories involved to describe exactly here.

Let’s look at the concrete implementation of Redis HyperLogLog algorithm. We know that the actual space taken up by a HyperLogLog is about 13684 * 6bit / 8 = 12K bytes. But when the count is small, most buckets count to zero. If too many of the 12K bytes are zero, this space can be saved. Redis uses sparse storage in the case of relatively small count, and the space occupation of sparse storage is far less than 12K bytes. The opposite of sparse storage is dense storage, which occupies a constant 12K bytes.

Dense storage structure

In both sparse and dense storage, Redis uses string bitmaps to store the count values of all HyperLogLog buckets. The structure of dense storage is very simple, it is a string bitmap of 16384 consecutive 6 bits.

Given a bucket number, how do I get its 6bit count? These 6 bits can be inside a byte, or they can span byte boundaries. We need to do a proper shift concatenation of this one or two bytes to get the count.

If the bucket number is IDX, the start byte offset of this 6-bit number is offset_bytes, and the start bit offset of this byte is offset_bits. We have a

offset_bytes = (idx * 6) / 8
offset_bits = (idx * 6) % 8
Copy the code

The former is the quotient and the latter is the remainder. For example, the byte offset of bucket 2 is 1, which is the second byte. Its bit offset is 4, where the fifth bit of the second byte starts with the count of bucket 2. It is important to note that the byte order is left low and right high, whereas most bytes we use are left high and right low, so we need to invert it in our mind.

If offset_bits is less than or equal to 2, then the 6 bits are inside a byte, and we can use the following expression directly to get the count val

val = buffer[offset_bytes] >> offset_bits  # shift to the right
Copy the code

If offset_bits is greater than 2, then the byte boundary is crossed and a two-byte bit fragment needs to be concatenated.

# low value
low_val = buffer[offset_bytes] >> offset_bits
# number of low values
low_bits = 8 - offset_bits
# splice, keep the lower 6 bits
val = (high_val << low_bits | low_val) & 0b111111
Copy the code

The Redis source code is a little more obscure, however, and looks like it only considers the case of crossing byte boundaries. This is because if 6bits are in a single byte, the high_val value in the code above is zero, so this code can take care of both single-byte and double-byte.

// Get the count of the specified bucket
#define HLL_DENSE_GET_REGISTER(target,p,regnum) do { \
    uint8_t *_p = (uint8_t*) p; \
    unsigned long _byte = regnum*HLL_BITS/8; \ 
    unsigned long _fb = regnum*HLL_BITS&7; \  # %8 = &7
    unsigned long _fb8 = 8 - _fb; \
    unsigned long b0 = _p[_byte]; \
    unsigned long b1 = _p[_byte+1]; \ target = ((b0 >> _fb) | (b1 << _fb8)) & HLL_REGISTER_MAX; The \}while(0)

// Set the bucket count
#define HLL_DENSE_SET_REGISTER(p,regnum,val) do { \
    uint8_t *_p = (uint8_t*) p; \
    unsigned long _byte = regnum*HLL_BITS/8; \
    unsigned long _fb = regnum*HLL_BITS&7; \
    unsigned long _fb8 = 8 - _fb; \
    unsigned long _v = val; \
    _p[_byte] &= ~(HLL_REGISTER_MAX << _fb); \
    _p[_byte] |= _v << _fb; \
    _p[_byte+1] &= ~(HLL_REGISTER_MAX >> _fb8); \
    _p[_byte+1] |= _v >> _fb8; \
} while(0)
Copy the code

Sparse storage structure

Sparse storage is suitable for cases where many counts are zero. The following figure shows the state of a general sparse storage count.

When multiple consecutive buckets count to zero, Redis uses a byte to indicate how many buckets will count to zero: 00xxxxxx. The prefix of two zeros indicates that the next 6bit integer value plus one is the number of zero counters. Note that this is added because the number is meaningless if it is zero. For example, 00010101 represents 22 consecutive zero-value counters. 6bit can only represent 64 consecutive zero-value counters, so Redis also designed a continuous number of more than 64 consecutive zero-value counters, it uses two bytes to represent: 01xxxxxx yyyyyyyy, the following 14bit can represent a maximum of 16384 consecutive zero-value counters. This means that the initial state of the 16,384 buckets in the HyperLogLog data structure, all of which have zero counters, can be represented directly as 2 bytes.

If the count for consecutive buckets is non-zero, a byte of the form 1VVVVvxx is used. The middle 5bit represents the count value, and the tail 2bit represents several consecutive buckets. It means that consecutive (xx +1) counts are (VVVVV +1). For example, 10101011 means four consecutive numbers are 11. Note that both values need to be incremented by 1, because a zero in either case means that the counter is zero, so you should use the zero counter. Note that count values can only be represented up to 32, whereas HyperLogLog’s dense storage single count values can be represented up to 63 with 6 bits. When a count of the sparse storage needs to be adjusted to a value greater than 32, Redis will immediately transform the storage structure of HyperLogLog to convert the sparse storage into dense storage.

To facilitate the representation of sparse storage, Redis assigns an instruction to each of the above three byte representations.

ZERO:len A single byte represents 00[LEN-1], and a maximum of 64 consecutive zeros can be counted
VAL:value,len A single byte represents 1[value-1][len-1], and len consecutive values are the count of value
XZERO:len A double byte indicates 01[LEN-1], which contains a maximum of 16384 consecutive zero values

#define HLL_SPARSE_XZERO_BIT 0x40 /* 01xxxxxx */
#define HLL_SPARSE_VAL_BIT 0x80 /* 1vvvvvxx */
#define HLL_SPARSE_IS_ZERO(p) (((*(p)) & 0xc0) == 0) /* 00xxxxxx */
#define HLL_SPARSE_IS_XZERO(p) (((*(p)) & 0xc0) == HLL_SPARSE_XZERO_BIT)
#define HLL_SPARSE_IS_VAL(p) ((*(p)) & HLL_SPARSE_VAL_BIT)
#define HLL_SPARSE_ZERO_LEN(p) (((*(p)) & 0x3f)+1)
#define HLL_SPARSE_XZERO_LEN(p) (((((*(p)) & 0x3f) << 8) | (*((p)+1)))+1)
#define HLL_SPARSE_VAL_VALUE(p) ((((*(p)) >> 2) & 0x1f)+1)
#define HLL_SPARSE_VAL_LEN(p) (((*(p)) & 0x3)+1)
#define HLL_SPARSE_VAL_MAX_VALUE 32
#define HLL_SPARSE_VAL_MAX_LEN 4
#define HLL_SPARSE_ZERO_MAX_LEN 64
#define HLL_SPARSE_XZERO_MAX_LEN 16384
Copy the code

The figure above can be expressed in instruction form as follows

Storage transformation

When the count reaches a certain level, the sparse storage will be irreversibly converted to dense storage at one time. The conversion takes place immediately if either one of the two conditions is met, that is, if any count is changed from 32 to 33. Since the VAL instruction can no longer hold it, the maximum number of counts it can represent is 32. The total number of bytes used for sparse storage is over 3000 bytes. This threshold can be adjusted with the hll_SPARse_max_bytes parameter.

Count the cache

The total value of the HyperLogLog mentioned above is calculated by harmonic average of the count values of 16384 barrels based on a factor correction formula. It has to iterate through all the buckets to get to this value, and there’s a lot of floating point arithmetic involved. This is a relatively large computation.

So Redis uses an extra field to cache the total value. This field is 64-bit, with the highest bit being 1 to indicate whether the value is out of date, and 0 to indicate the remaining 63 bits.

When the count of any bucket in the HyperLogLog changes, the count cache is set to expire, but calculation is not triggered immediately. Instead, the recalculation flusher cache is triggered when the user displays the invocation of the pfcount directive. In the case of dense storage, the cache refresh needs to traverse the count values of 16384 buckets for harmonic averaging, but in the case of sparse storage, there is not such a large amount of calculation. That is to say, only when the count value is relatively large can it produce a large amount of computation. On the other hand, if the count is large, most PFADD operations will not change the count in the bucket at all.

This means that frequent calls to the PFcount instruction in a highly variable HLL counter can be a minor performance issue. This performance concern was also addressed in a blog by Redis author Antirez. However, the authors did a careful stress test and found that there was no need to worry. The average time complexity of pfcount instructions was O(1).

After this change even trying to add elements at maximum speed using a pipeline of 32 elements with 50 simultaneous clients, PFCOUNT was able to perform as well as any other O(1) command with very small constant times.

Object head

In addition to the 16,384 bucket count, HyperLogLog also has some additional fields to store, such as total count cache and storage type. So it uses an extra object header to represent it.

struct hllhdr {
    char magic[4];      /* Magic string "HYLL" */
    uint8_t encoding;   /* HLL_DENSE or HLL_SPARSE. */
    uint8_t notused[3]; /* Reserve three bytes for possible future use of */
    uint8_t card[8];    /* Total cache */
    uint8_t registers[]; /* Counters for all buckets */
};
Copy the code

So the entire internal structure of HyperLogLog is the HLL object header plus the counting bitmap of 16,384 buckets. It is represented internally in Redis as a string bitmap. You can treat HyperLogLog objects as normal strings.

127.0.0.1:6379> pfadd codehole python java golang
(integer) 1
127.0.0.1:6379> get codehole
"HYLL\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80C\x03\x84MK\x80P\xb8\x80^\xf3"
Copy the code

However, you cannot use the HyperLogLog directive to manipulate ordinary strings because it needs to check if the object header magic string is “HYLL”.

127.0.0.1:6379 >setCodehole Python OK 127.0.0.1:6379> pfadd codehole Java Golang (error) WRONGTYPE Key is not a valid HyperLogLog string value.Copy the code

But if the string starts with “HYLL\x00” or “HYLL\x01”, the HyperLogLog command can be used.

127.0.0.1:6379 >set codehole "HYLL\x01whatmagicthing"OK 127.0.0.1:6379 > get codehole"HYLL\x01whatmagicthing"
127.0.0.1:6379> pfadd codehole python java golang
(integer1)Copy the code

This may sound strange to you because HyperLogLog formats the content before executing the command, This check is to see if the object header’s Magic string is “HYLL” and the Encoding field is HLL_SPARSE=0 or HLL_DENSE=1 to determine if the current string is a HyperLogLog counter. In the case of dense storage, you also need to determine whether the length of the string is exactly equal to the length of the dense counter store.

int isHLLObjectOrReply(client *c, robj *o) {.../* Magic should be "HYLL". */
    if (hdr->magic[0] != 'H' || hdr->magic[1] != 'Y' ||
        hdr->magic[2] != 'L' || hdr->magic[3] != 'L') goto invalid;

    if (hdr->encoding > HLL_MAX_ENCODING) goto invalid;

    if(hdr->encoding == HLL_DENSE && stringObjectLen(o) ! = HLL_DENSE_SIZE)goto invalid;

    return C_OK;

invalid:
    addReplySds(c,
        sdsnew("-WRONGTYPE Key is not a valid "
               "HyperLogLog string value.\r\n"));
    return C_ERR;
}
Copy the code

HyperLogLog is to strings what Geo is to Zset. You can also use any zset instruction to access the Geo data structure, because the Geo internal store uses a pure Zset to record the geographical location of elements.

This article is excerpted from Redis Deep Adventures, an online technology booklet. Start reading Redis Deep Adventures now!

Deep Redis HyperLogLog internal data structure analysis

Dense storage structure

Sparse storage structure

Storage transformation

Count the cache

Object head

Related Posts

Are you sure you really understand Java’s four types of references (strong, weak, soft, and virtual)?

TCP’s three handshakes and four waves

The most complete guide to Java generics