Redis source code reading notes

2021/3/17 Blue jade white dew

preface

Read the source code suggested:

VScode open Redis source code, combined with Teacher Huang Jianhong “Redis Design and Implementation”, first understand the concept, and then return to the code.

When reading the specific implementation of the database, it is recommended to compile and run Redis in Linux system, to understand it will be more intuitive.

Read the source code in order:

  1. Read the source code of the data structure
  2. Read the memory encoded data structure implementation
  3. Read the implementation of the data type
  4. Read the code related to the standalone database
  5. Read client and server code
  6. Read multiple machine (Cluster) implementation code

Redis Design and Implementation by Huang Jianhong (2nd Edition)

https://www.w3cschool.cn/hdclil/

Some information and pictures are copied from the Internet, with reference links, deleted.

Basic data structure

Learning Redis source code advice from the bottom up, from the bottom of the data structure, step by step to feel the clever design of Redis, the wonderful source.

Dynamic string SDS

Source files: SDS.h and SDS.C.

Redis builds an abstract type called Simple Dynamic String (SDS) and uses it as the default string representation.

PS: Traditional C strings are used only as string literals (constants) in Redis, such as print logs and so on.

Structure definition

Let’s first look at the definition of SDS structure:

// Type alias, used to point to the buf attribute of SDSHDR typedef char * SDS; Struct SDSHDR {// int len; // The length of the free space in buf int free; // Data space char buf[]; };Copy the code

From this point of view, Redis SDS is actually based on the traditional string array of C language to encapsulate, in order to be able to operate it better (explain why later). An EXAMPLE of SDS is shown as follows:

It is worth noting that SDS also follows the null character end of a C string and does not count in LEN, which has the advantage of using C-related string library functions.

Such as:

printf("%s", s->buf);
Copy the code

Compare C strings

Then let’s consider the question: what good is it to encapsulate C strings like this?

** First, get the string length quickly. ** The len attribute can be used to get the length of a string in O(1) time complexity, whereas C strings require O(n).

** Second, prevent buffer overflows. The **C string does not record its length, which can cause a buffer overflow during operations such as strcat, as shown in the following figure:

After performing the concatenation operation on S1:

When modifying SDS, the API of SDS will check whether its space is sufficient. If not, it will expand the space before performing corresponding operations.

** Third, ** reduces the number of memory reallocations when modifying strings. For a C string containing N characters, the underlying implementation of the C string is always an array of N+1 characters (one extra character is used to store null characters).

If a concatenate operation (append) is performed, the program will need to expand the size of the underlying array by reallocating memory before performing this operation — forgetting this step will result in a buffer overflow.

If truncation (trim) is performed, then after this operation, the program needs to reallocate memory to free up the portion of the string that is no longer used — forgetting this step can result in a memory leak.

Memory reallocation can cause system calls, and programs rarely make string length changes, but Redis, as a database, does a lot of string changes. To avoid this defect, SDS disassociates the length of the string from the length of the underlying array by using unused space: in SDS, the length of the BUF array is not necessarily the number of characters plus one; the array can contain unused bytes, and the number of bytes is recorded by the FREE attribute of SDS.

SDS realizes two optimization strategies of space pre-allocation and inert space release through unused space.

** Fourth, space pre-allocation strategy. ** When THE API modifies SDS, it allocates not only necessary space but also additional space, as follows:

If the modified SDS length is less than 1M (the default maximum length), an additional LEN length is allocated.

If it is larger than 1M, an additional 1M is allocated.

This preallocation strategy reduces the number of memory reallocations required to grow a string N times in a row from a certain N to a maximum of N.

** Fifth, the release of inert space. ** When the SDS API needs to shorten the string held by SDS, the program does not immediately use memory reallocation to reclaim the shortened extra bytes, but instead uses the free attribute to record the number of bytes and wait for future use.

** Its six, support binary. **C strings end with ‘\0’, which makes C strings only hold text data and not binary data such as pictures, audio, video, and compressed files. But Redis needs and has such characteristics.

It is these small improvements that make Redis efficient and stable.

To sum up:

Interface API

The overview

function role Time complexity
sdsnew Create an SDS containing the given C string. N is the length of a given C string.
sdsempty Create an empty SDS that contains nothing.
sdsfree Release the given SDS.
sdslen Returns the number of bytes of space used by SDS. Read the Len property to get directly,
sdsavail Returns the number of bytes of unused space for SDS. Read the free property to get directly,
sdsdup Create a copy of the given SDS. , N is the length of a given SDS.
sdsclear Clear the string contents saved by SDS. Because of the lazy space release strategy, the complexity is zero
sdscat Concatenates the given C string to the end of the SDS string. , N is the length of the concatenated C string.
sdscatsds Concatenate the given SDS string to the end of another SDS string. , N is the length of the string of spliced SDS.
sdscpy Copy the given C string into the SDS,Overrides the original SDS string. N is the length of the copied C string.
sdsgrowzero Expands SDS to a given length with null characters. N is the number of bytes added by the extension.
sdsrange Retain data within the SDS given interval, and data outside the interval will be overwritten or erased. N is the number of bytes of reserved data.
sdstrim Take an SDS and a C string as arguments,Remove all characters from the left and right sides of the SDS that appear in the C string. , M is the length of SDS and N is the length of a given C string.
sdscmp Compares two SDS strings to see if they are the same. , N is the length of the shorter SDS of the two.

Pick a few apis and look at the source code.

sdslen

/* * return the actual length of the string saved by SDS * * T = O(1) */ static inline size_t sdslen(const SDS s) {struct SDSHDR *sh = (void*)(s-(sizeof(struct sdshdr))); return sh->len; }Copy the code

Considering the structure definition of SDS, why does this API capture the length of SDS?

Inside the structure, char* buf and char buf[1] have the same effect (in the case of alignment), accounting for 4 bytes; Char buf[0] is the same as char buf[].

Refer to the link: www.cnblogs.com/hpplinux/ar…

So (void*)(s-(sizeof(struct SDSHDR)) transfers the memory address to the SDS structure address from which member variables can be accessed directly.

Sdsnewlen initializes the string

SDS sdsnewlen(const void *init, size_t initlen) {struct SDSHDR *sh; // Depending on whether there is an initialization, // zmalloc(sizeof(struct SDSHDR)+initlen+1); Sh = zcalloc(sizeof(struct SDSHDR)+initlen+1); } // Memory allocation failed, return if (sh == NULL) return NULL; // Set the initialization length sh->len = initlen; Sh ->free = 0; // if there are specified initializations, copy them to the buf of SDSHDR // T = O(N) if (initlen && init) memcpy(sh->buf, init, init); // end with \0 sh->buf[initlen] = '\0'; Char return (char*)sh->buf; }Copy the code

Sdsclear lazy delete policy

Void sdsclear(SDS s) {SDSHDR *sh = (void*) (s-(sizeof(struct)) sdshdr))); // recalculate the property sh->free += sh->len; sh->len = 0; Sh ->buf[0] = '\0'; }Copy the code

sdsMakeRoomFor

This function augments the FREE of SDS by either doubling the size or adding 1M extra space.

sds sdsMakeRoomFor(sds s, size_t addlen) { struct sdshdr *sh, *newsh; Size_t free = sdsavail(s); size_t len, newlen; If (free >= addlen) return s; if (free >= addlen) return s; Len = sdslen(s); sh = (void*) (s-(sizeof(struct sdshdr))); Newlen = (len+addlen); // If (newlen < SDS_MAX_PREALLOC) // If (newlen < SDS_MAX_PREALLOC) // If (newlen < SDS_MAX_PREALLOC) // Allocate twice as much space to s as it needs. Else // Otherwise, the allocated length is the current length plus SDS_MAX_PREALLOC newlen += SDS_MAX_PREALLOC; // T = O(N) newsh = zrealloc(sh, sizeof(struct sdshdr)+newlen+1); If (newsh == NULL) return NULL; Newsh ->free = newlen-len; SDS return newsh->buf; }Copy the code

List the List

Source files: adlist.h and adlist.c.

Redis implements double-ended linked lists, which are widely used to implement various functions of Redis, such as list keys, publish and subscribe, and so on. Redis lists can be used to hold various types of values by setting void *value to different types.

Structure definition

// typef struct list {// listNode *head; // listNode *tail; Void *(*dup)(void * PTR); Void (*free)(void * PTR); Int (*match)(void * PTR, void *key); Unsigned long len; } list;Copy the code

The linked list node is defined as follows:

// typef struct listNode {// struct listNode *prev; Struct listNode *next; Void *value; } listNode;Copy the code

There is an iterator worth noting here, which is used to iterate over a double-ended list, using next to return the list nodes that were iterated over.

Typedef struct listIter {// listNode *next; // Iteration direction (forward or back) int direction; } listIter;Copy the code

An example of a linked list would look like this:

Interface API

Overview:

listSetDupMethod Sets the given function to the node value copy function of the linked list.
listGetDupMethod Returns the node value copy function that the list is currently using. The copy function can be obtained directly from the DUP property of the linked list,
listSetFreeMethod Sets the given function to the node value release function of the linked list.
listGetFree Returns the node value release function that the list is currently using. The release function can be obtained directly from the free property of the linked list,
listSetMatchMethod Sets the given function to the node value comparison function of the linked list.
listGetMatchMethod Returns the node value comparison function that the list is currently using. The comparison function can be obtained directly from the match property of the linked list,
listLength Returns the length of the list (how many nodes it contains). The list length can be obtained directly from the len property of the list,
listFirst Returns the head node of the linked list. The head node can be obtained directly from the head property of the linked list,
listLast Returns the end node of the linked list. Table tail nodes can be obtained directly from the tail property of the linked list.
listPrevNode Returns the front node of the given node. The front node can be obtained directly from the node’s prev property,
listNextNode Returns the post-node of the given node. The post-node can be obtained directly from the next property of the node,
listNodeValue Returns the value currently being saved for the given node. The node value can be obtained directly from the value property of the node,
listCreate Create a new linked list that does not contain any nodes.
listAddNodeHead Adds a new node containing the given value to the header of the given linked list.
listAddNodeTail Adds a new node containing the given value to the end of the given linked list.
listInsertNode Adds a new node with the given value before or after the given node.
listSearchKey Finds and returns the node in the list that contains the given value. , N is the length of the list.
listIndex Returns the node of the linked list at the given index. , N is the length of the list.
listDelNode Removes the given node from the linked list.
listRotate Eject the bottom node of the linked list, and then insert the ejected node into the head of the linked list as the new head node.
listDup Make a copy of a given linked list. , N is the length of the list.
listRelease Releases the given list and all nodes in the list. , N is the length of the list.

Let’s pick a few apis to see how they are implemented:

**listAddNodeHead: ** Inserts a node into the head

list *listAddNodeHead(list *list, void *value) { listNode *node; If ((node = zmalloc(sizeof(*node))) == NULL) return NULL; Node ->value = value; If (list->len == 0) {list->head = list->tail = node; node->prev = node->next = NULL; } // Add nodes to a non-empty list else {node->prev = NULL; node->next = list->head; list->head->prev = node; list->head = node; } // update list->len++; return list; }Copy the code

ListGetIterator: ** Produces an iterator

/* Create an iterator for a given list, and each subsequent call to listNext returns the list node to which the iterator is iterated. * Direction determines the direction of the iterator: * AL_START_HEAD: Iterate from table head to table tail * AL_START_TAIL: ListIter *listGetIterator(list *list, int direction) {// Allocate memory for iterators listIter *iter; if ((iter = zmalloc(sizeof(*iter))) == NULL) return NULL; If (direction == AL_START_HEAD) iter->next = list->head; else iter->next = list->tail; Iter ->direction = direction; return iter; }Copy the code

**listSearchKey:** Finds the node with the specified value in the linked list

listNode *listSearchKey(list *list, void *key) { listIter *iter; listNode *node; Iter = listGetIterator(list, AL_START_HEAD); while((node = listNext(iter)) ! = NULL) {if (list->match) {if (list->match(node->value, key)) {listReleaseIterator(iter); // Find the return node; } } else { if (key == node->value) { listReleaseIterator(iter); // Find the return node; } } } listReleaseIterator(iter); // return NULL is not found; }Copy the code
  • So what’s the use of a list iterator?

From the source point of view, the list iterator is mainly to facilitate the implementation of two-way forward list. Iterators are a design pattern that allows lists to adapt to different algorithms.

The dictionary Dict

Source files dict.h and dict.c.

Dictionaries in Redis are implemented using hash tables. Also, for progressive Rehash operations, each dictionary has two hash tables (new/old).

Structure definition

Dictionaries in Redis are implemented using hash tables. Also, for progressive Rehash operations, each dictionary has two hash tables (new/old).

Dictf struct dict {// Dictf struct dict {// dictf struct dict {// dictf struct dict {// dictf struct dict {// dictf struct dict {// dictf struct dict; // Private data: saves the optional arguments that need to be passed to those type-specific functions (see below) void * privData; // Dictht ht[2]; // The value is -1 when the rehash is not in progress // the value is 0 when the rehash is in progress // the value is between 0 and size int rehashidx; int iterators; // The number of security iterators currently running} dict;Copy the code

The hash table is defined as follows:

// Each dictionary uses two hash tables to implement progressive rehash. Typedef struct dictht {// use pointer to dictEntry **table; // Hash table size unsigned long size; // Always equal to size-1 unsigned long sizemask; // The number of existing nodes in the hash table unsigned long used; } dictht;Copy the code

The Entry of the hash table is defined as follows. Redis uses the open chain method to handle hash collisions:

// typef struct dictEntry {// void *key; // value union {void *val; uint64_t u64; int64_t s64; } v; Struct dictEntry *next; struct dictEntry *next; } dictEntry;Copy the code

An example of a hash table is shown below:

If a hash conflict occurs, the new key-value will precede the old key-value. If the key-value is inserted at the end of the list, it will be O(N) complexity to traverse the list.

An example of a normal dictionary is shown below:

Let’s take a look at some of the important apis, such as calculating hash values, handling hash conflicts, or rehashing.

Interface API

The overview

function role Time complexity
dictCreate Create a new dictionary.
dictAdd Adds the given key-value pair to the dictionary.
dictReplace Adds the given key-value pair to the dictionary, and if the key already exists in the dictionary, replaces the original value with the new value.
dictFetchValue Returns the value of the given key.
dictGetRandomKey Returns a random key-value pair from the dictionary.
dictDelete Removes the key-value pairs for the given key from the dictionary.
dictRelease Releases the given dictionary and all key-value pairs contained in the dictionary. , N is the number of key-value pairs contained in the dictionary.

**dictAddRaw: ** New key value pair

dictEntry *dictAddRaw(dict *d, void *key) { int index; dictEntry *entry; dictht *ht; T = O(1) if (dictIsRehashing(d)) _dictRehashStep(d); If ((index = _dictKeyIndex(d, key)) == -1) return NULL; // If the dictionary is rehashing, add the new key to hash 1; otherwise, add the new key to hash 0. &d->ht[1] : &d->ht[0]; // Allocate space entry for the new node = zmalloc(sizeof(*entry)); // Insert a new node into the list header entry->next = ht->table[index]; ht->table[index] = entry; Ht ->used++; T = O(1) dictSetKey(d, entry, key); return entry; }Copy the code

There is a function in it: Dictkeyindex () — Calculate the index in the dictionary into which the key pair can be inserted. If dict_can_resize is positive, Rehash is enforced when the ratio of the hash table’s size to used is greater than 1 (the load factor) (no BGSAVE is performed) or greater than 5 (BGSAVE is performed).

To compute the hash value, it looks like this:

index = hash & dict->ht[0].sizemask
Copy the code

This algorithm is called MurmurHash2 (there are currently 3, but Redis does not use it). The advantage of this algorithm is that even if the input keys are regular, the algorithm can still give a good random distribution, and the algorithm is very fast.

  • Why are the load factors 1 and 5?

Depending on whether the BGSAVE command or BGREWRITEAOF command is being executed, the load factor required for the server to perform the extension operation is different because Redis needs to create a child of the current server process during the BGSAVE or BGREWRITEAOF command execution, Most operating systems use copy-on-write technology to optimize the use of subprocesses, so the server increases the load factor required to perform the extension operation during the existence of the subprocess, ** thus avoiding the hash table extension operation during the existence of the subprocess as much as possible. ** This avoids unnecessary memory writes and maximizes memory savings.

Static int dict_can_resize = 1; static int dict_can_resize = 1; Static unsigned int dict_force_resize_ratio = 5; static unsigned int dict_force_resize_ratio = 5;Copy the code

See the following API for details of how rehash works.

Rehash

Int dictRehash(dict *d, int n) {// If (! dictIsRehashing(d)) return 0; // T = O(N) while(N --) {dictEntry *de, *nextde; If (d->ht[0]. Used == 0) {zfree(d->ht[0].table); D ->ht[0] = d->ht[1]; // reset old hash table _dictReset(&d->ht[1]); D ->rehashidx = -1; // Return 0 to indicate to the caller that the rehash is complete. } // Make sure rehashidx does not overassert (d->ht[0]. Size > (unsigned)d->rehashidx); Void * key = void* key; void* key = void* key; While (d->ht[0]. Table [d->rehashidx] == NULL) d->rehashidx++; De = d->ht[0]. Table [d->rehashidx]; // T = O(1) while(de) {unsigned int h; Nextde = de->next; H = dictHashKey(d, de->key) & d->ht[1].sizemask; // Get the index in the new hash table. De ->next = d->ht[1]. Table [h]; d->ht[1].table[h] = de; // Update counter d->ht[0]. d->ht[1].used++; // proceed to next node de = nextde; Table [d->rehashidx] = NULL; // update rehash index d->rehashidx++; } return 1; }Copy the code

Through the above source code, we can get:

  • Redis hash table uses progressive step hash method
  • When rehashing a dictionary, both the ** hash table and the ** hash table need to be considered when iterating, updating, and deleting keys. Dictionary deletes, finds, updates, etc. are performed on both hash tables, but inserts are performed only on the second table, and the first table is reset after rehashing.
  • When rehashing, ht[1] has the size of the first 2^n greater than or equal to ht[0]. Used *2; If the shrink operation (load factor less than 0.1), then the size of HT [1] is the first 2^n greater than or equal to ht[0]. Used.
  • The database calls int dictRehashMilliseconds(dict *d, int ms) **. ** returns the iterated index after the specified number of milliseconds.

The iterator

Like linked lists, Redis dictionaries have iterators, mainly for traversing a dictionary.

Its definition is as follows:

Safety iterator

dictIterator *dictGetSafeIterator(dict *d) { dictIterator *i = dictGetIterator(d); // set the security iterator to identify I ->safe = 1; return i; }Copy the code

Unsafe iterators

dictIterator *dictGetIterator(dict *d)
{
    dictIterator *iter = zmalloc(sizeof(*iter));

    iter->d = d;
    iter->table = 0;
    iter->index = -1;
    iter->safe = 0;
    iter->entry = NULL;
    iter->nextEntry = NULL;
    return iter;
}
Copy the code

Both iterators share the same function interface:

dictEntry *dictNext(dictIterator *iter)

This interface returns the current node to which the iterator points. If this function is called with a safe iterator, it updates the dictionary’s iterator counter (safe iterator); If this function is called from a non-secure iterator, the fingerprint of the dictionary is computed to determine that the user has not violated any rules.

The key function for iterators is ****dictScan (). See the source notes.

Jump table skiplist

Source file: t_zset.c all functions starting with ZSL.

Skip list is a kind of data structure that can perform approximate binary search on ordered linked list. It can achieve the purpose of fast access to nodes by maintaining multiple Pointers to other nodes in each node.

Skip lists support node lookup with average O(logN) and worst O(N) complexity, and batch processing of nodes through sequential operations.

Redis uses skip lists in two places, one to implement ordered collections and the other as internal data structures in clustered nodes (see Redis Multi-machine database later).

In most cases, skip lists can be as efficient as balanced trees, and because skip lists are much easier to implement than balanced trees, many programs use skip lists instead of balanced trees.

Take a look at the data structure of the hop table:

In fact, a skip table is a spatial-temporal structure, because a linked list index needs to be created at certain intervals.

Its query process is as follows:

That is, search the same layer first, and then search the next layer, until finally find the node, similar to a binary process.

In addition, if we want to insert or delete data for the table, we first need to find the insert or delete location, and then perform the insert or delete operation. We already know that the time complexity of the query for the table is O(logn), because the insert and delete time complexity after finding the location is very low, O(1). Therefore, the time complexity of the final insert and delete is also O(longn).

For further information, please refer to:

zhuanlan.zhihu.com/p/68516038

Structure definition

The skip list contains the head and tail nodes, the number of nodes, and the maximum number of layers, as defined below:

Typedef struct zskiplist {// struct zskiplistNode *header, *tail; // The number of nodes in the table unsigned long length; Int level; int level; } zskiplist;Copy the code

Note that when level/length is calculated, headers are not counted.

The skip list nodes are defined as follows:

Typedef struct zskiplistNode {// member object (stored object type, see below) // Double score; Struct zskiplistNode *backward; Struct zskiplistNode *forward; struct zskiplistNode *forward; // span unsigned int span; } level[]; } zskiplistNode;Copy the code

Each node of the hop table has multiple layer tags, and each layer tag has two attributes: forward pointer and span. Forward pointer represents other nodes after it is visited (score is larger than it), and span represents the distance between the current node and subsequent nodes.

The skip table level level is completely random (numbers 1-32 are generated randomly when nodes are inserted). In general, the more layers there are, the faster nodes can be accessed.

In a skip list, the nodes are arranged in ascending order of the points they hold.

Obj refers to a string object, which holds an SDS value.

Note: In the same skip list, each node must hold unique member objects, but multiple nodes can hold the same points: Nodes with the same score are sorted by the size of their member objects (see below), with nodes with smaller member objects coming first (towards the head of the table) and nodes with larger member objects coming second (towards the bottom of the table).

The Redis member object is defined as follows:

Typedef struct redisObject {// unsigned type:4; // Unsigned encoding:4; // The last time the object was accessed unsigned lRU :REDIS_LRU_BITS; /* lru time (relative to server.lruclock) */ // int refcount; // A pointer to the actual value void * PTR; } robj;Copy the code

Note: member object knowledge, refer to Redis source read (three).

Find, delete, and add nodes

What about Redis node add/insert operations?

Let’s use a practical example to show how nodes can be found, deleted, and added to a skip list.

  • Add a node

If a skip list is initially empty, it is simply an empty linked list:

Insert key=3, level =3, level =3, level =3, level =3

· Insert key = 2 and random level = 1 as follows:

· Insert key = 100 and random level=2 as follows:

Okay, now that we’re done inserting the skip list, what do we do if we want to find 66?

Skip table query starts from the top layer, then the search will start from the top layer (i.e. the highest level), the search method is a binary search, if ** can not find the specified value through this layer, ** will jump to the next layer, continue to traverse, until the corresponding node is found.

  • The search process is as follows:

If 66 is larger than 1, go to the right of the same level. If 5 is smaller than 66, go to the right and end the table, so go down, from Level3 at 5 to Level2.

At level2, 100 to the right of 5 is larger than 66, so sink from Level2 at 5 to Level1.

At level1, 66 to the right of 5 is exactly equal, so return this node.

  • What about delete?

In fact, the operation is similar to the search operation. After finding the deleted node, delete the one-way linked list. The only difference is that the list node needs to be deleted at multiple levels.

The above reference links:

Blog.csdn.net/weixin_4162…

Interface API

The overview

function role Time complexity
zslCreate Create a new skip list.
zslFree Releases the given skip list and all nodes contained in the table. N is the length of the skip list.
zslInsert Adds a new node with the given member and score value to the skip list. On average,N is the length of skip list.
zslDelete Deletes the node in the skip table that contains the given member and score. On average,N is the length of skip list.
zslGetRank Returns the rank of the node in the skip list with the given member and score. On average,N is the length of skip list.
zslGetElementByRank Returns the node of the skip list in the given rank. On average,N is the length of skip list.
zslIsInRange Given a range of points, such as 0 to 15, 20 to 28, and so on, return 1 if the given range is included in the skip list, 0 otherwise. This detection can be used with the head and tail nodes of the skip tableComplexity done.
zslFirstInRange Given a range of points, returns the first node in the skip list that matches this range. On average,. N is the length of the skip list.
zslLastInRange Given a range of points, returns the last node in the skip list that matches this range. On average,. N is the length of the skip list.
zslDeleteRangeByScore Given a range of points, delete all nodes in the skip list that fall within this range. N is the number of nodes to be deleted.
zslDeleteRangeByRank Given a rank range, remove all nodes in the skip list that fall within this range. N is the number of nodes to be deleted.

The maximum level and probability of the hop table in Redis are defined as follows:

#define ZSKIPLIST_MAXLEVEL 32 /* Should be enough for 2^32 elements */
#define ZSKIPLIST_P 0.25      /* Skiplist P = 1/4 */
Copy the code

Create a new skip table as follows:

zskiplist *zslCreate(void) { int j; zskiplist *zsl; ZSL = zmalloc(sizeof(* ZSL)); ZSL ->level = 1; zsl->length = 0; T = O(1) ZSL ->header = zslCreateNode(ZSKIPLIST_MAXLEVEL,0,NULL); for (j = 0; j < ZSKIPLIST_MAXLEVEL; j++) { zsl->header->level[j].forward = NULL; zsl->header->level[j].span = 0; } zsl->header->backward = NULL; ZSL ->tail = NULL; return zsl; }Copy the code

As you can see from the appeal code, the head node of the skip table is just like any other skip table node, except that information such as BW Pointers and points and member objects is ignored.

Create a new node as follows (think carefully about the source code implementation) :

/* * create a new node with member obj and score * and insert this new node into the skip table ZSL. The * function returns a new node. * T_wrost = O(N^2), T_avg = O(N log N) */ zskiplistNode *zslInsert(zskiplist *zsl, double score, robj *obj) { zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x; unsigned int rank[ZSKIPLIST_MAXLEVEL]; int i, level; redisAssert(! isnan(score)); X = ZSL ->header; for (i = zsl->level-1; i >= 0; I -) {/ / if I wasn't ZSL - > level 1 layer / / so I start I + 1 layer of rank is a rank value / / rank values of each layer layer / / final rank value plus one [0] is a prerequisite for the new node node position Rank [0] = I == (ZSL ->level-1)? 0 : rank[i+1]; T_wrost = O(N^2), T_avg = O (N log N) while (x - > level [I] forward && (x - > level [I] forward - > score < score | | / / than score (x->level[I]. Forward ->score == score && T = O(N) compareStringObjects(x->level[I]. Forward ->obj,obj) < 0)); X = x->level[I]. Forward; Update [I] = x; update[I] = x; } /* // The caller to zslInsert() ensures that elements with the same score and same members do not appear, * so no further checks are needed and new elements can be created. T = O(N) level = zslRandomLevel(); T = O(N) level = zslRandomLevel(); // If the new node has more layers than the other nodes in the table // initialize the unused layers in the table head node, If (level > ZSL ->level) {// initialize the unused layer // T = O(1) for (I = ZSL ->level; i < level; i++) { rank[i] = 0; update[i] = zsl->header; update[i]->level[i].span = zsl->length; } ZSL ->level = level; } // create a new node x = zslCreateNode(level,score,obj); // T = O(1) for (I = 0; i < level; X ->level[I]. Forward = update[I]->level[I]. Forward; Update [I]->level[I]. Forward = x; /* Update span covered by update[I] as x is inserted here */ // Calculate the number of nodes that the new node will span x->level[I].span = update[I]->level[I].span  - (rank[0] - rank[i]); Update [I]->level[I]. Span = (rank[0] -rank [I]) +1; update[I]->level[I]. Span = (rank[0] -rank [I]) +1; } /* if (I = level) {if (I = level;} /* if (I = level) {if (I = level; i < zsl->level; i++) { update[i]->level[i].span++; } // Set the new node's backward pointer x->backward = (update[0] == ZSL ->header)? NULL : update[0]; if (x->level[0].forward) x->level[0].forward->backward = x; else zsl->tail = x; ZSL ->length++; return x; }Copy the code

The function of random level of compute nodes is as follows:

int zslRandomLevel(void) { int level = 1; //ZSKIPLIST = 0.25 while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF)) level += 1; return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL; }Copy the code

The return value is between 1 and ZSKIPLIST_MAXLEVEL inclusive. According to the power law used by the random algorithm, larger values are less likely to be generated.

Addendum: Ordered collections

The ordered set data structure of Redis is realized by combining dictionary and skip list. Its definition is as follows:

// typepedef struct zset {// dict *dict; // dict *dict; Zskiplist * ZSL; zskiplist * ZSL; zskiplist * ZSL; } zset;Copy the code

ZSET uses both data structures to hold the same element, providing O(log(N)) complexity for insertion and removal of ordered data structures. Hash tables map Redis objects to score values. A skip list maps points to Redis objects. From a skip list perspective, you can say that Redis objects are sorted by points.

Zset is further explained in Redis source Code Reading (3).

HyperLogLog

Source file: Hyperloglog.c.

Redis added the HyperLogLog structure in version 2.8.9. HyperLogLog is an algorithm for cardinality statistics. The advantage of HyperLogLog is that the space required to calculate the cardinality is always fixed and small when the number or volume of input elements is very, very large.

In Redis, each HyperLogLog key costs only 12 KB of memory to calculate the cardinality of close to 2^64 different elements. This is in stark contrast to collections where the more elements you have, the more memory you consume when calculating cardinality.

However, because HyperLogLog only calculates the cardinality from the input elements and does not store the input elements themselves, HyperLogLog cannot return the individual elements of the input, as collections do.

PS: What is the cardinal number?

For example, a dataset {1, 3, 5, 7, 5, 7, 8} would have a cardinality of {1, 3, 5, 7, 8} and a cardinality (not repeating elements) of 5. Cardinality estimation is to quickly calculate the cardinality within an acceptable range of error.

Algorithm details

It internally maintains 16,384 buckets to record the number of elements in each bucket. When an element arrives, it hashes to one of the buckets, affecting the bucket count with some probability. Since it is a probabilistic algorithm, the individual bucket count is not accurate, but the average sum of all the bucket count values will be very close to the real count.

In order to understand the HyperLogLog algorithm, we first simplify its counting logic. Because it is de-counting, if it is accurate de-counting, it must use the set set, use the set to record all the elements, and then obtain the set size can get the total count. Because there are many elements, a single set will be very large, so the set is broken into 16,384 small sets. When an element arrives, the hash algorithm dispatches the element to one of the small collection stores, and the same element is always hashed to the same small collection. So the total count is just the sum of the sizes of all the small sets. Counting precisely in this way allows you to subtract elements as well as add them.

However, there is no obvious benefit to collection shredding because the total memory footprint is not reduced. In HyperLogLog algorithm, the space occupied by each bucket is actually only 6 bits, which naturally cannot contain all the elements in the bucket. It records the log value of the number of elements in the bucket.

Log? Why the logarithm all of a sudden? Wait… Hyperloglog, log… is logarithm the soul of this data structure? !

Consider this for a moment: a random integer value has a 50% chance of ending with a 0, either a 0 or a 1 (in binary). Similarly, there is a 25% chance of having two zeros at the end, 12.5% chance of having three zeros, and so on, 2 to the minus k. If we randomly pick a lot of integers, the number of integers we don’t know, but we’re recording the maximum number K of consecutive zeros at the end of integers. We can use this K to approximate the number of integers, which is 2 to the K!!

And, of course, it’s going to be very inaccurate, because maybe then you randomly pick a lot of integers, but the maximum number of consecutive zeros at the end K doesn’t change, but the estimate is still 2 to the K. You might think that if this K was a floating-point number, every time you randomly pick up a new element, it would go up a little bit, and you’d get a much better estimate.

HyperLogLog allocates 16384 buckets and then mediates the maximum number K of all buckets to get an average of the trailing zero maximum number K#. K# is a floating point number, and using the average of 2^K# to estimate the total number of elements is relatively accurate. But this is just a simplification of the algorithm, the real algorithm has a lot of correction factors, because there are too many mathematical theories involved to describe exactly here.

Let’s look at the concrete implementation of Redis HyperLogLog algorithm. We know that the actual space taken up by a HyperLogLog is about 13684 * 6bit / 8 = 12K bytes. But when the count is small, most buckets count to zero. If too many of the 12K bytes are zero, this space can be saved. Redis uses sparse storage in the case of relatively small count, and the space occupation of sparse storage is far less than 12K bytes. The opposite of sparse storage is dense storage, which occupies a constant 12K bytes.

In both sparse and dense storage, Redis uses string bitmaps to store the count values of all HyperLogLog buckets.

Dense storage

The structure of dense storage is very simple, it is a string bitmap of 16384 consecutive 6 bits.

Given a bucket number, how do I get its 6bit count? These 6 bits can be inside a byte, or they can span byte boundaries. We need to do a proper shift concatenation of this one or two bytes to get the count.

If the bucket number is IDX, the start byte offset of this 6-bit number is offset_bytes, and the start bit offset of this byte is offset_bits. We have:

offset_bytes = (idx * 6) / 8
offset_bits = (idx * 6) % 8
Copy the code

It is important to note that the byte order is left low and right high, whereas most bytes we use are left high and right low, so we need to invert it in our mind.

Sparse storage

Sparse storage is suitable for cases where many counts are zero. The following figure shows the state of a general sparse storage count.

When multiple consecutive buckets count to zero, Redis uses a byte to indicate how many buckets will count to zero: 00xxxxxx. The prefix of two zeros indicates that the next 6bit integer value plus one is the number of zero counters. Note that this is added because the number is meaningless if it is zero. For example, 00010101 represents 22 consecutive zero-value counters. 6bit can only represent 64 consecutive zero-value counters, so Redis also designed a continuous number of more than 64 consecutive zero-value counters, it uses two bytes to represent: 01xxxxxx yyyyyyyy, the following 14bit can represent a maximum of 16384 consecutive zero-value counters. This means that the initial state of the 16,384 buckets in the HyperLogLog data structure, all of which have zero counters, can be represented directly as 2 bytes.

The general state of sparse storage is as follows:

To answer the question above, when do you switch from sparse to non-sparse?

When a count of the sparse storage needs to be adjusted to a value greater than 32 (such as 33), Redis will immediately transform the storage structure of HyperLogLog to convert the sparse storage into dense storage.

Alternatively, if the total number of bytes consumed by sparse storage exceeds 3000 bytes, this threshold can be adjusted with the hll_SPARse_max_bytes parameter.

When one of these two conditions occurs, the transition from sparse to dense is triggered and irreversible.

Count the cache

The total value of the HyperLogLog mentioned above is calculated by harmonic average of the count values of 16384 barrels based on a factor correction formula. It has to iterate through all the buckets to get to this value, and there’s a lot of floating point arithmetic involved. This is a relatively large computation.

So Redis uses an extra field to cache the total value. This field is 64-bit, with the highest bit being 1 to indicate whether the value is out of date, and 0 to indicate the remaining 63 bits.

When the count value of any bucket in the HyperLogLog changes, the count cache is set to expire, but calculation (lazy calculation) is not triggered immediately. Instead, the recalculation flusher cache is triggered when the user displays the invocation of the pfcount directive. In the case of dense storage, the cache refresh needs to traverse the count values of 16384 buckets for harmonic averaging, but in the case of sparse storage, there is not such a large amount of calculation. That is to say, only when the count value is relatively large can it produce a large amount of computation. On the other hand, if the count is large, most PFADD operations will not change the count in the bucket at all.

This means that frequent calls to the PFcount instruction in a highly variable HLL counter can be a minor performance issue. This performance concern was also addressed in a blog by Redis author Antirez. However, the authors did a careful stress test and found that there was no need to worry. The average time complexity of pfcount instructions was O(1).

API interface

HHL data structure for the time being only need to understand its principle, code analysis waiting for the memory coding source code to read. (This part is tired, wait to see the code and come back)

The article links

Redis source code, interview guide (2) integer set, compressed list # nuggets article #

Juejin. Cn/post / 694955…

Redis source code, interview guide (3) object system, reference counting, ordered collection # nuggets article # juejin.cn/post/694981…

Reference links:

Basic introduction

www.runoob.com/redis/redis…

The specific algorithm details are reprinted from (highly recommended)

Cloud.tencent.com/developer/a…