Hello, I’m Xiao Lin.
Why is Redis so fast?
In addition to it is an in-memory database, so that all operations are carried out in memory, there is another important factor, it implements the data structure, so that we can add, delete, check and change data operations, Redis can efficiently process.
So this time we’re going to talk about Redis data structures, which are often asked in interviews.
Note that Redis data structures do not refer to String objects, List objects, Hash objects, Set objects, and Zset objects, because these are the data types for the median value of Redis key and value pairs, which is how data is stored. The underlying implementation of these objects uses data structures.
I drew a diagram of Redis data types (also known as Redis objects) and underlying data structures. On the left is Redis 3.0, which is explained in the Redis Design and Implementation book. It’s still a little out of date. On the right is the latest version of Github’s Redis code (yet to be released).
As you can see, the underlying data structure of the Redis data type varies from version to version, such as:
- In Redis 3.0, the underlying data structure of List objects is implemented by “bidirectional linked List” or “compressed table List”, but after 3.2, the underlying data structure of List data types is implemented by QuickList.
- In the latest Redis code (which has not yet been released), the compressed list data structure has been deprecated in favor of the ListPack data structure.
This time, Kobayashi put the old and new versions of the data structure said diagram again, a total of nine data structures: SDS, bidirectional linked list, compressed list, hash table, jump table, integer set, QuickList, listpack.
No more BB, straight off!
How is a key-value database implemented?
Before we get into data structures, let’s take a look at how Redis implements key-value databases.
The key in Redis’s key-value pair is a string object, and the value can be either a string object or an object of collection data types, such as List, Hash, Set, and Zset.
As an example, here are several commands for Redis to add key-value pairs:
> SET name "xiaolincoding"
OK
> HSET person name "xiaolincoding" age 18
0
> RPUSH stu "xiaolin" "xiaomei"
(integer) 4
Copy the code
These commands stand for:
- The first command: name is a string key because the value of the key is a string object;
- The second command: person is a hash key because the value of the key is a hash object containing two key-value pairs;
- Stu is a list key because the value of the key is a list object containing two elements.
How are these key-value pairs stored in Redis?
Redis uses a “hash table” to store all key-value pairs. The biggest advantage of the hash table is that we can quickly find key-value pairs with O(1) time complexity. A hash table is just an array, and the elements in the array are called hash buckets.
How does Redis hash bucket hold key-value pair data?
Hash buckets store Pointers to key-value pairs (dictEntry*), so that Pointers can be used to find key-value pairs. Since the values of key-value pairs can hold both string objects and collection objects, the data structure of key-value pairs does not directly store the values themselves. Instead, it holds void * key and void * value Pointers to the actual key object and value object, respectively, so that the value can be found through the void * value pointer even if it is collection data.
Here I have drawn a picture of the data structure involved in Redis saving key-value pairs.
I’m not going to go into all the internal details of these data structures, but I’ll talk about them in more detail when I talk about hash table data structures, because they’re the same data structures. Here are the names and purposes of the data structures involved in the figure below:
- RedisDb structure, representing the Redis database structure, which stores Pointers to the dict structure;
- The dict structure stores two hash tables. Normally, “hash table 1” is used. “Hash table 2” is used only for rehash.
- Ditctht structure, represents the structure of the hash table, the structure stores the hash table array, each element in the array is a pointer to a hash table node structure (dictEntry);
- Void * key and void * value; void * key; void * value;Key refers to a String, andValue can point to a String or to objects of collection type, such as List, Hash, Set, and Zset.
The void * key and void * value Pointers refer to Redis objects. Each object in Redis is represented by a redisObject structure, as shown in the following figure:
Object structure contains member variables:
- Type, which identifies what type of object it is (String, List, Hash, Set, and Zset);
- Encoding, which identifies the underlying data structure used by the object;
- PTR, a pointer to the underlying data structure.
I’ve drawn a picture of the Redis key-value database so you can see how Redis objects relate to data structures:
Next, let’s talk about the underlying data structure!
SDS
Strings are very common in Redis. Keys in key-value pairs are strings, and values are sometimes strings.
Redis is implemented in C, but instead of using C’s char* character array directly to implement strings, it encapsulates its own data structure called Simple Dynamic String (SDS) to represent strings. The underlying data structure of the Redis String data type is SDS.
Since Redis designed the SDS structure to represent strings, it must be C’s char* character array that has some flaws.
To understand this, take a look at the structure of the char* character array.
C language string defects
A C string is actually an array of characters, that is, each element in the array is a character in the string.
For example, here is the char* array for the string “xiaolin” :
For those of you who haven’t studied C, you might wonder why the last character is “\0”.
In C, the char * pointer to a string refers only to the beginning of the array of characters. The end of the array is denoted by “\0”, which means the end of the string.
Therefore, the string manipulation function in the C language standard library decides whether to stop the operation by checking whether the character is “\0”. If the current character is not “\0”, the string is not finished and the operation can be continued. If the current character is “\0”, the string is finished and the operation should be stopped.
For example, the function strlen of C language to obtain the length of a string is to count every character in the character array. When the character is “\0”, it will stop traversing and return the number of characters counted, that is, the length of the string. The following diagram shows the strlen function execution flow:
Obviously, C takes O (N) to get the length of a string (an area that could be improved)
There is a flaw in using the “\0” character as the closing tag for C strings. If there is a string with a “\0” character in it, the operation on the string will end prematurely, such as “xiao\0lin”, which will calculate the string length as 4, as shown in the following figure:
Therefore, a string cannot contain “\0” characters except at the end of the string. Otherwise, the first “\0” character read by the program will be mistaken for the end of the string. This limitation makes C strings only hold text data. Can’t save binary data like images, audio, video culture (another area that could be improved)
In addition, the string manipulation functions in the C standard library are very insecure and unfriendly to programmers, and can cause buffer overflows if you are not careful.
For example, the strcat function concatenates two strings together.
// Concatenate the SRC string to the dest string
char *strcat(char *dest, const char* src);
Copy the code
C language of the string is not to record its own buffer size, so strcat function assumes that the programmer, when this function is performed to dest allocated enough memory, can accommodate all the content in the SRC string, and once this assumption is not established, buffer overflow occurs will be terminated may cause program is running, (This is an area that can be improved).
In addition, strcat is similar to Strlen in its time complexity and requires traversing the string to reach the end of the target string. The strcat function then iterates through the source string to complete the append, which is inefficient.
Well, through the above analysis, we can know the deficiencies of C language string and can be improved:
- The time complexity of obtaining string length is O (N);
- The end of a string is identified by a “\0” character. The string cannot contain a “\0” character, so binary data cannot be saved.
- String manipulation functions are inefficient and unsafe, such as the risk of buffer overflow, which may cause the program to terminate;
The STRUCTURE of SDS implemented by Redis solves these problems. Let’s take a look at how Redis solves them.
SDS structure design
Redis 5.0 SDS data structure:
Each member variable in the structure is described separately:
- Len, records the length of the string. So when you get the length of the string, you just need to return the value of the member variable, and the time complexity is O (1).
- Alloc, the length of space allocated to the character array. So when you modify the string, you can pass
alloc - len
Calculate the remaining space, which can be used to judge whether the space meets the modification requirements. If not, the space of SDS will be automatically expanded to the size required by the modification, and then the actual modification operation will be carried out. Therefore, using SDS, there is no need to manually modify the space of SDS. You don’t have the buffer overflow problem mentioned earlier. - Flags for different types of SDS. A total of 5 types are designed, namely SDSHDR5, SDSHDR8, SDSHDR16, SDSHDR32 and SDSHDR64. The differences will be explained later.
- Buf [], a character array used to hold the actual data. You can store not only strings, but also binary data.
In general, Redis SDS adds three metadata, len, Alloc, and flags, on top of the original character array, to address the shortcomings of C strings.
O (1) Complexity Gets the length of the string
C language string length to obtain strlen function, the need to traverse the way to count the string length, time complexity is O (N).
As len member variable is added to the SDS structure of Redis, the value of this member variable is returned directly when obtaining the length of the string, so the complexity is only O (1).
Binary security
Because SDS does not need the “\0” character to identify the end of a string, but instead has a special len member variable to record the length, data containing “\0” can be stored. However, in order to be compatible with some functions of C language standard library, SDS string ends with “\0” character.
Therefore, the SDS API processes the data stored in BUF [] by SDS in binary mode. The program does not restrict the data in buF []. The data is read as it is written.
By using binary-safe SDS instead of C strings, Redis can store binary data in arbitrary formats as well as text data.
Buffer overflows do not occur
Most of the string manipulation functions provided by the C standard string library, such as strcat’s string append function, are unsafe because they leave the buffer size to the developer and do not determine whether the buffer size is sufficient. A buffer overflow can cause a program to end abnormally.
Therefore, Redis introduces alloc and LEN member variables into SDS structure, so that SDS API can calculate the remaining available space through alloc – len calculation, so that when modifying the string, the program can determine whether the buffer size is sufficient.
In addition, Redis will automatically increase the size of SDS when it determines that the buffer size is insufficient (less than 1MB, double the size of SDS, larger than 1MB, press 1MB) to meet the required size of modification.
Before expanding the SDS space, the SDS API first checks to see if the unused space is sufficient, and if not, the API allocates not only the space necessary to modify the SDS, but also additional “unused space” to the SDS.
The advantage of this is that the next time an SDS operation is performed, if the SDS space is sufficient, the API will directly use the “unused space” without performing a memory allocation, effectively reducing the number of memory allocations.
Therefore, with SDS, there is no need to manually modify the size of SDS and no buffer overflow problems.
Save memory space
The SDS structure has a FLAGS member variable that represents the SDS type.
There are five types designed by Redos, namely SDSHDR5, SDSHDR8, SDSHDR16, SDSHDR32, and SDSHDR64.
The main difference between these five types is that the len and alloc member variables in their data structures have different data types.
For example, sdSHdr16 and sdshdr32 are defined as follows:
struct __attribute__(((packed__)) sdshdr16 {
uint16_t len;
uint16_t alloc;
unsigned char flags;
char buf[];
};
struct __attribute__(((packed__)) sdshdr32 {
uint32_t len;
uint32_t alloc;
unsigned char flags;
char buf[];
};
Copy the code
You can see:
- The data types of len and alloc of sdSHdr16 are both uint16_t, indicating that the length of the character array and the size of the allocated space cannot exceed 2 ^ 16.
- Sdshdr32 is all uint32_t, which means that the length of the character array and the size of the allocated space cannot exceed 2 to the power of 32.
The reason why SDS design different types of structures is to flexibly store different sizes of strings, so as to effectively save memory space. For example, structure headers take up less space when saving small strings.
In addition to designing different types of structures, Redis programmatically uses special compiler optimizations to save memory. That is, the struct declares __attribute__ ((Packed)), which does: Tells the compiler to unalign structures optimized during compilation and align them according to the actual number of bytes used.
For example, with SDS of type SDSHDR16, by default the compiler allocates memory to variables in 16-byte alignment, which means that the compiler allocates 16 bytes to a variable even if it is less than 16 bytes in size.
For example, consider the following structure, which has two member variables of type char and int, as follows:
#include <stdio.h>
struct test1 {
char a;
int b;
} test1;
int main(a) {
printf("%lu\n".sizeof(test1));
return 0;
}
Copy the code
Guess what the size of this structure is? I’m going to jump right into the answer, but this structure has a size of 8.
This is because, by default, the compiler allocates memory using “byte alignment”. A char takes up only one byte, but since an int takes up four bytes in a member variable, it allocates four bytes. The extra three bytes are allocated for byte alignment, which means three bytes are wasted.
If you don’t want the compiler to allocate memory using byte alignment, you can use the __attribute__ ((Packed)) attribute to define the structure so that the compiler allocates as much memory as the structure actually occupies.
For example, I use the __attribute__ ((Packed)) attribute to define the following structure, which also contains char and int members as follows:
#include <stdio.h>
struct __attribute__((packed)) test2 {
char a;
int b;
} test2;
int main(a) {
printf("%lu\n".sizeof(test2));
return 0;
}
Copy the code
The printed result is 5 (1 byte char + 4 byte int).
As you can see, this is allocated according to the actual number of bytes, which can save memory space.
The list
Next to arrays, the most familiar data structures are, I believe, linked lists.
One of the underlying implementations of Redis’s List object is a linked List. C language itself does not have the linked list data structure, so Redis designed a linked list data structure.
Structure design of linked list node
Let’s take a look at what a “linked list node” structure looks like:
typedef struct listNode {
// The front node
struct listNode *prev;
// back node
struct listNode *next;
// Node value
void *value;
} listNode;
Copy the code
There’s a front node and a back node, and you can see that this is a bidirectional list.
Linked list structure design
Redis encapsulates the list data structure on the basis of the listNode structure, so it is more convenient to operate the list structure as follows:
typedef struct list {
// List header node
listNode *head;
// The end of the list
listNode *tail;
// Node value copy function
void *(*dup)(void *ptr);
// Node value release function
void (*free) (void *ptr);
// Node value comparison function
int (*match)(void *ptr, void *key);
// The number of linked list nodes
unsigned long len;
} list;
Copy the code
The list structure provides the list with head, tail, len, dUP, free, and match functions that can be customized.
As an example, here is a linked list consisting of a list structure and three listNode structures.
Advantages and disadvantages of linked lists
Redis’s linked list implementation has the following advantages:
- There are prev and NEXT Pointers in the structure of listNode linked list nodes. The time complexity of obtaining the pre-node or post-node of a node is only O(1), and both Pointers can point to NULL, so the linked list is acyclic.
- Since the list structure provides the table head pointer head and the table tail node, the time complexity of obtaining the table head node and the table tail node of the linked list is only O(1).
- The list structure provides len, so the time complexity of obtaining the number of nodes in the list is only O(1).
- The listNode linked list section uses a void* pointer to store node values, and can set the node specific functions by using the DUP, free, and match Pointers of the list structure. Therefore, the list node can store various types of values.
Linked lists also have drawbacks:
-
The memory between each node of the linked list is discontinuous, meaning that the CPU cache is not well utilized. One data structure that makes good use of the CPU cache is an array, because the memory of an array is contiguous, making full use of the CPU cache to speed up access.
-
In addition, saving the value of a linked list node requires the allocation of a linked list node structure header, which is a large memory overhead.
Therefore, the List object of Redis 3.0 will adopt “compressed List” as the realization of the underlying data structure when the amount of data is relatively small. Its advantage is that it saves memory space and is a data structure with compact memory.
However, there were performance issues with compressed lists (more on that below), so Redis designed a new quickList data structure in version 3.2 and implemented the underlying data structure of the List object in QuickList.
Then, in Redis 5.0, a new data structure listPack was designed, using the compact memory layout of compressed lists. Finally, in the latest Redis version, the compressed list, one of the underlying data structure implementations of Hash and Zset objects, was replaced by ListPack.
List of compression
The biggest feature of compressed list is that it is designed as a data structure with compact memory, occupying a piece of continuous memory space. It can not only make use of CPU cache, but also encode data with different lengths. This method can effectively save memory overhead.
However, there are drawbacks to compressed lists:
- Do not save too many elements, otherwise the query efficiency will be reduced;
- When an element is added or modified, the memory space occupied by the compressed list needs to be reallocated and can even cause the problem of chained updates.
Therefore, Redis objects (List objects, Hash objects, Zset objects) contain a small number of elements, or the value of the elements is not large enough to use compressed lists as the underlying data structures.
Next, I’ll talk to you about the compressed list in detail.
Compressed list structure design
Compressed lists, developed by Redis to save memory, are sequential data structures composed of contiguous memory blocks, somewhat similar to arrays.
The compressed list has three fields in the header:
- zlbytes, records the number of bytes of memory occupied by the compressed list.
- zltail, records the number of bytes from the start address of the “tail” node of the compressed list, that is, the offset of the end of the list;
- zllen, record the number of nodes contained in the compression list;
- zlend, marks the end point of the compressed list with a fixed value 0xFF (255 decimal).
In the compressed list, if we want to locate the first and last elements, we can locate them directly by the length of the first three fields of the table, complexity is O(1). When you look for other elements, it’s not as efficient, you have to look at them one by one, and it’s O(N) complexity, so it’s not good to have too many elements in a compressed list.
In addition, the composition of compressed list entries is as follows:
The compressed list node contains three parts:
- prevlen, records the length of “previous node”;
- encoding, records the actual data type and length of the current node;
- data, records the actual data of the current node;
When inserting data into the compressed list, the compressed list uses the prevlen and Encoding elements for different sizes, depending on whether the data is a string or an integer and the size of the data. This is what Redis uses to save memory.
Separately, how Prevlen and Encoding allocate space differently depending on the size and type of the data.
The prevlen property of each node in the compressed list records the “length of the previous node”, and the prevlen property size is related to the length of the previous node, for example:
- If the length of the previous node is less than 254 bytes, the prevlen attribute takes 1 byte to hold the length value.
- If the length of the previous node is greater than or equal to 254 bytes, the prevlen attribute takes 5 bytes to hold the length value.
The size of the encoding property depends on whether the data is a string or an integer, and the length of the string:
- If the data of the current node is an integer, the encoding uses 1 byte space to encode.
- If the current node’s data is a string, the encoding uses 1 byte /2 byte /5 byte space, depending on the length of the string.
Chain update
There is another problem with compressed lists besides finding complexity.
When an element is added or modified to a compressed list, the memory space occupied by the compressed list needs to be reallocated if there is not enough space. However, when the newly inserted element is large, it can cause the prevlen space of subsequent elements to change, leading to the “chained update” problem, causing space for each element to be reallocated, resulting in poor performance in accessing the compressed list.
As mentioned earlier, the prevlen property of the compressed list node allocates space differently depending on the length of the previous node:
- If the length of the previous node is less than 254 bytes, the prevlen attribute takes 1 byte to hold the length value.
- If the length of the previous node is greater than or equal to 254 bytes, the prevlen attribute takes 5 bytes to hold the length value.
Now suppose a compressed list has multiple consecutive nodes between 250 and 253 in length, as shown below:
Since these node length values are less than 254 bytes, the Prevlen property needs 1 byte of space to hold the length value.
If a new node with a length greater than or equal to 254 bytes is added to the head node of the compressed list, the new node will become the front node of E1, as shown in the following figure:
Since the prevlen property of the E1 node is only 1 byte and cannot hold the length of the new node, it is necessary to redistribute the space of the compressed list and expand the prevlen property of the E1 node from the original 1 byte to 5 bytes.
The domino effect began.
The original length of E1 is between 250 and 253. The length of E1 is greater than or equal to 254. Therefore, the prevlen property of E1 saved by E2 must be expanded from 1 byte to 5 byte.
Just as extending E1 triggers an extension to E2, extending e2 triggers an extension to e3, which in turn triggers an extension to e4…. All the way to the end.
This kind of continuous spatial expansion operation under special circumstances is called “chain update”, just like the domino effect, the first card falls, pushing the second card to fall; The second card falls, and promoted the third card falls…. .
Defects in compressed lists
Space expansion operation is also a reallocation of memory, so once the chain update occurs, the memory space occupied by the compressed list will be re-allocated several times, which will directly affect the access performance of the compressed list.
So, although the compact memory layout of a compressed list can save memory, if the number of stored elements increases, or if the elements become larger, memory can be reallocated, and worst of all, “chained updates” can occur.
Therefore, the compressed list will only be used for scenarios with a small number of nodes, as long as the number of nodes is small enough that cascading updates are acceptable.
However, Redis addressed the design limitations of compressed lists by introducing two new data structures in later versions: QuickList (introduced in Redis 3.2) and ListPack (introduced in Redis 5.0). The design goal of these two data structures is to maintain the memory saving advantage of compressed lists as much as possible, while at the same time solving the “cascading update” problem of compressed lists.
Hash table
A hash table is a data structure that holds key-value pairs.
Each key in the hash table is unique, and the program can find the value associated with the key, update the value by the key, or delete the entire key-value by the key, and so on.
When I talked about compressed lists, I mentioned that one of the underlying implementations of Redis Hash objects is compressed lists (the latest Redis code has replaced compressed lists with Listpacks). Another underlying implementation of Hash objects is a Hash table.
The advantage of hash tables is that they can query data quickly with O(1) complexity. How did you do that? The key is computed by the Hash function to locate the data in the table. Since the Hash table is actually an array, the data can be quickly queried by the index value.
However, there are risks. In the case that the size of the hash table is fixed, the possibility of hash conflicts will increase as the data increases.
There are many ways to resolve the Hashish conflict.
Redis uses “chained hash” to solve hash conflicts. On the premise of not expanding the hash table, the data with the same hash value is connected to form a link, so that the data can still be queried in the table.
Now, let’s talk more about hash tables.
Hash table structure design
The hash table structure of Redis is as follows:
typedef struct dictht {
// Hash table array
dictEntry **table;
// Hash table size
unsigned long size;
// Hash table size mask, used to calculate index values
unsigned long sizemask;
// The number of existing nodes in the hash table
unsigned long used;
} dictht;
Copy the code
As you can see, the hash table is an array (dictEntry **table), and each element of the array is a pointer to a “dictEntry” node.
Hash table node structure is as follows:
typedef struct dictEntry {
// The key in the key-value pair
void *key;
// The value in the key-value pair
union {
void *val;
uint64_t u64;
int64_t s64;
double d;
} v;
// Point to the next hash table node to form a linked list
struct dictEntry *next;
} dictEntry;
Copy the code
The dictEntry structure not only contains Pointers to keys and values, but also Pointers to the next hash table node. This pointer can link multiple key-value pairs with the same hash value to resolve hash conflicts. This is called chained hash.
Also, the value in the key-value pair in the dictEntry structure is defined as union V, so the value in the key-value pair can be a pointer to the actual value, or an unsigned 64-bit integer, or a signed 64-bit integer, or the value of a double. The advantage of doing this is that memory is saved, because when the “value” is an integer or floating point number, the data of the value can be embedded in the dictEntry structure, eliminating the need for a pointer to the actual value.
Hash conflict
A hash table is actually an array, and each element in the array is a hash bucket.
When the key of a key-value pair is computed by the Hash function to obtain the Hash value, and then modulo (Hash value % Hash table size), the resulting value is the location of the array element corresponding to the key-value, that is, the number of Hash buckets.
What is hash conflict?
For example, there is a hash table that can hold eight hash buckets. After the hash function is used to calculate key1, the modulus of “hash value % 8” is calculated. If the value is 1, then it corresponds to hash bucket 1. Similarly, key9 and key10 correspond to hash bucket 1 and bucket 6 respectively.
At this point, key1 and key9 correspond to the same hash bucket, resulting in a hash conflict.
Therefore, when more than two kays are assigned to the same hash bucket in the hash table, the keys are said to be in conflict.
Chain hash
Redis uses a “chain hash” approach to resolve hash conflicts.
How is chain hashing implemented?
The way this is done is that each hash table node has a next pointer that points to the next hash table node, so multiple hash table nodes can use the next pointer to form a single necklace table that can be used to join nodes assigned to the same hash bucket, thus resolving hash conflicts.
Key1 and key9 hash in the same bucket. Key1 is hashed to the next pointer to key9, creating a one-way list.
However, the limitation of chained hashing is obvious. As the length of the list increases, the time to query the data at this location increases. After all, the time complexity of the list query is O(n).
To solve this problem, you need to rehash, which is to expand the size of the hash table.
Next, take a look at how Redis implements rehash.
rehash
In this section of hash table structure design, I introduce Redis to use dictht structure to represent hash tables. However, when actually using hash tables, Redis defines a dict structure that defines two hash tables (HT [2]).
typedef struct dict {
…
// Two Hash tables, used interchangeably, for rehash operations
dictht ht[2]; ... } dict;Copy the code
The reason we have 2 hash tables is because when we rehash, we need 2 hash tables.
During the normal service request phase, inserted data is written to “hash table 1”, where “hash table 2” is not allocated.
As the data grows, the rehash operation is triggered, and the process is divided into three steps:
- The space allocated to “hash table 2” is usually twice as large as that of “hash table 1”.
- Migrate the data from hash table 1 to hash Table 2.
- After the migration, the space of Hash table 1 is released, hash table 2 is set to Hash table 1, and a blank hash table is created in Hash table 2 to prepare for the next rehash.
For your convenience, I’ve drawn the three rehash processes in the following diagram:
This process seems simple, but the second step is problematic. If the volume of “hash table 1” is very large, migrating to “hash table 2” may block Redis from serving other requests because of the large number of copies involved.
Progressive rehash
In order to avoid the impact of rehash on Redis performance due to the time-consuming data copy during rehash data migration, Redis uses progressive Rehash, that is, data migration is performed multiple times instead of once.
The progressive rehash steps are as follows:
- Allocate space to “hash table 2”;
- During rehash, Redis will not only perform the corresponding operation, but also migrate all key-values in the index position of “Hash table 1” to “Hash Table 2” each time the hash table element is added, deleted, searched or updated.
- Eventually, at some point, all key-values from hash table 1 will be migrated to hash table 2 to complete the rehash operation, as more hash table requests are processed from the client.
This cleverly spreads the cost of a large one-time data migration effort over multiple requests, avoiding the time-consuming rehash operation.
During progressive rehash, there are two hash tables, so during progressive rehash, deletion, lookup, update, and other hash table elements are performed in these two hash tables.
For example, if the value of a key is searched in hash table 1, it will be searched in hash table 2 if not found.
In addition, during progressive rehash, when a new key-value is added, it will be saved to hash table 2, while hash table 1 will not be added. This ensures that the number of key-values in hash table 1 will only decrease. As the rehash operation completes, eventually “hash table 1” becomes empty.
Rehash trigger condition
After all this talk about rehash, what are the circumstances that trigger a rehash operation?
The trigger condition for rehash is related to the load factor.
The load factor can be calculated using the following formula:
There are two main conditions that trigger a rehash operation:
- The rehash operation is performed when the load factor is greater than or equal to 1 and Redis is not executing bgsave or bgrewiteaof, i.e. no RDB snapshot or AOF rewrite has been performed.
- When the load factor is greater than or equal to 5, the hash conflict is serious and the rehash operation is forced regardless of whether RDB snapshots or AOF overrides are being performed.
The integer set
The collection of integers is one of the underlying implementations of the Set object. When a Set contains only integer numeric elements, and the number of elements is not constant, the data structure of the integer Set is used as the underlying implementation.
Integer set structure design
A set of integers is essentially a contiguous memory space, and its structure is defined as follows:
typedef struct intset {
// Encoding mode
uint32_t encoding;
// The number of elements the collection contains
uint32_t length;
// Hold an array of elements
int8_t contents[];
} intset;
Copy the code
So you can see that the container that holds the elements is a contents array, and even though contents is declared as an int8_T array, the contents array doesn’t actually hold any int8_T elements, The true type of the contents array depends on the value of the Encoding property in the IntSet structure. Such as:
- If the encoding property is INTSET_ENC_INT16, then contents is an array of type INT16_T, where each element is of type int16_t;
- If the encoding property is INTSET_ENC_INT32, then contents is an array of type int32_T, where each element is of type int32_t;
- If the encoding property is INTSET_ENC_INT64, then contents is an array of type INT64_T, where each element is of type int64_t;
Different types of contents arrays, that means the size of the array is going to be different.
The upgrade operation of an integer collection
The set of integers has an upgrade rule, which states that when we add a new element to the set of integers, if the new element type (int32_t) is longer than all the existing elements in the set of integers (int16_t), the set of integers must be upgraded first. That is, expand the contents array by the type of the new element (int32_t) before adding the new element to the set of integers, while maintaining the order of the set of integers during the upgrade.
Instead of reallocating an array of a new type, an integer set upgrade expands the original array and then splits each element by the size of the interval type. If the encoding attribute is INTSET_ENC_INT16, the interval is 16 bits.
For example, suppose you have a set of integers with three elements of type INT16_t.
Now add a new element, 65535, to the set of integers, which needs to be stored as int32_t, so to upgrade the set of integers, we need to expand the contents array by 80 bits (4×32-3×16=80). This will hold four elements of type INT32_t.
After the contents array space is expanded, the previous three elements need to be converted to int32_T type, and the converted elements need to be placed in the correct bit, and the order of the underlying array needs to remain unchanged. The whole conversion process is as follows:
What are the benefits of integer collection upgrades?
If you want an array to hold elements of type INT16_t, int32_t, and int64_t, the easiest thing to do is to use an array of type INT64_t. However, if all the elements are int16_T, the memory will be wasted.
If we keep adding elements of type INT16_T to the set of integers, the underlying implementation of the set of integers will always be an array of type INT16_T. Only if we want to add elements of type INT32_t or type INT64_t to the set, Before the array is upgraded.
Therefore, the benefit of the integer collection upgrade is to save memory resources.
Do integer collections support degraded operations?
Demotion is not supported. Once an array has been upgraded, it will remain upgraded. For example, in the previous upgrade example, if the 65535 element is deleted, the array of integers is int32_T and will not be downgraded to INT16_T.
Jump table
Redis uses skip tables only in the underlying implementation of Zset objects. The advantage of skip tables is that they can support node lookup with average O(logN) complexity.
The Zset object is the only Redis object that uses two data structures: a hop table and a hash table. This has the advantage of being able to perform efficient range queries as well as efficient single point queries.
typedef struct zset {
dict *dict;
zskiplist *zsl;
} zset;
Copy the code
Zset supports range queries (such as the ZRANGEBYSCORE operation) because its data structure is designed with a skip table, and element weights can be obtained with constant complexity (such as the ZSCORE operation) because it is indexed with a hash table.
Next, let’s talk about the hop table in detail.
Skip table structure design
When the linked list is searching for elements, because it needs to search one by one, the query efficiency is very low, and the time complexity is O(N), so the skip table appears. Skip list is improved on the basis of linked list, to achieve a “multi-layer” ordered linked list, such advantage is to quickly read positioning data.
What does a jumper look like? As an example, the figure below shows a level 3 skip table.
The head node in the figure has three head Pointers, L0~L2, respectively pointing to nodes of different levels, and then nodes of each level are connected by Pointers:
- L0 level has 5 nodes, namely nodes 1, 2, 3, 4 and 5.
- L1 has three nodes, namely nodes 2, 3 and 5.
- Level L2 has only one node, which is node 3.
If we want to find node 4 in the list, we have to traverse the list from the beginning, which takes 4 lookups. With the skip table, we only need 2 lookups to locate node 4, because we can jump directly from L2 level to node 3 in the first node, and then traverse to node 4 again.
As you can see, the search process is to hop up and down multiple levels to locate the element. When there is a large amount of data, the lookup complexity of the hop table is O(logN).
So how is the hop table node multi-tiered? To do this, we need to see the data structure of the “hop table node”, as follows:
typedef struct zskiplistNode {
// The element value of Zset
sds ele;
// Element weight value
double score;
// back pointer
struct zskiplistNode *backward;
// Level array of nodes, holding forward Pointers and spans at each level
struct zskiplistLevel {
struct zskiplistNode *forward;
unsigned long span;
} level[];
} zskiplistNode;
Copy the code
The Zset object holds both elements and their weights, which correspond to the SDS ele variable and the double score variable in the hop table node structure. Each hop table node has a backward pointer to the previous node. The purpose is to make it easy to access the node from the last node of the hop table, making it easy to find the node in reverse order.
A skip table is a linked list with a hierarchy, and each hierarchy can contain multiple nodes. Each node is connected by a pointer. This feature is realized by the level array of the zskiplistLevel structure in the skip table node structure.
Each element in the level array represents a level of the table, which is represented by the zskiplistLevel structure, such as leve[0] for the first level and leve[1] for the second level. The zskiplistLevel structure defines “pointer to the next hop table node” and “span”, which is used to record the distance between two nodes.
For example, the graph below shows the span of each node.
At first glance, the span is thought to be related to traversal, but in fact it has nothing to do with traversal, which can be done only with a forward pointer.
The span actually calculates the rank of the node in the hop table. How do you do that? Because the nodes in the hop table are all arranged in order, when calculating the rank of a node, add up the span of all layers visited along the query path from the initial node point to this node, and the result is the rank of the target node in the hop table.
For example, to find the rank of node 3 in the hop table in the graph, the search for node 3 starts from the beginning node. The search process only goes through one layer (L3), and the span of layer is 3, so node 3 is ranked 3 in the hop table.
In addition, the head node in the figure is actually a zskiplistNode skip table node, but the head node’s backpointers, weights, and element values are all used, so this part is omitted in the figure.
Who defines which hop table node is the head node? This brings us to the “hop table” structure, as shown below:
typedef struct zskiplist {
struct zskiplistNode *header, *tail;
unsigned long length;
int level;
} zskiplist;
Copy the code
The skip table structure contains:
- The head and tail nodes of the hop table are easy to access the head and tail nodes of the hop table within O(1) time complexity;
- The length of the hop table, so as to obtain the number of hop table nodes in O(1) time complexity;
- The maximum number of layers of the hop table is convenient for obtaining the number of layers of the node with the maximum height of the hop table in O(1) time complexity;
Procedure For querying hop table nodes
In the process of finding a hop table node, the hop table traverses each layer, starting at the top of the primary node. When traversing skip table nodes of a certain layer, SDS type elements and element weights are used for judgment. There are two judgment conditions:
- If the weight of the current node is “less” than the weight you are looking for, the hop table accesses the next node on that layer.
- If the weight of the current node is “equal” to the weight being looked up, and the SDS type data of the current node is “less” than the data being looked up, the hop table accesses the next node on that layer.
If neither of the above conditions is met, or the next node is empty, the hop table will use the pointer to the level array of the node traversed so far, and then continue the search along the next pointer, which is equivalent to skipping to the next level and continuing the search.
For example, the figure below shows a level 3 skip table.
If you want to find a node with “element: abcd, weight: 4”, the process looks like this:
- Starting at the highest level of the initial node, L2 points to the “element: ABC, weight: 3” node, which has a lower weight than the node being searched, so the next node on that level is accessed.
- However, the next node on this layer is empty node, so it jumps to the next layer of “element: ABC, weight: 3” node to find, that is, leve[1];
- The next pointer to the “element: ABC, weight: 3” node of Leve [1] points to the “element: abcde, weight: 4” node and compares it with the node to be found. Although the weight of the node with “element: abcde, weight: 4” is the same as the weight to be searched, the SDS type data of the current node is “greater than” the data to be searched, so it will continue to jump to the next layer of the node with “element: ABC, weight: 3” to find, that is, Leve [0];
- The next pointer of the leve[0] node of “element: ABC, weight: 3” points to the node of “element: abcd, weight: 4”, which is the node to be searched. The query ends.
Layer number of hop table nodes
The ratio of the number of nodes at two adjacent layers of a hop table affects the query performance of the hop table.
For example, in the following table, the number of nodes in the second layer is only 1, while the number of nodes in the first layer is 6.
In this case, if you want to query node 6, the complexity of the query is basically the same as that of the linked list, you need to search the nodes in the first layer in order, the complexity is O(N). Therefore, in order to reduce the query complexity, we need to maintain the relationship between the nodes of adjacent layers.
The optimal ratio of the number of nodes in the two adjacent layers of the hop table is 2:1, and the search complexity can be reduced to O(logN).
As shown in the table below, the ratio of nodes in adjacent layers is 2:1.
So how can we maintain a 2:1 ratio of nodes between two adjacent layers?
There is additional overhead if you adjust the hop table nodes to maintain scale when adding or deleting nodes.
Redis adopts an ingenious method: when creating nodes, the hop table randomly generates the number of layers of each node, and does not strictly maintain the ratio of nodes in two adjacent layers to each other at 2:1.
Specifically, when creating a node, the hop table will generate a random number in the range of [0-1]. If the random number is less than 0.25 (equivalent to the probability of 25%), then the number of layers will be increased by 1 layer, and then continue to generate the next random number until the result of the random number is greater than 0.25, and finally determine the number of layers of the node.
In this way, the probability of each additional layer does not exceed 25%. The higher the number of layers, the lower the probability. The maximum limit of the height is 64.
quicklist
Prior to Redis 3.0, the underlying data structure of a List object was a bidirectional or compressed List. Then, in Redis 3.2, the underlying List object was implemented by the QuickList data structure.
A QuickList is a combination of a two-way linked list and a compressed list, because a Quicklist is a linked list and each element in a linked list is a compressed list.
At the time of speaking in front of the compressed list, I also mentioned the compression of the list, while compression list is through the compact memory layout saves memory overhead, but because of its structure design, if increase the number of elements, or elements become bigger, compression list there is a risk of chain “update”, once happened, would cause performance degradation.
The QuickList solution circumvents the issue of chained updates by controlling the size or number of elements of the compressed list in each linked list node. Because the fewer or smaller the compressed list elements, the less impact cascading updates have, providing better access performance.
Quicklist structure design
A QuickList structure is similar to a linked list structure in that it contains a header and a tail, except that the quickList node is a QuickList Node.
typedef struct quicklist {
// The header of the quickList
quicklistNode *head; // The header of the quickList
// The header of the quickList
quicklistNode *tail;
// The total number of elements in all compressed lists
unsigned long count;
/ / the number of quicklistNodes
unsigned longlen; . } quicklist;Copy the code
Let’s look at the structure definition of a quicklistNode:
typedef struct quicklistNode {
// The previous quicklistNode
struct quicklistNode *prev; // The previous quicklistNode
// Next quicklistNode
struct quicklistNode *next; // The last quicklistNode
// The compressed list pointed to by quicklistNode
unsigned char *zl;
// Compress the list's size in bytes
unsigned int sz;
// Compress the number of elements in the list
unsigned int count : 16; // The number of elements in ziplist. } quicklistNode;Copy the code
As you can see, the quicklistNode structure contains Pointers to the previous node and the next node, so that each quicklistNode forms a bidirectional linked list. But instead of simply holding element values, the list node’s elements hold a compressed list, so the quicklistNode structure has a pointer to the compressed list.
I’ve drawn a diagram to help you understand the QuickList data structure.
When you add an element to quickList, you don’t create a new list node as you would with a regular list. Instead, it checks whether the compressed list at the insertion location can hold the element, saves it directly to the compressed list in the quicklistNode structure if it can, and creates a new quicklistNode structure if it can’t.
Quicklist controls the size or number of elements of the compressed list in the quicklistNode structure to avoid potential cascading updates, but this does not completely solve the cascading update problem.
listpack
Quicklist reduces the performance impact of chained updates by controlling the size or number of elements of the compressed list in the quicklistNode structure, but it does not completely solve the problem of chained updates.
Because quicklistNode still uses compressed lists to store elements, the problem of cascading updates to compressed lists stems from its structural design, and a new data structure is needed to completely solve this problem.
Therefore, Redis designed a new data structure named Listpack in 5.0 to replace the compressed list. Its biggest feature is that each node in the Listpack no longer contains the length of the previous node. Because each node in the compressed list needs to save the length field of the previous node, there will be a hidden danger of chain update.
I looked at Github of Redis, and in the latest 6.2 release, the Redis Hash object, the compressed list of the underlying data structures of the Set object, has not been replaced with listpack, The latest Redis code (yet to be released) has replaced all Redis objects that use the underlying data structure of the compressed list with the ListPack data structure. It is expected that Redis will release a distribution of the compressed list as a Listpack in the near future.
Listpack structure design
Listpack uses many of the best features of compressed lists, such as the compact storage of data in a contiguous memory space, and to save memory, listPack nodes encode different sizes of data in different ways.
Let’s start with the ListPack structure:
The listpack header contains two attributes that record the total number of bytes and the number of elements of the listpack, followed by a trailing identifier at the end of the listpack. The listpack entry in the figure is the node of the Listpack.
Each ListPack node is structured as follows:
It mainly includes three aspects:
- Encoding, defines the encoding type of the element, which encodes integers and strings of different lengths.
- Data, the actual data stored;
- Len, encoding+data total length;
As you can see, listPack does not compress the length of the previous node in the list. Listpack only records the length of the current node. When we add a new element to the Listpack, we do not affect the length of other nodes.
References:
- Redis Design and Implementation
- Redis source code analysis and combat
conclusion
It’s a relief to finally be finished.
I haven’t written such a long graphic technical article for a long time. This time, I wrote 15,000 words + drew more than 40 pictures. I spent a lot of time reading books and looking at the source code.
Hopefully this article will help you break through the fog of Redis data structure!
I’m Xiao Lin. See you next time
Wait, wait, wait, slide to the end of the friend, to a “three even” in walking ah!