1. Introduction

In your interview, you may be asked how the underlying data structure of redis (String, List, Hash, Set, SortedSet) is implemented. Many of you may complain that I am not a Redis developer. It is enough to use these five different data types in different business scenarios, why to understand the underlying data structure; In fact, once you understand the underlying implementation of these data structures, you can probably figure out why Redis is so fast

2. Five commonly used data types

In Redis, the five common data types and common scenarios are as follows:

  • String: cache, counter, distributed lock, etc
  • List: a timeline of linked lists, queues, and microblogs
  • Set: to depress, to like, to be friends with
  • Hash: User information, Hash table
  • Zset: ranking function, visit ranking, click ranking

2.1 Relationship with six underlying data structures

  • String: Uses int encoding for numbers. Rwa encoding for non-numbers.
  • List: zipList is used when the length of the element string is less than 64 bytes and the number of elements is less than 512. Otherwise, use likedList; This value is configured in the redis configuration file redis.conf as follows:
    • list-max-ziplist-entries 512
    • list-max-ziplist-value 64
  • Hash: The underlying data structure can be zipList or hashTab. ZipList is used when the length of the hash key and value is less than 64 bytes and the number of key-value pairs is less than 512. In other cases, hashTa is used
  • Set: The encoding of a set object can be an intset or a Hashta table. If all the stored elements are integer numbers and the number of elements is less than the configured range, the intset object is used. Otherwise, the hash table is used.
  • SortSet: The underlying data structure can be zipList or skiplList. ZipList is used when the number of elements is less than 128 and the length of the members is less than 64 bytes, and skipList is used otherwise. This value is configured at redis.con as:
    • zset-max-ziplist-entries 128
    • zset-max-ziplist-value 64

3. Implementation of six underlying data structures

3.1 SDS – Simple dynamic String

Instead of using C’s strings directly (null-terminated character arrays), Redis itself built an abstract type called SIMPLE Dynamic String SDS to be used as the default string representation of Redis. In addition to holding strings in the database, SDS is also used as buffers, such as AOF buffers, for client input. SDS is defined as follows:

Struct SDSHDR {// Record the number of bytes used by the buf array; // buf unused number of bytes int free; Char buf[]; }Copy the code
  • Len is the length of the string
  • Free stands for character array, unused length
  • Buf byte array that holds strings

3.1.1 Space preallocation

Space preallocation is used to optimize redis for string growth operations. When the SDS API needs to be modified for an SDS and the SDS space needs to be expanded, Redis will not only allocate the necessary space for the lock to the SDS, but also allocate additional unused space to the SDS. The specific allocation rules are as follows:

  • If the length of SDS is less than 1MB, redis will allocate the same unused space to SDS as len, that is, the current free is the same size as len, and the current buF is len*2+1; The extra byte holds the empty string.
  • If the SDS size is greater than or equal to 1MB, redis will allocate 1MB of unused space to SDS, since the free size is 1MB and the current array length is len+1M+1.

3.1.2 Lazy space release

When shortening SDS, the program will not reclaim the excess memory space, but use the free field to record the number of bytes and do not release them. If the append operation is needed later, the unused space in free will be directly used to reduce the memory allocation

3.1.3 summary

  • By using SDS, the time required for reids to obtain the length of the string can be reduced from O (n) to O (1), because C needs to traverse the entire character array and count the length of the string at the end of the space. Even if we repeatedly use STRLEN for very long strings, there is no performance bottleneck because STRLEN is o(1).
  • When operating string, buffer overflow is avoided. SDS will check whether SDS space is enough in advance when operating string. If not, SDSCat will expand SDS space first and then perform splicing operation to avoid repeated memory allocation when C changes string length.

3.2 zipList – Compressed list

ZipList is a sequential data structure composed of continuous memory blocks. ZipList has multiple entry nodes. Each entry node can store integers or strings.

struct ziplist<T> { int32 zlbytes; Int32 zltail_offset; Int16 zlLength = int16 zlLength; int16 zlLength = int16 zlLength; int16 zlLength = int16 zlLength; T[] entries; Int8 zlend; // List of element contents, compactly stored one by one; // Marks the end of the compressed list, constant 0xFF}Copy the code

3.3 LinkedList- Two-way list

The linked list provides efficient rearrangement of nodes and sequential access, and the length of the linked list can be flexibly adjusted by deleting nodes. Linked lists and linked list nodes are implemented as follows:

Struct listNode *pre; // struct listNode *pre; Struct listNode *next; // Node value void *value; Typef struct list{// listNode *head; // listNode *tail; Unsigned long len; Void (*dup) (void * PTR); // ß void (*free) (void * PTR); Void (*match) (void * PTR,void *key); }Copy the code

3.1 Summary of bidirectional list features

  • Double-ended: Linked list nodes have prev and next Pointers, and the complexity of obtaining the pre-node and post-node of a node is O (1).
  • Acyclic: The prev pointer on the head node and the next pointer on the tail node point to NULL, and access to the linked list ends at NULL.
  • With head and tail Pointers: Through the head and tail Pointers of the list structure, the complexity of the program to obtain the head and tail nodes of the linked list is O (1).
  • List length counter: The program uses the len attribute of the list structure to count the list nodes held by the list. The complexity of obtaining the number of nodes in the list is O (1).

3.4 skipList skipList

A skip list is an ordered data structure that maintains multiple Pointers on each node for fast access. Within Redis, skip lists are used in only two scenarios:

  • Implement an ordered set sordSet when the ordered set has more elements or when the member variables are long strings.
  • The data structure of nodes within the cluster.

Redis jumps are defined by zskiplistNode and zskiplist structures. ZskiplistNode is a skip table node, and zskiplist stores skip table node information, such as the number of nodes, Pointers to the head and tail of the table, etc.

3.4.1 Skip list node

Skip list nodes are defined as follows:

Typedef struct zskiplistNode{// struct zskiplistNode *forward; // span unsigned int span; }level[]; Struct zskiplistNode *backward; // double score; // member object robj *robj; }zskiplistNode;Copy the code
  • Layers: Layers are a means of fast access to other skip list nodes within Redis. The more layers there are, the faster access to other nodes will be. When a skip list node is created, the power law randomly generates values between 1 and 32 as the size of the layer array, which becomes the height of the layer.
  • Forward pointer: a pointer to the end of a table that can be accessed from the current node by traversing the forward pointer. The dotted lines in the figure represent all node paths traversed from the table head to the table tail
  • Span: as long as it is used to calculate the distance between two nodes, the larger the span of a node is, the farther the nodes are together.
  • Ranking: The result obtained by summing up all layers visited along the way in the process of finding nodes.
  • Back pointer: Used to access the table from the end to the head. Unlike forward, there is only one back pointer.
  • Point value: A point value is a floating-point number of type double, and all nodes in the skip list are sorted by point value.
  • Member object: It points to a string object SDS.

3.4.2 jump table

For skiplist, REDis is represented by zskiplist data structure, which not only holds skiplist nodes, but also the head node and tail node of skiplist nodes, and the number of skiplist nodes. The structure of Zskiplist is defined as follows:

Typedef struct zskiplist{// struct zskiplist *header,*tail; // Number of table nodes unsigned long length; Int level; } zskiplist;Copy the code
  • The header and tail sub-tables point to the head and tail nodes of the skip table. The time complexity of locating the head and tail nodes of the skip table is O (1).
  • Length records the number of skip list nodes. Redis can return the skip list length at o(1) time complexity.

3.4.3 Traversal of skip lists

  • Head node: A head node has the same structure as a skip list node, but points and back Pointers are not used, and member traversal is not shown
  • Skip list nodes: the four on the right of the figure have the following attributes: Layer (L1,L2… ,Ln) and the span and forward pointer, backward pointer, branch score and member variable MEM inside the layer
  • Skip list: the leftmost one in the figure. The skip list consists of four cores: head pointer, tail pointer, skip list length, and skip surface number
  • Table head to end traversal process:
    • The iterator first accesses the first node of the skip table (the header), and then moves from the forward pointer of layer 4 (L4) to the second node in the table
    • At the second node, the program moves to the third node in the table following the forward pointer at the second layer.
    • At the third node, the program also moves to the fourth node in the table following the forward pointer of the second layer
    • When the program moves along the fourth node’s forward pointer again, it hits a NULL, and the program knows it has reached the end of the skip list, ending the walk.

3.5 Integer Set

The set of integers (intSet) is one of the underlying implementations of the set key. Redis uses the set of integers as the underlying implementation of the set key when a set contains only integer numeric elements and the set has a small number of elements. Its structure is as follows:

// uint32_t encoding; // Uint32_t length; Int8_t contents[]; }intset;Copy the code
  • Length: Records the number of elements in the integer set, which is the length of the contents array.
  • Contents: represents a set of integers, and the data in the set is sorted from smallest to largest, and the data in the array does not contain duplicates.
  • Encoding: Specifies the data encoding of the integer data. It can also be the byte size of each item in the array. The value can be:
    • INTSET_ENC_INT16: Each item in the array is an integer value of type INT16_T (minimum: -32768, maximum: 32767)
    • INTSET_ENC_INT32: Each item in the array is an integer value of type INT32_T (minimum: -2147483648; maximum: 2147483647)
    • INTSET_ENC_INT64: Each item in the array is an integer value of type INT64_T (minimum: -9223372036854775808; maximum: 9223372036854775807)

3.5.1 Upgrading integer Sets

Whenever a new element is added to the set of integers, and the type of the new element is longer than all existing elements in the set of integers, the set of integers needs to be upgraded before the new element can be added to the set of integers. The upgrade process is roughly divided into three steps:

  • Expands the size of the underlying array of integers based on the type of the new element and allocates space for the new element.
  • Expands the size of the underlying array of integers based on the type of the new element and allocates space for the new element.
  • Adds a new element to the underlying array

3.5.1 track of summary

  • The collection of integers is one of the underlying implementations of the collection key
  • The underlying implementation of the collection of integers is an array, which holds the collection elements in an ordered, non-repetitive manner. At the same time, when new elements are added to the collection, the data type of the array will be changed. This is called the upgrade of the collection of integers
  • Integer sets only support upgrade operations, not degrade operations

3.6 the hash table

The hash table used by the Redis dictionary looks like this:

Hash tables have an array of table attributes, and each element in the array is a dictEntry. Each dictEntry structure holds a key-value pair. The size property records the size of the hash table, which is the size of the table array, while the Used property records the number of nodes (key-value pairs) that the hash table currently has. The sizemask attribute is always equal to size-1. This attribute, together with the hash value, determines which index of the table array a key should be placed on

3.6.1 dictionary

There are two core components of the dictionary: HT array and rehashIndex. Other core components, such as type-specific functions, are beyond the scope of this article.

  • Ht array. Ht is a fixed size of 2, and each element of the array is a hash table. Generally, dictionaries only use HT [0] hash tables.
  • Rehashidx: This records the current progress of rehash and has a value of -1 if no rehash is currently in progress.
3.6.2 Process of dictionary locating elements

To add a new key-value pair to a dictionary, the program calculates the hash and index values based on the keys of the key-value pair. Then, based on the index value, the hash table node containing the new key-value pair is placed above the specified index of the hash table array

  • Compute the hash value: Select different hash functions based on the dictionary type to compute the hash value of the ket.
  • Index = hash & dict->ht[x]. SizeMask,ht[x] can be HT [0] or HT [1]
  • Resolving hash collisions: Using the zipper method, note that since the dictEntry nodes have no Pointers to the end of the list, the program always adds new nodes to the head of the list for speed.
3.6.3 rehash

The work of expanding and contracting hash tables can be done by performing a rehash operation. Redis rehashes the dictionary’s hash table as follows:

  • Allocates space for the HT [1] hash table in the dictionary. The amount of space allocated depends on the operation to be performed and the current ht[0] space
    • If the capacity expansion operation is performed, the size of HT [1] is the first 2^n that is greater than or equal to HT [0]. Used *2, which means that the ht[1] allocated is greater than or equal to the minimum 2 power of the current HT [0] table node
    • If the shrink operation is to be performed, the size of HT [1] is the first 2n greater than or equal to ht[0]. Used
  • Rehash all key-value pairs stored in HT [0] onto HT [1]. Rehash means recalculating the key hash and index values, and then place the key-value pairs into the specified position in the HT [1] hash table
  • When all key-value pairs contained in HT [0] are migrated to HT [1] (HT [0] becomes empty), release HT [0], set HT [1] to HT [0], and create a new blank hash table in HT [1] in preparation for the next rehash
3.6.3.1 Rehash Triggering Condition
  • The server is not currently executing BGSAVE or BGREWRITEAOF, and the hash table load factor is greater than or equal to 1.
  • The server is running the BGSAVE or BGREWRITEAOF command and the hash table load factor is greater than or equal to 5
  • When the load factor of the hash table is less than 0.1, the program automatically starts to shrink the hash table

Note: Load_Factory = ht[0]. Userd/HT [0]. Size

3.6.4 Progressive Rehash

Progressive rehash is mainly used to solve the problem that when ht[0] data is copied to HT [1] in the rehash process, if there are hundreds of millions of data, the huge amount of calculation may cause the server to stop service for a period of time if all these key value pairs are rehash to HT [1] at one time. Therefore, redis does not copy HT [0] data to HT [1] at one time, but copies the data to HT [1] in stages to avoid the impact of rehash process on redis server performance.

The progressive rehash steps are as follows:

  • Set the variable rehashidx in the dictionary to 0, indicating that the rehash of the dictionary has officially started. At this time, the dictionary holds both HT [0] and HT [1].
  • When the dictionary is added, deleted, changed, or updated, the corresponding action will be performed, and the rehash process will be performed to transfer the corresponding HT [0] element to HT [1]. After the transfer, the value of rehashinx will be added by one
  • As the operation progresses, the final element held by HT [0] will be 0. At this point, progressive hash is completed. Set the value of rehashinx to -1 to indicate that the rehash operation is complete

Note: In progressive rehash, since the dictionary holds two hash tables, operations to add, delete, modify, and query the dictionary are based on the two hash tables. However, in progressive rehash, if new operations are added, they are placed directly in HT [1], which ensures that elements in HT [0] are only decreased. You end up with an empty hash table