Read the underlying data structure of Redis common object types

Redis is an in-memory data structure storage system that can be used as a database, cache, and messaging middleware. Redis supports five common object types: String, Hash, List, Set, and Zset, and we use them a lot in our daily work. To know why, this article will take you through the underlying data structures of these five common object types.

The main content of this article is from Redis Design and Implementation.

Object type and encoding

Redis uses objects to store keys and values. In Redis, each object is represented by the redisObject structure. The redisObject structure consists of three main attributes: Type, Encoding, and PTR.

typedef struct redisObject {
    / / type
    unsigned type:4;
    / / code
    unsigned encoding:4;
    // A pointer to the underlying data structure
    void *ptr;
} robj;
Copy the code

The type attribute records the type of the object. For Redis, the key object is always a string and the value object can be any supported type. Therefore, when we say which object type is used for the Redis key, we mean which object type is used for the corresponding value.

Type constants	Object type name
REDIS_STRING	String object
REDIS_LIST	A list of objects
REDIS_HASH	The hash object
REDIS_SET	A collection of objects
REDIS_ZSET	Ordered set object

* The PTR attribute points to the underlying data structures of the object, which are determined by the Encoding attribute.

Code constants	Encode the corresponding underlying data structure
REDIS_ENCODING_INT	An integer of type long
REDIS_ENCODING_EMBSTR	A simple dynamic string encoded by emstr
REDIS_ENCODING_RAW	Simple dynamic string
REDIS_ENCODING_HT	The dictionary
REDIS_ENCODING_LINKEDLIST	Double side chain table
REDIS_ENCODING_ZIPLIST	List of compression
REDIS_ENCODING_INTSET	The integer set
REDIS_ENCODING_SKIPLIST	Skip lists and dictionaries

The encoding property determines the underlying data structure of the object in order to implement the same object type, supporting different underlying implementations. This allows different underlying data structures to be used in different scenarios, greatly increasing the flexibility and efficiency of Redis.

The underlying data structure will be explained later, so take a quick look here.

String object

String is the object type we use most in our daily work, and its encoding can be int, RAW, or embstr. For details about the command for the string object, see Redis command-strings.

If a string object is saved no more thanlongType, in which case the encoding type isint, its underlying data structure is directlylongType. For example, to performset number 10086, will createintEncoding the string object asnumberThe value of the key.

If the string object stores a string longer than 39 bytes, the encoding type is RAW and the underlying data structure is simple dynamic string (SDS). If the length is less than or equal to 39 bytes, the encoding type is EMbSTR and the underlying data structure is EMBSTR encoding SDS. Let’s take a closer look at what a simple dynamic string is.

Simple dynamic string

Definition of SDS

In Redis, SDS is represented using the SDSHDR data structure:

struct sdshdr {
    // String length
    int len;
    // The number of unused bytes in the buf array
    int free;
    // An array of bytes to hold strings
    char buf[];
};
Copy the code

SDSFollowing the convention that C strings end with a null character, 1 byte of the saved null character is not counted inlenAttributes inside. For example,RedisThis string is inSDSThe data could be in the following form:

SDS and C strings

C uses an array of characters of length N+1 to represent strings of length N, and the last element of the string is the null character \0. Redis uses SDS to have the following advantages over C strings:

Constant complexity gets the length of the string
Prevent buffer overflow
Reduce the number of memory reallocations when modifying strings
Binary security

Constant complexity gets the length of the string

Because the C string does not record its own length information, in order to obtain the length of the string, the entire string must be traversed in O(N) time. SDS uses the len attribute to record the length of the string, so the time complexity of obtaining the length of the string in SDS is O(1).

Prevent buffer overflow

Another problem with C strings not recording their length is that they can easily overflow the cache. For example, using the string concatenation function (Stract), it is easy to overwrite the original data of the character array. Unlike C strings, SDS’s space allocation strategy completely eliminates the possibility of a cache overflow. When SDS performs string augmentation, it first checks whether the current byte array length is sufficient. If not, it performs automatic augmentation before performing string operations.

Reduce the number of memory reallocations when modifying strings

Because the length of the C string is closely related to the underlying data, the program reallocates the array each time it grows or shortens a string:

If the string operation is to grow, the underlying array size needs to be expanded by memory reallocation. Otherwise, the cache will overflow.
If you shorten the string operation, you need to reclaim unused space by memory reallocation first. Failure to do so will result in a memory leak.

Because memory reallocation involves complex algorithms and may require system calls, it is usually a time-consuming operation. For Redis, string modification is a very frequent operation, and if you reallocate memory every time like C strings, the performance impact is too great to be acceptable.

SDS disassociates the string length from the underlying data through free space. In SDS, arrays can contain unused bytes, the number of bytes recorded by the free property. SDS realizes two optimization strategies of space pre-allocation and inert space release through free space.

Space preallocation Space preallocation is used for optimizationSDSThe simple idea behind string growth is that when the byte array runs out of space to trigger a reallocation, some free space is always reserved. This reduces the number of memory reallocations required for successive string growth operations. There are two strategies for pre-allocation:
1. lenLess than1MBWhen: Each reassignment will allocate the same amount of free space;
2. lenGreater than or equal to1MBWhen: 1MB more free space is allocated for each reallocation.
Inert space releaseLazy space release is used for optimizationSDSThe short string operation simply means that when the string is shortened, it does not immediately use memory reallocation to reclaim the extra bytes. Instead, it uses the free property to record the bytes and wait for future use.SDSAlso provides direct free up of unused spaceAPIWhen needed, it can really free up extra space.

Binary security

Characters in a C string must conform to some encoding, and no null characters are allowed except at the end of the string. These restrictions restrict C strings to holding only text data. But for Redis, you need to save not just text, but binary data. To achieve this goal, all SDS apis are binary-safe.

SDS difference between RAW and EMBSTR codes

As we said earlier, strings larger than 39 bytes are encoded asraw, the underlying data structure is a simple dynamic string (SDS). And this is easy to understand, for example, when we executeset story "Long, long, long ago there lived a king ..."(length greater than 39),RedisI’m going to create onerawThe codingStringObject. The data structure is as follows:

The value is a string of less than 39 bytes and the encoding type isembstr, the underlying data structure isembstrcodingSDS.embstrThe encoding is specifically designed to hold short strings, and it israwThe biggest coding differences are:rawEncoding calls two memory allocations created separatelyredisObjectThe structure andsdshdrStructure, andembstrEncoding calls memory allocation only once, including it simultaneously on a contiguous block of spaceredisObjectThe structure andsdshdrstructure.

Code conversion

String objects encoded by int and EMbstr are automatically converted to raw string objects if the conditions are met. For int encoding, when we change the string to no longer be an integer value, the encoding of the string object changes from int to RAW; For embSTR encoding, whenever we change the value of the string, the encoding of the string object changes from EMbstr to RAW.

Embstr encoded string objects can be considered read-only because Redis writes any modifications to them. When we want to modify the embSTR encoding string, we will first convert to raw encoding, and then modify.

A list of objects

List objects can be encoded as linkedList or Ziplist, and the underlying data structures are linked lists and compressed lists. For details about commands related to List objects, see Redis command -list. By default, list objects use ziplist encoding when all string elements are less than 64 bytes long and the number of elements is less than 512, otherwise linkedList encoding is used.

This upper limit can be modified through the configuration file.

The list

Linked list is a very common data structure, which provides efficient node rearrangement and sequential node access. In Redis, each linked listNode is represented using the listNode structure:

typedef struct listNode {
    // The front node
    struct listNode *prev;
    // back node
    struct listNode *next;
    / / the node values
    void *value;
} listNode
Copy the code

multiplelistNodethroughprevandnextPointers form a double-ended linked list, as shown below:

For ease of operation, Redis uses the List structure to hold linked lists.

typedef struct list {
    // Table header node
    listNode *head;
    // Table end node
    listNode *tail;
    // The number of nodes in the list
    unsigned long len;
    // Node copy function
    void *(*dup)(void *ptr);
    // Node release function
    void (*free) (void *ptr);
    // node comparison function
    int (*match)(void *ptr, void *key);
} list;
Copy the code

The list structure provides a header pointer to a linked listheadPointer to the end of the tabletail, and the linked list length counterlenAnd thedup,freeandmatchMembers are specific functions that implement the types of polymorphic linked lists.

The characteristics of Redis linked list implementation are summarized as follows:

Double side: Linked list nodes withprevandnextPointer, the complexity of obtaining the pre-node and post-node of a node isO(n).
acyclic: of the header nodeprevPointer and end-of-table nodenextAll the Pointers point toNULLTo access the linked listNULLTo the end.
With header pointer and tail pointerThrough:listThe structure of theheadPointers andtailPointer, the program obtains the complexity of the head node and the tail node of the linked listO(1).
Length counter with linked list: Program uselistThe structure of thelenAttribute to thelistThe complexity of obtaining the number of nodes in the linked list by the program isO(1).
polymorphism: Used by linked list nodesvoid*Pointer to hold node values, which can hold various types of values.

List of compression

Compressed list (ziplist) is one of the underlying implementations of list and hash keys. The compressed list is a sequential data structure composed of a series of contiguous memory blocks with special encoding. A compressed list can contain any number of nodes, and each node can hold either a byte array or an integer value.As shown in the figure above, the compressed list records the type, length, and purpose of each component.

attribute	type	The length of the	use
zlbytes	uint_32_t	4 bytes	Records the number of bytes of memory used for the entire compressed list
zltail	uint_32_t	4 bytes	Records the number of bytes from the start address of the end of the compressed table. This offset allows the program to determine the end address without traversing the entire compressed table
zlen	uint_16_t	2 –	Records the number of nodes in the compression list
entryX	List node	indefinite	Compress each node of the list, the length of which is determined by the content saved
zlend	uint_8_t	1 byte	Special value (`0xFFF`) to mark the end of the compressed list

The hash object

Hash objects can be encoded ziplist or Hashtable.

hash-ziplist

Ziplist uses a compressed list at the bottom. The implementation principle of the compressed list has been described in detail above. Each time a new key-value pair is added to the hash object, the node holding the key is pushed to the end of the compressed column table, and then the node holding the value is pushed to the end of the compressed column table. For example, we execute the following three HSET commands:

HSET profile name "tom"
HSET profile age 25
HSET profile career "Programmer"
Copy the code

If you useziplistCode, then theHashThe structure of the object in memory is as follows:

hash-hashtable

Hashtable encoded hash objects use dictionaries as the underlying implementation. A dictionary is a data structure used to store key-value pairs. The dictionary of Redis uses hash table as the underlying implementation. A hash table can have multiple hash table nodes, and each hash table node stores a key-value pair.

Hash table

The hash table used by Redis is defined by the dictht construct:

typedef struct dictht{
    // Hash table array
    dictEntry **table;

    // Hash table size
    unsigned long size;

    // Hash table size mask, used to calculate index values
    // always equal to size-1
    unsigned long sizemask;

    // The number of existing nodes in the hash table
    unsigned long used;
} dictht
Copy the code

tableProperty is an array, and each element in the array is a pointer todictEntryStructure pointer to eachdictEntryThe structure holds a key-value pair.sizeProperty records the size of the hash table, i.etableThe size of the array.usedProperty records the number of existing nodes in the hash table.sizemaskAlways equal to thesize-1, this value is used primarily for array indexes. For example, the following figure shows an empty hash table of size 4.

Hash table node

Hash table nodes are represented by dictEntry structures, each of which holds a key-value pair:

typedef struct dictEntry {
    / / key
    void *key;

    / / value
    union {
        void *val;
        unit64_t u64;
        nit64_t s64;
    } v;

    // Point to the next hash table node to form a linked list
    struct dictEntry *next;
} dictEntry;
Copy the code

The key property holds the keys in the key-value pair, while the V property holds the values in the key-value pair. The value can be a pointer, a uint64_t integer or an int64_t integer. The next attribute points to another dictEntry node that resolves key collisions by concatenating multiple dictEntry nodes into a linked list with the same bucket bits. (Chain address method)

The dictionary

Redis dictionaries are represented by dict constructs:

typedef struct dict {

    // Type specific functions
    dictType *type;

    // Private data
    void *privdata;

    / / a hash table
    dictht ht[2];

    / / rehash index
    // When rehash is not in progress, the value is -1
    int rehashidx;
}
Copy the code

htIs of size 2, and each element points todicthtHash table. In general, a dictionary will only be usedht[0]Hash table,ht[1]The hash table will only work in pairsht[0]Hash tablerehashWhen use.rehashidxRecord therehashIf no rehash is currently performed, the value is -1.

rehash

To keep the hash table load factor (HT [0]).used/ HT [0]).size) within a reasonable range, the program needs to expand and shrink the hash table when too many or too few elements are stored. The rehash operation is used to expand and shrink the hash table. The steps for rehash are as follows:

forht[1]Hash table allocates space
1. If it is an extension operation, thenht[1]The size of the first is greater thanht[0].used*22 ^ n. Such asht[0].used=5, then at this timeht[1]The size of PI is 16. (The first 2 to the n greater than 10 is 16.)
2. If it’s a shrink operation, thenht[1]The size of the first is greater thanht[0].used2 ^ n. Such asht[0].used=5, then at this timeht[1]The magnitude of theta is 8. (The first 2 to the n greater than 5 is 8.)
Will be stored in theht[0]Rehash to all key value pairs inht[1]In the.
When the migration is complete, it is releasedht[0]And will nowht[1]Set toht[0]In theht[1]Create a new blank hash table in preparation for the next rehash.

When to expand and shrink a hash table:

When the server does not executeBGSAVEorBGREWRITEAOFIf the load factor is greater than or equal to 1, the hash table expansion operation is triggered.
When the server is executingBGSAVEorBGREWRITEAOFIf the load factor is greater than or equal to 5, the extension operation of the hash table is triggered.
When the load factor of the hash table is less than 0.1, shrink the hash table is triggered.

Progressive rehash

As mentioned earlier, to expand or shrink ht[0] requires rehashing all elements into HT [1]. If HT [0] has a large number of elements, the cost of rehashing is obviously very high, affecting Redis performance. To solve the above problems, Redis uses a progressive rehash technique, in which elements in HT [0] are slowly rehashed into HT [1] in multiple increments. Here are the step-by-step steps for progressive Rehash:

forht[1]Allocate space.
Maintains an index counter variable in the dictionaryrehashidxAnd sets its value to 0 to indicate that rehash has officially begun.
During rehash, every time a dictionary is added, deleted, found, or updated, in addition to the corresponding operation, it will be added to the dictionaryht[0]inrehashidxAll key value pairs on the index bit rehash toht[1]After rehash is complete,rehashidxValue plus one.
As the dictionary operation continues, eventually the migration will be completed at some pointrehashidxIf the value is set to -1, rehash is complete.

Progressive Rehash migrates all data on a bucket at a time and is designed to be dial-and-conquer, dispersing the centralized operations over each add, delete, find, and update operation to avoid the massive calculations that centralized Rehash brings.

Because the dictionary uses both HT [0] and HT [1] in progressive Rehash, dictionary deletions, look-up, and updates may be performed in both hash tables. For example, if you want to look for a key, look in HT [0] first, and continue to look in HT [1] if no key is found.

Hashtable in a hash object

HSET profile name "tom"
HSET profile age 25
HSET profile career "Programmer"
Copy the code

Again, the above three commands, save the data toRedisHash object, if adoptedhashtableIf the code is saved, then theHashThe structure of the object in memory is as follows:

Ziplist encoding is used when the key and value strings of all key-value pairs held by the hash object are less than 64 bytes long and the number is less than 512, otherwise HashTable encoding is used.

This upper limit can be modified through the configuration file.

A collection of objects

The encoding of a collection object can be intSet or HashTable. The intSet encoding is used when the collection object holds integers and the number of elements does not exceed 512; otherwise, the HashTable encoding is used.

set-intset

Intset-encoded collection objects are implemented using integer collections at the bottom.

Intset is an abstract data structure Redis uses to store integer values. It can store integer values of type INT16_t, INT32_t, or INT64_t without duplicating the data in the set. Redis uses the intset structure to represent a collection of integers.

typedef struct intset {
    // Encoding mode
    uint32_t encoding;
    // The number of elements the collection contains
    uint32_t length;
    // Hold an array of elements
    int8_t contents[];
} intset;
Copy the code

The contents array is a low-level implementation of the contents array: each element of the integer array is an element of the contents array, and each item is ordered in order of value from smallest to largest in the array, and the array contains no duplicates. Although the contents property is declared as an array of type INT8_T, the contents array does not actually hold any values of type INT8_T, and the actual type of value held in the array depends on encoding. If the encoding property is INTSET_ENC_INT16, then the contents array is an array of type INT16_T, and so on.

When the type of the newly inserted element is greater than that of the existing type element in the integer collection, the integer collection must be upgraded before the new element can be added. This process takes place in three steps.

Expands the size of the underlying array space of the integer collection based on the new element type.
Converts all existing elements of the underlying array to the same type as the new element, and maintains the ordering of the underlying array.
Adds a new element to the underlying array.

It is also important to note that integer collections do not support degradation, and once an array has been upgraded, the encoding remains the same.

For example, when we executeSADD numbers 1 3 5When data is inserted into a collection object, the collection object has the following structure in memory:

set-hashtable

hashtableThe encoded collection object uses a dictionary as the underlying implementation, each key of the dictionary is a string object, each string object corresponds to a collection element, and the dictionary value isNULL. When we executeSADD fruits "apple" "banana" "cherry"When data is inserted into a collection object, the collection object has the following structure in memory:

Ordered set object

The encoding for ordered collections can be ziplist or Skiplist. Ziplist encoding is used when the ordered set holds less than 128 elements and all element members are less than 64 bytes in length; otherwise, skiplist encoding is used.

zset-ziplist

Ziplist-encoded ordered collections use compressed lists as the underlying implementation. Each collection element is represented by two compressed list nodes next to each other, the first node holding the element’s members (member) and the second node holding the element’s score (score).

The collection elements in the compressed list are arranged in ascending order of magnitude. If we implementZADD price 8.5 apple 5.0 banana 6.0 cherryCommand to insert elements into an ordered collection with the following structure in memory:

zset-skiplist

Skiplist-encoded ordered collection objects use as their underlying implementation a zset structure that contains both a dictionary and a skiplist.

typedef struct zset {
    zskiplist *zs1;
    dict *dict;
}
Copy the code

Before we move on, let’s take a look at what skip lists are.

Jump table

A skiplist is an ordered data structure that enables fast access to nodes by maintaining multiple Pointers to other nodes in each node. The Redis skiplist is defined by two structures, zskiplistNode and Zskiplist. ZskiplistNode represents the skiplist node, and Zskiplist stores the information about the skiplist node, such as the number of nodes, Pointers to the head and tail of the table, etc.

Skip list node zskiplistNode

The zskiplistNode structure is defined as follows:

typedef struct zskiplistNode {
    // Back up the pointer
    struct zskiplistNode *backward;
    / / score
    double score;
    // Member objects
    robj *obj;
    / / layer
    struct zskiplistLevel {
        // Forward pointer
        struct zskiplistNode *forward;
        / / span
        unsigned int span;
    } level[];
} zskiplistNode;
Copy the code

The following figure shows a skip table consisting of 4 skip table nodes (1 table head node and 3 data nodes) with a height of 5:

layerEach time a new skip list node is created, one is randomly generated according to the power law (larger numbers are less likely to occur)1-32Is the height of the current node.. Each layer element contains two pieces of data, a forward pointer and a span.
1. Forward pointer Each level has a forward pointer to the end of the table, which is used to access nodes from the head to the end of the table.
2. The span of span layer is used to record the distance between two nodes.
Back pointer (BW) The back pointer of a node is used to access the node from the end of the table to the head. There is only one back pointer per node, so only one node can be backed at a time.
The score and the score of the member node are the samedoubleType of a floating-point number in which all nodes in the skip list are arranged in ascending order of point value. The node member (obj) is a pointer to a string object. In a skip list, the member objects held by each node must be unique, but multiple nodes do have the same score.

Note that the table header node does not store real data, and the height is fixed at 32, from the first table header node is not NULL at the highest level, can achieve fast lookup.

Jump table zskiplist

In fact, a skiplist can be made up of just a few skiplist nodes, but Redis uses the Zskiplist structure to hold these nodes, making it easier to manipulate the entire skiplist. For example, fast access to the top and bottom of the table, the number of skip table nodes, and so on. The zskiplist structure is defined as follows:

typedef struct zskiplist {

    // Table header and table tail nodes
    struct skiplistNode *header, *tail;
    // Number of nodes
    unsigned long length;
    // Maximum number of layers
    int level;
} zskiplist;
Copy the code

Here is an example of a complete skip list structure:

Of an ordered collection object`skiplist`implementation

As mentioned earlier, skiplist-encoded ordered collection objects use a zset structure as their underlying implementation, a zset structure that contains both a dictionary and a skiplist.

typedef struct zset {
    zskiplist *zs1;
    dict *dict;
}
Copy the code

The ZS1 skip list in the ZSET structure holds all the set elements in ascending order of points, and each skip list node holds one set element. Skip lists allow for quick score-based range lookups of ordered collections. The dict dictionary in the Zset structure creates a mapping for ordered collections from members to score values, where the keys hold the members and the values hold the score values. With a dictionary, you can look up the score of a given member with O(1) complexity.

If it’s still executedZADD price 8.5 apple 5.0 banana 6.0 cherryThe command tozsetSave data if usedskiplistThe structure of the ordered set in memory is as follows:

conclusion

Redis, on the whole, the underlying data structure mainly includes the simple dynamic string (SDS), linked lists, dictionaries, jump table set, integer and compression list six types, and based on these basic data structure to achieve the string object, the list of object, the hash object, collection objects, as well as an ordered set of five kinds of common object type. Each object type uses at least two data encodings, with different encodings using different underlying data structures.

Original is not easy, feel that the article is written well small partners, a praise to encourage it ~

Welcome to my open source project: a lightweight HTTP invocation framework for SpringBoot