Redis five basic data structures

When we refer to Redis’s “data structure” in this article, we may be talking about it on two different levels. The first level is from the user’s perspective. Such as:

  • string
  • list
  • hash
  • set
  • sorted set

This layer is also the calling interface that Redis exposes to the outside world, called Redis data types. The second layer is the internal data structure introduced by Redis to realize the above data structure. Such as:

  • dict
  • sds
  • ziplist
  • quicklist
  • skiplist

This paper introduces the internal data structures involved in each data structure from the perspective of the user. For example, string uses SDS, list uses QuickList and Ziplist. The Redis data type corresponds to the underlying data structure as follows:

Redis object

Redis uses objects to represent keys and values in a database, where key objects are typically string objects and value objects can point to different types of data.

Type records the object type. The value can be REDIS_STRING, REDIS_LIST, REDIS_HASH, REDIS_SET, or REDIS_ZSET, indicating string, list, hash, set, or zset objects respectively.

Encoding Indicates the underlying encoding and storage mode of the object. The mapping between the value and the object is as follows:

Code constants Underlying data structure
REDIS_ENCODING_INT An integer of type long
REDIS_ENCODING_EMBSTR Embstr encoded Dynamic String (SDS)
REDIS_ENCODING_RAW Simple Dynamic Strings (SDS)
REDIS_ENCODING_HT The dictionary
REDIS_ENCODING_QUICKLIST Fast table
REDIS_ENCODING_ZIPLIST List of compression
REDIS_ENCODING_INTSET The integer set
REDIS_ENCODING_SKIPLIST Skip list + dictionary

The PTR is a pointer to a specific in-memory data structure. Since Redis has multiple data types, each of which has multiple data structures that can be implemented, the PTR is modified with void.

string

When a string holds an integer value and can be stored as long, the string stores the integer value in the PTR and converts void* to long, encoded as int; When an integer value is too large to be stored in a long, it is converted to a string, stored in SDS, and encoded as EMBSTR or RAW.

When a string holds a string that is less than or equal to 39 bytes long, the embstr encoding is used to hold the string value. At this point SDS and the string redisObject are next to each other and only need to apply for memory once.

When the length of the string is larger than 39 bytes, the raw code is used to save the string. When applying for memory, the raw code is applied twice, one for storing redisObject and the other for storing SDS structure. Freeing memory also requires freeing twice.

SDS data structure

In Redis3.2, SDS is defined as follows:

struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header andnull terminator */ unsigned char flags; / *3 lsb of type.5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header andnull terminator */ unsigned char flags; / *3 lsb of type.5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header andnull terminator */ unsigned char flags; / *3 lsb of type.5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header andnull terminator */ unsigned char flags; / *3 lsb of type.5 unused bits */
    char buf[];
};
Copy the code

The meanings of each field are as follows:

  • Len: Records the number of bytes currently in use (excluding ‘\0’), the complexity of obtaining SDS length is O(1)
  • Alloc: Records the total number of bytes allocated in the current byte array (excluding ‘\0’)
  • Flags: Indicates the attributes of the current byte array, whether sdSHdr8 or sdshdr16, etc
  • Buf: byte array used to hold strings, including closing whitespace ‘\0’

Using different types of len and alloc for different string lengths is a memory-saving Redis optimization.

capacity

If the character string length is less than 1 MBIT/s, the capacity expansion is twice the new length. If the character string length is greater than 1 MBIT/s, only 1 MBIT/s space is expanded at a time.

sds sdsMakeRoomFor(sds s, size_t addlen) { void *sh, *newsh; size_t avail = sdsavail(s); /* SDS residuals are available i.e. S.alloc-s.len */
    size_t len, newlen;
    char type, oldtype = s[-1] & SDS_TYPE_MASK;
    inthdrlen; /* Returns */ if there is enough spaceif (avail >= addlen) return s;

    len= sdslen(s); /* String length */ sh = (char*) s-sdshdrSize (oldType); /* SDSHDR */ newlen = (len+addlen); /* New length */if (newlen < SDS_MAX_PREALLOC)
        newlen *= 2;
    else
        newlen += SDS_MAX_PREALLOC;

    type= sdsReqType(newlen); /* New SDS type of appropriate length */ /* Generally not SDS_TYPE_5 */if (type == SDS_TYPE_5) type = SDS_TYPE_8;

    hdrlen = sdsHdrSize(type); /* New SDSHDR size */if (oldtype==type) {
        newsh = s_realloc(sh, hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        s = (char*)newsh+hdrlen;
    } else{/* Allocate memory, copy the original string, free the original SDS memory, and set flags as welllen */
        newsh = s_malloc(hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        memcpy((char*)newsh+hdrlen, s, len+1);
        s_free(sh);
        s = (char*)newsh+hdrlen;
        s[-1] = type;
        sdssetlen(s, len);
    }
    sdssetalloc(s, newlen);
    return s;
}
Copy the code

Shrink & inert release space

Len and free fields are changed during the memory release process, but the occupied memory is not released. For example, the sdsclear operation:

/ / will belenField set to0, but the memory space is not freed. void sdsclear(sds s) { sdssetlen(s,0);
    s[0] = '\ 0';
}
Copy the code

Of course, if necessary, you can also call sdsRemoveFreeSpace to actively reduce capacity.

C string compared with SDS

  • The structure stores the length of the string, which only needs constant time complexity to obtain, while the native string needs to traverse the number group.
  • Binary safe, can store byte data, because of the storage string length, will not encounter \0 character in advance and terminated;
  • Eliminate buffer overflow, C language string itself does not record the length of the array, increase operation, allocate insufficient memory is easy to cause buffer overflow, and SDS because of the existence of alloc, will check whether the space size meets the modification;
  • Memory pre-allocation and lazy deletion reduce the number of memory reallocation.
  • Compatible with some string functions in the C standard library.

list

For 64-bit operating systems, the prev and next Pointers take up 16 bytes, given the relatively high amount of additional space for linked lists; Also, each node of the linked list is allocated separately, memory fragmentation is high, and list objects are stored in the bottom layer using QuickList (after Redis3.2). A QuickList is a linked list of ziplists. Each Ziplist stores multiple list elements consecutively. Each entry block in the ziplist stores one list element. Quicklist can significantly reduce the footprint of prev and next Pointers compared to linked list implementations.

ziplist

Ziplist’s data structure,

struct ziplist<T>{ int32 zlbytes; Int32 zltail_offset; Int16 zlLength; // The offset from the starting position of the last element, used to quickly locate the last node int16 zlLength; T[] entries; // Element content int8 zlend; // End bit 0xFF}Copy the code

The pop operation deletes the header or tail item from the ziplist. If the ziplist is empty, the header or tail item is deleted.

Quicklist implements inserts not only from the head or tail, but also from any specified location. QuicklistInsertAfter and quicklistInsertBefore insert data items after and before the specified location, respectively.

  • When the size of the Ziplist in the insertion position does not exceed the limit, it is good to directly insert the Ziplist;
  • If the ziplist size of the adjacent QuickList node exceeds the limit, but the ziplist size of the adjacent QuickList node does not exceed the limit, the ziplist size of the adjacent QuickList node is inserted into the ziplist of the adjacent QuickList node.
  • If the ziplist size at both ends of the Ziplist exceeds the limit and the ziplist size of the adjacent QuickList nodes also exceeds the limit, you need to create a New QuickList node to insert.
  • For other cases where the size of the ziplist at the insertion location exceeds the limit (mainly for inserting data in the middle of the Ziplist), you need to split the current Ziplist into two nodes and then insert data on one node.

Composition of each data item:

: Indicates the total number of bytes occupied by the previous data item. The purpose of this field is to allow Ziplist to traverse from back to front (from the position of the latter item, just offset prevlen bytes forward to find the previous one). This field is variable-length encoded. It has two possibilities, either 1 byte or 5 bytes:

  • If the number of bytes consumed by the previous item is less than 254, then only one byte is used, and this byte value is the number of bytes consumed by the previous item.
  • If the number of bytes occupied by the previous item is greater than or equal to 254, it is represented by five bytes, with the first byte having a value of 254 (as a marker in this case) and the next four bytes forming an integer value that actually stores the number of bytes occupied by the previous item. (255 is not used because it is already defined as the value of the Ziplist end tag)

: There are 9 encoding methods

  • | 00 PPPPPP | 1 byte. A string value. The remaining six bits represent the length value, up to 63 (2^6-1).
  • | 01 PPPPPP | QQQQQQQQ | – 2 bytes. A string value. The remaining 14 bits represent the length value, up to 16383 (2^14-1).
  • 10000000 | | QQQQQQQQ | RRRRRRRR | SSSSSSSS | TTTTTTTT | – 5 bytes. The first byte is 10000000, and the remaining four bytes represent the length value.
  • | | – 3 11000000 byte. The field takes up one byte and has a value of 0xC0. The subsequent data store is of type INT16_T of two bytes.
  • | | – 5 11010000 byte. The field takes up one byte and has a value of 0xD0. The subsequent data store is of type INT32_T of four bytes.
  • | | – 9 11100000 byte. The field occupies one byte and has a value of 0xE0. The subsequent data store is of type INT64_T of eight bytes.
  • 11110000 | | – 4 byte. The field occupies one byte, the value is 0xF0, and the following data store is a three-byte integer.
  • 11111110 | | – 2 byte. The field takes up one byte, has a value of 0xFE, and the subsequent data is stored as a one-byte integer.
  • 1111 XXXX | | – (XXXX value between 0001 and 1101). This is a special case where XXXX has 13 values from 1 to 13, and these 13 values are used to represent the real data. Note that this is the actual data, not the length of the data. That is, in this case, a separate field is no longer needed to represent the real data, but is merged with it.

hash

When the number of elements is small and each element is short, the hash object is stored by ziplist, and the field and value are stored in two adjacent entries.

When there are many elements or long elements, the hash object is converted to a dict store. The number and length of critical elements can be configured with the hash-max-ziplist-entries and hash-max-ziplist-value parameters, which default values are 512 and 64, respectively.

dict

A dict consists of the following items:

  • A type pointing to a dictType structure. It allows dict keys and values to store any type of data in a custom way.
  • A private data pointer. Passed in by the caller when the dict is created.
  • Two hash tables (HT [2]). Both HT [0] and HT [1] are valid only during rehashing. In normal cases, only HT [0] is valid and HT [1] does not contain any data.
  • Current rehash index (rehashidx). If rehashidx = -1, it is not currently in the rehash process. Otherwise, it indicates that the current rehash is in progress and its value records the current rehash progress.

DictType structure contains several function Pointers for dict callers to customize various operations involving keys and values. These operations include:

  • HashFunction, a hash algorithm that evaluates keys.
  • KeyDup and valDup, respectively, define copy functions for keys and values, which are used to make deep copies of keys and values when needed, rather than just passing Pointers to objects.
  • KeyCompare, which defines the comparison of two keys, used when looking up by key.
  • KeyDestructor and valDestructor, which define destructors for key and value, respectively.

The privdata pointer is returned to the caller when certain operations of the dictType are called.

The dictht structure that needs to be looked at in detail. It defines the structure of a hash table, consisting of the following items:

  • An array of dictEntry Pointers (table). The hash of key eventually maps to a location in the array (corresponding to a bucket). If multiple keys map to the same location, a conflict occurs, and a dictEntry list is pulled.
  • Size: Identifies the length of the dictEntry pointer array. It’s always an exponent of 2.
  • Sizemask: Positional index used to map hash values to tables. Its value is equal to (size-1), such as 7, 15, 31, 63, etc., which is the binary representation of all 1s in each bit. Each key is computed by hashFunction to get a hash value, and then computed (hash & sizemask) to get its position in the table. Equivalent to calculating the remainder (hash % size).
  • Used: Records the number of data in the dict. The ratio of this to size is the load factor.

Expansion and reduction of Dictht

The hash table in Redis determines whether to expand or shrink the capacity by load_factor, which can be calculated by the formula HT [0]. Used/HT [0]. Size.

capacity

During capacity expansion, consider whether the server is running BGSave or BgreWriteAOF. If the server is running BGSave or BgreWriteAOF, avoid capacity expansion. In this case, the load_factor is 5. Otherwise, the capacity expansion starts when load_factor is 1. This is done because the bgsave or bgrewriteAof command forks the memory, shares the page with the parent if there is no change, and makes a copy of the page to modify the memory data (this is also known as copy-on-write). Modifying memory data not only consumes CPU resources when copying pages, but also consumes more memory space. During capacity expansion, the old hash table elements need to be rehash to the new hash table, resulting in memory data changes. Therefore, avoid capacity expansion.

To facilitate rehash, the array is expanded to 2^n (the first hash element is greater than 2 times the number of elements).

Shrinkage capacity

Scaling does not need to be considered for BGSave or BgreWriteAOF. Scaling is triggered when the load factor is 0.1. Rehash is also required to reduce the size of the array to 2^n (the first element is greater than the number of elements in the hash table). The new subscript can also be calculated using the hash(key) &sizemask formula.

Progressive rehash

If there are too many elements in the dictionary, the rehash operation will stall. Therefore, Redis adopts the strategy of progressive Rehash, and migrates them during search, insert, and delete operations. If it has not been accessed, Redis rehashes it in a scheduled task.

Progressive rehash moves the rehash forward at least n steps at a time (unless the whole rehash ends in less than n steps), each step moving every dictEntry on a bucket (that is, a dictEntry linked list) on HT [0] to HT [1], Its new position on HT [1] is recalculated according to the SIZemask of HT [1]. Rehashidx records the bucket location of ht[0] that is not currently migrated (yet to be migrated).

If the rehash is called without a single dictEntry in the bucket pointed to by rehashidx, then it has no data to migrate. It then tries to iterate backwards through the HT [0].table array until it finds the next bucket that holds the data. If you cannot find it, you can take n*10 steps at most. The rehash ends temporarily.

Finally, if all data on HT [0] is migrated to HT [1] (that is, d-> HT [0]. Used == 0), then the entire rehash ends, ht[0] becomes the contents of HT [1], and HT [1] is reset to null.

set

When the set elements are all integer values that can be declared with long and the number of elements is not greater than 512 (configurable with set-max-intset-entries), the set data is stored through intsets, in which the elements are arranged in ascending order. Otherwise, collection data is stored using dict (dictionaries), where the key is the value of the collection element and the value of the dictionary is null.

zset

By default, when the number of elements in a zset is not greater than 128 (controlled by zset-max-ziplist-entries) and all elements are not greater than 64 bytes (controlled by zset-max-ziplist-value), At the bottom of zset, ziplist is used for storage. Two adjacent entries represent value and score respectively. If either condition is not met, both conditions are stored in the form of hashtable+skiplist. Hashtable stores the mapping between value and score, and skiplist stores values in sequence.

The use of hashtable+skiplist to store zsets is due to the advantages of both structures. First of all, Zset is sorted by score, and Skiplist’s sorting features fit the bill. Then, hashtable saves the mapping between value and score, and score can be obtained directly by value, with time complexity O(1). Without hashtable, score would have to be searched layer by layer in Skiplist in O(lg n) time. Although both hashtable and skiplist are used to store zsets, they share the same value and score, so there is no wasted space.

Why does Zset use Skiplist for sorting instead of red-black trees? I concluded that there are two main points:

  • Skiplist implementations are simpler to implement and easier to maintain than red-black trees;
  • Skiplist interval lookup is more efficient. The skiplist can locate the beginning of the interval in O(lg n) time complexity and then traverse the interval sequentially.

For more information about Skiplist, see this article :Redis Internal Data Structure details (6) — Skiplist

Refer to the link

  • Wzmmmmj.com/2020/07/12/…
  • www.wmyskxz.com/2020/02/28/…
  • www.jianshu.com/p/98f3d4ae3…
  • www.jianshu.com/p/fa75ee891…
  • Zhangtielei.com/posts/blog-…
  • Zhangtielei.com/posts/blog-…
  • Zhangtielei.com/posts/blog-…
  • Zhangtielei.com/posts/blog-…