preface

We looked at the underlying data structure implementations of String, List, and Hash in the previous article, so let’s take a look at the underlying data structure implementations of Set and ZSet. Cut the crap and get to work.

1. Set the set

1.1 Low-level implementation of SET

Let’s take a look at the underlying data structure of set from the Redis client:

When the number of sets corresponding to our key is small, the underlying data structure is intSet. When the number of sets corresponding to our key exceeds a certain value, the underlying data structure becomes HashTable. We also said that HashTable is the underlying implementation of dictionaries, so we use dictionaries instead of HashTable.

In the diagram above, I only have 10 values for keys. The underlying data structure is still intset, but when I add another string, the underlying data structure changes from intset to dict.

An intset can only store numbers of the type int. The largest integer that an intset can represent is -2^64 to 2^64-1. If an integer is added beyond this range, an Intset will become a dict. So we can sum up as follows:

  1. The value of the key in the set is a number, the number ranges from -2^64 to 2^64-1, and the number of elements added to the set does not exceed the specified valueset-max-intset-entriesThe underlying data structure of Set key Set is INTSet.
  2. In other cases, the underlying data structure is dict.

1.2 Integer set Intset

We know that the underlying implementation of Redis’s Set key has two data structures: dict and integer Set intSet. We talked about dict dict implementations in the previous article, so let’s focus on intset.

The data structure of intset is defined as follows (code from intset.h) :

typedef struct intset {
    uint32_t encoding;
    uint32_t length;
    int8_t contents[];
} intset;

Copy the code
  • Encoding: Data encoding format, indicating that each element in intset is stored in bytes.

    1. INTSET_ENC_INT16 indicates that each element is stored in 2 bytes.
    2. INTSET_ENC_INT32 indicates that each element is stored in 4 bytes
    3. INTSET_ENC_INT64 indicates that each element is stored in 8 bytes

    Therefore, the integer stored in intset can occupy at most 64bit.

  • Length: indicates the number of elements in intset.

  • Contents: This is a dynamic array that actually stores data. The total length of this array is equal to Encoding * length. It is important to note that the encoding format of intSet is adjusted as new elements are added, and the data encoding is upgraded depending on the size of the element.

Intset memory is continuous, all data add, delete, change and check operations are based on memory address offset, and the integer storage is also in order.

1.2 Why design intSet of integers

The main reason to use intset storage over dict is to save memory. Especially when a key is stored with a small number of elements, dict requires much more memory overhead than dict because dictionaries contain two hash tables, linked list Pointers, and a lot of other metadata. So, when storing a large number of small collections whose elements are all numbers, intset can save a considerable amount of memory.

Dict query time is O(1), which is faster on average than intset query time. Dict is better for Redis, which pursues high performance. I want to remind you that intset is a sorted set, and we can use dichotomy to find the data, which is order logn, and because we’re using intset, we don’t have a lot of elements, so it doesn’t matter.

2. Ordered set zset

2.1 Low-level implementation of Zset

Ordered set on the basis of the set extension out of ordered set, zset implementation of the underlying data structure has two kinds: ziplist compressed list and skip table Zskiplist.

When zset uses Ziplist as the underlying implementation, the following two conditions must be met:

  • An ordered set holds less than 128 elements
  • Ordered collections hold all elements less than 64 bytes in length

We have analyzed the ziplist data structure before, so we won’t repeat it here. If you forget, you can read an article. If the underlying zset data structure uses Ziplist, each element is stored with two compressed list nodes next to each other, the first node holding the element’s members and the second node holding the element’s score. And they are arranged in the order of highest score. The diagram below:

2.2 skiplist jump table

2.2.1 What is a skip table

A skip list, full name for a skip list, allows you to quickly query, insert, and delete a linked list of ordered contiguous elements. The average search and insert time complexity of jump lists is O(logn).

Jump table is on the basis of singly linked lists to build, we know the singly linked list data structure advantage lies in the insertion and deletion time complexity is O (1), but singly linked lists of the query efficiency is very low, the time complexity is O (n), even if data is ordered, we also need to traverse from the start to get we want to obtain the list of nodes. Is there any way to improve the efficiency of singly linked list queries? We know that the time of an array query is O(1), because arrays can be accessed directly by subscripts, which we can also think of as indexes. Is it also possible to add indexes to a single linked list to improve the efficiency of data query?

Is there any way to improve the efficiency of linked list queries? We can learn from the way of database, by adding indexes or directories to improve the efficiency of linked list query. On the basis of the original linked list, we can extract one node from every two nodes to build an index. We call the extracted nodes index layer or index

Suppose we add a pointer to each of the two adjacent nodes, so that the pointer points to the next node, as shown in the figure below:

When we set up the pointer this way, when we visit the red node again, we only need to go through three nodes directly, which saves us half the time compared to before.

In the same way, we can create a third chain by adding a pointer to each of the two adjacent nodes on the newly generated list at the upper level

If we also look for the red nodes, we only need to access two of them. As you can imagine, when the list is long enough, this kind of multi-layer list lookup allows us to skip many nodes and greatly speed up the lookup. In this way, the number of nodes in each layer of the linked list is half that of the nodes in the next layer, so that the search process is very similar to a binary search, so that the search time can be reduced to order (log n). But this way will make the insert data when there will be a big problem, when the new insert or delete nodes will disturb linked list node number than, if you want to maintain this correspondence, you must put behind all of the newly inserted node (including the newly inserted node) to adjust, so let time complexity O (n) again.

To avoid this problem, skiplist does not require a strict relationship between the number of nodes in the next two linked lists. It uses randomization techniques to determine which nodes in the list should add forward Pointers and how many Pointers should be added to that node, which is what we understand as the number of layers. The head node of the hop table structure needs to have enough pointer fields (the maximum number of layers) to satisfy the need for constructing the largest possible series, while the tail node does not need pointer fields. The diagram below:

Because the number of layers of each node is randomly selected, adding or deleting a node will not affect the number of layers of other nodes. Therefore, the insertion operation only needs to modify the pointer before and after the insertion node. The same is true for deleting nodes. You only need to adjust the Pointers before and after deleting nodes, which reduces the operation complexity of inserting and deleting nodes.

A skip table is called a skip table because some nodes can be skipped by Pointers, and the larger the number of layers, the more nodes are skipped. This allows us to search for data in the upper echelon of the list, and then move down to the first level of the list to precisely locate the data.

2.2.2 Comparison of hop table, hash table and binary balance tree

  • Hash tables are not ordered, and hash tables are only good for individual data lookups, not range lookups.
  • Although the elements of binary balanced trees (AVL books, red-black trees, etc.) are ordered, the operation of a range query balanced tree is much more complex than a hop table.
  • Balancing the insertion and deletion of elements in a tree can cause tree readjusting, and the logic is complex compared to that of a skip table. Skiplist inserts and deletes are simple and fast, requiring only pointer changes to adjacent nodes.
  • Skiplist is much easier to implement than a balanced tree.

2.2.3 Implementation of Skiplist in Redis

Skiplist is just one of the data structures exposed by Zset. The underlying zset uses not only Skiplist but also dict.

Skiplist is defined as Redis/SRC /server.h


#define ZSKIPLIST_MAXLEVEL 32
#define ZSKIPLIST_P 0.25

typedef struct zskiplistNode {
    sds ele;      
    double score; 
    struct zskiplistNode *backward;
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned long span;
    } level[];
} zskiplistNode;

typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;
Copy the code

Analyze the above source code:

  • Zskiplist defines the data structure of the skip table, which contains:
    1. The head Pointers head and tail Pointers tail
    2. Length represents the length of the linked list, that is, the number of nodes in the list.
    3. Level represents the total number of skiplist layers, which is the maximum number of layers on all nodes
  • ZskiplistNode defines the contents of each linked list node.
    1. sds ele: is the real data field that stores data, using the dynamic string SDS described earlier to store data.
    2. Double score: the number of points by which sorts are performed.
    3. zskiplistNode *backward: a pointer to a precursor node of the current node.
    4. zskiplistLevel: Level of each node. The maximum value is 32, defined by ZSKIPLIST_MAXLEVEL
      • zskiplistNode *forward: a pointer to the next node at a certain level.
      • zskiplistNode *forward: span, the span between one layer and the next node

The following diagram shows the possible structure of Skiplist in Redis:

2.2.4 Low-level implementation of sorted set in Redis

Previously, we just said that Redis implements skiplist, but the underlying implementation of Redis implements sorted set using dict+skiplist instead of just using skiplist. The definition of Redis for zset is as follows:

typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;
Copy the code

Why did Redis design this way? It needed an extra dict. Skiplist does range queries more efficiently, but it’s not as efficient as dict for single-value lookups, so Redis uses dict data structures to implement fast single-finger lookups. For example, if we search for zscore key members of a key, if we use a skip table to search, we need to traverse each member in the linked list one by one until the member you need to find is matched, so the query efficiency is very low.

So Redis uses an additional dict here to store members and points. Dict keys are members and values are points. Skiplist stores each linked list node in order of points and members. The structure of the linked list node may be the structure shown in the figure above. However, in the sorted set data structure, only the zscore command uses dict, and the rest commands use the Zskiplist data structure to provide operations.

This is what zset data structures look like:

3. Redis data storage analysis

3.1 Redis storage structure

Redis is a typical high performance key-value database, and it supports rich data types. The Key can only be string type, and the Value supports the following data types: String, List, Hash, Set Set, sorted Set Sort Set, how does Redis support multiple data structures? From the point of view of the internal implementation of Redis, in fact Redis is equivalent to an object system. Redis encapsulates every data type into a redisObject object, and each object is realized by at least one of the data structures mentioned before. We are through Redis source code, look at Redis is how to store data at the bottom, we have a simple understanding of the overall situation of Redis through the following figure.

A redisServer have multiple redisDb redisDb equivalent to the concept of library in Mysql, Redis default in a server has 16 redisDb, can through the Redis configuration file to modify the parameters, in which each redisDb dict is a dictionary, Each dict stores key-value data, and each key and value in Redis is composed of redisObject. Each redisObject corresponds to the five basic data types provided by Redis, which will be discussed in detail below.

Here I believe how much of Redis data storage structure has a certain understanding, next we look at the details of the implementation of Redis through the source code, but also a kind of verification of the content we said before, this is not what I said casually, this is written in Redis source code.

RedisDb: redisDb: redisDb: redisDb: redis/ SRC /server.h Here will not say the code of redisServer, if you are interested in it, please click here: redisServer source code

typedef struct redisDb {
    // Key-value pair dictionary, store all key-value pairs in the database
    dict *dict;                 /* The keyspace for this DB */
    // The expiration dictionary holds the keys that are set to expire and the expiration time of the keys
    dict *expires;              /* Timeout of keys with a timeout set */
     // Holds all keys that block the client
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP)*/
     // Hold the blocked key with value NULL
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    // The transaction module saves the keys monitored by the WATCH command
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    // Database ID
    int id;                     /* Database ID */
    // The average expiration time of a key
    long long avg_ttl;          /* Average TTL, just for stats */
    unsigned long expires_cursor; /* Cursor of the active expire cycle. */
    list *defrag_later;         /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;
Copy the code

We found that there are five dicts in redisDb, of which dict *dict and dict * Expires are closely related to the user. The first dictionary is used to store all key-value pair data, and the second one stores the key whose expiration date is set in the dictionary table. Key is our key and value is a timestamp.

3.2 redisObject analysis

Redis/SRC /server.h 606: redisObject: redisObject: redisObject: redisObject: redisObject: redisObject: redisObject: redisObject: redisObject: redisObject

typedef struct redisObject {
   
    unsigned type:4;
    unsigned encoding:4; 
 
    unsigned lru:LRU_BITS; /* lru time (relative to server.lruclock) */
    // Reference count
    int refcount;
    // A pointer to the underlying data implementation
    void *ptr;
} robj;
Copy the code

RedisObject The meanings of each parameter:

  1. unsigned type:4: data type of the object, which is 4bits:
    • OBJ_STRING: string object
    • OBJ_LIST: list object
    • OBJ_SET: collection object
    • OBJ_ZSET: Ordered collection object
    • OBJ_HASH: Hash object
  2. unsigned encoding:4: specifies the code of an object. The value is 4bits and contains 10 types
    • OBJ_ENCODING_RAW: Raw representation, string objects are simple dynamic strings
    • OBJ_ENCODING_INT: integer of type long
    • OBJ_ENCODING_HT: dictionary
    • OBJ_ENCODING_ZIPMAP: deprecated and not used
    • OBJ_ENCODING_LINKEDLIST: double-ended linked list
    • OBJ_ENCODING_ZIPLIST: compressed list
    • OBJ_ENCODING_INTSET: set of integers
    • OBJ_ENCODING_SKIPLIST: Skip lists and dictionaries, corresponding to the zset we analyzed earlier.
    • OBJ_ENCODING_EMBSTR: simple dynamic string encoded by embstr
    • OBJ_ENCODING_QUICKLIST: quick list
  3. unsigned lru:LRU_BITSUsed for LRU algorithm to calculate LRU time relative to Server. lRUClock
  4. int refcount: Reference count.
  5. void *ptr: a pointer to the underlying data implementation.

There are two commands in Redis. One is the type command, which returns the data type of the value. I think type should return the unsigned type field in redisObject. There is also another object Encoding command that returns the encoding of the data type, which I think returns the unsigned Encoding field in the redisObject. Look at the source code and guess the same.

char* getObjectTypeName(robj *o) {
    char* type;
    if (o == NULL) {
        type = "none";
    } else {
        switch(o->type) {
        case OBJ_STRING: type = "string"; break;
        case OBJ_LIST: type = "list"; break;
        case OBJ_SET: type = "set"; break;
        case OBJ_ZSET: type = "zset"; break;
        case OBJ_HASH: type = "hash"; break;
        case OBJ_STREAM: type = "stream"; break;
        case OBJ_MODULE: {
            moduleValue *mv = o->ptr;
            type = mv->type->name;
        }; break;
        default: type = "unknown"; break; }}return type;
}

Copy the code

See if you have a general understanding of Redis database here, let’s use the following figure to consolidate:

4. Redis data search source code analysis

Here we have all the data structures in Redis have been over, but also the whole Redis data storage has a certain understanding. Finally, I want to analyze how Redis conducts data search through source code, so that we can have some understanding of the whole data search process of Redis. Here, I take the sorted set zscore command as an example.

First of all, Redis obtains the command sent by the client through the request. If the function is called, the server has already read in a set of command parameters, which are saved in the parameter list. At this point you enter the processCommand method. Source: processCommand() method in server.c

int processCommand(client *c) {
    if(! strcasecmp(c->argv[0]->ptr,"quit")) {
        addReply(c,shared.ok);
        c->flags |= CLIENT_CLOSE_AFTER_REPLY;
        return C_ERR;
    }
  // This encapsulates the command sent by the client. C ->argv[0]-> PTR returns the first argument to the command we entered.
  Argv [0]-> PTR fruits
    c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr);
    /** return the corresponding command based on name. Struct redisCommand *lookupCommand(SDS name) {return dictFetchValue(server.commands, name); struct redisCommand *lookupCommand(SDS name) {return dictFetchValue(server.commands, name); } * /
    /**redisCommand specifies the corresponding function of the command. Struct redisCommand redisCommandTable[] = {{"zscore",zscoreCommand,3,"rF",0,NULL,1,1,1,0,0}} */ struct redisCommand redisCommandTable[] = {{"zscore",zscoreCommand,3,"rF",0,NULL,1,1,1,0}} */./** here omitted some code, need to see the source code */.// Do a lot of verification, judge the command
     
    // Execute common commands
     call(c,CMD_CALL_FULL);
}

Copy the code

C -> CMD ->proc(c) Here is the execute command, the final method called here is zscoreCommand (). Source from t_zset.c line 3074

void zscoreCommand(client *c) {
    robj *key = c->argv[1];Argv [1] returns fruits, the second parameter of the command we entered.
    robj *zobj;
    double score;
    // Fetch the ordered collection object with a read operation
    if ((zobj = lookupKeyReadOrReply(c,key,shared.nullbulk)) == NULL ||
        checkType(c,zobj,OBJ_ZSET)) return;
     /** Retrieve the value object of key with read operation, if key does not exist, send reply message, C robj *lookupKeyReadOrReply(client *c, robj *key, robj *reply) {robj *o = lookupKeyRead(c->db, key); // Select value if (! o) addReply(c,reply); return o; } * /
    // The zsetScore method returns the corresponding score from ob objects and members we have obtained.
    if (zsetScore(zobj,c->argv[2],&score) == C_ERR) {
        // Return null on failure
        addReply(c,shared.nullbulk);
    } else {
        addReplyDouble(c,score);    // Send the score}}Copy the code

To the ground we put Redis the whole search process roughly over again, if you want to in-depth understanding of the students, go to the open source look, basically the process is like this.

The last

From the last write Redis data structure analysis of the upper part of the article for almost a month now, I also want to hurry up finish, but between work and personal reasons now, actually write this special effort, itself Redis bottom is the C language to write, and C language and is not very good, I look somewhat difficult, coupled with the drawing, Sometimes it takes an hour to return a map. I hope it is not too late. I hope this article can bring you a little harvest and make you want to have a deep understanding of the underlying data structure and implementation of Redis.

Every time I feel that I write things very carefully, but every time the people who like are not many, can also let my article last recommended ┭┮﹏┭┮. If there is help to you, I hope quality three even once, thank you old iron!