When introducing ziplist, we mentioned that there are two storage structures inside Zset, one is Ziplist and the other is skiplist. To fully understand zset’s internals, let’s revisit Skiplist.

Skiplist is introduced

As the name implies, Skiplist is essentially an ordered multi-dimensional list. Let’s start by reviewing how one-dimensional lists do lookups.

As shown in the figure above, to find an element, we need to traverse from the beginning node until we find the corresponding node or the first node larger than the element we are looking for (not found). Time complexity is O(N).

This lookup is inefficient if we elevate some of the nodes in the list by one level, for example, making one in every two nodes two levels. Then the nodes in the second layer are only half of the nodes in the first layer, and the search efficiency will be improved.

The search process starts at the top of the initial node. When the first node is larger than the specified element, go back to the previous node and continue the search at the next level.

For example, we want to query 16 in the list above.

  • Start at the top of the primary node and go to node 7.
  • The next node of 7 is 39, greater than 16, so we go back to 7
  • If you start at 7 and continue at the next level, you’ll find 16.

There are no fewer nodes to traverse in this example than in a one-dimensional list, but the advantages of this approach come into play when there are more nodes and larger numbers to look for. Again, if we were looking for 39, we would only need to access two nodes (7, 39) to find it. That’s half the number of one-dimensional lists.

To avoid the O(N) time complexity of insert operations, skiplist does not have a strict 2:1 ratio for each layer, but rather a random number of layers for each element to be inserted.

The calculation process of random layers is as follows:

  • Each node has a first layer
  • So the probability that it has a second layer is p, and the probability that it has a third layer is P times p
  • The maximum number of layers cannot be exceeded

The implementation in Redis is

/* Returns a random level for the new skiplist node we are going to create. * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL * (both inclusive), with a powerlaw-alike distribution where higher * levels are less likely to be returned. */
int zslRandomLevel(void) {
    int level = 1;
    while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
        level += 1;
    return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}
Copy the code

Where ZSKIPLIST_P has a value of 0.25, the probability of the existence of an upper layer is 1/4, which means it is flatter than our example above. The value of ZSKIPLIST_MAXLEVEL is 64, which is the maximum allowed 64 layers.

The Redis skiplist

Skiplist in Redis is an internal storage structure for Zset

/* ZSETs use a specialized version of Skiplists */
typedef struct zskiplistNode {
    sds ele;
    double score;
    struct zskiplistNode *backward;
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned long span;
    } level[];
} zskiplistNode;

typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;

typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;
Copy the code

You can see that zset consists of a hash and a skiplist.

Skiplist’s structure includes head and tail Pointers, length, and the number of layers in the current jump list.

In a zskiplistNode, the node of the jump list

  • Ele stands for data stored on a node
  • Score of a node
  • A backtrace pointer is a pointer to the previous node at the first level, which means Redis skiplist at the first level is a bidirectional list
  • Pointer level[] for each level of a node, each level corresponds to a pointer forward, and how many nodes span this pointer spans. Span is used to calculate the ranking of elements

Now that you know the structure of Zset and Skiplist, let’s look at the implementation of the basic operations of Zset.

Insertion process

In the zsetAdd function, Redis determines the encoding mode of zset, processing skiplist and ziplist respectively. The Ziplist section was covered earlier, so today we’ll take a look at the Skiplist section.

if (zobj->encoding == OBJ_ENCODING_SKIPLIST) {
    zset *zs = zobj->ptr;
    zskiplistNode *znode;
    dictEntry *de;

    de = dictFind(zs->dict,ele);
    if(de ! =NULL) {
        /* NX? Return, same element already exists. */
        if (nx) {
            *flags |= ZADD_NOP;
            return 1;
        }
        curscore = *(double*)dictGetVal(de);

        /* Prepare the score for the increment if needed. */
        if (incr) {
            score += curscore;
            if (isnan(score)) {
                *flags |= ZADD_NAN;
                return 0;
            }
            if (newscore) *newscore = score;
        }

        /* Remove and re-insert when score changes. */
        if(score ! = curscore) { znode = zslUpdateScore(zs->zsl,curscore,ele,score);/* Note that we did not removed the original element from * the hash table representing the sorted set, so we just * update the score. */
            dictGetVal(de) = &znode->score; /* Update score ptr. */
            *flags |= ZADD_UPDATED;
        }
        return 1;
    } else if(! xx) { ele = sdsdup(ele); znode = zslInsert(zs->zsl,score,ele); serverAssert(dictAdd(zs->dict,ele,&znode->score) == DICT_OK); *flags |= ZADD_ADDED;if (newscore) *newscore = score;
        return 1;
    } else {
        *flags |= ZADD_NOP;
        return 1; }}Copy the code

The first is to find if the corresponding element exists, if there is and there is no parameter NX, the current score of the element is recorded. Here you can see that the hash dictionary in zset is used to get scores based on elements.

Increment command is then used to determine whether the increment command is executed. If so, add the specified score to the current score to get a newscore, newscore. If the score changes, the zslUpdateScore function is called to update the nodes in skiplist, plus one more step to update the score in the Hash dictionary.

If the element to be inserted does not exist, the zslInsert function is called directly.

zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) {
    zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x;
    unsigned int rank[ZSKIPLIST_MAXLEVEL];
    inti, level; serverAssert(! isnan(score)); x = zsl->header;for (i = zsl->level- 1; i >= 0; i--) {
        /* store rank that is crossed to reach the insert position */
        rank[i] = i == (zsl->level- 1)?0 : rank[i+1];
        while (x->level[i].forward &&
                (x->level[i].forward->score < score ||
                    (x->level[i].forward->score == score &&
                    sdscmp(x->level[i].forward->ele,ele) < 0)))
        {
            rank[i] += x->level[i].span;
            x = x->level[i].forward;
        }
        update[i] = x;
    }
    /* we assume the element is not already inside, since we allow duplicated * scores, reinserting the same element should never happen since the * caller of zslInsert() should test in the hash table if the element is * already inside or not. */
    level = zslRandomLevel();
    if (level > zsl->level) {
        for (i = zsl->level; i < level; i++) {
            rank[i] = 0;
            update[i] = zsl->header;
            update[i]->level[i].span = zsl->length;
        }
        zsl->level = level;
    }
    x = zslCreateNode(level,score,ele);
    for (i = 0; i < level; i++) {
        x->level[i].forward = update[i]->level[i].forward;
        update[i]->level[i].forward = x;

        /* update span covered by update[i] as x is inserted here */
        x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]);
        update[i]->level[i].span = (rank[0] - rank[i]) + 1;
    }

    /* increment span for untouched levels */
    for (i = level; i < zsl->level; i++) {
        update[i]->level[i].span++;
    }

    x->backward = (update[0] == zsl->header) ? NULL : update[0];
    if (x->level[0].forward)
        x->level[0].forward->backward = x;
    else
        zsl->tail = x;
    zsl->length++;
    return x;
}
Copy the code

The function begins by defining two arrays, the update array to store the search path and the rank array to store the node span.

The first step is to find the search path for the node to be inserted and record the number of node spans.

And then I’m going to insert a random number of layers. If the number of randomly generated layers is greater than the current number, you need to continue populating the UPDATE and Rank arrays and update the maximum number of skiplist layers.

The zslCreateNode function is then called to create a new node.

Once the nodes are created, they are inserted layer by layer (updating Pointers) and span values for other nodes, starting at the first level, based on the location provided by the search path data.

Finally, update the traceback node and increment the skiplist length by one.

That’s the whole process of inserting a new element.

The update process

Now that we know about the insertion process, let’s go back to the update process

zskiplistNode *zslUpdateScore(zskiplist *zsl, double curscore, sds ele, double newscore) {
    zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x;
    int i;

    /* We need to seek to element to update to start: this is useful anyway, * we'll have to update or remove it. */
    x = zsl->header;
    for (i = zsl->level- 1; i >= 0; i--) {
        while (x->level[i].forward &&
                (x->level[i].forward->score < curscore ||
                    (x->level[i].forward->score == curscore &&
                     sdscmp(x->level[i].forward->ele,ele) < 0)))
        {
            x = x->level[i].forward;
        }
        update[i] = x;
    }

    /* Jump to our element: note that this function assumes that the * element with the matching score exists. */
    x = x->level[0].forward;
    serverAssert(x && curscore == x->score && sdscmp(x->ele,ele) == 0);

    /* If the node, after the score update, would be still exactly * at the same position, we can just update the score without * actually removing and re-inserting the element in the skiplist. */
    if ((x->backward == NULL || x->backward->score < newscore) &&
        (x->level[0].forward == NULL || x->level[0].forward->score > newscore))
    {
        x->score = newscore;
        return x;
    }

    /* No way to reuse the old node: we need to remove and insert a new * one at a different place. */
    zslDeleteNode(zsl, x, update);
    zskiplistNode *newnode = zslInsert(zsl,newscore,x->ele);
    /* We reused the old node x->ele SDS string, free the node now * since zslInsert created a new one. */
    x->ele = NULL;
    zslFreeNode(x);
    return newnode;
}
Copy the code

As with the insert process, the search path is saved first. Locate the node to be updated. If the position of the node remains unchanged after the update, the node is directly returned. Otherwise, the zslDeleteNode function is called to delete the node and then insert a new one.

Removal process

The skiplist update process in Redis is fairly straightforward: delete before insert, so let’s take a look at how it removes nodes.

void zslDeleteNode(zskiplist *zsl, zskiplistNode *x, zskiplistNode **update) {
    int i;
    for (i = 0; i < zsl->level; i++) {
        if (update[i]->level[i].forward == x) {
            update[i]->level[i].span += x->level[i].span - 1;
            update[i]->level[i].forward = x->level[i].forward;
        } else {
            update[i]->level[i].span -= 1; }}if (x->level[0].forward) {
        x->level[0].forward->backward = x->backward;
    } else {
        zsl->tail = x->backward;
    }
    while(zsl->level > 1 && zsl->header->level[zsl->level- 1].forward == NULL)
        zsl->level--;
    zsl->length--;
}
Copy the code

The code for the deletion process is also easier to understand, with the forward pointer being updated layer by layer, following the search path from bottom to top. Then update the backtrace pointer. You also need to update the Level field of skiplist if the number of layers to remove the node is the maximum number of layers. And then the length goes down by one.

conclusion

Skiplist is a hierarchical list of nodes that can be searched across multiple nodes to save time.

Redis’s zset consists of a Hash dictionary, which maps data to scores, and a Skiplist, which looks up data based on scores.

Skiplist insert and delete operations in Redis depend on the search path, while update operations delete before insert.

Recommended reading

Skip Lists: A Probabilistic Alternative to Balanced Trees

Redis Deep Adventures: Core Principles and Practical Applications