Redis Zset data structure :SkipList principle and implementation

This is the 10th day of my participation in the First Challenge 2022, for more details: First Challenge 2022

1. Data structure

1.1 introduction

Skiplist is a query/insert/delete complexity o(LGN) data structure. The query is consistent with the complexity of the balanced tree, so it is an alternative to the balanced tree. Leveldb is available in Redis zset. I realized that this algorithm solved a problem for me, too. Various balance trees are widely used in the industry to help us quickly find and insert data, but the complexity of their ideas makes it not so easy to access. At that time, if you wanted to do binary search for ordered lists directly, it would be easy to achieve the same complexity and easy to understand, but linked lists can not be randomly accessed like arrays, you can only go through them one by one. A hop table sets Pointers for quick access to nodes, which can be accessed across nodes instead of traversing them one by one, which is what the name of the hop table means.

This is the jump-table introduction, and its data structure is shown below

1.2 How to Build a hop table

When inserting a piece of data, get the height of the node randomly, yes, randomly! The probability of each layer rising is p, which is generally set to 0.25 or 0.5, so that the higher the number of layers, the fewer nodes (this structure is similar to a balanced tree).

1.3 How Can I Search?

As shown in the figure above, we retrieve the value 19, traversing the path shown in the figure below

You can see that the higher-level nodes act as a fast track, allowing the search to skip nodes rather than traverse them one by one.

2. Coding implementation

Based on Java, I realized a jump table with add delete check: github.com/yangzhenkun…

2.1 the search

The search implementation follows the same idea, traversing from the top to the bottom, with ordinary list traversal between each layer.

public T get(Integer key) {

    SkipNode cur = head;

    for (int l = maxLevel - 1; l >= 0; l--) {

        if (cur.forward[l] == null) {
            continue;
        }

        while(cur.forward[l] ! =null && key > cur.forward[l].key) {
            cur = cur.forward[l];
        }

        if(cur.forward[l] ! =null && key == cur.forward[l].key) {
            return(T) cur.forward[l].value; }}return null;
}
Copy the code

2.2 insert

The idea of insertion is to find the insertion point, and while traversing, record where the fast channel pointer of the number of layers needs to be updated, and then process it at the end.

For example, insert 17 and get a random number of layers 2 at node 17. Then layer 2 of node 9 (from bottom up) needs to point to the new node 17 and layer 1 of node 12 also needs to point to 17. It can be seen that what needs to be updated is the node last traversed by each layer.

    public boolean set(int key, T value) {
        SkipNode newNode = initNode(key, value);
        int newNodeMaxLevel = newNode.forward.length;
        SkipNode cur = head;
        SkipNode[] update = new SkipNode[newNodeMaxLevel];// Record the updated node

        /** * iterate over the inserted node and record the node where the layer pointer needs to be updated */
        for (int l = maxLevel - 1; l >= 0; l--) {

            if (cur.forward[l] == null) {
                if (l < newNodeMaxLevel) {
                    update[l] = cur;
                }
                continue;
            }

            while(cur.forward[l] ! =null && key > cur.forward[l].key) {
                cur = cur.forward[l];
            }

            if(cur.forward[l] ! =null && key == cur.forward[l].key) {
                return false;
            }
            if(l < newNodeMaxLevel) { update[l] = cur; }}/** * Perform insert update */
        for (int l = newNodeMaxLevel - 1; l >= 0; l--) {
            newNode.forward[l] = update[l].forward[l];
            update[l].forward[l] = newNode;
        }


        return true;
    }
Copy the code

2.3 delete

Once you know the GET and set methods, there is no pressure to understand the implementation of the delete method. Similarly, the nodes last traversed by each layer in the search process need to be recorded. After the deleted node is found, the pointer to the deleted node in each layer points to the subsequent pointer to the deleted node in each layer.

 public boolean remove(int key) {
        SkipNode cur = head;
        SkipNode[] update = new SkipNode[maxLevel];// We do not know how many layers the node is deleted, so we need to record the total number of layers
        SkipNode delete = null;

        /** * iterate to find the node to delete and record the node to update the layer pointer */
        for (int l = maxLevel - 1; l >= 0; l--) {

            if (cur.forward[l] == null) {
                continue;
            }

            while(cur.forward[l] ! =null && key > cur.forward[l].key) {
                cur = cur.forward[l];
            }

            if(cur.forward[l] ! =null&& key == cur.forward[l].key) { delete = cur.forward[l]; update[l] = cur; }}if (delete == null) {
            return false;
        }

        /** * update pointer */
        for (int l = delete.forward.length - 1; l >= 0; l--) {
            update[l].forward[l] = delete.forward[l];
            delete.forward[l] = null;//help gc
        }

        return true;
    }
Copy the code

3. Proof of principle

3.1 Maximum level proof

Call the lowest level of the list level0 and the maximum number of levels includes the lowest level

The maximum level mentioned in the data Structure section is

$MaxLevel =\log_\frac{1}{p} n$

The proof method is also mentioned in the paper

There are 16 nodes at Level1, level2, level3, level14, and level1. If we start searching at level 14, there is a lot of inefficiency because the top layer is too sparse. It is mentioned that the top layer should have 1/ P nodes.

It’s easy to get the number of nodes at each level

$L(k)=n*p^{k-1}$

So that gives us the derivation

$1/p = n*p^{k-1}$

$n=(1/p)^{k}$

$k=log_\frac{1}{p} n$

That gives you the maximum number of layers

3.2 Complexity Proof

The complexity of the search has also been proved in this paper. I have captured part of the graph, which can be viewed in the original text

So this is a proof using the idea of recursion.

In this paper, the method of searching path backtracking is proved, traversing up and left. Suppose the current location is layer I, the value to be accessed is x, and the number of layers between I and x is layer K. C (k) table traverses the number of steps accessed from layer K to X. The movement of any of the nodes backtracked is shown below. When we’re in scenario A, our next step must be scenario B or C. Why is probability of c p? Because based on the current layer, the probability that the next layer has this node is p. So b is going to be 1 minus p. So that gives us the recursive formula

C(k)=(1-p)(1+C(k))+p(1+C(k-1))

At the same time C (0) = 0

The number of steps needed to search k layer can be concluded as c(k)= K/P

We go through the maximum number of levels at a time, which we derived above, and plug in the step number formula to get the number of steps in a search

$c=\frac{1}{p}log_\frac{1}{p} n$

And since p is a probability constant, usually 1/4 or 1/2, it’s order LGN

The screenshot of the data structure is also from the paper. The paper address: ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf