0 x00. Preface

As we all know, data structures and algorithms are the focus of the interview. It’s wise for us to keep working hard, or we’ll lose out in the end. Less playing games and tweeting can change our living standards.

                  

However, this article is not to talk about data structures and algorithms, but another interview focus Redis, because Redis is a common technical point across languages, whether Java or C++ will be asked, so it is a high frequency surface pilot.

The author started contact with Redis only in 2017, during his version, on a stand-alone and cluster is larger but now companies are fully operational, our users only need to related application in a web page, a lot of details were blocked, so, of course, is very convenient, but we still want to deep understanding of the.

During my years of work, I have been exposed to NoSQL such as Redis, Redis-like SSDB and Pika, LevelDB of Google’s key-value storage engine, RocksDB of Facebook’s key-value storage engine and so on. Redis is developed based on standard C language. It is an excellent open source project in engineering and learning.

I have written several articles about Redis before, but the knowledge points are scattered and not conducive to reading, so this time I will summarize and supplement the previous articles to form a complete set, hoping that it will be helpful to the readers who pay attention to me.

The test points listed in the article are more and accumulated up to 3W + words, so it is recommended that readers collect, in case of emergency, through this article you will understand the following content:

  • A brief history of the author and development of Redis
  • Redis commonly used data structures and their implementation
  • Principles and comparison of Redis SDS and C strings
  • Redis ordered set of the bottom design and implementation of ZSet
  • Redis ordered set ZSet and skip list problems
  • Redis dictionary implementation and progressive Rehash process
  • Redis single thread running mode of basic principles and processes
  • Principle and design implementation of the Redis reactor model
  • Redis persistence scheme and its basic principles
  • Cluster version Redis and Gossip protocol
  • Redis memory reclaim mechanism and basic principle
  • Redis data synchronization mechanism and basic principle

Without further ado, the Great White is accelerating at 400 kilometers an hour!



I try my best to elaborate on each question, in order to understand and avoid reciting, of course, there will be some shortcomings, if you have questions can be sent to me.

What is Redis and its importance?

Redis is an open source, net-based, memory-based, optional persistent, high-performance key-value pair database written in ANSI C.

The father of Redis is Salvatore Sanfilippo from Sicily, Italy, who goes by the Github name Antirez. I found some brief information about the author and translated it, as shown here:



Ten years after its first release in 2009, Redis is still one of the most popular key-value in-memory databases.

A good open source project cannot be achieved without the support of a large company. Prior to May 2013, its development was sponsored by VMware, from May 2013 to June 2015, its development was sponsored by Bivito, and from June 2015, Redis development was sponsored by Redis Labs.



The author has also used some other NoSQL, some of which support a very single type of value, so many operations must be implemented in the client, for example, value is a structured data, to modify one of the fields, it needs to be read out, modified and then written as a whole, which is cumbersome, but Redis value supports multiple types. Many operations can be done on the server side, which is very convenient for the client.

Of course, Redis is a memory-based database with limited data storage and high cost of distributed cluster. Therefore, many companies have developed SSD-like Redis systems, such as SSDB and Pika databases developed by 360. However, the author believes that the difficulty from 0 to 1 is greater than that from 1 to 2. There is no doubt that Redis is an important part of NoSQL and is worth studying and using.

Redis provides clients in Java, C/C++, C#, PHP, JavaScript, Perl, object-C, Python, Ruby, Erlang, Golang and other mainstream languages, so users will always find their own client regardless of the language stack. The audience is very wide.

I checked datanyze.com to see the latest market share and ranking comparison between Redis and MySQL as well as the deployment volume comparison of the world’s Top sites (website data 2019.12) :





As you can see, Redis ranks 9th in the overall share and is roughly equal to MySQL in the number of deployed sites in the global Top100, so Redis still has a certain status.

0x02. Describe the data structures commonly used in Redis and how they are implemented.

The five commonly used data types supported by Redis refer to the value type, which are String, List, Hash, Set, and Zset. However, Redis later enriched several data types, which are Bitmaps, HyperLogLogs, and GEO.

Since Redis is written based on standard C and only has the most basic data types, in order to meet the five data types used externally, Redis developed a unique set of basic data structures and used these data structures to realize the five data types.

The underlying data structures of Redis include: simple dynamic array SDS, linked list, dictionary, skip list, integer set, compressed list, object.

In order to balance space and time efficiency, Redis adopts different data structures at the bottom layer for specific types of value, among which hash table and compressed list are data structures that are more reused. The following figure shows the mapping relationship between external data types and underlying data structures:





As can be seen from the figure ziplist compressed List can be used as the bottom implementation of Zset, Set, List three data types, it seems very powerful, compressed List is a kind of memory and developed to save memory and after special coding sequential memory block data structure, the bottom structure is more complex.

What are the advantages of Redis SDS over strings in C?

In C language, the character array of N+1 length is used to represent the string, and ‘\0’ is used as the end sign. For this kind of implementation, Redis cannot meet the requirements of security, efficiency and rich functions, so Redis encapsulates the simple dynamic string structure of SDS separately.

Before you can understand the advantages of SDS, you need to look at the details of the SDS implementation. Check out github’s latest SRC/SDs.h definition:

typedef char *sds;/* This is not ignored */
struct __attribute__(((packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};/* Header lengths 8, 16, 32, 64 The following macro definition for SDS_TYPE_x has only five types: 3bit is sufficient; 5bit is not buf: this is the same as in C character arrays. So we know from typedef char star SDS that's exactly what this is. The maximum length of buf is 2^n where n is the type of SDSHDR, such as when sdSHDR16 is selected, buf_max=2^16. * /
struct __attribute__(((packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__(((packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};

#define SDS_TYPE_5  0
#define SDS_TYPE_8  1
#define SDS_TYPE_16 2
#define SDS_TYPE_32 3
#define SDS_TYPE_64 4
#define SDS_TYPE_MASK 7
#define SDS_TYPE_BITS 3Copy the code

Looking at the previous definition, the author drew a picture:



From the figure, we can see that SDS is divided into three parts: header, BUF, and null. Header can be regarded as the guidance part of the whole SDS, which is given the space size used, the maximum allocation size, and other information. Then, we can clearly see an example of SDSHDR8 using a figure on the Internet:

                        

You can see the full implementation details of SDS in the sds.h/ SDs.c source code, so I won’t go into this article because it would be too long. Let’s quickly get into the topic and talk about the advantages of SDS:

  • O(1) get length: the C string needs to be traversed and the SDS has len which can be obtained directly;

  • When the SDS needs to modify a string, it first checks whether the space is sufficient by len and alloc. If the space is insufficient, the SDS automatically expands the space, avoiding overwriting conditions like C string operations.

  • Effectively reduce the number of memory allocation: C string on the add or remove operation will change the size of the underlying array cause redistribution, SDS, USES the space pre-allocated and inertia space release mechanism, namely every time when the extension is exponentially more allocation, the shrinkage capacity is also has not formally returned to the OS, first the two mechanisms is also better to understand;

  • Binary security: C language string can only save ASCII code, for pictures, audio and other information can not be saved, SDS is binary security, what is written is read, do not do any filtering and restrictions;

The last picture summarized by Huang Jianhong:



0x04. How is Redis’ dictionary implemented? Describe the progressive rehash process

Dictionaries are one of the most common data types in Redis. As mentioned earlier, dictionaries can be implemented based on ziplist and Hashtable.

A dictionary is a very hierarchical data type, as shown here:



With an idea in mind, let’s look at the latest SRC /dict.h source definition:

// struct typedef struct dictEntry {void *key; union { void *val; uint64_t u64; int64_t s64; double d; } v; struct dictEntry *next; } dictEntry; Typedef struct type {uint64_t (*)hashFunction)(const void *key);
    void *(*keyDup)(void *privdata, const void *key);
    void *(*valDup)(void *privdata, const void *obj);
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);
    void (*keyDestructor)(void *privdata, void *key);
    void (*valDestructor)(void *privdata, void *obj);
} dictType;

/* This is our hash table structure. Every dictionary has two of this as we
 * implement incremental rehashing, forTypedef struct dictht {dictEntry **table; unsigned long size; unsigned long sizemask; unsigned long used; } dictht; DictType = dictType; // dictType = dictTypetype;
    void *privdata;
    dictht ht[2];
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    unsigned long iterators; /* number of iterators currently running */
} dict;
Copy the code

The advantage of C language is that the definition must be from the bottom to the outside, so we can see a significant change in the level, so the author draws a diagram to show the specific level concept:



  • About dictEntry

A dictEntry is a hash table node, where we store data, and its protected members are the key,v, and next Pointers. The key holds the key in the key-value pair and the v holds the value in the key-value pair. The value can be a pointer or uint64_t or int64_t. Next is a pointer to another hash table node that can concatenate multiple key-value pairs with the same hash value to resolve hash conflicts.

The following figure shows the connection between two conflicting hash nodes:

                     

  • About dictht

Hash table includes table, size, used, and Sizemask members. Table is an array, each element of which is a pointer to a dictEntry structure, each of which holds a key-value pair; The size attribute records the size of the table, while the Used attribute records the number of existing nodes in the hash table. The value sizemask is equal to size-1 and the hash value used to calculate the index of a key in the table array.



The figure above shows the hash node situation in a table of size 4, where k1 and k0 have hash conflict at index=2, and the open linked list exists. In essence, k0 is stored first, and k1 is placed in the front of the conflicted linked list directly to ensure efficiency, because the linked list has no tail pointer.

  • About dict

The dict structure is the definition of a dictionary. It contains members such as type, privData, HT, and rehashidx. DictType refers to the API that operates the dictionary. Ht is an array of 2 dicthts, that is, the dictionary contains 2 hash tables. Rehashidx is a variable used to rehash. Privdata is used as an argument with the function pointed to by dictType, which gives a glimpse of several members of the dictionary.



Dictionary hashing algorithm

// Pseudo-code: uses the hash function to calculate the hash value of the keyhash = dict->type->hashFunction(key); // pseudo-code: use hash table sizemask and hash value to calculate index = in HT [0] or HT [1]hash& dict->ht[x].sizemask; // Source definition#define dictHashKey(d, key) (d)->type->hashFunction(key)
Copy the code

Redis uses MurmurHash algorithm to calculate the hash value, which was originally invented by Austin Appleby in 2008. MurmurHash algorithm can give hash value with good random distribution no matter what the data input situation is, and the calculation speed is very fast. There are MurmurHash2 and MurmurHash3 versions.

  • Normal Rehash rehashes

The number of key-value pairs stored in the hash table changes dynamically. In order to keep the load factor of the hash table within a reasonable range, it is necessary to expand and shrink the hash table.

Scaling is done by performing a rehash. The basic steps to perform a normal rehash on a dictionary hash table are allocating space -> migrating one by one -> swapping hash tables. The detailed process is as follows:

  1. Allocate space for the dictionary’s HT [1] hash table. The size of the allocated space depends on the operation to be performed and the number of key-value pairs ht[0] currently contains: the size of HT [1] for the extended operation is 2^n of the first one greater than or equal to HT [0].used*2; The size of the first ht[1] is greater than or equal to 2^n of ht[0].used;

    For example, h[0].used=200, then select the first power of 2 greater than 400, i.e. 2^9=512.

  2. Rehash all key pairs stored in HT [0] to ht[1] by recalculating the hash and index values of the key;

  3. Rehash is repeated until all key-value pairs in HT [0] are migrated to HT [1], ht[1] is set to HT [0], and a blank hash table is created in HT [1] to prepare for the next rehash.

  • Progressive Rehash process

The rehash action of Redis is not done in one time, but in several times and gradually. The reason is that when there are a large number of key and value pairs stored in the hash table, rehashing them all to HT [1] at one time may cause the server to stop service for a period of time, which is unacceptable.

In this case, Redis uses progressive rehash. The detailed steps of the process are:

  1. Allocating space for HT [1] is no different than a normal Rehash;

  2. Set rehashidX to 0 to officially start the rehash work, and the rehashidx is incrementing, starting from 0 to rehash from the first element of the array.

  3. During the rehash process, each time the dictionary is added, deleted, modified and looked up, the key-value pair of the hashed table HT [0] on the rehashidx index is rehash to HT [1]. After the rehashidx is finished, the rehashidx is increased by 1 to point to the next key-value pair to be rehash.

  4. As the dictionary operation continues, eventually all key-value pairs of HT [0] are rehash to HT [1], and the rehashidx attribute is set to -1 to indicate that the rehash operation is complete.

The idea of progressive rehash is to spread the computation required for rehash key-value pairs over each addition, deletion, lookup, and update to the dictionary, thus avoiding the blocking problem associated with centralized rehash.

You can’t help but wonder if this piggy-back rehash will make the whole process very long. If a value has not been operated so the need to expand because it has not been used so little impact, need to shrink if it has not been dealt with may cause memory waste, specific haven’t had time to study, first buried a problem!

How to run Redis in single-threaded mode before 4.0

In essence, Redis is not a single threaded service model. Some of the secondary tasks, such as persistent disk flushing and lazy deletion, are done by BIO threads.
It is not clear how Antirez thought about the single-threaded design at that time, but it can only be analyzed from several angles to determine the reason why the single-threaded model was chosen.

5.1 Considerations for single-threaded mode

CPU is not the bottleneck: The multi-threaded model is designed to make full use of the multi-core CPU, allowing the thread to be suspended when I/O is blocked to give the use of the CPU to other threads, thus fully improving the CPU usage. However, this scenario is not obvious in Redis, because the CPU is not the bottleneck of Redis, all operations in Redis are based on memory, and processing events is very fast. Therefore, the need to use multiple threads to switch threads to improve CPU utilization is not strong;

Memory is the bottleneck: A single Redis instance makes good use of a single core, but the bottleneck of Redis is memory. Imagine a 64-core machine with only 16GB of memory. What’s the use of multi-threaded Redis?

Complex Value types: Redis has rich data structures, not simple key-value NoSQL, which is also the reason why Redis is popular. When the Value of common Hash, Zset, List and other structures is large, the operation of CURD will be very complicated. If you use multithreaded mode, you need to lock to synchronize the same key operation, which can cause deadlock problems.

At this point you ask: The hash of the key and the load balancing of the key can be added to Redis. The hash of the key and the load balancing of the key can be added to Redis. The hash of the key and the load balancing can be added to Redis. But Antirez didn’t do that, and there must have been a reason for that, and sure enough, we saw a clustered Redis;


Cluster extension: Current machines are multi-core, but the memory is generally 128GB/64GB is relatively common, but Redis is more than 60% of the memory stability is not as good as 50% of the performance (at least when the author uses the clustered Redis more than 70%, the frequency of cluster failover will be higher), so when the data is large, When Redis is used as the main memory, it is necessary to use multiple machines to build a clustered Redis database system, so that the single thread mode of Redis is extended by the clustering processing.

From the perspective of software engineering: single-threaded is much easier to develop and maintain than multi-threaded, and it can also improve the stability of services. Lockless processing makes single-threaded Redis have considerable advantages in development and maintenance.

Redis-like system: Redis design adheres to practical first and engineering. Although there are many excellent design modes in theory, they are not necessarily applicable to themselves. Software design process is a process of weighing. There are also many Redis-like NoSQL in the industry, such as Pika system developed by 360 infrastructure group, which is based on SSD and Rocks storage engine, and encapsulates a layer of protocol transformation on the upper layer to realize the simulation of all Redis functions. Interested people can study and use it.

5.2 File and Time events of Redis

As a single-threaded service, Redis has to deal with a lot of work. Redis is an event-driven server. The main event types are file event type and time event type, among which time event is the key to understand the single-threaded logical model.
  • Time event

Redis time events fall into two categories:


  1. Timed event: The task is executed after the waiting time of the specified size. After the execution is complete, the task is not executed and only triggered once.

  2. Periodic event: The task is executed at a certain interval. After the execution is complete, it is triggered periodically for the next execution.

  • Periodic time event

Most of the events in Redis are periodic events, which are mainly for servers to periodically detect and adjust their own running conditions to ensure stability. This work is mainly completed by ServerCron function. The contents of periodic events mainly include:


  1. Delete the key of the database

  2. Trigger RDB and AOF persistence

  3. Master-slave synchronization

  4. Cluster survival

  5. Close the cleanup dead client link

  6. Update the memory and number of keys of the server

It can be seen that the periodic events of Redis mainly deal with auxiliary tasks, but play a vital role in the stable operation of the whole service.
  • An unordered list of time events

Each time event in Redis is divided into three parts:
  1. Event ids are globally unique and in ascending order

  2. Trigger timestamp with MS level accuracy

  3. Event handler event callback function

Time_Event structure:

                          

Redis time events are stored in a linked list, and they are stored by ID, with new events in the header and old events in the tail, but not in the order in which they will be executed.


In other words, the first element will be executed after 50ms, but the third element may be executed after 30ms. In this case, every time Redis obtains the latest event to be executed from the linked list, it needs to perform O(N) traversal. Obviously, the performance is not the best, and the best case must be similar to the MinStack idea. Antirez, however, chose the unordered list.




The selection of unordered lists is also suitable for Redis scenarios, because there are not many time events in Redis, and the performance penalty is minimal even if the O(N) traversal is performed, so there is no need to rearrange the list every time a new event is inserted.


The disordered list of Redis stored time events is shown as follows:



5.3 Event scheduling and execution in single-threaded mode

The Redis service is complicated by the inclusion of time events and file events. The server has to decide when to process file events, when to process time events, and the length of time to process events, so the execution and scheduling of events becomes the focus.


The Redis server processes file events and time events in turn, both of which are handled synchronously, sequentially, and atomically, and the server does not terminate or preempt the events that are being executed.
  • Event execution scheduling rules

File events are random, if processing is complete file at a time, there is still no other files coming event, the server will continue to wait for, in the file events continuously perform, time will gradually to the earliest event set and eventually came to the arrival time, arrival time approximation of the server can start to deal with the arrival time of event.


Because time events execute after file events and there is no preemption between events, the actual processing time of time events is generally a little later than the set time.
  • Code implementation of event execution scheduling

Redis source ae.c on the event scheduling and execution of the detailed process in the aeProcessEvents implementation, the specific code is as follows:

int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
  int processed = 0, numevents;
  if(! (flags & AE_TIME_EVENTS) && ! (flags & AE_FILE_EVENTS))return 0;

  if(eventLoop->maxfd ! =- 1|| ((flags & AE_TIME_EVENTS) && ! (flags & AE_DONT_WAIT))) {int j;
    aeTimeEvent *shortest = NULL;
    struct timeval tv, *tvp;

    if(flags & AE_TIME_EVENTS && ! (flags & AE_DONT_WAIT)) shortest = aeSearchNearestTimer(eventLoop);if (shortest) {
      long now_sec, now_ms;
      aeGetTime(&now_sec, &now_ms);
      tvp = &tv;
      long long ms =
        (shortest->when_sec - now_sec)*1000 +
        shortest->when_ms - now_ms;

      if (ms > 0) {
        tvp->tv_sec = ms/1000;
        tvp->tv_usec = (ms % 1000) *1000;
      } else {
        tvp->tv_sec = 0;
        tvp->tv_usec = 0; }}else {
      if (flags & AE_DONT_WAIT) {
        tv.tv_sec = tv.tv_usec = 0;
        tvp = &tv;
      } else {
        tvp = NULL; /* wait forever */
      }
    }
    numevents = aeApiPoll(eventLoop, tvp);
    if(eventLoop->aftersleep ! =NULL && flags & AE_CALL_AFTER_SLEEP)
      eventLoop->aftersleep(eventLoop);

    for (j = 0; j < numevents; j++) {
      aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd];
      int mask = eventLoop->fired[j].mask;
      int fd = eventLoop->fired[j].fd;
      int fired = 0;
      int invert = fe->mask & AE_BARRIER;
      if(! invert && fe->mask & mask & AE_READABLE) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; }if (fe->mask & mask & AE_WRITABLE) {
        if(! fired || fe->wfileProc ! = fe->rfileProc) { fe->wfileProc(eventLoop,fd,fe->clientData,mask); fired++; }}if (invert && fe->mask & mask & AE_READABLE) {
        if(! fired || fe->wfileProc ! = fe->rfileProc) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; } } processed++; }}/* Check time events */
  if (flags & AE_TIME_EVENTS)
    processed += processTimeEvents(eventLoop);
  return processed;
}
Copy the code

  • Pseudo code for event execution and scheduling

The above source code may not be intuitive to read, but the pseudo-code implementation is given in Redis Design and Implementation:

def aeProcessEvents()
  # Get the last time event to be executed
  time_event = aeGetNearestTimer()
  Calculate the difference between the last executed event and the current time
  remain_gap_time = time_event.when - uinx_time_now()
  # Determine whether the time event has expired, then reset the execution immediately
  if remain_gap_time < 0:
    remain_gap_time = 0
  The blocking wait time is determined by remain_gap_time
  # Remain_gap_time = 0 return immediately without blocking
  aeApiPoll(remain_gap_time)
  # Handle all file events
  ProcessAllFileEvent()
  # Handle all time events
  ProcessAllTimeEvent()
Copy the code

It can be seen that the Redis server executes while blocking, and the specific blocking event is determined by the waiting time of the latest pending time event. After the return of the minimum blocking waiting time, the event task is processed, and the file event is executed first and then the time event is executed, even if the time event is executed immediately. You also need to wait for the file event to complete before executing the time event, so it’s a little later than expected.
  • Event scheduling and execution processes

     

Talk about the Redis reactor model

Redis developed its own network model based on Reactor model and formed a complete event driven server based on IO reuse. However, several problems can not help but emerge:
  1. Why use the Reactor model?

  2. How does Redis implement its Reactor model?

6.1 Reactor model

Pure EPoll/KQueue can support tens of thousands of concurrency on a single server, which is fine from a purely performance perspective, but there are some differences in technical implementation and software design.
Consider this scenario:
  • Epoll/Kqueue puts all the collected read and write events into a queue for processing by business threads. In this case, the worker thread of the thread pool obtains the task for processing. In actual scenarios, there may be many types of requests

  • Worker threads need to pay attention to different types of requests and choose different processing methods for different requests, so increasing the number of request types makes worker threads more complex and difficult to maintain

Scenario above actually and the high concurrency network model are similar, if we are in business to distinguish on the basis of epoll/kqueue, and set up corresponding to each kind of business processing function, every time he comes to the task to identify and distribute task, after each processing function only deals with a kind of business, this model is more in line with the OO design concept, That’s how the Reactor model was designed.
Reactor mode is a design mode of object behavior, mainly the same as synchronous IO, asynchronous IO has Proactor mode, here do not describe in detail Proactor mode, the main difference between the two is that the Reactor is synchronous IO,Proactor is asynchronous IO, theoretically Proactor is more efficient. However, Proactor mode requires the OS to support asynchronous IO at the kernel level. Linux’s Boost. Asio is a representative of Proactor mode, and Windows has IOCP.


A classic map of Reactor model online:



The figure shows five components:


  1. Handle can be interpreted as read and write events that can be registered to the Reactor for monitoring

  2. Sync Event Demultiplexer can be understood as the collector of IO events such as epoll/kqueue/ SELECT

  3. The Dispatcher acts as an event Dispatcher by registering/removing events and distributing them

  4. Event Handler Callbacks of specific events completed by the Event Handler are called by the Dispatcher

  5. Concrete Event Handler Specifies the specific request Handler

A more concise process is as follows:



Before the loop, the events to be monitored are registered. When the Socket read and write events in the monitoring come, I/O multiplexing tools such as event collector Epoll detect the events and return the events to the event Dispatcher. The Dispatcher distributes the events to the event handler according to the read, write, and exception conditions. The event handler then schedules the corresponding implementation function to complete the task according to the specific event type.

Implementation of the 6.2 Reactor model in Redis

Redis handles client business (file event) the basic flow:



Redis IO reuse options

#ifdef HAVE_EVPORT
#include "ae_evport.c"
#else
    #ifdef HAVE_EPOLL
    #include "ae_epoll.c"
    #else
        #ifdef HAVE_KQUEUE
        #include "ae_kqueue.c"
        #else
        #include "ae_select.c"
        #endif
    #endif
#endif
Copy the code

Redis supports a variety of I/O reuse, the source code uses the corresponding macro definition to choose, compile time can get the current system support the optimal I/O reuse function to use, so as to achieve the excellent portability of Redis.




  • Task event queue for Redis

Since Redis processes business in a single thread, the IO reuse program synchronously puts the read and write events into the queue one by one. If the current queue is full, only one can go out and one can go in. However, Redis normally processes quickly, so it is unlikely that the queue is full and the task cannot be released late. However, when certain blocking operations are performed, they will result in long blocks, unable to process new tasks.
  • Redis event dispatcher

Events can be read and written from the server’s point of view. The types of events seen by dispatch include:
  1. The AE_READABLE client writes data, closes the connection, and a new connection arrives

  2. AE_WRITEABLE The client reads data

In particular, when a socket connection is both readable and writable, the server processes the read event first and then the write event first.

  • Redis event handler

Redis categorizes file events and writes several event handler functions, including:


  1. Connection reply handler: Implements the establishment of new connections

  2. Command request handler: Processes new commands from the client

  3. Command reply handler: returns the result of the client’s request

  4. Replication processor: implements data replication between primary and secondary servers

  • Redis C/S a complete interaction

The main thread of the Redis server is in a loop, and the Client initiates a connection request to the Redis server. For example, if it is port 6379, the listening port detects the AE_READABLE event under the I/O multiplexing tool and puts it in the TaskQueue for processing. The event dispatcher gets the read event. Further determining that it is a new connection request, the event is passed to the connection reply handler to establish the connection;


After the connection is established, the Client sends a GET command to the server, which is still put into the queue by THE I/O multiplexing detection processing, and assigned to the command request processor by the event dispatcher, and invokes the corresponding program for execution.


The server associated the AE_WRITEABLE event of the socket with the command reply processor. When the client tried to read the result, a writable event was generated. At this time, the server triggered the command reply response and wrote the data result to the socket.

                  

X07. How does Redis persist and its basic principles

In general, persistence is the writing of data from memory to non-volatile media, such as mechanical disks and SSDS.

In the event of a server outage, Redis, as an in-memory database, will lose all of its data, so Redis provides two powerful tools for persistence: RDB and AOF

  1. The RDB saves database snapshots to disk in binary mode.

  2. AOF records all the commands and parameters that have been written to the database in protocol text to the AOF file, thus recording the database status.

  • View the RDB configuration

[redis@abc]$ cat /abc/redis/conf/redis.conf   
save 900 1  
save 300 10  
save 60 10000  
dbfilename "dump.rdb" 
dir "/data/dbs/redis/rdbstro" 
Copy the code

The first three rows are a condition for triggering RDB. For example, if the first row indicates that a data change occurs every 900 seconds, RDB is triggered, and so on. As long as one satisfies, the RDB will persist;

The fourth line, dbfilename, specifies the name to write the in-memory database to a local file, which is a compressed binary file.

The fifth line “dir” specifies the directory where the RDB binaries are stored.

  • Modify the RDB configuration

The configuration takes effect only after the server restarts:

[redis@abc]$bin/redis-cli 127.0.0.1:6379> CONFIG GET save 1)"save"
2) "900 1 300 10 60 10000"127.0.0.1:6379 > CONFIG SET save"21600, 1000" 
OK
Copy the code

7.1 RDB SAVE and BGSAVE

RDB files are suitable for data disaster recovery, backup, and recovery. It takes less time to recover database data using RDB files.


RDB persistence stores data only periodically. Incremental data will be lost if the service breaks down before the next storage is triggered. For large data volumes, the fork operation is CPU intensive and can result in second-long blocks.


SAVE is a blocking persistence. When a command is executed, the Redis main process writes memory data to the RDB file until it is created, during which time Redis cannot process any command.


BGSAVE is a non-blocking persistence that creates a child process to write data from memory to an RDB file while the main process processes command requests.

The figure shows a simple bgSave process:



  • BGSAVE implementation details

When the conditions are met, Redis will automatically take a snapshot of the memory data and store it on the hard disk. Take BGSAVE as an example, the process of a complete data snapshot is as follows:
  1. Redis uses fork to create a child process;

  2. The parent process continues to receive and process command requests, and the child process writes memory data to temporary files;

  3. After all data is written, the child process replaces the old RDB file with a temporary file.

When fork is executed, the OS uses a copy-on-write policy to optimize the snapshot process for the child process.

Redis does not modify the RDB files during the snapshot process. The Redis replaces the old files with new ones only after the snapshot is completed. That is, the RDB files are complete at any time.

We can implement Redis database backup by periodically backing up RDB files. RDB files are compressed and occupy less space than the data size in memory.

You can also manually send SAVE or BGSAVE commands to Redis to execute snapshots in addition to automatic snapshots. Use RDB for persistence. Due to the frequency limit of RDB, if the data is important, consider using AOF for persistence.

7.2 AOF,

When using AOF persistence, Redis appends every Write command it receives to a file using a Write function similar to MySQL’s binlog. In other words, AOF records the state of the database by saving write commands to the Redis server.


AOF files have their own storage protocol format:

[redis@abc]$ more appendonly.aof 
*2     # 2 parameters
$6     # The first argument has length 6
SELECT     # First argument
The $1     # The second parameter has length 1
8     # Second parameter* 3# 3 parameters
$3     # The first argument has length 4
SET     # First argument
$4     # The second parameter has length 4
name     # Second parameter
$4     # The third argument has length 4
Jhon     # The second parameter has length 4
Copy the code

AOF configuration:

[redis@abc]$ more ~/redis/conf/redis.conf
dir "/data/dbs/redis/abcd"           #AOF file storage directory
appendonly yes                       # Enable AOF persistence. It is disabled by default
appendfilename "appendonly.aof"      AOF file name (default)
appendfsync no                       #AOF persistence policy
auto-aof-rewrite-percentage 100      Condition that triggers AOF overwrite (default)
auto-aof-rewrite-min-size 64mb       Condition that triggers AOF overwrite (default)
Copy the code

When AOF is enabled, the server will append this command to the end of a separate AOF buffer every time it performs a write operation, and then write the AOF buffer to the AOF file. When to synchronize files is configured according to the appendfsync configuration:


Appendfsync options: always, everysec, and no:
  • Always: The server forcibly writes the contents of the AOF buffer to the AOF file on the hard disk every time an event is executed, ensuring the integrity of data persistence, which is the slowest but the safest.

  • Everysec: The server will synchronize files once every second to write the AOF cache data in the memory buffer to the AOF file, which takes into account the efficiency and integrity. In extreme cases, the server will only lose the write operations to the Redis database within one second when the server is down;

  • “No” : indicates that the system cache is written to the disk by default, and the data security and integrity are poor.

AOF files are larger than RDB files and grow faster when storing commands. In order to compress AOF persistent files, Redis provides a rewriting mechanism to control the growth of AOF files.


The rationale for the AOF rewrite implementation is this:
  1. Execute set Hello world 50 times

  2. Finally, set Hello China is executed

  3. Finally, the first 50 times are meaningless to the AOF file, and the AOF rewrite is to save only the last state of the key.

  4. Data consistency issues during rewriting

During the AOF rewrite, the main process needs to continue processing commands, and the new commands may modify existing data, resulting in data inconsistency between the database and the rewritten AOF file.


Therefore, Redis added an AOF rewrite cache, which is enabled after the fork of the child process. When the Redis main process receives a new write command, it appends the protocol content of the write command to this cache in addition to the existing AOF file.

        

When the child process completes the AOF rewrite, it sends a completion signal to the parent process. After receiving the completion signal, the parent process calls the signal handler to complete the following tasks:
  1. Writes the entire contents of the AOF rewrite cache to the new AOF file

  2. Rename the new AOF file and overwrite the original AOF file

  3. Obstruction of AOF rewrite

During the entire AOF rewrite process, only the last write cache and rename operation will block the main process. At other times, the AOF rewrite will not block the main process, minimizing the impact of the AOF rewrite on performance.


AOF overwriting can be manually triggered by the user by calling BGREWRITEAOF.
When the AOF function is enabled, the server maintains the following three variables:
  1. Current AOF file size

  2. Variable of the size of the AOF file after the last rewrite

  3. AOF File size growth percentage

Each time the serverCron function executes, it checks if all of the following conditions are met, and if so, an automatic AOF override is triggered:
  1. No BGSAVE command is running to prevent collisions with the RDB

  2. No BGREWRITEAOF is in use to prevent and manual AOF conflicts

  3. The current AOF file size is at least larger than the set value

  4. The ratio between the current AOF file size and the size after the last AOF rewrite is greater than or equal to the specified growth percentage

7.3 Data Restoration of Redis

Redis data recovery priority

  1. If only AOF is configured, the AOF file is loaded during the restart to restore data.

  2. If both RDB and AOF are configured, only the AOF file is loaded to restore data.

  3. If only RDB is configured, the dump file is loaded to restore data.

Copy the AOF file to the Redis data directory and start the redis-server AOF data recovery process :Redis virtual client, read the AOF file to restore Redis command and parameters, then execute the command to restore the data. These procedures are mainly implemented in loadAppendOnlyFile().


Copy the RDB file to the Redis data directory and start the redis-server, because the RDB file and the redis-server file save the real data, not the command status and parameters.
New hybrid persistence
Both RDB and AOF have their disadvantages:
  1. RDB is persisted at intervals. When a failure occurs, data between the time of outage and the last persistence will be lost. Data integrity cannot be guaranteed

  2. AOF stores sequences of instructions, which can take a long time to recover and replay and have a larger file

Redis 4.0 offers better hybrid persistence options: Create an AOF file that contains both RDB and AOF data, with RDB data at the beginning of the AOF file. They store the database state when the server started the rewrite. It continues to append in AOF format to the end of the AOF file, after the RDB data.

                 


Persistence practice

In the actual use of Redis needs to be used as the main memory or cache, data integrity and missing requirements, CPU and memory conditions and many other factors to determine their own persistence scheme, generally safe practices include:
  1. The safest approach is to use RDB in conjunction with AOF. Even if the AOF is damaged beyond repair, RDB can be used to recover data, but of course there is a performance impact when persisting.

  2. Redis is a simple cache, no cache will not cause cache avalanche only use RDB.

  3. It is not recommended to use AOF alone because AOF is slower to restore and load data than RDB, so it is better to have RDB as a backup when using AOF.

  4. Using the new version of Redis 4.0 persistence scheme.

Talk about the underlying design and implementation of Redis’ ZIPLIST

Leaving aside Redis’s implementation of Ziplist, let’s think about what we need to do to design this data structure. Thinking to learn more harvest yo!

  • Consider point 1: the double-sided nature of continuous memory

Continuous memory reduces memory fragmentation, but continuous large memory is not easy to satisfy. It’s not a waste of space for you to sit next to each other. However, many people travel alone in the subway, and they don’t want to be next to each other. So there are two seats and one seat, but it’s really hard to find three consecutive seats.



  • Consider point 2: the diversity of compression list hosting elements

The design structure is different from the array. The array has a mandatory type convention, so we can determine the offset of the index based on the type and number of elements, but the compressed list has no constraints on the type of elements, which means we don’t know what the data type and length is. This is a bit like the TCP approach to stick and unpack. We need to specify the terminator or the length of the individual stored elements, otherwise the data gets stuck together.

  • Consider point 3: Constant level time acquisition of attributes

That said, we have addressed the first two considerations, but as a whole, compressing lists requires constant level consumption to provide some general information, such as the total length, the number of elements stored, and the position of the tail node (for quick insertion and removal of the tail), which makes a lot of sense for manipulating compressing lists.

  • Consider point 4: data structure support for additions and deletions

In theory, we need to design data structures to support adding and deleting. Of course, there are tradeoffs. No data structure is perfect, so let’s adjust it as we go.

  • Consider point 5: How can I save memory

If we want to save memory, we need to use a special case of variable length, which means we don’t use two pre and next Pointers like a two-way linked list, which is more space intensive, so we might need to use variable length coding.

Ziplist overall structure

With that in mind, let’s take a look at what Redis thinks, and here’s an overview:



From the figure we can basically see several main parts: ZLbytes, Zltail, zllen, Zlentry, and Zlend.

To explain what each of these attributes means, take a look at a very nice graph on the Internet, where the red line validates our consideration point 2 and the green line validates our consideration point 3:



Look at ziplist. C of ziplist application and scale operation, to deepen the understanding of the above several attributes:

/* Create a new empty ziplist. */
unsigned char *ziplistNew(void) {
    unsigned int bytes = ZIPLIST_HEADER_SIZE+ZIPLIST_END_SIZE;
    unsigned char *zl = zmalloc(bytes);
    ZIPLIST_BYTES(zl) = intrev32ifbe(bytes);
    ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(ZIPLIST_HEADER_SIZE);
    ZIPLIST_LENGTH(zl) = 0;
    zl[bytes-1] = ZIP_END;
    return zl;
}

/* Resize the ziplist. */
unsigned char *ziplistResize(unsigned char *zl, unsigned int len) {
    zl = zrealloc(zl,len);
    ZIPLIST_BYTES(zl) = intrev32ifbe(len);
    zl[len-1] = ZIP_END;
    return zl;
}
Copy the code

The realization of the zlentry
  • Encoding Encoding and Content storage

When content is a string, the 2bit height of the first byte of the encoding indicates the string type, when content is an integer, The first byte high 2bit of Redis encoding is fixed to 11.

/ *########### Description of the string ###############
 #### Encoding consists of three types: 1 byte, 2 byte, and 5 byte ####
 #### The maximum 2bit indicates the length of the string. 00, 01, and 10 respectively correspond to 1 byte, 2 bytes, and 5 bytes ####

 #### When the maximum 2bit is 00, encoding=1 byte. The remaining 6 bits 2^6=64 Indicates that the value ranges from 0 to 63####
 #### When the maximum 2bit is 01, it indicates encoding=2 bytes. The remaining 14 bits 2^14=16384 The value ranges from 0 to 16383####
 #### When the maximum is 2 bits =11, encoding=5 bytes. The remaining 32 bits are 4 bytes after special use. 2^32= 4.2 billion ####
 * |00pppppp| - 1 byte
 *      String value with length less than or equal to 63 bytes (6 bits).
 *      "pppppp" represents the unsigned 6 bit length.
 * |01pppppp|qqqqqqqq| - 2 bytes
 *      String value with length less than or equal to 16383 bytes (14 bits).
 *      IMPORTANT: The 14 bit number is stored in big endian.
 * |10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes
 *      String value with length greater than or equal to 16384 bytes.
 *      Only the 4 bytes following the first byte represents the length
 *      up to 32^2-1. The 6 lower bits of the first byte are not used and
 *      are set to zero.
 *      IMPORTANT: The 32 bit number is stored in big endian.

 *# # # # # # # # # # # # # # # # # # # # # # # # string and integer storage line # # # # # # # # # # # # # # # # # # # # *
 *#### the high 2bit is fixed to 11, and the next 2 bits are 00, 01, 10. 11 indicates the integer type of storage
 * |11000000| - 3 bytes
 *      Integer encoded as int16_t (2 bytes).
 * |11010000| - 5 bytes
 *      Integer encoded as int32_t (4 bytes).
 * |11100000| - 9 bytes
 *      Integer encoded as int64_t (8 bytes).
 * |11110000| - 4 bytes
 *      Integer encoded as 24 bit signed (3 bytes).
 * |11111110| - 2 bytes
 *      Integer encoded as 8 bit signed (1 byte).
 * |1111xxxx| - (with xxxx between 0000 and 1101) immediate 4 bit integer.
 *      Unsigned integer from 0 to 12. The encoded value is actually from
 *      1 to 13 because 0000 and 1111 can not be used, so 1 should be
 *      subtracted from the encoded 4 bit value to obtain the right value.
 * |11111111| - End of ziplist special entry.
*/
Copy the code

Content holds the contents of the node. The contents can be byte arrays and integers of various types. The type and length of the node determine the encoding.

        

Hold byte array: the top two bits of the encoding, 00, indicate that the node holds an array of bytes. The last six bits of the encoding, 001011, record the length of the byte array, 11. The content attribute holds the value of the node, “Hello World.”

        

Save integers: The code 11,000,000 indicates that the node stores an integer value of type INT16_t, and the content attribute holds the node’s value 10086.

  • Prevlen properties

Finally give prevlen this attribute, this property is the key, said has been compressed in front of the list is designed in order to save memory, prevlen attributes just happens to have the effect, however, recall the list if you want to get in front of the nodes need to use a pointer, compressed list due to the diversity of the element cannot like an array to implement, So use the Prevlen property to record the size of the previous node to point to.

The prevlen attribute specifies the length of the previous node in the compressed list, in bytes. It can be 1 byte or 5 bytes long:

  1. If the length of the previous node is less than 254 bytes, then the prevlen property is 1 byte long and the length of the previous node is stored in this byte.

  2. If the length of the previous node is 254 bytes or greater, then the prevlen property is 5 bytes long. The first byte is set to 0xFE, and the next four bytes are used to hold the length of the previous node.

Think: Notice that the first byte here is set to 0xFE instead of 0xFF. Think about why?

That’s right! As mentioned earlier, zlend is a special value set to 0xFF to indicate the end of the compressed list, so it can’t be set to 0xFF. There was an issue about this in Redis, and someone mentioned that Antirez’s ziplist was incorrectly commented. Finally Antirez found that the comment was incorrectly written. Then happily modified, ha ha!



Another question to consider is why prevlen is either 1 or 5 bytes long. Why aren’t there intermediate states with lengths of 2, 3, and 4 bytes? Answering this question leads to one of today’s key questions: chained updates.

Chain update problem

Consider this scenario of adding nodes:

If you add a new node to the header of the compressed list and the length is greater than 254 bytes, the prevlen of the next node must be 5 bytes. However, the prevlen of the next node must be 1 byte before the new node is added, which must be extended. In extreme cases, if the extension is needed all the time, this will cause a chain reaction:

              

Consider another scenario of deleting a node:

If the node to be deleted is a small node, the node in front of the node is a large node. In this way, when the small node is deleted, the node behind it must keep the length of the large node in front of it, which faces the problem of expansion:



Understand the chain update problems, take a look at why 1 byte or 5 bytes of question, if is 2-4 bytes may produce a chain reaction of probability is bigger, in contrast to directly to the five biggest bytes will greatly reduce the probability of chain update, so the author also thinks the small waste memory is worth it.

Ziplist is not designed to be good at modification, which can lead to memory copy problems and increased traversal costs when the amount of data stored in the compressed list exceeds a certain threshold.

0x09. Talk about Redis Zset and skip lists

The ZSet structure contains both a dictionary and a skip list that holds all the collection elements in order of score. The dictionary holds the mapping from member to Score. The two structures share members and scores of the same elements through Pointers, without wasting extra memory.

typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;
Copy the code

Dictionary and skip table layout in ZSet:



9.1 Implementation details of skip lists in ZSet

  • The realization principle of random layer number

A skip table is a probabilistic data structure in which the number of insertion layers of elements is randomly specified. In the paper, Willam Pugh describes the calculation process as follows:
  1. Specify the maximum layer level of the node, specify the probability p, the default layer LVL is 1

  2. Generate a random number r from 0 to 1. If r<p and LVL <MaxLevel, LVL ++

  3. Repeat step 2 until the generated R >p, at which point LVL is the number of layers to insert.

In this paper, pseudo-codes with random layers are generated:

           

The Redis implementation of the table is basically following this idea, but with a slight difference, look at the Redis source code for random surface number of jumps SRC /z_set.c:

/* Returns a random level for the new skiplist node we are going to create.
 * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL
 * (both inclusive), with a powerlaw-alike distribution where higher
 * levels are less likely to be returned. */
int zslRandomLevel(void) {
    int level = 1;
    while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
        level += 1;
    return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}
Copy the code

Two of these macros are defined in redis.h:

#define ZSKIPLIST_MAXLEVEL 32 /* Should be enough for 2^32 elements */
#define ZSKIPLIST_P 0.25 /* Skiplist P = 1/4 */
Copy the code

You can see in the while:

(random()&0xFFFF) < (ZSKIPLIST_P*0xFFFF)
Copy the code

When I first saw this formula, it was a little bit of a surprise because it involved bit operations, and I wanted to see why Antirez used bit operations to write it this way.

                                             

Random () returns a floating-point number [0-1]. This is the first time that random() returns a floating-point number [0-1].



0x3e800000, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF



We can see that the result of the 32-bit conversion of 0.5 to hexadecimal is 0x3F000000. If we do the and with 0xFFFF, the result is still 0, which is not expected.


I remember that the MATH library of C language does not seem to have a direct random function, so I went to Redis source code, so I downloaded the 3.2 version of the code, and did not find the implementation of random(), but found several other places to apply:

The use of random() in dict.c



Use of random() in cluster.c



Random () was supposed to be a [0-1] floating-point number, but now it seems to be uint32, so Antirez’s formula is easy to understand.

ZSKIPLIST_P*0xFFFF
Copy the code

Since ZSKIPLIST_P=0.25, it is equivalent to 0xFFFF shifted right 2 bits to 0x3FFF. Assuming that random() is relatively uniform, after clearing the high 16 bits of 0xFFFF, the value of the low 16 bits will fall between 0x0000-0xFFFF, so the probability of while being true is only 1/4. More generally true with a probability of 1 over ZSKIPLIST_P.


The implementation of random layer is not uniform, the important is the generation of random number, LevelDB to jump the surface number generation code:

template <typename Key, typename Value>
int SkipList<Key, Value>::randomLevel() {

  static const unsigned int kBranching = 4;
  int height = 1;
  while (height < kMaxLevel && ((::Next(rnd_) % kBranching) == 0)) {
    height++;
  }
  assert(height > 0);
  assert(height <= kMaxLevel);
  return height;
}

uint32_t Next( uint32_t& seed) {
  seed = seed & 0x7fffffffu;

  if (seed == 0 || seed == 2147483647L) { 
    seed = 1;
  }
  static const uint32_t M = 2147483647L;
  static const uint64_t A = 16807;
  uint64_t product = seed * A;
  seed = static_cast<uint32_t>((product >> 31) + (product & M));
  if (seed > M) {
    seed -= M;
  }
  return seed;
}
Copy the code

As you can see, LevelDB uses a random modulus with the Ksieve, and an additional layer if the value is 0, thus achieving probability balance without using floating point numbers.
  • Average number of layers of a hop table node

It is easy to see that the higher the number of node layers is, the lower the probability of occurrence. In any case, the number of layers always satisfies the power law. The larger the number is, the smaller the probability of occurrence.

If the frequency of something is a power relation to one of its properties, then the frequency is said to conform to the power law. The power law is characterized by the fact that a few events account for a large proportion of the total frequency, while most of the remaining events account for only a small fraction of the total frequency.

                                     

The power law applies to the random number of layers in the hop table, which means that most of the node layers are yellow, only a few are green, and the probability is very low.




Quantitative analysis is as follows:
  1. The number of node layers is at least 1, and the number of node layers greater than 1 satisfies a probability distribution.

  2. The probability that the number of nodes is exactly 1 is p^0(1-p).

  3. The probability that the number of node layers is exactly 2 is p^1(1-p).

  4. The probability that the number of node layers is exactly 3 is p^2(1-p).

  5. The probability that the number of nodes is exactly 4 is p^3(1-p).

  6. The probability that the number of nodes in sequence is exactly equal to K is p to the K minus 1 times 1 minus p.



If the average number of nodes is required, then it becomes the expectation problem of the probability distribution.



In this table, P is the probability, V is the corresponding value, and it gives you all the possible values and probabilities, so you can find the expectation of this probability distribution.


The formula inside the square brackets is actually a geometric sequence learned in high school, the common technique of subtracting the sum of dislocation, from which you can see that the expected value of the number of layers of nodes is inversely proportional to 1-p. For Redis, when p=0.25, the expected number of node layers is 1.33.


Redis source code in detail about the insert and delete adjustment jump table process, this article will not expand, the code is not difficult to understand, are pure C written without so many stunt effects, bold to read up.

0x0A. Talk about the Clustered Redis and Gossip protocols

The cluster version of Redis sounds very grand, and it is indeed a lot more complex than the single instance one master one slave or one master many slave model. The architecture of the Internet is always evolving with the development of the business.

A.1 Basic Information about Clusters

  • Single instance Redis architecture

Starting with a primary N slave plus read/write separation, Redis looks good as an instance of a cache, and has a Sentinel mechanism that allows for primary/secondary failover.


Single instance with one master and two slaves + read/write separation structure:



In the case of single instance, there is only one Master as storage in essence. Even if the memory of the machine is 128GB, it is generally recommended that the utilization rate should not exceed 70%-80%. Therefore, 100GB data is a lot to use at most, and 50% is good in practice. Because a large amount of data means a high persistence cost, it can severely block the service, or even eventually switch the master.


If the single instance is used only as a cache, then in addition to the cache breakdown problem that occurs when the service fails or blocks, there could be a lot of requests that would kill MySQL together.




If the single instance is used as the main storage, then the problem is larger, because of the persistence issue, whether bgSave or Aof will cause the disk flushing blocking, resulting in a service request success rate decline, this is not a single instance can solve, because as the main storage, persistence is necessary.


Therefore, we expect a multi-master and multi-slave Redis system, which will increase the pressure and stability both as the main storage and as the cache. Nevertheless, I suggest that Redis try not to be the main storage!


  • Clustering and sharding

To support clustering, one of the first things to overcome is the sharding problem, also known as the consistent hash problem. There are three common solutions:


Client sharding: This situation is similar to hashing to modulo. It is easy to use when the client has complete control over the number of servers.

Middle layer sharding: In this case, a middle layer is added between the client and the server, acting as the administrator and scheduler. The client requests are sent to the middle layer, and the middle layer realizes the forwarding and recycling of requests. Of course, the most important role of the middle layer is the dynamic management of multiple servers.

Server shard: do not use the intermediation is decentralized management mode, arbitrary nodes in the client to the server directly request, if the requested Node without the required data, is a client reply version, and tell the client the required data storage location, the process is, in fact, the client and the server mutual cooperate, request is redirected to complete.

  • Cluster version of Redis for mid-tier sharding

As mentioned earlier, changing to N principal N slave can improve processing power and stability, but it also faces the problem of consistent hashing, which is the data problem when the capacity is dynamically expanded.


Before the official release of Redis cluster version, there are some solutions in the industry can’t wait to use the self-developed version of Redis cluster, including domestic Wandoujia Codis, foreign Twiter TwemProxy.


The core idea is to add a sharding layer between multiple Redis servers and clients, and the sharding layer completes the consistent hash and sharding of data. The practices of each company are different to some extent. However, the core problems to be solved are the expansion capacity, failover, data integrity, data consistency, request processing latency and other issues in the scenario of multiple Redis.



Industry Codis with LVS and other practices to achieve Redis cluster solutions have many are applied to the generation environment, the performance is also good, mainly the official cluster version in Redis3.0, how to its stability, many companies are not willing to do mice, but in fact after iteration has been to Redis5.x version, The official cluster version is pretty good, at least in my opinion.
  • Official cluster version of the server shard

The official version is different from the above Codis and Twemproxy in that it implements Sharding Sharding technology at the server layer. In other words, there is no middle layer in the official version, but multiple service nodes themselves realize Sharding. Of course, it can also be considered that these functions of realizing Sharding are integrated into the Redis service itself. There is no separate Sharding module.


The previous article also mentioned that the concept of slot was introduced into the official cluster for data sharding, and then data slots were allocated to multiple Master nodes, which were then configured with N slave nodes, thus the official cluster architecture of multi-instance Sharding version was formed.


Redis Cluster is a distributed Cluster that can share data among multiple Redis nodes. On the server side, the communication between nodes is carried out through a special protocol, which acts as the communication protocol for the management part of the middle layer. This protocol is called Gossip protocol.


Distributed System Conformance Protocol (DSS) is designed to solve the problem of multi-node status notification in a cluster. It is the basis of cluster management. The figure shows the official cluster architecture diagram based on Gossip protocol:



A.2 Basic operating principles of the Redis Cluster

  • Node state information structure

Each node in a Cluster maintains a copy of the current state of the entire Cluster in its view, including:
  1. Current Cluster Status

  2. Information about slots that each node in the cluster is responsible for, and their migrate status

  3. The status of each node in the cluster is master-slave

  4. The survival status and unreachable vote of each node in the cluster

In other words, the above information is the content and theme of gossip spread among nodes in the cluster, and it is relatively comprehensive. In this way, both the information of oneself and that of others are passed on to each other, and the final information is comprehensive and accurate, which is different from the problem of Byzantine Empire and has high credibility.




Based on the Gossip protocol, when the cluster status changes, such as new node joining, slot migration, node breakdown, slave promoted to new Master, we hope these changes will be discovered as soon as possible and spread to all nodes in the whole cluster and reach an agreement. The heartbeat (PING, PONG, MEET) between nodes and the data they carry are the most important way of cluster state propagation.
  • Concept of the Gossip protocol

The Gossip Protocol, also known as the Epidemic Protocol, is a protocol for exchanging information between nodes or processes based on how an epidemic spreads.

It is widely used in distributed systems. For example, we can use the Gossip protocol to ensure that all nodes in the network have the same data.

The Gossip Protocol was originally created in 1987 by Alan Demers, a researcher at Xerox’s Palo Alto Research Center.

https://www.iteblog.com/archives/2505.html
The Gossip protocol is a mature protocol in P2P networks. The great benefit of The Gossip protocol is that even if the number of nodes in the cluster increases, the load on each node does not increase much, and is almost constant. This allows the cluster that Consul manages to scale out to thousands of nodes.

The Gossip algorithm is also known as anti-entropy. Entropy is a physical concept that represents chaos, and anti-entropy is the search for consistency in chaos. This fully illustrates the characteristics of Gossip: In a bounded network, each node communicates randomly with other nodes, and after some chaotic communication, the state of all nodes eventually reaches an agreement. Each node may know all other nodes or only a few neighbor nodes. As long as these nodes can be connected through the network, their status is consistent eventually. Of course, this is also the characteristic of epidemic transmission.

https://www.backendcloud.cn/2017/11/12/raft-gossip/
The above description is more academic, in fact, the Gossip protocol for our people to eat melon is not unfamiliar, the Gossip protocol is also known as the Gossip protocol, to put it simply, the Gossip protocol, the spread of the scale and speed are very fast, you can experience. So many algorithms in computers are derived from life, but higher than life.


  • The use of the Gossip protocol

Redis clusters are decentralized and communicate with each other using the Gossip protocol. Messages in the cluster can be of the following types:
  1. Meet Using the cluster Meet IP port command, a node in an existing cluster sends an invitation to a new node to join the existing cluster.

  2. The Ping node sends a Ping message to other nodes in the cluster every second. The message contains the known addresses, slots, status information, and last communication time of the two nodes.

  3. After receiving the ping message, the Pong node replies a Pong message containing the information of the two known nodes.

  4. If node Fail fails to ping a node, a message indicating that the node is down is broadcast to all nodes in the cluster. After receiving the message, other nodes mark themselves offline.

Due to the decentralization and communication mechanism, Redis Cluster chose ultimate consistency and basic availability. For example, when a new node is added (meet), only the invited node and the invited node know about it, and the rest of the nodes have to wait for the ping message to spread layer by layer.


In addition to Fail, the entire network is notified immediately. Other nodes, such as new nodes, nodes coming online again, nodes being elected as primary nodes, slot changes, etc., need to wait to be notified. In other words, the Gossip protocol is the final consistency protocol.


The Gossip protocol has a high requirement on the server time. Otherwise, an inaccurate timestamp may affect the validity of the node’s message determination. In addition, the network overhead after the number of nodes increases will also exert pressure on the server. At the same time, the number of nodes is too many, which means that the time to reach the final consistency is relatively longer. Therefore, the official recommendation of the maximum number of nodes is about 1000.


The figure shows the communication interaction diagram when the node server is newly added:



Redis official cluster is a decentralized peer-to-peer (P2P) network, which was very popular in the early years, such as Edm, BT and so on.


In The Redis cluster, Gossip protocol plays the role of decentralized communication protocol, which realizes the autonomous behavior of the whole cluster without central management node according to the formulated communication rules.
  • Fault detection based on Gossip protocol

Each node in the cluster periodically sends PING messages to other nodes in the cluster to exchange status information of each node and check the status of each node: online, suspected offline PFAIL, offline FAIL.


Save your own information: When primary node A receives A message from primary node B that primary node D is in the PFAIL state, primary node A finds the clusterNode structure corresponding to primary node D in its clusterState. Nodes dictionary. Add the offline report of primary node B to the Fail_Reports link list in the clusterNode structure, and inform other nodes of the suspected offline status of node D through the Gossip protocol.


Order together: If more than half of the primary nodes in the cluster report that the primary node D is suspected to be offline, then the primary node D will be marked as offline (FAIL), and the nodes marked as offline will broadcast the FAIL message of the primary node D to the cluster. All nodes that receive the FAIL message will immediately update the status of the primary node D in the Nodes and mark it offline.


The final award: The following two conditions must be met to mark node as FAIL:
  1. More than half of the primary nodes have Node in the PFAIL state.

  2. The current node also marks node as a PFAIL state.

That is to say the current node found suspected hangs up the other nodes, then write them in your own little, waiting for the notice to the other good gay friend, let them look at myself, finally and half or more of the good gay friends think hang up and the node and the node oneself also think you hang up and it’s really hung up, be careful.

0x0B. Talk about understanding Redis’ memory reclaim mechanism

Redis as a memory databases, if just only burst into out sooner or later, in fact many store Redis when family DB guys will taste the bitter fruit, unless, of course, does your factory money, drizzle T level of memory, or no longer growth growth to a certain extent after scene, is another matter.

For Redis service safe and stable operation, keep using memory within a certain threshold is necessary, so we need to delete the delete, cleaning the clean, leave the keys of the need for the memory, just think of a great river need to set up a few warning level to ensure that don’t burst not dried up, Redis is the same, But Redis only cares about breaking the levees. Here’s a picture:



In the figure, the machine memory is set as 128GB, occupying 64GB is a relatively safe level. If the memory is close to 80%, which is about 100GB, then Redis is considered to have a relatively large carrying capacity. The specific proportion can be determined according to the business experience of companies and individuals.

I just want to say that for the sake of security and stability, don’t assume that 128GB of memory means 128GB of data storage is discounted.

B.1 Where does the Reclaimed Memory Come from

Redis’ memory usage is divided into two parts: the storage of key-value pairs and the running of Redis itself. Obviously, we can’t retrieve the latter, so we have to start with key-value pairs, which can be classified into several categories: expired, non-expired, hot, and cold data. Expired key values need to be deleted. What if memory is still out after all expired key pairs are deleted? Well, that’s only gonna kick out some of the data.



B.2 How Can I Delete Expired Key And Value Pairs

To implement key-value pair deletion we need to understand the following:

  • Where are key-value pairs with expiration timeouts stored?

  • How do I determine if a key-value pair with a timeout can be deleted?

  • What are the removal mechanisms and how to select them?

1. Storage of key and value pairs

SRC /server.h gives the answer to the redisDb structure:

typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP)*/
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    int id;                     /* Database ID */
    long long avg_ttl;          /* Average TTL, just for stats */
    unsigned long expires_cursor; /* Cursor of the active expire cycle. */
    list *defrag_later;         /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;
Copy the code

A Redis is essentially a large key-value, where a key is a string, and a value can be several objects: strings, lists, ordered lists, collections, hashes, etc. These key-values are stored in a dict in ReddisDB.



Redis has its own rules for organizing expired key-value pairs, so let’s move on to the next step. We just need to delete the target key-value from dict in Redis.

The Expires member of redisDb is also of dict type, the same as a key-value pair. Essentially, expires is a subset of dict. Expires holds all expired key-value pairs.

For keys, we can set the absolute and relative expiration times and view the remaining time:

  1. Use EXPIRE and PEXPIRE to implement second and millisecond lifetime Settings for key pairs, which are expiration Settings for relative duration

  2. EXPIREAT and EXPIREAT are used to implement the expiration deletion of the key-value pair in a second and millisecond level timestamp, which belongs to the absolute expiration setting

  3. View the remaining expiration time of a key-value pair with time to live through TTL and PTTL

The above three sets of commands are useful when designing a cache, and should be noted by interested readers.

The expires dictionary expires does not store exactly the same content as the dict key-value pair space. The key of the expires dictionary is a pointer to the corresponding Redis object, and its value is a long long Unix timestamp. The relative length of EXPIRE and PEXPIRE will eventually be converted to a timestamp.



2. Determine whether the key-value pair is expired

To determine whether the key is expired and can be deleted, we need to check whether the value exists in the expiration dictionary first. If so, we can further determine the relative size of the expiration timestamp and the current timestamp and make the deletion judgment. The simple process is shown in the figure:

                             

3. The policy for deleting key-value pairs

After the previous several links, we know two storage locations of Redis: key space and expired dictionary, as well as the structure of expired dictionary expires and the method to determine whether it is expired. Then how to implement deletion?

Instead of Redis, consider several possible deletion strategies:

  • Timed delete: When the key expiration time is set, a timer is created. When the key expiration time arrives, the timer immediately deletes the key-value pair.

  • Periodic deletion: Scan the database every specified time to detect and delete expired key-value pairs.

  • Lazy deletion: the expiration of the key and value pair is not deleted temporarily, as to the timing of deletion is related to the use of the key and value pair, when the key is obtained, first check whether it is expired, expired delete, otherwise keep;

In the preceding three policies, periodic deletion and periodic deletion are active deletion with different time granularity, while lazy deletion is passive deletion.

The three strategies have their own advantages and disadvantages: periodic deletion has advantages in memory usage but is not CPU friendly, and lazy deletion is not memory friendly. If some key pairs are not used all the time, a certain amount of memory will be wasted. Periodic deletion is a compromise between periodic deletion and lazy deletion.

Reids adopted is a combination of inert delete and delete regularly, generally can be done with minimum heap timer, but Redis design considering the time limited the amount and type of events, use the unordered list storage time event, so that if on the basis of the implementation time to delete, means that the O (N) traversal for recently need to delete the data.

However, I think that if Antirez had to use periodic deletion, he would certainly not use the original unordered list mechanism. Therefore, I personally think that the existing unordered list cannot be the fundamental reason why Redis does not use periodic deletion. I venture to guess that the only possible reason is that Antirez does not think it is necessary to use periodic deletion.

                

4. Implementation details that are periodically deleted

Regular erasure sounds simple, but how do you control how often and for how long?

If the frequency of execution is too little, it will degenerate into lazy deletion. If the time of execution is too long, it will be similar to timed deletion. The timing of regular deletions also needs to be considered, so let’s take a look at how Redis implements regular deletions. In the SRC/expire-c file, the author found the activeExpireCycle function, which is the implementation of periodic deletion. Antirez has made a detailed comment in the code, but it is all in English. Try to read it and make a vague outline. So learning English and reading foreign materials is an important way to learn.

The core part of the code is about 210 lines, including comments. Here are the details:

#define ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP 20 /* Keys for each DB loop. */
#define ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1000 /* Microseconds. */
#define ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC 25 /* Max % of CPU to use. */
#define ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE 10 /* % of stale keys after which
                                                   we do extra efforts. */

void activeExpireCycle(int type) {
    /* Adjust the running parameters according to the configured expire
     * effort. The default effort is 1, and the maximum configurable effort
     * is 10. */
    unsigned long
    effort = server.active_expire_effort-1, /* Rescale from 0 to 9. */
    config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +
                           ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort,
    config_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +
                                 ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort,
    config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC +
                                  2*effort,
    config_cycle_acceptable_stale = ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE-
                                    effort;

    /* This function has some global state in order to continue the work
     * incrementally across calls. */
    static unsigned int current_db = 0; /* Last DB tested. */
    static int timelimit_exit = 0;      /* Time limit hit in previous call? */
    static long long last_fast_cycle = 0; /* When last fast cycle ran. */

    int j, iteration = 0;
    int dbs_per_call = CRON_DBS_PER_CALL;
    long long start = ustime(), timelimit, elapsed;

    /* When clients are paused the dataset should be static not just from the
     * POV of clients not being able to write, but also from the POV of
     * expires and evictions of keys not being performed. */
    if (clientsArePaused()) return;

    if (type == ACTIVE_EXPIRE_CYCLE_FAST) {
        /* Don't start a fast cycle if the previous cycle did not exit * for time limit, unless the percentage of estimated stale keys is * too high. Also never repeat a fast cycle for the same period * as the  fast cycle total duration itself. */ if (! timelimit_exit && server.stat_expired_stale_perc < config_cycle_acceptable_stale) return; if (start < last_fast_cycle + (long long)config_cycle_fast_duration*2) return; last_fast_cycle = start; } /* We usually should test CRON_DBS_PER_CALL per iteration, with * two exceptions: * * 1) Don't test more DBs than we have.
     * 2) If last time we hit the time limit, we want to scan all DBs
     * in this iteration, as there is work to do in some DB and we don't want * expired keys to use memory for too much time. */ if (dbs_per_call > server.dbnum || timelimit_exit) dbs_per_call = server.dbnum; /* We can use at max 'config_cycle_slow_time_perc' percentage of CPU * time per iteration. Since this function gets called with a frequency of * server.hz times per second, the following is the max amount of * microseconds we can spend in this function. */ timelimit = config_cycle_slow_time_perc*1000000/server.hz/100; timelimit_exit = 0; if (timelimit <= 0) timelimit = 1; if (type == ACTIVE_EXPIRE_CYCLE_FAST) timelimit = config_cycle_fast_duration; /* in microseconds. */ /* Accumulate some global stats as we expire keys, to have some idea * about the number of keys that are already logically expired, but still * existing inside the database. */ long total_sampled = 0; long total_expired = 0; for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) { /* Expired and checked in a single loop. */ unsigned long expired, sampled; redisDb *db = server.db+(current_db % server.dbnum); /* Increment the DB now so we are sure if we run out of time * in the current DB we'll restart from the next. This allows to
         * distribute the time evenly across DBs. */
        current_db++;

        /* Continue to expire if at the end of the cycle more than 25%
         * of the keys were expired. */
        do {
            unsigned long num, slots;
            long long now, ttl_sum;
            int ttl_samples;
            iteration++;

            /* If there is nothing to expire try next DB ASAP. */
            if ((num = dictSize(db->expires)) == 0) {
                db->avg_ttl = 0;
                break;
            }
            slots = dictSlots(db->expires);
            now = mstime();

            /* When there are less than 1% filled slots, sampling the key
             * space is expensive, so stop here waiting for better times...
             * The dictionary will be resized asap. */
            if (num && slots > DICT_HT_INITIAL_SIZE &&
                (num*100/slots < 1)) break;

            /* The main collection cycle. Sample random keys among keys
             * with an expire set, checking for expired ones. */
            expired = 0;
            sampled = 0;
            ttl_sum = 0;
            ttl_samples = 0;

            if (num > config_keys_per_loop)
                num = config_keys_per_loop;

            /* Here we access the low level representation of the hash table
             * for speed concerns: this makes this code coupled with dict.c,
             * but it hardly changed in ten years.
             *
             * Note that certain places of the hash table may be empty,
             * so we want also a stop condition about the number of
             * buckets that we scanned. However scanning for free buckets
             * is very fast: we are in the cache line scanning a sequential
             * array of NULL pointers, so we can scan a lot more buckets
             * than keys in the same time. */
            long max_buckets = num*20;
            long checked_buckets = 0;

            while (sampled < num && checked_buckets < max_buckets) {
                for (int table = 0; table < 2; table++) {
                    if(table == 1 && ! dictIsRehashing(db->expires))break;

                    unsigned long idx = db->expires_cursor;
                    idx &= db->expires->ht[table].sizemask;
                    dictEntry *de = db->expires->ht[table].table[idx];
                    long long ttl;

                    /* Scan the current bucket of the current table. */
                    checked_buckets++;
                    while(de) {
                        /* Get the next entry now since this entry may get
                         * deleted. */
                        dictEntry *e = de;
                        de = de->next;

                        ttl = dictGetSignedIntegerVal(e)-now;
                        if (activeExpireCycleTryExpire(db,e,now)) expired++;
                        if (ttl > 0) {
                            /* We want the average TTL of keys yet
                             * not expired. */
                            ttl_sum += ttl;
                            ttl_samples++;
                        }
                        sampled++;
                    }
                }
                db->expires_cursor++;
            }
            total_expired += expired;
            total_sampled += sampled;

            /* Update the average TTL stats for this database. */
            if (ttl_samples) {
                long long avg_ttl = ttl_sum/ttl_samples;

                /* Do a simple running average with a few samples.
                 * We just use the current estimate with a weight of 2%
                 * and the previous estimate with a weight of 98%. */
                if (db->avg_ttl == 0) db->avg_ttl = avg_ttl;
                db->avg_ttl = (db->avg_ttl/50)*49 + (avg_ttl/50);
            }

            /* We can't block forever here even if there are many keys to * expire. So after a given amount of milliseconds return to the * caller waiting for the other active expire cycle. */ if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */  elapsed = ustime()-start; if (elapsed > timelimit) { timelimit_exit = 1; server.stat_expired_time_cap_reached_count++; break; } } /* We don't repeat the cycle for the current database if there are
             * an acceptable amount of stale keys (logically expired but yet
             * not reclained). */
        } while ((expired*100/sampled) > config_cycle_acceptable_stale);
    }

    elapsed = ustime()-start;
    server.stat_expire_cycle_time_used += elapsed;
    latencyAddSampleIfNeeded("expire-cycle",elapsed/1000);

    /* Update our estimate of keys existing but yet to be expired.
     * Running average with this sample accounting for 5%. */
    double current_perc;
    if (total_sampled) {
        current_perc = (double)total_expired/total_sampled;
    } elsecurrent_perc = 0; For server stat_expired_stale_perc = (current_perc * 0.05) + (server. Stat_expired_stale_perc * 0.95); }Copy the code

To tell the truth, the code details more, because the author of Redis source code understanding is not much, can only do a fuzzy version of the interpretation, so there is inevitably a problem, or suggest conditional readers to go to the source area to read, to see the author of the fuzzy version:

  • The algorithm is an adaptive process. When there are fewer expired keys, it takes less CPU time to process them. If there are many expired keys, it takes a radical way to process them, so as to avoid a large amount of memory consumption.

  • Since there are many database DB in Redis, the algorithm will scan one by one. At the end of this time, the algorithm will continue to scan the later DB, which is a closed-loop process.

  • There are two modes of periodic deletion: fast cycle and slow cycle. The main mode is the slow cycle mode. The cycle frequency depends on server. Hz and is usually set to 10.

  • The execution time of the slow loop is relatively long, and timeout may occur. The execution time of the fast loop is less than 1ms, that is, the execution time is shorter, but the execution times are more. During the execution, if the proportion of expired keys in a DB sample is less than 25%, the execution is skipped.

Periodic deletion is an adaptive closed-loop and probabilistic sampling scanning process, which is limited by execution time and CPU time. If the threshold is triggered, it is stopped, so to say, it is done quietly without affecting the response to the client.

5. DEL Deletes key and value pairs

Before Redis4.0, if the key value of del is large, it may cause blocking. In the new release, BIO threads and some new commands are introduced to implement the lazy deletion of del. Finally, there will be BIO threads to implement the cleaning of the memory.

B.2 Memory Elimination Mechanism

In order to ensure the safe and stable operation of Redis, a max-memory threshold is set. When the memory usage reaches the threshold, the newly written key-value pairs cannot be written, and the memory weeding mechanism is required. In the configuration of Redis, there are several weeding policies to choose from, as follows:

  • Noeviction: New write operations will report an error when memory is insufficient for new data

  • Allkeys-lru: removes the least-recently used key from the key space when memory is insufficient for new writes;

  • Allkeys-random: Randomly removes a key from the key space when the memory is insufficient to accommodate new data.

  • Volatile -lru: Removes the least-recently used key from the expired key space when memory is insufficient to accommodate new writes;

  • Volatile -random: When memory is insufficient to hold new data, a key is randomly removed from the key space with the expiration time set.

  • Volatile – TTL: When the memory is insufficient to hold new data, in the key space with the expiration time set, keys with earlier expiration time are removed first.

The last three strategies are all for the processing of expired dictionaries, but noeviction returns write failure when expired dictionaries are empty, and random deletion without a strategy is not desirable, so the second one, Allkeys-LRU, is generally selected for elimination based on the LRU policy.

Personally, I think Antirez has always been engineering thinking and is good at using probabilistic design for approximate implementation, and LRU algorithm is no exception. Approximate LRU algorithm is implemented in Redis, and after several versions of iteration, the effect is relatively close to the effect of theoretical LRU algorithm. This is also a good content. This article plans to discuss the LRU algorithm separately later.

The deletion policy of expired keys emphasizes the operation of expired keys. If the expired keys are available and the memory is sufficient, Redis does not use the memory elimination mechanism to free up space. In this case, Redis uses the deletion policy to delete expired keys first.

The memory elimination mechanism emphasizes the elimination of memory data. When the memory is insufficient, even if some health does not reach the expiration time or is not set at all, some of them should be deleted according to certain policies to ensure that new data can be written to the space.

Talk about understanding Redis data synchronization mechanism and principle

To understand the relationship between persistence and data synchronization, you need to analyze it from the perspective of single point of failure and high availability:

C.1 Single Point Of Failure

If we only have a single Redis machine that is serving as a cache and writes hotspot data to disk persistently, at some point this single Redis machine will fail and go down, during which time the cache will fail, the primary storage service will be overwhelmed by all requests, and the monitor will pull up the down Redis machine.

After the restart, the machine can Load disk RDB data for quick recovery. The recovery time depends on the amount of data, which generally ranges from seconds to minutes. After the recovery is complete, the hot data is still in place.

The persistence mechanism is very useful in single-point Redis. Writing only text makes it easy to put people to sleep. Here’s a picture:



As a single point of a cache of high availability system downtime is not allowed, so the master-slave structure, for the master node to multiple backup of data, if the master node point, you can immediately switch state the best from the node is given priority to, writing services to and from other nodes to meet the new master node synchronization data, to ensure high availability of the entire Redis caching system.

The figure shows the failover process of the primary node of the Redis system with one primary node and two secondary read/write nodes separated. The whole process does not stop working normally, which greatly improves the high availability of the system:



From these two points of analysis, a small conclusion can be drawn: persistence makes single points of failure no longer scary, and data synchronization gives high availability wings.

We understand the important role of data synchronization to Redis, then continue to see the implementation principle and process of data synchronization, key difficulties and other details of it!

C.2 CAP theory in Redis system

Readers with knowledge of distributed storage must know the CAP theory. It is a shame that I encountered the CAP theory when I changed my job in March 2018 and went to Face++ megvii technology back-end development position. In addition to the problems of CAP theory, other problems were within range, so I finally took the Offer.

In theoretical computer science, the CAP theorem, also known as Brewer’s Theorem, stems from a conjecture made by Eric Brewer, a computer scientist at the University of California, Berkeley, at the Distributed Computing Principles Symposium PODC in 2000.

In 2002 Seth Gilbert and Nancy Lynch of THE Massachusetts Institute of Technology published a proof of Brewer’s conjecture, making it a theorem. It points out that for a distributed computing system, it is impossible to meet the following three requirements:

  • C. Consistent D. Consistent

  • A Availability

  • P Partition Tolerance Specifies the Partition Tolerance

Take a look at a picture drawn by Ruan Yifeng:

                                    

The key to understanding CP and AP is partition tolerance. Network partitions are common in distributed storage. Even if the machines are in the same room, they are not all in one rack or one switch.

So in the LOCAL area network will appear network jitter, the author has done more than 1 year DPI for network transmission in the most profound three nouns: packet loss, disorderly order, retransmission. So what looks like a calm network to us may be a stormy network to the server, and it will not work if you are not careful. So when the network is disconnected, then there will be a network partition problem.

For Redis data synchronization, suppose that the slave node and the master node are on two racks and the network is disconnected at some point. If Redis reads and writes are separated at this time, the data of the slave node cannot continue to synchronize data with the master node. In this case, if data continues to be read from the node, data inconsistency will occur; if data consistency is mandatory, services cannot be provided from the node, resulting in unavailability. Therefore, it can be seen that C and A cannot be given consideration under the influence of P.

The other cases will not go into detail, from which we can draw the conclusion that: When multiple Redis machines are distributed on different networks, if a network failure occurs, data consistency and service availability cannot be balanced. Redis system must make a choice. In fact, Redis chose availability, or Redis chose another final consistency.

C.3 Final Consistency and replication of Redis

Redis selects final consistency, which means that the data is not guaranteed to be consistent at all times, and the Redis master/slave synchronization is asynchronous by default. Don’t do it!

Let me explain synchronous replication and asynchronous replication (note: I did not write synchronous synchronous and asynchronous synchronous for the reader) :



A picture is worth a thousand words. Look at the red numbers to know the difference between synchronous and asynchronous replication:

  • Asynchronous replication: When the client writes hello World to the master node, the master node sends OK back to the client, and the interaction between the master node and the client is complete. The master node synchronizes hello World to the slave node, and the slave node sends OK back to the master node. The client does not need to wait for the synchronization to complete. So the whole process is implemented asynchronously.

  • Synchronous replication: When the client writes hello World to the master node, the master node synchronizes hello World to the slave node, the slave node replies OK to the master node, and then the master node replies OK to the client. The client needs to wait for the synchronization to complete, so the whole process is implemented synchronously.

Redis chose asynchronous replication to avoid client waiting, more realistic requirements, but this replication mode can be modified, according to their needs.

1. From copying if high availability of Redis, four from a master if four from to the main node data synchronization at the same time, the master node pressure will be more big, considering the eventual consistency Redis, therefore Redis follow-up launched from from the replication, which copies single copy structure evolution of multi-layer structure, the author drew a diagram to see:



2. Full replication and incremental replication

Full replication is the data replication in the initialization phase that occurs when a node recovers from a failure or when a new slave node is added. This replication is completed by synchronizing all the data of the master node to the slave node, so the cost is large but inevitable.

Incremental replication is a data replication mode that is carried out at every moment after the master node and slave node work properly. The data is synchronized in a trickle stream. This synchronization mode is light and fast, and has many advantages.



3. Analyze the full replication process

Redis’ full replication process is divided into three main stages:

  • Snapshot phase: The slave node initiates SYNC full replication command to the master node. The master node executes bgsave to generate snapshots of all data in memory and sends them to the slave node. The slave node releases old memory to load and parse the new snapshot.

  • Buffer phase: The primary node synchronizes the operation commands stored in the buffer to the secondary node. These commands are newly added during the period from bgsave to snapshot loading on the primary node, and need to be recorded or data loss will occur.

  • Incremental phase: After buffer synchronization is completed, the master node normally synchronizes the incremental operation command to the slave node, so that the master and slave nodes keep the same pace.

Take a diagram from Reference 1, which is very well written:



Consider a multi-slave concurrent full replication problem: what if there are multiple slave nodes simultaneously making full synchronization requests to the master node?

The Redis master node is smart and honest. For example, there are 3 slave nodes A/B/C making full SYNC requests to the Redis master node.

  • When the primary node performs bgsave on A, the SYNC command of B and C arrives. In this case, the primary node synchronizes snapshot data and buffer data for A to ABC at the same time, which improves efficiency and ensures correctness.

  • The primary node has completed the snapshot of A and is currently synchronizing the buffer, so the primary node can only perform the same operation on B and C as that on A after the completion of A to realize the full synchronization of the new node. Therefore, the primary node does not get lazy but repeats the process, which is cumbersome but correct.

Consider a snapshot copy circulation problems: the master node to perform bgsave is time consuming and the operation of the memory consumption, during loading of old data from node also experience – > free memory – > loading new data, the process of the dynamic process of memory to drop after litres, to know whether the master node execution snapshots or load data from the nodes is need time and resources.

Regardless of the performance impact, imagine if the master node snapshot time is 1 minutes, during 1 w new orders coming, these new orders will be written to buffer, if only 8 k buffer is small, so after the completion of a snapshot, the master node buffer is only 8 k command lost 2 k, so at this time to the full amount of synchronization is missing from the node data, Is a full synchronization error.

In this case, the node will start SYNC again and get stuck in a loop. Therefore, setting the buffer size is very important.



4. Analyze the incremental replication process

Incremental replication process somewhat simple, but very useful, imagine a complex network environment, and not every broken cannot recover, if every time off back to the full amount will be copied, the master node do not be put to death, so the incremental replication is the complicated network environment an optimization of the process of data replication, allows for a period of time behind, finally caught up with him.

Incremental replication is a typical producer-consumer model, implemented using fixed-length circular arrays (queues). If the buffer is full, the new data overwrites the old data, so the slave node copies the data and feeds back its offset to the master node to ensure that no data is missing.

This process is very good understanding, kakfa the MQ also is such, so in the premise of reasonable set the buffer size, in theory from the spending power is greater than the Lord’s production capacity, most of the buffer occurs only in the network disconnection time is too long be overwritten, becanse of the lag of consumption from nodes, at this time to the full amount only copy.



5. Diskless replication

So before we understand the concept of diskless replication what is diskless replication? Disc refers to the disk, may be a mechanical drive or SSD, but no matter which kind of memory are compared more slowly, we all know that IO operations on the service side is time consuming to dominate, so for the whole quantity replicate this high IO time-consuming operation, especially when the service concurrently is larger and also for other procedures to Redis is greatly the influence of the service itself, The previous model looked like this:



After Redis2.8.18 version, diskless replication is developed, which is to avoid the generation of RDB file disk before loading and the process of network transmission, but send process flow traversal, the master node traversal memory data, and sending data serialization from node, the node does not change, still the data stored in turn to the local disk, After the transfer is completed, memory is loaded, so diskless replication is more IO friendly.

Talk about distributed locking based on Redis and Redlock algorithm

D.1 Introduction to Redis-based distributed Locking

Initially, the lock is distributed using the setnx and expire commands, but these commands are not atomic. If the lock is acquired after setnx but the client is dead, the lock cannot be released after expire is set. In version 2.8, Antirez added an extension to setNx to make setNx and EXPIRE atomic.



In the Redis system with a single matster-slave, normally the Client obtains the lock from the Master and synchronizes it to the Slave. If the Master node fails after the Client succeeds in obtaining the lock and the lock is not synchronized to the Slave, After that, with the help of Sentinel, the Slave is upgraded to Master, but there is no information about the lock that has not been synchronized before. At this time, if a new Client wants to acquire the lock from the new Master, it may occur that two clients hold the same lock. Let’s see the diagram to think about this process:

  

In order to ensure that its own locks can only be released by itself, it is necessary to add unique verification. In summary, the simple process of acquiring and releasing locks based on a single Redis node is as follows:

SET resource_name unique_value NX PX 30000 // Release the lock and compare the unique_value to the unique_value to avoid release by errorif redis.call("get",KEYS[1]) == ARGV[1] then
    return redis.call("del",KEYS[1])
else
    return 0
end
Copy the code

These are a few points of distributed locking based on single Redis.

D.2 Basic procedure of Redlock algorithm

Redlock algorithm is a high availability mode introduced by Antirez on the basis of single Redis node. In the distributed Redis environment, we assume that there are N completely independent Redis nodes, and we use the same methods to acquire and release locks on the N Redis instances as in the Redis single instance.

Now assume that there are 5 Redis primary nodes (an odd number greater than 3), so that they don’t all go down at the same time. During lock acquisition and lock release, the client will do the following:

  • Gets the current Unix time, in milliseconds

  • When requesting a lock from Redis, the client should set a timeout period for the network connection and response. The timeout period should be less than the lock expiration time, so as to avoid client death

  • The client uses the current time minus the time when the lock was acquired to obtain the time when the lock was acquired. A lock is successfully acquired if and only if the lock is obtained from more than half of the Redis nodes and is used for less than the lock expiration time

  • If a lock is acquired, the true valid time of the key is equal to the valid time minus the time it took to acquire the lock. This is important

  • If for some reason, failed to get the lock (no more than half in the instance to lock or lock time was more than effective time), the client should be conducted on all instances of Redis unlock, whether Redis instance lock is successful, because may be the server response message lost but actually succeeded, released, after all, there will be no problem at a time

The above 5 steps are the important process of Redlock algorithm, is also the hot spot of interview, interested readers or record it!

D.3 Debate on whether The Redlock algorithm is safe

1. About Dr. Martin Kleipman

On February 8, 2016, Dr. Martin Kleppmann, an expert on distributed systems, pointed out some principles of distributed lock design in an article “How to Do Distributed Locking” and raised some doubts about Antirez’s Redlock algorithm. I found Dr. Martin Kleipman’s personal website and a brief introduction.



Check it out with Sogou translator:

1. I am a Senior Research Assistant and attached Lecturer in the Department of Computer Science and Technology, University of Cambridge, funded by the Leffulm Trust Early Career Scholarship and the Isaac Newton Trust. I work on locally first collaborative software and distributed system security. 2. I am also a researcher and director of computer Science research at Corpus Christi College, Cambridge, where I do undergraduate teaching. 3. In 2017, I published a book for O ‘Reilly called Designing Data-intensive Applications. It covers a wide range of databases and distributed data processing system architectures and is one of the publisher’s best sellers. 4. I speak regularly at conferences, and recordings of my speeches have been viewed more than 150,000 times. 5. I have worked on various open source projects, including Automerge, Apache Avro and Apache Samza. 6. From 2007 to 2014, I worked as an industrial software engineer and entrepreneur. I co-founded Rapportive(acquired by linkedin in 2012) and Go Test(acquired by RedDoor Software in 2009). 7. I have composed several musical works, including Death in February (in German), a musical theatrical adaptation of the book by Dunk Delacter, which premiered in 2007 with 150 participants.

A good person is a good person. They can teach, publish books, write open source software, start a business, and write musicals. Good people are good at everything.

                                

2. The main points of Dr. Martin’s article

In the article, Martin Klepman talked about many basic problems of distributed systems, especially the asynchronous model of distributed computing. The article is divided into two parts: The first half describes some principles of distributed locking, and the second half puts forward some views on Redlock:

  • Martin points out that even if we had a perfectly implemented distributed lock, we wouldn’t be able to achieve sufficient security without sharing resources involved to provide some kind of fencing mechanism

  • Martin pointed out that since Redlock is essentially built on a synchronous model, it has a strong requirement on the system’s time, and its own security is not enough

Martin gives a timing diagram for the Fencing mechanism:



The client that acquired the lock is holding the lock
It may be suspended for an extended period of timeAlthough the lock has a timeout, the client that avoids a crash may hold the lock forever and never release it, but if
The pause of the client lasts longer than the lock expiration time, and the customer does not realize that it has expired, then it may go ahead and make some unsafe changes, in other words because
A lock held by a client that is blocked expires without the client knowing it.

In this case, Martin pointed out that the fencing mechanism should be added, specifically the fencing token isolation mechanism, and also gave a timing diagram:



Client 1 acquires the lock and acquires the token with serial number 33, but then it goes into a long pause until the lock timeout expires, and client 2 acquires the lock and acquires the token with serial number 34 and writes it to the storage service. Client 1 then revives and sends its write to the storage service, however the storage server remembers that it has processed the write 34 with the higher token, so it rejects the request for token 33.

The Redlock algorithm does not have such a unique and increasing fencing token generation mechanism, which means that the Redlock algorithm cannot avoid the operation problems after the lock expires due to client blocking, so it is not safe.

This idea I think is not completely solve the problem, because if the client 1 write operation is to perform a success, but due to the blocking timeout can no longer write also creates a mistake as a result, the client 2 will be operated on the results of this error, then any operation is doomed to be wrong.

3. Dr. Martin’s doubts about Redlock

Martin Klepman points out that Redlock is a very time-dependent algorithm, which can lead to a lot of inconsistencies. Here’s an example:

Suppose A multi-node Redis system has five nodes A/B/C/D/E and two clients C1 and C2. What happens if the clock on one of the Redis nodes jumps forward?

  • Client C1 obtains A lock on nodes A, B, and C, and due to network problems, method reaches nodes D and E
  • The clock on node C jumps forward, causing the lock to expire in advance
  • Client C2 is locked on nodes C, D, and E and cannot reach A and B due to network problems
  • Both client C1 and client C2 now think they own the lock

Distributed asynchronous model: The above situation is possible because the security of Redlock is strongly dependent on the system clock of Redis node. Once the system clock becomes inaccurate, the security of the algorithm cannot be guaranteed.

Martin is actually pointing out some fundamental problems in the study of distributed algorithms. A good distributed algorithm should be based on an asynchronous model, and its security should not depend on any timing assumptions.

Processes and messages can be delayed for any length of time in a distributed asynchronous model, and the system clock can fail in any way. These factors should not affect its safety, only could affect its activity, even in very extreme cases, most algorithms cannot be given for a limited time as a result, and should not give incorrect results, the algorithm exists in reality such as Paxos/Raft, according to the standards Redlock level of security is unattainable.

4. Dr. Martin’s conclusions and basic views

Martin expressed his views, dividing locks into two uses:

  • The efficiency of the first


    Distributed locks are used only to coordinate the simple work of multiple clients, and occasional lock failures can have other adverse consequences.
    It’s as innocuous as sending and receiving the same email twice
  • The first right


    Using distributed locks requires that under no circumstances should lock failure occur, which could mean data inconsistency, data loss, file corruption, or other serious problems.
    Like giving patients repeated doses of a drug, the consequences can be severe

Martin concluded as follows:

  • Use distributed locks for efficiency


    A single Redis node locking scheme is sufficient. Redlock is a heavy and expensive design
  • Use distributed locks for proper purposes


    Redlock is not a strong enough algorithm to build on the asynchronous model, and its assumptions about the system model contain a lot of dangerous elements

Martin thinks the Redlock algorithm is a bad choice because it is neither fish nor fish: it is too heavy and expensive for efficiency and not safe enough for correctness.

Martin’s article was published on June 28, 2016 and Antirez was quick to respond. He published “Is Redlock Safe?” The address of the article is as follows:

http://antirez.com/news/101

Antirez believes that Martin’s criticism of Redlock can be summarized in two aspects:

  • A distributed lock with automatic expiration must provide a fencing mechanism to ensure true mutual exclusion of shared resources. The Redlock algorithm does not provide such a mechanism
  • Redlock algorithm is built on an insecure system model. It has strong requirements for the timing assumptions of the system, which cannot be guaranteed in the real system

Mr Antirez argues in detail against both.

About Fencing Mechanism

Antirez questioned why use a distributed lock and require such strong security guarantees when an fencing mechanism exists to maintain exclusive access to resources even if the lock fails.

Taking a step back, although Redlock does not provide an increasing Fencing token to isolate tokens, the same effect can be achieved by using the random string generated by Redlock, which is not increasing, but is unique.

The timing hypothesis

Antirez refuted the algorithm in the time model hypothesis set, and Martin believed that there were three main failure cases of Redlock:

  • 1. The clock jumps
  • 2. Long GC pause
  • 3. Long network latency

In the latter two cases, Redlock has been designed and considered at the beginning and is somewhat resistant to the consequences caused by these two problems. Clock jump has a great impact on Redlock, and Redlock will not work properly once this happens. Antirez pointed out that Redlock’s requirements for the system clock do not need to be completely accurate. As long as the error does not exceed a certain range, it will not affect the system, which is completely reasonable in the actual environment, and large clock beats can be completely avoided through proper operation and maintenance.

The distributed system itself is very complex, and the effect of the mechanism and theory needs to be based on certain mathematical derivation. Martin and Antirez are both experts in this field, and they have their own views and reflections on some problems. More importantly, there is no perfect solution to the problem in many cases.

The debate was a very good clash of ideas in the field of distributed systems. Many people posted their views and opinions, and Dr. Martin expressed some of his views again some time after Antirez’s reaction:

For me, this is the most important point: I don’t care who is right or wrong in this debate — I care about learning from others’ work, so that we can avoid repeating old mistakes, and make things better in future. So much great work has already been done for us: by standing on the shoulders of giants, we can build better software.

By all means, test ideas by arguing them and checking whether they stand up to scrutiny by others. That’s part of the learning process. But the goal should be to learn, not to convince others that you are right. Sometimes that just means to stop and think for a while.

For Martin, it didn’t matter who was right or wrong. He was more concerned with learning from the work of others so that he wouldn’t repeat his own mistakes

As we stand on the shoulders of giants, we can make better achievements.

In addition, only through others’ arguments and tests can we make our own ideas stand the test. Our goal is to learn from each other rather than persuade others to believe that you are right. The so-called one person is short, thinking and refuting can be closer to the truth.

After Antirez’s article was published, distributed system experts and enthusiasts all over the world actively expressed their opinions, and I found a familiar name in the comments:

0 XFF. About me