0 x00. Preface
As we all know, data structures and algorithms are the focus of the interview. It’s wise for us to keep working hard, or we’ll lose out in the end. Less playing games and tweeting can change our living standards.
However, this article is not to talk about data structures and algorithms, but another interview focus Redis, because Redis is a common technical point across languages, whether Java or C++ will be asked, so it is a high frequency surface pilot.
The author started contact with Redis only in 2017, during his version, on a stand-alone and cluster is larger but now companies are fully operational, our users only need to related application in a web page, a lot of details were blocked, so, of course, is very convenient, but we still want to deep understanding of the.
During my years of work, I have been exposed to NoSQL such as Redis, Redis-like SSDB and Pika, LevelDB of Google’s key-value storage engine, RocksDB of Facebook’s key-value storage engine and so on. Redis is developed based on standard C language. It is an excellent open source project in engineering and learning.
I have written several articles about Redis before, but the knowledge points are scattered and not conducive to reading, so this time I will summarize and supplement the previous articles to form a complete set, hoping that it will be helpful to the readers who pay attention to me.
The test points listed in the article are more and accumulated up to 3W + words, so it is recommended that readers collect, in case of emergency, through this article you will understand the following content:
-
A brief history of the author and development of Redis
-
Redis commonly used data structures and their implementation
-
Principles and comparison of Redis SDS and C strings
-
Redis ordered set of the bottom design and implementation of ZSet
-
Redis ordered set ZSet and skip list problems
-
Redis dictionary implementation and progressive Rehash process
-
Redis single thread running mode of basic principles and processes
-
Principle and design implementation of the Redis reactor model
-
Redis persistence scheme and its basic principles
-
Cluster version Redis and Gossip protocol
-
Redis memory reclaim mechanism and basic principle
-
Redis data synchronization mechanism and basic principle
Without further ado, the Great White is accelerating at 400 kilometers an hour!
I try my best to elaborate on each question, in order to understand and avoid reciting, of course, there will be some shortcomings, if you have questions can be sent to me.
What is Redis and its importance?
Redis is an open source, net-based, memory-based, optional persistent, high-performance key-value pair database written in ANSI C.
The father of Redis is Salvatore Sanfilippo from Sicily, Italy, who goes by the Github name Antirez. I found some brief information about the author and translated it, as shown here:
Ten years after its first release in 2009, Redis is still one of the most popular key-value in-memory databases.
A good open source project cannot be achieved without the support of a large company. Prior to May 2013, its development was sponsored by VMware, from May 2013 to June 2015, its development was sponsored by Bivito, and from June 2015, Redis development was sponsored by Redis Labs.
The author has also used some other NoSQL, some of which support a very single type of value, so many operations must be implemented in the client, for example, value is a structured data, to modify one of the fields, it needs to be read out, modified and then written as a whole, which is cumbersome, but Redis value supports multiple types. Many operations can be done on the server side, which is very convenient for the client.
Of course, Redis is a memory-based database with limited data storage and high cost of distributed cluster. Therefore, many companies have developed SSD-like Redis systems, such as SSDB and Pika databases developed by 360. However, the author believes that the difficulty from 0 to 1 is greater than that from 1 to 2. There is no doubt that Redis is an important part of NoSQL and is worth studying and using.
Redis provides clients in Java, C/C++, C#, PHP, JavaScript, Perl, object-C, Python, Ruby, Erlang, Golang and other mainstream languages, so users will always find their own client regardless of the language stack. The audience is very wide.
I checked datanyze.com to see the latest market share and ranking comparison between Redis and MySQL as well as the deployment volume comparison of the world’s Top sites (website data 2019.12) :
As you can see, Redis ranks 9th in the overall share and is roughly equal to MySQL in the number of deployed sites in the global Top100, so Redis still has a certain status.
0x02. Describe the data structures commonly used in Redis and how they are implemented.
The five commonly used data types supported by Redis refer to the value type, which are String, List, Hash, Set, and Zset. However, Redis later enriched several data types, which are Bitmaps, HyperLogLogs, and GEO.
Since Redis is written based on standard C and only has the most basic data types, in order to meet the five data types used externally, Redis developed a unique set of basic data structures and used these data structures to realize the five data types.
The underlying data structures of Redis include: simple dynamic array SDS, linked list, dictionary, skip list, integer set, compressed list, object.
In order to balance space and time efficiency, Redis adopts different data structures at the bottom layer for specific types of value, among which hash table and compressed list are data structures that are more reused. The following figure shows the mapping relationship between external data types and underlying data structures:
As can be seen from the figure ziplist compressed List can be used as the bottom implementation of Zset, Set, List three data types, it seems very powerful, compressed List is a kind of memory and developed to save memory and after special coding sequential memory block data structure, the bottom structure is more complex.
What are the advantages of Redis SDS over strings in C?
In C language, the character array of N+1 length is used to represent the string, and ‘\0’ is used as the end sign. For this kind of implementation, Redis cannot meet the requirements of security, efficiency and rich functions, so Redis encapsulates the simple dynamic string structure of SDS separately.
Before you can understand the advantages of SDS, you need to look at the details of the SDS implementation. Check out github’s latest SRC/SDs.h definition:
typedef char *sds;/* This is not ignored */
struct __attribute__(((packed__)) sdshdr5 {
unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
char buf[];
};/* Header lengths 8, 16, 32, 64 The following macro definition for SDS_TYPE_x has only five types: 3bit is sufficient; 5bit is not buf: this is the same as in C character arrays. So we know from typedef char star SDS that's exactly what this is. The maximum length of buf is 2^n where n is the type of SDSHDR, such as when sdSHDR16 is selected, buf_max=2^16. * /
struct __attribute__(((packed__)) sdshdr8 {
uint8_t len; /* used */
uint8_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
struct __attribute__(((packed__)) sdshdr16 {
uint16_t len; /* used */
uint16_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
struct __attribute__(((packed__)) sdshdr32 {
uint32_t len; /* used */
uint32_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
struct __attribute__(((packed__)) sdshdr64 {
uint64_t len; /* used */
uint64_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};
#define SDS_TYPE_5 0
#define SDS_TYPE_8 1
#define SDS_TYPE_16 2
#define SDS_TYPE_32 3
#define SDS_TYPE_64 4
#define SDS_TYPE_MASK 7
#define SDS_TYPE_BITS 3Copy the code
Looking at the previous definition, the author drew a picture:
From the figure, we can see that SDS is divided into three parts: header, BUF, and null. Header can be regarded as the guidance part of the whole SDS, which is given the space size used, the maximum allocation size, and other information. Then, we can clearly see an example of SDSHDR8 using a figure on the Internet:
You can see the full implementation details of SDS in the sds.h/ SDs.c source code, so I won’t go into this article because it would be too long. Let’s quickly get into the topic and talk about the advantages of SDS:
-
O(1) get length: the C string needs to be traversed and the SDS has len which can be obtained directly;
-
When the SDS needs to modify a string, it first checks whether the space is sufficient by len and alloc. If the space is insufficient, the SDS automatically expands the space, avoiding overwriting conditions like C string operations.
-
Effectively reduce the number of memory allocation: C string on the add or remove operation will change the size of the underlying array cause redistribution, SDS, USES the space pre-allocated and inertia space release mechanism, namely every time when the extension is exponentially more allocation, the shrinkage capacity is also has not formally returned to the OS, first the two mechanisms is also better to understand;
-
Binary security: C language string can only save ASCII code, for pictures, audio and other information can not be saved, SDS is binary security, what is written is read, do not do any filtering and restrictions;
The last picture summarized by Huang Jianhong:
0x04. How is Redis’ dictionary implemented? Describe the progressive rehash process
Dictionaries are one of the most common data types in Redis. As mentioned earlier, dictionaries can be implemented based on ziplist and Hashtable.
A dictionary is a very hierarchical data type, as shown here:
With an idea in mind, let’s look at the latest SRC /dict.h source definition:
// struct typedef struct dictEntry {void *key; union { void *val; uint64_t u64; int64_t s64; double d; } v; struct dictEntry *next; } dictEntry; Typedef struct type {uint64_t (*)hashFunction)(const void *key);
void *(*keyDup)(void *privdata, const void *key);
void *(*valDup)(void *privdata, const void *obj);
int (*keyCompare)(void *privdata, const void *key1, const void *key2);
void (*keyDestructor)(void *privdata, void *key);
void (*valDestructor)(void *privdata, void *obj);
} dictType;
/* This is our hash table structure. Every dictionary has two of this as we
* implement incremental rehashing, forTypedef struct dictht {dictEntry **table; unsigned long size; unsigned long sizemask; unsigned long used; } dictht; DictType = dictType; // dictType = dictTypetype;
void *privdata;
dictht ht[2];
long rehashidx; /* rehashing not in progress if rehashidx == -1 */
unsigned long iterators; /* number of iterators currently running */
} dict;
Copy the code
The advantage of C language is that the definition must be from the bottom to the outside, so we can see a significant change in the level, so the author draws a diagram to show the specific level concept:
-
About dictEntry
A dictEntry is a hash table node, where we store data, and its protected members are the key,v, and next Pointers. The key holds the key in the key-value pair and the v holds the value in the key-value pair. The value can be a pointer or uint64_t or int64_t. Next is a pointer to another hash table node that can concatenate multiple key-value pairs with the same hash value to resolve hash conflicts.
The following figure shows the connection between two conflicting hash nodes:
-
About dictht
Hash table includes table, size, used, and Sizemask members. Table is an array, each element of which is a pointer to a dictEntry structure, each of which holds a key-value pair; The size attribute records the size of the table, while the Used attribute records the number of existing nodes in the hash table. The value sizemask is equal to size-1 and the hash value used to calculate the index of a key in the table array.
The figure above shows the hash node situation in a table of size 4, where k1 and k0 have hash conflict at index=2, and the open linked list exists. In essence, k0 is stored first, and k1 is placed in the front of the conflicted linked list directly to ensure efficiency, because the linked list has no tail pointer.
-
About dict
The dict structure is the definition of a dictionary. It contains members such as type, privData, HT, and rehashidx. DictType refers to the API that operates the dictionary. Ht is an array of 2 dicthts, that is, the dictionary contains 2 hash tables. Rehashidx is a variable used to rehash. Privdata is used as an argument with the function pointed to by dictType, which gives a glimpse of several members of the dictionary.
Dictionary hashing algorithm
// Pseudo-code: uses the hash function to calculate the hash value of the keyhash = dict->type->hashFunction(key); // pseudo-code: use hash table sizemask and hash value to calculate index = in HT [0] or HT [1]hash& dict->ht[x].sizemask; // Source definition#define dictHashKey(d, key) (d)->type->hashFunction(key)
Copy the code
Redis uses MurmurHash algorithm to calculate the hash value, which was originally invented by Austin Appleby in 2008. MurmurHash algorithm can give hash value with good random distribution no matter what the data input situation is, and the calculation speed is very fast. There are MurmurHash2 and MurmurHash3 versions.
-
Normal Rehash rehashes
The number of key-value pairs stored in the hash table changes dynamically. In order to keep the load factor of the hash table within a reasonable range, it is necessary to expand and shrink the hash table.
Scaling is done by performing a rehash. The basic steps to perform a normal rehash on a dictionary hash table are allocating space -> migrating one by one -> swapping hash tables. The detailed process is as follows:
-
Allocate space for the dictionary’s HT [1] hash table. The size of the allocated space depends on the operation to be performed and the number of key-value pairs ht[0] currently contains: the size of HT [1] for the extended operation is 2^n of the first one greater than or equal to HT [0].used*2; The size of the first ht[1] is greater than or equal to 2^n of ht[0].used;
For example, h[0].used=200, then select the first power of 2 greater than 400, i.e. 2^9=512.
-
Rehash all key pairs stored in HT [0] to ht[1] by recalculating the hash and index values of the key;
-
Rehash is repeated until all key-value pairs in HT [0] are migrated to HT [1], ht[1] is set to HT [0], and a blank hash table is created in HT [1] to prepare for the next rehash.
-
Progressive Rehash process
The rehash action of Redis is not done in one time, but in several times and gradually. The reason is that when there are a large number of key and value pairs stored in the hash table, rehashing them all to HT [1] at one time may cause the server to stop service for a period of time, which is unacceptable.
In this case, Redis uses progressive rehash. The detailed steps of the process are:
-
Allocating space for HT [1] is no different than a normal Rehash;
-
Set rehashidX to 0 to officially start the rehash work, and the rehashidx is incrementing, starting from 0 to rehash from the first element of the array.
-
During the rehash process, each time the dictionary is added, deleted, modified and looked up, the key-value pair of the hashed table HT [0] on the rehashidx index is rehash to HT [1]. After the rehashidx is finished, the rehashidx is increased by 1 to point to the next key-value pair to be rehash.
-
As the dictionary operation continues, eventually all key-value pairs of HT [0] are rehash to HT [1], and the rehashidx attribute is set to -1 to indicate that the rehash operation is complete.
The idea of progressive rehash is to spread the computation required for rehash key-value pairs over each addition, deletion, lookup, and update to the dictionary, thus avoiding the blocking problem associated with centralized rehash.
You can’t help but wonder if this piggy-back rehash will make the whole process very long. If a value has not been operated so the need to expand because it has not been used so little impact, need to shrink if it has not been dealt with may cause memory waste, specific haven’t had time to study, first buried a problem!
How to run Redis in single-threaded mode before 4.0
5.1 Considerations for single-threaded mode
CPU is not the bottleneck: The multi-threaded model is designed to make full use of the multi-core CPU, allowing the thread to be suspended when I/O is blocked to give the use of the CPU to other threads, thus fully improving the CPU usage. However, this scenario is not obvious in Redis, because the CPU is not the bottleneck of Redis, all operations in Redis are based on memory, and processing events is very fast. Therefore, the need to use multiple threads to switch threads to improve CPU utilization is not strong;
Memory is the bottleneck: A single Redis instance makes good use of a single core, but the bottleneck of Redis is memory. Imagine a 64-core machine with only 16GB of memory. What’s the use of multi-threaded Redis?
Complex Value types: Redis has rich data structures, not simple key-value NoSQL, which is also the reason why Redis is popular. When the Value of common Hash, Zset, List and other structures is large, the operation of CURD will be very complicated. If you use multithreaded mode, you need to lock to synchronize the same key operation, which can cause deadlock problems.
Cluster extension: Current machines are multi-core, but the memory is generally 128GB/64GB is relatively common, but Redis is more than 60% of the memory stability is not as good as 50% of the performance (at least when the author uses the clustered Redis more than 70%, the frequency of cluster failover will be higher), so when the data is large, When Redis is used as the main memory, it is necessary to use multiple machines to build a clustered Redis database system, so that the single thread mode of Redis is extended by the clustering processing.
From the perspective of software engineering: single-threaded is much easier to develop and maintain than multi-threaded, and it can also improve the stability of services. Lockless processing makes single-threaded Redis have considerable advantages in development and maintenance.
Redis-like system: Redis design adheres to practical first and engineering. Although there are many excellent design modes in theory, they are not necessarily applicable to themselves. Software design process is a process of weighing. There are also many Redis-like NoSQL in the industry, such as Pika system developed by 360 infrastructure group, which is based on SSD and Rocks storage engine, and encapsulates a layer of protocol transformation on the upper layer to realize the simulation of all Redis functions. Interested people can study and use it.
5.2 File and Time events of Redis
-
Time event
-
Timed event: The task is executed after the waiting time of the specified size. After the execution is complete, the task is not executed and only triggered once.
-
Periodic event: The task is executed at a certain interval. After the execution is complete, it is triggered periodically for the next execution.
-
Periodic time event
-
Delete the key of the database
-
Trigger RDB and AOF persistence
-
Master-slave synchronization
-
Cluster survival
-
Close the cleanup dead client link
-
Update the memory and number of keys of the server
-
An unordered list of time events
-
Event ids are globally unique and in ascending order
-
Trigger timestamp with MS level accuracy
-
Event handler event callback function
5.3 Event scheduling and execution in single-threaded mode
-
Event execution scheduling rules
-
Code implementation of event execution scheduling
int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
int processed = 0, numevents;
if(! (flags & AE_TIME_EVENTS) && ! (flags & AE_FILE_EVENTS))return 0;
if(eventLoop->maxfd ! =- 1|| ((flags & AE_TIME_EVENTS) && ! (flags & AE_DONT_WAIT))) {int j;
aeTimeEvent *shortest = NULL;
struct timeval tv, *tvp;
if(flags & AE_TIME_EVENTS && ! (flags & AE_DONT_WAIT)) shortest = aeSearchNearestTimer(eventLoop);if (shortest) {
long now_sec, now_ms;
aeGetTime(&now_sec, &now_ms);
tvp = &tv;
long long ms =
(shortest->when_sec - now_sec)*1000 +
shortest->when_ms - now_ms;
if (ms > 0) {
tvp->tv_sec = ms/1000;
tvp->tv_usec = (ms % 1000) *1000;
} else {
tvp->tv_sec = 0;
tvp->tv_usec = 0; }}else {
if (flags & AE_DONT_WAIT) {
tv.tv_sec = tv.tv_usec = 0;
tvp = &tv;
} else {
tvp = NULL; /* wait forever */
}
}
numevents = aeApiPoll(eventLoop, tvp);
if(eventLoop->aftersleep ! =NULL && flags & AE_CALL_AFTER_SLEEP)
eventLoop->aftersleep(eventLoop);
for (j = 0; j < numevents; j++) {
aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd];
int mask = eventLoop->fired[j].mask;
int fd = eventLoop->fired[j].fd;
int fired = 0;
int invert = fe->mask & AE_BARRIER;
if(! invert && fe->mask & mask & AE_READABLE) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; }if (fe->mask & mask & AE_WRITABLE) {
if(! fired || fe->wfileProc ! = fe->rfileProc) { fe->wfileProc(eventLoop,fd,fe->clientData,mask); fired++; }}if (invert && fe->mask & mask & AE_READABLE) {
if(! fired || fe->wfileProc ! = fe->rfileProc) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; } } processed++; }}/* Check time events */
if (flags & AE_TIME_EVENTS)
processed += processTimeEvents(eventLoop);
return processed;
}
Copy the code
-
Pseudo code for event execution and scheduling
def aeProcessEvents()
# Get the last time event to be executed
time_event = aeGetNearestTimer()
Calculate the difference between the last executed event and the current time
remain_gap_time = time_event.when - uinx_time_now()
# Determine whether the time event has expired, then reset the execution immediately
if remain_gap_time < 0:
remain_gap_time = 0
The blocking wait time is determined by remain_gap_time
# Remain_gap_time = 0 return immediately without blocking
aeApiPoll(remain_gap_time)
# Handle all file events
ProcessAllFileEvent()
# Handle all time events
ProcessAllTimeEvent()
Copy the code
-
Event scheduling and execution processes
Talk about the Redis reactor model
-
Why use the Reactor model?
-
How does Redis implement its Reactor model?
6.1 Reactor model
-
Epoll/Kqueue puts all the collected read and write events into a queue for processing by business threads. In this case, the worker thread of the thread pool obtains the task for processing. In actual scenarios, there may be many types of requests
-
Worker threads need to pay attention to different types of requests and choose different processing methods for different requests, so increasing the number of request types makes worker threads more complex and difficult to maintain
-
Handle can be interpreted as read and write events that can be registered to the Reactor for monitoring
-
Sync Event Demultiplexer can be understood as the collector of IO events such as epoll/kqueue/ SELECT
-
The Dispatcher acts as an event Dispatcher by registering/removing events and distributing them
-
Event Handler Callbacks of specific events completed by the Event Handler are called by the Dispatcher
-
Concrete Event Handler Specifies the specific request Handler
Implementation of the 6.2 Reactor model in Redis
Redis IO reuse options
#ifdef HAVE_EVPORT
#include "ae_evport.c"
#else
#ifdef HAVE_EPOLL
#include "ae_epoll.c"
#else
#ifdef HAVE_KQUEUE
#include "ae_kqueue.c"
#else
#include "ae_select.c"
#endif
#endif
#endif
Copy the code
-
Task event queue for Redis
-
Redis event dispatcher
-
The AE_READABLE client writes data, closes the connection, and a new connection arrives
-
AE_WRITEABLE The client reads data
In particular, when a socket connection is both readable and writable, the server processes the read event first and then the write event first.
-
Redis event handler
-
Connection reply handler: Implements the establishment of new connections
-
Command request handler: Processes new commands from the client
-
Command reply handler: returns the result of the client’s request
-
Replication processor: implements data replication between primary and secondary servers
-
Redis C/S a complete interaction
X07. How does Redis persist and its basic principles
In general, persistence is the writing of data from memory to non-volatile media, such as mechanical disks and SSDS.
In the event of a server outage, Redis, as an in-memory database, will lose all of its data, so Redis provides two powerful tools for persistence: RDB and AOF
-
The RDB saves database snapshots to disk in binary mode.
-
AOF records all the commands and parameters that have been written to the database in protocol text to the AOF file, thus recording the database status.
-
View the RDB configuration
[redis@abc]$ cat /abc/redis/conf/redis.conf
save 900 1
save 300 10
save 60 10000
dbfilename "dump.rdb"
dir "/data/dbs/redis/rdbstro"
Copy the code
The first three rows are a condition for triggering RDB. For example, if the first row indicates that a data change occurs every 900 seconds, RDB is triggered, and so on. As long as one satisfies, the RDB will persist;
The fourth line, dbfilename, specifies the name to write the in-memory database to a local file, which is a compressed binary file.
The fifth line “dir” specifies the directory where the RDB binaries are stored.
-
Modify the RDB configuration
[redis@abc]$bin/redis-cli 127.0.0.1:6379> CONFIG GET save 1)"save"
2) "900 1 300 10 60 10000"127.0.0.1:6379 > CONFIG SET save"21600, 1000"
OK
Copy the code
7.1 RDB SAVE and BGSAVE
The figure shows a simple bgSave process:
-
BGSAVE implementation details
-
Redis uses fork to create a child process;
-
The parent process continues to receive and process command requests, and the child process writes memory data to temporary files;
-
After all data is written, the child process replaces the old RDB file with a temporary file.
When fork is executed, the OS uses a copy-on-write policy to optimize the snapshot process for the child process.
We can implement Redis database backup by periodically backing up RDB files. RDB files are compressed and occupy less space than the data size in memory.
You can also manually send SAVE or BGSAVE commands to Redis to execute snapshots in addition to automatic snapshots. Use RDB for persistence. Due to the frequency limit of RDB, if the data is important, consider using AOF for persistence.
7.2 AOF,
[redis@abc]$ more appendonly.aof
*2 # 2 parameters
$6 # The first argument has length 6
SELECT # First argument
The $1 # The second parameter has length 1
8 # Second parameter* 3# 3 parameters
$3 # The first argument has length 4
SET # First argument
$4 # The second parameter has length 4
name # Second parameter
$4 # The third argument has length 4
Jhon # The second parameter has length 4
Copy the code
AOF configuration:
[redis@abc]$ more ~/redis/conf/redis.conf
dir "/data/dbs/redis/abcd" #AOF file storage directory
appendonly yes # Enable AOF persistence. It is disabled by default
appendfilename "appendonly.aof" AOF file name (default)
appendfsync no #AOF persistence policy
auto-aof-rewrite-percentage 100 Condition that triggers AOF overwrite (default)
auto-aof-rewrite-min-size 64mb Condition that triggers AOF overwrite (default)
Copy the code
-
Always: The server forcibly writes the contents of the AOF buffer to the AOF file on the hard disk every time an event is executed, ensuring the integrity of data persistence, which is the slowest but the safest.
-
Everysec: The server will synchronize files once every second to write the AOF cache data in the memory buffer to the AOF file, which takes into account the efficiency and integrity. In extreme cases, the server will only lose the write operations to the Redis database within one second when the server is down;
-
“No” : indicates that the system cache is written to the disk by default, and the data security and integrity are poor.
-
Execute set Hello world 50 times
-
Finally, set Hello China is executed
-
Finally, the first 50 times are meaningless to the AOF file, and the AOF rewrite is to save only the last state of the key.
-
Data consistency issues during rewriting
-
Writes the entire contents of the AOF rewrite cache to the new AOF file
-
Rename the new AOF file and overwrite the original AOF file
-
Obstruction of AOF rewrite
-
Current AOF file size
-
Variable of the size of the AOF file after the last rewrite
-
AOF File size growth percentage
-
No BGSAVE command is running to prevent collisions with the RDB
-
No BGREWRITEAOF is in use to prevent and manual AOF conflicts
-
The current AOF file size is at least larger than the set value
-
The ratio between the current AOF file size and the size after the last AOF rewrite is greater than or equal to the specified growth percentage
7.3 Data Restoration of Redis
Redis data recovery priority
-
If only AOF is configured, the AOF file is loaded during the restart to restore data.
-
If both RDB and AOF are configured, only the AOF file is loaded to restore data.
-
If only RDB is configured, the dump file is loaded to restore data.
-
RDB is persisted at intervals. When a failure occurs, data between the time of outage and the last persistence will be lost. Data integrity cannot be guaranteed
-
AOF stores sequences of instructions, which can take a long time to recover and replay and have a larger file
Persistence practice
-
The safest approach is to use RDB in conjunction with AOF. Even if the AOF is damaged beyond repair, RDB can be used to recover data, but of course there is a performance impact when persisting.
-
Redis is a simple cache, no cache will not cause cache avalanche only use RDB.
-
It is not recommended to use AOF alone because AOF is slower to restore and load data than RDB, so it is better to have RDB as a backup when using AOF.
-
Using the new version of Redis 4.0 persistence scheme.
Talk about the underlying design and implementation of Redis’ ZIPLIST
Leaving aside Redis’s implementation of Ziplist, let’s think about what we need to do to design this data structure. Thinking to learn more harvest yo!
-
Consider point 1: the double-sided nature of continuous memory
Continuous memory reduces memory fragmentation, but continuous large memory is not easy to satisfy. It’s not a waste of space for you to sit next to each other. However, many people travel alone in the subway, and they don’t want to be next to each other. So there are two seats and one seat, but it’s really hard to find three consecutive seats.
-
Consider point 2: the diversity of compression list hosting elements
The design structure is different from the array. The array has a mandatory type convention, so we can determine the offset of the index based on the type and number of elements, but the compressed list has no constraints on the type of elements, which means we don’t know what the data type and length is. This is a bit like the TCP approach to stick and unpack. We need to specify the terminator or the length of the individual stored elements, otherwise the data gets stuck together.
-
Consider point 3: Constant level time acquisition of attributes
That said, we have addressed the first two considerations, but as a whole, compressing lists requires constant level consumption to provide some general information, such as the total length, the number of elements stored, and the position of the tail node (for quick insertion and removal of the tail), which makes a lot of sense for manipulating compressing lists.
-
Consider point 4: data structure support for additions and deletions
In theory, we need to design data structures to support adding and deleting. Of course, there are tradeoffs. No data structure is perfect, so let’s adjust it as we go.
-
Consider point 5: How can I save memory
If we want to save memory, we need to use a special case of variable length, which means we don’t use two pre and next Pointers like a two-way linked list, which is more space intensive, so we might need to use variable length coding.
Ziplist overall structure
With that in mind, let’s take a look at what Redis thinks, and here’s an overview:
From the figure we can basically see several main parts: ZLbytes, Zltail, zllen, Zlentry, and Zlend.
To explain what each of these attributes means, take a look at a very nice graph on the Internet, where the red line validates our consideration point 2 and the green line validates our consideration point 3:
Look at ziplist. C of ziplist application and scale operation, to deepen the understanding of the above several attributes:
/* Create a new empty ziplist. */
unsigned char *ziplistNew(void) {
unsigned int bytes = ZIPLIST_HEADER_SIZE+ZIPLIST_END_SIZE;
unsigned char *zl = zmalloc(bytes);
ZIPLIST_BYTES(zl) = intrev32ifbe(bytes);
ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(ZIPLIST_HEADER_SIZE);
ZIPLIST_LENGTH(zl) = 0;
zl[bytes-1] = ZIP_END;
return zl;
}
/* Resize the ziplist. */
unsigned char *ziplistResize(unsigned char *zl, unsigned int len) {
zl = zrealloc(zl,len);
ZIPLIST_BYTES(zl) = intrev32ifbe(len);
zl[len-1] = ZIP_END;
return zl;
}
Copy the code
The realization of the zlentry
-
Encoding Encoding and Content storage
When content is a string, the 2bit height of the first byte of the encoding indicates the string type, when content is an integer, The first byte high 2bit of Redis encoding is fixed to 11.
/ *########### Description of the string ###############
#### Encoding consists of three types: 1 byte, 2 byte, and 5 byte ####
#### The maximum 2bit indicates the length of the string. 00, 01, and 10 respectively correspond to 1 byte, 2 bytes, and 5 bytes ####
#### When the maximum 2bit is 00, encoding=1 byte. The remaining 6 bits 2^6=64 Indicates that the value ranges from 0 to 63####
#### When the maximum 2bit is 01, it indicates encoding=2 bytes. The remaining 14 bits 2^14=16384 The value ranges from 0 to 16383####
#### When the maximum is 2 bits =11, encoding=5 bytes. The remaining 32 bits are 4 bytes after special use. 2^32= 4.2 billion ####
* |00pppppp| - 1 byte
* String value with length less than or equal to 63 bytes (6 bits).
* "pppppp" represents the unsigned 6 bit length.
* |01pppppp|qqqqqqqq| - 2 bytes
* String value with length less than or equal to 16383 bytes (14 bits).
* IMPORTANT: The 14 bit number is stored in big endian.
* |10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes
* String value with length greater than or equal to 16384 bytes.
* Only the 4 bytes following the first byte represents the length
* up to 32^2-1. The 6 lower bits of the first byte are not used and
* are set to zero.
* IMPORTANT: The 32 bit number is stored in big endian.
*# # # # # # # # # # # # # # # # # # # # # # # # string and integer storage line # # # # # # # # # # # # # # # # # # # # *
*#### the high 2bit is fixed to 11, and the next 2 bits are 00, 01, 10. 11 indicates the integer type of storage
* |11000000| - 3 bytes
* Integer encoded as int16_t (2 bytes).
* |11010000| - 5 bytes
* Integer encoded as int32_t (4 bytes).
* |11100000| - 9 bytes
* Integer encoded as int64_t (8 bytes).
* |11110000| - 4 bytes
* Integer encoded as 24 bit signed (3 bytes).
* |11111110| - 2 bytes
* Integer encoded as 8 bit signed (1 byte).
* |1111xxxx| - (with xxxx between 0000 and 1101) immediate 4 bit integer.
* Unsigned integer from 0 to 12. The encoded value is actually from
* 1 to 13 because 0000 and 1111 can not be used, so 1 should be
* subtracted from the encoded 4 bit value to obtain the right value.
* |11111111| - End of ziplist special entry.
*/
Copy the code
Content holds the contents of the node. The contents can be byte arrays and integers of various types. The type and length of the node determine the encoding.
Hold byte array: the top two bits of the encoding, 00, indicate that the node holds an array of bytes. The last six bits of the encoding, 001011, record the length of the byte array, 11. The content attribute holds the value of the node, “Hello World.”
Save integers: The code 11,000,000 indicates that the node stores an integer value of type INT16_t, and the content attribute holds the node’s value 10086.
-
Prevlen properties
Finally give prevlen this attribute, this property is the key, said has been compressed in front of the list is designed in order to save memory, prevlen attributes just happens to have the effect, however, recall the list if you want to get in front of the nodes need to use a pointer, compressed list due to the diversity of the element cannot like an array to implement, So use the Prevlen property to record the size of the previous node to point to.
The prevlen attribute specifies the length of the previous node in the compressed list, in bytes. It can be 1 byte or 5 bytes long:
-
If the length of the previous node is less than 254 bytes, then the prevlen property is 1 byte long and the length of the previous node is stored in this byte.
-
If the length of the previous node is 254 bytes or greater, then the prevlen property is 5 bytes long. The first byte is set to 0xFE, and the next four bytes are used to hold the length of the previous node.
Think: Notice that the first byte here is set to 0xFE instead of 0xFF. Think about why?
That’s right! As mentioned earlier, zlend is a special value set to 0xFF to indicate the end of the compressed list, so it can’t be set to 0xFF. There was an issue about this in Redis, and someone mentioned that Antirez’s ziplist was incorrectly commented. Finally Antirez found that the comment was incorrectly written. Then happily modified, ha ha!
Another question to consider is why prevlen is either 1 or 5 bytes long. Why aren’t there intermediate states with lengths of 2, 3, and 4 bytes? Answering this question leads to one of today’s key questions: chained updates.
Chain update problem
Consider this scenario of adding nodes:
If you add a new node to the header of the compressed list and the length is greater than 254 bytes, the prevlen of the next node must be 5 bytes. However, the prevlen of the next node must be 1 byte before the new node is added, which must be extended. In extreme cases, if the extension is needed all the time, this will cause a chain reaction:
Consider another scenario of deleting a node:
If the node to be deleted is a small node, the node in front of the node is a large node. In this way, when the small node is deleted, the node behind it must keep the length of the large node in front of it, which faces the problem of expansion:
Understand the chain update problems, take a look at why 1 byte or 5 bytes of question, if is 2-4 bytes may produce a chain reaction of probability is bigger, in contrast to directly to the five biggest bytes will greatly reduce the probability of chain update, so the author also thinks the small waste memory is worth it.
Ziplist is not designed to be good at modification, which can lead to memory copy problems and increased traversal costs when the amount of data stored in the compressed list exceeds a certain threshold.
0x09. Talk about Redis Zset and skip lists
typedef struct zset {
dict *dict;
zskiplist *zsl;
} zset;
Copy the code
Dictionary and skip table layout in ZSet:
9.1 Implementation details of skip lists in ZSet
-
The realization principle of random layer number
-
Specify the maximum layer level of the node, specify the probability p, the default layer LVL is 1
-
Generate a random number r from 0 to 1. If r<p and LVL <MaxLevel, LVL ++
-
Repeat step 2 until the generated R >p, at which point LVL is the number of layers to insert.
The Redis implementation of the table is basically following this idea, but with a slight difference, look at the Redis source code for random surface number of jumps SRC /z_set.c:
/* Returns a random level for the new skiplist node we are going to create.
* The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL
* (both inclusive), with a powerlaw-alike distribution where higher
* levels are less likely to be returned. */
int zslRandomLevel(void) {
int level = 1;
while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
level += 1;
return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}
Copy the code
Two of these macros are defined in redis.h:
#define ZSKIPLIST_MAXLEVEL 32 /* Should be enough for 2^32 elements */
#define ZSKIPLIST_P 0.25 /* Skiplist P = 1/4 */
Copy the code
You can see in the while:
(random()&0xFFFF) < (ZSKIPLIST_P*0xFFFF)
Copy the code
When I first saw this formula, it was a little bit of a surprise because it involved bit operations, and I wanted to see why Antirez used bit operations to write it this way.
Random () returns a floating-point number [0-1]. This is the first time that random() returns a floating-point number [0-1].
0x3e800000, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF
The use of random() in dict.c
Use of random() in cluster.c
Random () was supposed to be a [0-1] floating-point number, but now it seems to be uint32, so Antirez’s formula is easy to understand.
ZSKIPLIST_P*0xFFFF
Copy the code
template <typename Key, typename Value>
int SkipList<Key, Value>::randomLevel() {
static const unsigned int kBranching = 4;
int height = 1;
while (height < kMaxLevel && ((::Next(rnd_) % kBranching) == 0)) {
height++;
}
assert(height > 0);
assert(height <= kMaxLevel);
return height;
}
uint32_t Next( uint32_t& seed) {
seed = seed & 0x7fffffffu;
if (seed == 0 || seed == 2147483647L) {
seed = 1;
}
static const uint32_t M = 2147483647L;
static const uint64_t A = 16807;
uint64_t product = seed * A;
seed = static_cast<uint32_t>((product >> 31) + (product & M));
if (seed > M) {
seed -= M;
}
return seed;
}
Copy the code
-
Average number of layers of a hop table node
If the frequency of something is a power relation to one of its properties, then the frequency is said to conform to the power law. The power law is characterized by the fact that a few events account for a large proportion of the total frequency, while most of the remaining events account for only a small fraction of the total frequency.
-
The number of node layers is at least 1, and the number of node layers greater than 1 satisfies a probability distribution.
-
The probability that the number of nodes is exactly 1 is p^0(1-p).
-
The probability that the number of node layers is exactly 2 is p^1(1-p).
-
The probability that the number of node layers is exactly 3 is p^2(1-p).
-
The probability that the number of nodes is exactly 4 is p^3(1-p).
-
The probability that the number of nodes in sequence is exactly equal to K is p to the K minus 1 times 1 minus p.
0x0A. Talk about the Clustered Redis and Gossip protocols
A.1 Basic Information about Clusters
-
Single instance Redis architecture
-
Clustering and sharding
Client sharding: This situation is similar to hashing to modulo. It is easy to use when the client has complete control over the number of servers.
Middle layer sharding: In this case, a middle layer is added between the client and the server, acting as the administrator and scheduler. The client requests are sent to the middle layer, and the middle layer realizes the forwarding and recycling of requests. Of course, the most important role of the middle layer is the dynamic management of multiple servers.
Server shard: do not use the intermediation is decentralized management mode, arbitrary nodes in the client to the server directly request, if the requested Node without the required data, is a client reply version, and tell the client the required data storage location, the process is, in fact, the client and the server mutual cooperate, request is redirected to complete.
-
Cluster version of Redis for mid-tier sharding
-
Official cluster version of the server shard
A.2 Basic operating principles of the Redis Cluster
-
Node state information structure
-
Current Cluster Status
-
Information about slots that each node in the cluster is responsible for, and their migrate status
-
The status of each node in the cluster is master-slave
-
The survival status and unreachable vote of each node in the cluster
-
Concept of the Gossip protocol
The Gossip Protocol, also known as the Epidemic Protocol, is a protocol for exchanging information between nodes or processes based on how an epidemic spreads.
It is widely used in distributed systems. For example, we can use the Gossip protocol to ensure that all nodes in the network have the same data.
The Gossip Protocol was originally created in 1987 by Alan Demers, a researcher at Xerox’s Palo Alto Research Center.
https://www.iteblog.com/archives/2505.html
The Gossip algorithm is also known as anti-entropy. Entropy is a physical concept that represents chaos, and anti-entropy is the search for consistency in chaos. This fully illustrates the characteristics of Gossip: In a bounded network, each node communicates randomly with other nodes, and after some chaotic communication, the state of all nodes eventually reaches an agreement. Each node may know all other nodes or only a few neighbor nodes. As long as these nodes can be connected through the network, their status is consistent eventually. Of course, this is also the characteristic of epidemic transmission.
https://www.backendcloud.cn/2017/11/12/raft-gossip/
-
The use of the Gossip protocol
-
Meet Using the cluster Meet IP port command, a node in an existing cluster sends an invitation to a new node to join the existing cluster.
-
The Ping node sends a Ping message to other nodes in the cluster every second. The message contains the known addresses, slots, status information, and last communication time of the two nodes.
-
After receiving the ping message, the Pong node replies a Pong message containing the information of the two known nodes.
-
If node Fail fails to ping a node, a message indicating that the node is down is broadcast to all nodes in the cluster. After receiving the message, other nodes mark themselves offline.
-
Fault detection based on Gossip protocol
-
More than half of the primary nodes have Node in the PFAIL state.
-
The current node also marks node as a PFAIL state.
0x0B. Talk about understanding Redis’ memory reclaim mechanism
Redis as a memory databases, if just only burst into out sooner or later, in fact many store Redis when family DB guys will taste the bitter fruit, unless, of course, does your factory money, drizzle T level of memory, or no longer growth growth to a certain extent after scene, is another matter.
For Redis service safe and stable operation, keep using memory within a certain threshold is necessary, so we need to delete the delete, cleaning the clean, leave the keys of the need for the memory, just think of a great river need to set up a few warning level to ensure that don’t burst not dried up, Redis is the same, But Redis only cares about breaking the levees. Here’s a picture:
In the figure, the machine memory is set as 128GB, occupying 64GB is a relatively safe level. If the memory is close to 80%, which is about 100GB, then Redis is considered to have a relatively large carrying capacity. The specific proportion can be determined according to the business experience of companies and individuals.
I just want to say that for the sake of security and stability, don’t assume that 128GB of memory means 128GB of data storage is discounted.
B.1 Where does the Reclaimed Memory Come from
Redis’ memory usage is divided into two parts: the storage of key-value pairs and the running of Redis itself. Obviously, we can’t retrieve the latter, so we have to start with key-value pairs, which can be classified into several categories: expired, non-expired, hot, and cold data. Expired key values need to be deleted. What if memory is still out after all expired key pairs are deleted? Well, that’s only gonna kick out some of the data.
B.2 How Can I Delete Expired Key And Value Pairs
To implement key-value pair deletion we need to understand the following:
-
Where are key-value pairs with expiration timeouts stored?
-
How do I determine if a key-value pair with a timeout can be deleted?
-
What are the removal mechanisms and how to select them?
1. Storage of key and value pairs
SRC /server.h gives the answer to the redisDb structure:
typedef struct redisDb {
dict *dict; /* The keyspace for this DB */
dict *expires; /* Timeout of keys with a timeout set */
dict *blocking_keys; /* Keys with clients waiting for data (BLPOP)*/
dict *ready_keys; /* Blocked keys that received a PUSH */
dict *watched_keys; /* WATCHED keys for MULTI/EXEC CAS */
int id; /* Database ID */
long long avg_ttl; /* Average TTL, just for stats */
unsigned long expires_cursor; /* Cursor of the active expire cycle. */
list *defrag_later; /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;
Copy the code
A Redis is essentially a large key-value, where a key is a string, and a value can be several objects: strings, lists, ordered lists, collections, hashes, etc. These key-values are stored in a dict in ReddisDB.
Redis has its own rules for organizing expired key-value pairs, so let’s move on to the next step. We just need to delete the target key-value from dict in Redis.
The Expires member of redisDb is also of dict type, the same as a key-value pair. Essentially, expires is a subset of dict. Expires holds all expired key-value pairs.
For keys, we can set the absolute and relative expiration times and view the remaining time:
-
Use EXPIRE and PEXPIRE to implement second and millisecond lifetime Settings for key pairs, which are expiration Settings for relative duration
-
EXPIREAT and EXPIREAT are used to implement the expiration deletion of the key-value pair in a second and millisecond level timestamp, which belongs to the absolute expiration setting
-
View the remaining expiration time of a key-value pair with time to live through TTL and PTTL
The above three sets of commands are useful when designing a cache, and should be noted by interested readers.
The expires dictionary expires does not store exactly the same content as the dict key-value pair space. The key of the expires dictionary is a pointer to the corresponding Redis object, and its value is a long long Unix timestamp. The relative length of EXPIRE and PEXPIRE will eventually be converted to a timestamp.
2. Determine whether the key-value pair is expired
To determine whether the key is expired and can be deleted, we need to check whether the value exists in the expiration dictionary first. If so, we can further determine the relative size of the expiration timestamp and the current timestamp and make the deletion judgment. The simple process is shown in the figure:
3. The policy for deleting key-value pairs
After the previous several links, we know two storage locations of Redis: key space and expired dictionary, as well as the structure of expired dictionary expires and the method to determine whether it is expired. Then how to implement deletion?
Instead of Redis, consider several possible deletion strategies:
-
Timed delete: When the key expiration time is set, a timer is created. When the key expiration time arrives, the timer immediately deletes the key-value pair.
-
Periodic deletion: Scan the database every specified time to detect and delete expired key-value pairs.
-
Lazy deletion: the expiration of the key and value pair is not deleted temporarily, as to the timing of deletion is related to the use of the key and value pair, when the key is obtained, first check whether it is expired, expired delete, otherwise keep;
In the preceding three policies, periodic deletion and periodic deletion are active deletion with different time granularity, while lazy deletion is passive deletion.
The three strategies have their own advantages and disadvantages: periodic deletion has advantages in memory usage but is not CPU friendly, and lazy deletion is not memory friendly. If some key pairs are not used all the time, a certain amount of memory will be wasted. Periodic deletion is a compromise between periodic deletion and lazy deletion.
Reids adopted is a combination of inert delete and delete regularly, generally can be done with minimum heap timer, but Redis design considering the time limited the amount and type of events, use the unordered list storage time event, so that if on the basis of the implementation time to delete, means that the O (N) traversal for recently need to delete the data.
However, I think that if Antirez had to use periodic deletion, he would certainly not use the original unordered list mechanism. Therefore, I personally think that the existing unordered list cannot be the fundamental reason why Redis does not use periodic deletion. I venture to guess that the only possible reason is that Antirez does not think it is necessary to use periodic deletion.
4. Implementation details that are periodically deleted
Regular erasure sounds simple, but how do you control how often and for how long?
If the frequency of execution is too little, it will degenerate into lazy deletion. If the time of execution is too long, it will be similar to timed deletion. The timing of regular deletions also needs to be considered, so let’s take a look at how Redis implements regular deletions. In the SRC/expire-c file, the author found the activeExpireCycle function, which is the implementation of periodic deletion. Antirez has made a detailed comment in the code, but it is all in English. Try to read it and make a vague outline. So learning English and reading foreign materials is an important way to learn.
The core part of the code is about 210 lines, including comments. Here are the details:
#define ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP 20 /* Keys for each DB loop. */
#define ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1000 /* Microseconds. */
#define ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC 25 /* Max % of CPU to use. */
#define ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE 10 /* % of stale keys after which
we do extra efforts. */
void activeExpireCycle(int type) {
/* Adjust the running parameters according to the configured expire
* effort. The default effort is 1, and the maximum configurable effort
* is 10. */
unsigned long
effort = server.active_expire_effort-1, /* Rescale from 0 to 9. */
config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +
ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort,
config_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +
ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort,
config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC +
2*effort,
config_cycle_acceptable_stale = ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE-
effort;
/* This function has some global state in order to continue the work
* incrementally across calls. */
static unsigned int current_db = 0; /* Last DB tested. */
static int timelimit_exit = 0; /* Time limit hit in previous call? */
static long long last_fast_cycle = 0; /* When last fast cycle ran. */
int j, iteration = 0;
int dbs_per_call = CRON_DBS_PER_CALL;
long long start = ustime(), timelimit, elapsed;
/* When clients are paused the dataset should be static not just from the
* POV of clients not being able to write, but also from the POV of
* expires and evictions of keys not being performed. */
if (clientsArePaused()) return;
if (type == ACTIVE_EXPIRE_CYCLE_FAST) {
/* Don't start a fast cycle if the previous cycle did not exit * for time limit, unless the percentage of estimated stale keys is * too high. Also never repeat a fast cycle for the same period * as the fast cycle total duration itself. */ if (! timelimit_exit && server.stat_expired_stale_perc < config_cycle_acceptable_stale) return; if (start < last_fast_cycle + (long long)config_cycle_fast_duration*2) return; last_fast_cycle = start; } /* We usually should test CRON_DBS_PER_CALL per iteration, with * two exceptions: * * 1) Don't test more DBs than we have.
* 2) If last time we hit the time limit, we want to scan all DBs
* in this iteration, as there is work to do in some DB and we don't want * expired keys to use memory for too much time. */ if (dbs_per_call > server.dbnum || timelimit_exit) dbs_per_call = server.dbnum; /* We can use at max 'config_cycle_slow_time_perc' percentage of CPU * time per iteration. Since this function gets called with a frequency of * server.hz times per second, the following is the max amount of * microseconds we can spend in this function. */ timelimit = config_cycle_slow_time_perc*1000000/server.hz/100; timelimit_exit = 0; if (timelimit <= 0) timelimit = 1; if (type == ACTIVE_EXPIRE_CYCLE_FAST) timelimit = config_cycle_fast_duration; /* in microseconds. */ /* Accumulate some global stats as we expire keys, to have some idea * about the number of keys that are already logically expired, but still * existing inside the database. */ long total_sampled = 0; long total_expired = 0; for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) { /* Expired and checked in a single loop. */ unsigned long expired, sampled; redisDb *db = server.db+(current_db % server.dbnum); /* Increment the DB now so we are sure if we run out of time * in the current DB we'll restart from the next. This allows to
* distribute the time evenly across DBs. */
current_db++;
/* Continue to expire if at the end of the cycle more than 25%
* of the keys were expired. */
do {
unsigned long num, slots;
long long now, ttl_sum;
int ttl_samples;
iteration++;
/* If there is nothing to expire try next DB ASAP. */
if ((num = dictSize(db->expires)) == 0) {
db->avg_ttl = 0;
break;
}
slots = dictSlots(db->expires);
now = mstime();
/* When there are less than 1% filled slots, sampling the key
* space is expensive, so stop here waiting for better times...
* The dictionary will be resized asap. */
if (num && slots > DICT_HT_INITIAL_SIZE &&
(num*100/slots < 1)) break;
/* The main collection cycle. Sample random keys among keys
* with an expire set, checking for expired ones. */
expired = 0;
sampled = 0;
ttl_sum = 0;
ttl_samples = 0;
if (num > config_keys_per_loop)
num = config_keys_per_loop;
/* Here we access the low level representation of the hash table
* for speed concerns: this makes this code coupled with dict.c,
* but it hardly changed in ten years.
*
* Note that certain places of the hash table may be empty,
* so we want also a stop condition about the number of
* buckets that we scanned. However scanning for free buckets
* is very fast: we are in the cache line scanning a sequential
* array of NULL pointers, so we can scan a lot more buckets
* than keys in the same time. */
long max_buckets = num*20;
long checked_buckets = 0;
while (sampled < num && checked_buckets < max_buckets) {
for (int table = 0; table < 2; table++) {
if(table == 1 && ! dictIsRehashing(db->expires))break;
unsigned long idx = db->expires_cursor;
idx &= db->expires->ht[table].sizemask;
dictEntry *de = db->expires->ht[table].table[idx];
long long ttl;
/* Scan the current bucket of the current table. */
checked_buckets++;
while(de) {
/* Get the next entry now since this entry may get
* deleted. */
dictEntry *e = de;
de = de->next;
ttl = dictGetSignedIntegerVal(e)-now;
if (activeExpireCycleTryExpire(db,e,now)) expired++;
if (ttl > 0) {
/* We want the average TTL of keys yet
* not expired. */
ttl_sum += ttl;
ttl_samples++;
}
sampled++;
}
}
db->expires_cursor++;
}
total_expired += expired;
total_sampled += sampled;
/* Update the average TTL stats for this database. */
if (ttl_samples) {
long long avg_ttl = ttl_sum/ttl_samples;
/* Do a simple running average with a few samples.
* We just use the current estimate with a weight of 2%
* and the previous estimate with a weight of 98%. */
if (db->avg_ttl == 0) db->avg_ttl = avg_ttl;
db->avg_ttl = (db->avg_ttl/50)*49 + (avg_ttl/50);
}
/* We can't block forever here even if there are many keys to * expire. So after a given amount of milliseconds return to the * caller waiting for the other active expire cycle. */ if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */ elapsed = ustime()-start; if (elapsed > timelimit) { timelimit_exit = 1; server.stat_expired_time_cap_reached_count++; break; } } /* We don't repeat the cycle for the current database if there are
* an acceptable amount of stale keys (logically expired but yet
* not reclained). */
} while ((expired*100/sampled) > config_cycle_acceptable_stale);
}
elapsed = ustime()-start;
server.stat_expire_cycle_time_used += elapsed;
latencyAddSampleIfNeeded("expire-cycle",elapsed/1000);
/* Update our estimate of keys existing but yet to be expired.
* Running average with this sample accounting for 5%. */
double current_perc;
if (total_sampled) {
current_perc = (double)total_expired/total_sampled;
} elsecurrent_perc = 0; For server stat_expired_stale_perc = (current_perc * 0.05) + (server. Stat_expired_stale_perc * 0.95); }Copy the code
To tell the truth, the code details more, because the author of Redis source code understanding is not much, can only do a fuzzy version of the interpretation, so there is inevitably a problem, or suggest conditional readers to go to the source area to read, to see the author of the fuzzy version:
-
The algorithm is an adaptive process. When there are fewer expired keys, it takes less CPU time to process them. If there are many expired keys, it takes a radical way to process them, so as to avoid a large amount of memory consumption.
-
Since there are many database DB in Redis, the algorithm will scan one by one. At the end of this time, the algorithm will continue to scan the later DB, which is a closed-loop process.
-
There are two modes of periodic deletion: fast cycle and slow cycle. The main mode is the slow cycle mode. The cycle frequency depends on server. Hz and is usually set to 10.
-
The execution time of the slow loop is relatively long, and timeout may occur. The execution time of the fast loop is less than 1ms, that is, the execution time is shorter, but the execution times are more. During the execution, if the proportion of expired keys in a DB sample is less than 25%, the execution is skipped.
Periodic deletion is an adaptive closed-loop and probabilistic sampling scanning process, which is limited by execution time and CPU time. If the threshold is triggered, it is stopped, so to say, it is done quietly without affecting the response to the client.
5. DEL Deletes key and value pairs
Before Redis4.0, if the key value of del is large, it may cause blocking. In the new release, BIO threads and some new commands are introduced to implement the lazy deletion of del. Finally, there will be BIO threads to implement the cleaning of the memory.
B.2 Memory Elimination Mechanism
In order to ensure the safe and stable operation of Redis, a max-memory threshold is set. When the memory usage reaches the threshold, the newly written key-value pairs cannot be written, and the memory weeding mechanism is required. In the configuration of Redis, there are several weeding policies to choose from, as follows:
-
Noeviction: New write operations will report an error when memory is insufficient for new data
-
Allkeys-lru: removes the least-recently used key from the key space when memory is insufficient for new writes;
-
Allkeys-random: Randomly removes a key from the key space when the memory is insufficient to accommodate new data.
-
Volatile -lru: Removes the least-recently used key from the expired key space when memory is insufficient to accommodate new writes;
-
Volatile -random: When memory is insufficient to hold new data, a key is randomly removed from the key space with the expiration time set.
-
Volatile – TTL: When the memory is insufficient to hold new data, in the key space with the expiration time set, keys with earlier expiration time are removed first.
The last three strategies are all for the processing of expired dictionaries, but noeviction returns write failure when expired dictionaries are empty, and random deletion without a strategy is not desirable, so the second one, Allkeys-LRU, is generally selected for elimination based on the LRU policy.
Personally, I think Antirez has always been engineering thinking and is good at using probabilistic design for approximate implementation, and LRU algorithm is no exception. Approximate LRU algorithm is implemented in Redis, and after several versions of iteration, the effect is relatively close to the effect of theoretical LRU algorithm. This is also a good content. This article plans to discuss the LRU algorithm separately later.
The deletion policy of expired keys emphasizes the operation of expired keys. If the expired keys are available and the memory is sufficient, Redis does not use the memory elimination mechanism to free up space. In this case, Redis uses the deletion policy to delete expired keys first.
The memory elimination mechanism emphasizes the elimination of memory data. When the memory is insufficient, even if some health does not reach the expiration time or is not set at all, some of them should be deleted according to certain policies to ensure that new data can be written to the space.
Talk about understanding Redis data synchronization mechanism and principle
To understand the relationship between persistence and data synchronization, you need to analyze it from the perspective of single point of failure and high availability:
C.1 Single Point Of Failure
If we only have a single Redis machine that is serving as a cache and writes hotspot data to disk persistently, at some point this single Redis machine will fail and go down, during which time the cache will fail, the primary storage service will be overwhelmed by all requests, and the monitor will pull up the down Redis machine.
After the restart, the machine can Load disk RDB data for quick recovery. The recovery time depends on the amount of data, which generally ranges from seconds to minutes. After the recovery is complete, the hot data is still in place.
The persistence mechanism is very useful in single-point Redis. Writing only text makes it easy to put people to sleep. Here’s a picture:
As a single point of a cache of high availability system downtime is not allowed, so the master-slave structure, for the master node to multiple backup of data, if the master node point, you can immediately switch state the best from the node is given priority to, writing services to and from other nodes to meet the new master node synchronization data, to ensure high availability of the entire Redis caching system.
The figure shows the failover process of the primary node of the Redis system with one primary node and two secondary read/write nodes separated. The whole process does not stop working normally, which greatly improves the high availability of the system:
From these two points of analysis, a small conclusion can be drawn: persistence makes single points of failure no longer scary, and data synchronization gives high availability wings.
We understand the important role of data synchronization to Redis, then continue to see the implementation principle and process of data synchronization, key difficulties and other details of it!
C.2 CAP theory in Redis system
Readers with knowledge of distributed storage must know the CAP theory. It is a shame that I encountered the CAP theory when I changed my job in March 2018 and went to Face++ megvii technology back-end development position. In addition to the problems of CAP theory, other problems were within range, so I finally took the Offer.
In theoretical computer science, the CAP theorem, also known as Brewer’s Theorem, stems from a conjecture made by Eric Brewer, a computer scientist at the University of California, Berkeley, at the Distributed Computing Principles Symposium PODC in 2000.
In 2002 Seth Gilbert and Nancy Lynch of THE Massachusetts Institute of Technology published a proof of Brewer’s conjecture, making it a theorem. It points out that for a distributed computing system, it is impossible to meet the following three requirements:
-
C. Consistent D. Consistent
-
A Availability
-
P Partition Tolerance Specifies the Partition Tolerance
Take a look at a picture drawn by Ruan Yifeng:
The key to understanding CP and AP is partition tolerance. Network partitions are common in distributed storage. Even if the machines are in the same room, they are not all in one rack or one switch.
So in the LOCAL area network will appear network jitter, the author has done more than 1 year DPI for network transmission in the most profound three nouns: packet loss, disorderly order, retransmission. So what looks like a calm network to us may be a stormy network to the server, and it will not work if you are not careful. So when the network is disconnected, then there will be a network partition problem.
For Redis data synchronization, suppose that the slave node and the master node are on two racks and the network is disconnected at some point. If Redis reads and writes are separated at this time, the data of the slave node cannot continue to synchronize data with the master node. In this case, if data continues to be read from the node, data inconsistency will occur; if data consistency is mandatory, services cannot be provided from the node, resulting in unavailability. Therefore, it can be seen that C and A cannot be given consideration under the influence of P.
The other cases will not go into detail, from which we can draw the conclusion that: When multiple Redis machines are distributed on different networks, if a network failure occurs, data consistency and service availability cannot be balanced. Redis system must make a choice. In fact, Redis chose availability, or Redis chose another final consistency.
C.3 Final Consistency and replication of Redis
Redis selects final consistency, which means that the data is not guaranteed to be consistent at all times, and the Redis master/slave synchronization is asynchronous by default. Don’t do it!
Let me explain synchronous replication and asynchronous replication (note: I did not write synchronous synchronous and asynchronous synchronous for the reader) :
A picture is worth a thousand words. Look at the red numbers to know the difference between synchronous and asynchronous replication:
-
Asynchronous replication: When the client writes hello World to the master node, the master node sends OK back to the client, and the interaction between the master node and the client is complete. The master node synchronizes hello World to the slave node, and the slave node sends OK back to the master node. The client does not need to wait for the synchronization to complete. So the whole process is implemented asynchronously.
-
Synchronous replication: When the client writes hello World to the master node, the master node synchronizes hello World to the slave node, the slave node replies OK to the master node, and then the master node replies OK to the client. The client needs to wait for the synchronization to complete, so the whole process is implemented synchronously.
Redis chose asynchronous replication to avoid client waiting, more realistic requirements, but this replication mode can be modified, according to their needs.
1. From copying if high availability of Redis, four from a master if four from to the main node data synchronization at the same time, the master node pressure will be more big, considering the eventual consistency Redis, therefore Redis follow-up launched from from the replication, which copies single copy structure evolution of multi-layer structure, the author drew a diagram to see:
2. Full replication and incremental replication
Full replication is the data replication in the initialization phase that occurs when a node recovers from a failure or when a new slave node is added. This replication is completed by synchronizing all the data of the master node to the slave node, so the cost is large but inevitable.
Incremental replication is a data replication mode that is carried out at every moment after the master node and slave node work properly. The data is synchronized in a trickle stream. This synchronization mode is light and fast, and has many advantages.
3. Analyze the full replication process
Redis’ full replication process is divided into three main stages:
-
Snapshot phase: The slave node initiates SYNC full replication command to the master node. The master node executes bgsave to generate snapshots of all data in memory and sends them to the slave node. The slave node releases old memory to load and parse the new snapshot.
-
Buffer phase: The primary node synchronizes the operation commands stored in the buffer to the secondary node. These commands are newly added during the period from bgsave to snapshot loading on the primary node, and need to be recorded or data loss will occur.
-
Incremental phase: After buffer synchronization is completed, the master node normally synchronizes the incremental operation command to the slave node, so that the master and slave nodes keep the same pace.
Take a diagram from Reference 1, which is very well written:
Consider a multi-slave concurrent full replication problem: what if there are multiple slave nodes simultaneously making full synchronization requests to the master node?
The Redis master node is smart and honest. For example, there are 3 slave nodes A/B/C making full SYNC requests to the Redis master node.
-
When the primary node performs bgsave on A, the SYNC command of B and C arrives. In this case, the primary node synchronizes snapshot data and buffer data for A to ABC at the same time, which improves efficiency and ensures correctness.
-
The primary node has completed the snapshot of A and is currently synchronizing the buffer, so the primary node can only perform the same operation on B and C as that on A after the completion of A to realize the full synchronization of the new node. Therefore, the primary node does not get lazy but repeats the process, which is cumbersome but correct.
Consider a snapshot copy circulation problems: the master node to perform bgsave is time consuming and the operation of the memory consumption, during loading of old data from node also experience – > free memory – > loading new data, the process of the dynamic process of memory to drop after litres, to know whether the master node execution snapshots or load data from the nodes is need time and resources.
Regardless of the performance impact, imagine if the master node snapshot time is 1 minutes, during 1 w new orders coming, these new orders will be written to buffer, if only 8 k buffer is small, so after the completion of a snapshot, the master node buffer is only 8 k command lost 2 k, so at this time to the full amount of synchronization is missing from the node data, Is a full synchronization error.
In this case, the node will start SYNC again and get stuck in a loop. Therefore, setting the buffer size is very important.
4. Analyze the incremental replication process
Incremental replication process somewhat simple, but very useful, imagine a complex network environment, and not every broken cannot recover, if every time off back to the full amount will be copied, the master node do not be put to death, so the incremental replication is the complicated network environment an optimization of the process of data replication, allows for a period of time behind, finally caught up with him.
Incremental replication is a typical producer-consumer model, implemented using fixed-length circular arrays (queues). If the buffer is full, the new data overwrites the old data, so the slave node copies the data and feeds back its offset to the master node to ensure that no data is missing.
This process is very good understanding, kakfa the MQ also is such, so in the premise of reasonable set the buffer size, in theory from the spending power is greater than the Lord’s production capacity, most of the buffer occurs only in the network disconnection time is too long be overwritten, becanse of the lag of consumption from nodes, at this time to the full amount only copy.
5. Diskless replication
So before we understand the concept of diskless replication what is diskless replication? Disc refers to the disk, may be a mechanical drive or SSD, but no matter which kind of memory are compared more slowly, we all know that IO operations on the service side is time consuming to dominate, so for the whole quantity replicate this high IO time-consuming operation, especially when the service concurrently is larger and also for other procedures to Redis is greatly the influence of the service itself, The previous model looked like this:
After Redis2.8.18 version, diskless replication is developed, which is to avoid the generation of RDB file disk before loading and the process of network transmission, but send process flow traversal, the master node traversal memory data, and sending data serialization from node, the node does not change, still the data stored in turn to the local disk, After the transfer is completed, memory is loaded, so diskless replication is more IO friendly.
Talk about distributed locking based on Redis and Redlock algorithm
D.1 Introduction to Redis-based distributed Locking
Initially, the lock is distributed using the setnx and expire commands, but these commands are not atomic. If the lock is acquired after setnx but the client is dead, the lock cannot be released after expire is set. In version 2.8, Antirez added an extension to setNx to make setNx and EXPIRE atomic.
In the Redis system with a single matster-slave, normally the Client obtains the lock from the Master and synchronizes it to the Slave. If the Master node fails after the Client succeeds in obtaining the lock and the lock is not synchronized to the Slave, After that, with the help of Sentinel, the Slave is upgraded to Master, but there is no information about the lock that has not been synchronized before. At this time, if a new Client wants to acquire the lock from the new Master, it may occur that two clients hold the same lock. Let’s see the diagram to think about this process:
In order to ensure that its own locks can only be released by itself, it is necessary to add unique verification. In summary, the simple process of acquiring and releasing locks based on a single Redis node is as follows:
SET resource_name unique_value NX PX 30000 // Release the lock and compare the unique_value to the unique_value to avoid release by errorif redis.call("get",KEYS[1]) == ARGV[1] then
return redis.call("del",KEYS[1])
else
return 0
end
Copy the code
These are a few points of distributed locking based on single Redis.
D.2 Basic procedure of Redlock algorithm
Redlock algorithm is a high availability mode introduced by Antirez on the basis of single Redis node. In the distributed Redis environment, we assume that there are N completely independent Redis nodes, and we use the same methods to acquire and release locks on the N Redis instances as in the Redis single instance.
Now assume that there are 5 Redis primary nodes (an odd number greater than 3), so that they don’t all go down at the same time. During lock acquisition and lock release, the client will do the following:
-
Gets the current Unix time, in milliseconds
-
When requesting a lock from Redis, the client should set a timeout period for the network connection and response. The timeout period should be less than the lock expiration time, so as to avoid client death
-
The client uses the current time minus the time when the lock was acquired to obtain the time when the lock was acquired. A lock is successfully acquired if and only if the lock is obtained from more than half of the Redis nodes and is used for less than the lock expiration time
-
If a lock is acquired, the true valid time of the key is equal to the valid time minus the time it took to acquire the lock. This is important
-
If for some reason, failed to get the lock (no more than half in the instance to lock or lock time was more than effective time), the client should be conducted on all instances of Redis unlock, whether Redis instance lock is successful, because may be the server response message lost but actually succeeded, released, after all, there will be no problem at a time
The above 5 steps are the important process of Redlock algorithm, is also the hot spot of interview, interested readers or record it!
D.3 Debate on whether The Redlock algorithm is safe
1. About Dr. Martin Kleipman
On February 8, 2016, Dr. Martin Kleppmann, an expert on distributed systems, pointed out some principles of distributed lock design in an article “How to Do Distributed Locking” and raised some doubts about Antirez’s Redlock algorithm. I found Dr. Martin Kleipman’s personal website and a brief introduction.
Check it out with Sogou translator:
1. I am a Senior Research Assistant and attached Lecturer in the Department of Computer Science and Technology, University of Cambridge, funded by the Leffulm Trust Early Career Scholarship and the Isaac Newton Trust. I work on locally first collaborative software and distributed system security. 2. I am also a researcher and director of computer Science research at Corpus Christi College, Cambridge, where I do undergraduate teaching. 3. In 2017, I published a book for O ‘Reilly called Designing Data-intensive Applications. It covers a wide range of databases and distributed data processing system architectures and is one of the publisher’s best sellers. 4. I speak regularly at conferences, and recordings of my speeches have been viewed more than 150,000 times. 5. I have worked on various open source projects, including Automerge, Apache Avro and Apache Samza. 6. From 2007 to 2014, I worked as an industrial software engineer and entrepreneur. I co-founded Rapportive(acquired by linkedin in 2012) and Go Test(acquired by RedDoor Software in 2009). 7. I have composed several musical works, including Death in February (in German), a musical theatrical adaptation of the book by Dunk Delacter, which premiered in 2007 with 150 participants.
A good person is a good person. They can teach, publish books, write open source software, start a business, and write musicals. Good people are good at everything.
2. The main points of Dr. Martin’s article
In the article, Martin Klepman talked about many basic problems of distributed systems, especially the asynchronous model of distributed computing. The article is divided into two parts: The first half describes some principles of distributed locking, and the second half puts forward some views on Redlock:
-
Martin points out that even if we had a perfectly implemented distributed lock, we wouldn’t be able to achieve sufficient security without sharing resources involved to provide some kind of fencing mechanism
-
Martin pointed out that since Redlock is essentially built on a synchronous model, it has a strong requirement on the system’s time, and its own security is not enough
Martin gives a timing diagram for the Fencing mechanism:
It may be suspended for an extended period of timeAlthough the lock has a timeout, the client that avoids a crash may hold the lock forever and never release it, but if
The pause of the client lasts longer than the lock expiration time, and the customer does not realize that it has expired, then it may go ahead and make some unsafe changes, in other words because
A lock held by a client that is blocked expires without the client knowing it.
In this case, Martin pointed out that the fencing mechanism should be added, specifically the fencing token isolation mechanism, and also gave a timing diagram:
Client 1 acquires the lock and acquires the token with serial number 33, but then it goes into a long pause until the lock timeout expires, and client 2 acquires the lock and acquires the token with serial number 34 and writes it to the storage service. Client 1 then revives and sends its write to the storage service, however the storage server remembers that it has processed the write 34 with the higher token, so it rejects the request for token 33.
The Redlock algorithm does not have such a unique and increasing fencing token generation mechanism, which means that the Redlock algorithm cannot avoid the operation problems after the lock expires due to client blocking, so it is not safe.
This idea I think is not completely solve the problem, because if the client 1 write operation is to perform a success, but due to the blocking timeout can no longer write also creates a mistake as a result, the client 2 will be operated on the results of this error, then any operation is doomed to be wrong.
3. Dr. Martin’s doubts about Redlock
Martin Klepman points out that Redlock is a very time-dependent algorithm, which can lead to a lot of inconsistencies. Here’s an example:
Suppose A multi-node Redis system has five nodes A/B/C/D/E and two clients C1 and C2. What happens if the clock on one of the Redis nodes jumps forward?
-
Client C1 obtains A lock on nodes A, B, and C, and due to network problems, method reaches nodes D and E
-
The clock on node C jumps forward, causing the lock to expire in advance
-
Client C2 is locked on nodes C, D, and E and cannot reach A and B due to network problems
-
Both client C1 and client C2 now think they own the lock
Distributed asynchronous model: The above situation is possible because the security of Redlock is strongly dependent on the system clock of Redis node. Once the system clock becomes inaccurate, the security of the algorithm cannot be guaranteed.
Martin is actually pointing out some fundamental problems in the study of distributed algorithms. A good distributed algorithm should be based on an asynchronous model, and its security should not depend on any timing assumptions.
Processes and messages can be delayed for any length of time in a distributed asynchronous model, and the system clock can fail in any way. These factors should not affect its safety, only could affect its activity, even in very extreme cases, most algorithms cannot be given for a limited time as a result, and should not give incorrect results, the algorithm exists in reality such as Paxos/Raft, according to the standards Redlock level of security is unattainable.
4. Dr. Martin’s conclusions and basic views
Martin expressed his views, dividing locks into two uses:
-
The efficiency of the first
Distributed locks are used only to coordinate the simple work of multiple clients, and occasional lock failures can have other adverse consequences.
It’s as innocuous as sending and receiving the same email twice -
The first right
Using distributed locks requires that under no circumstances should lock failure occur, which could mean data inconsistency, data loss, file corruption, or other serious problems.
Like giving patients repeated doses of a drug, the consequences can be severe
Martin concluded as follows:
-
Use distributed locks for efficiency
A single Redis node locking scheme is sufficient. Redlock is a heavy and expensive design -
Use distributed locks for proper purposes
Redlock is not a strong enough algorithm to build on the asynchronous model, and its assumptions about the system model contain a lot of dangerous elements
Martin thinks the Redlock algorithm is a bad choice because it is neither fish nor fish: it is too heavy and expensive for efficiency and not safe enough for correctness.
Martin’s article was published on June 28, 2016 and Antirez was quick to respond. He published “Is Redlock Safe?” The address of the article is as follows:
http://antirez.com/news/101
Antirez believes that Martin’s criticism of Redlock can be summarized in two aspects:
-
A distributed lock with automatic expiration must provide a fencing mechanism to ensure true mutual exclusion of shared resources. The Redlock algorithm does not provide such a mechanism
-
Redlock algorithm is built on an insecure system model. It has strong requirements for the timing assumptions of the system, which cannot be guaranteed in the real system
Mr Antirez argues in detail against both.
About Fencing Mechanism
Antirez questioned why use a distributed lock and require such strong security guarantees when an fencing mechanism exists to maintain exclusive access to resources even if the lock fails.
Taking a step back, although Redlock does not provide an increasing Fencing token to isolate tokens, the same effect can be achieved by using the random string generated by Redlock, which is not increasing, but is unique.
The timing hypothesis
Antirez refuted the algorithm in the time model hypothesis set, and Martin believed that there were three main failure cases of Redlock:
-
1. The clock jumps
-
2. Long GC pause
-
3. Long network latency
In the latter two cases, Redlock has been designed and considered at the beginning and is somewhat resistant to the consequences caused by these two problems. Clock jump has a great impact on Redlock, and Redlock will not work properly once this happens. Antirez pointed out that Redlock’s requirements for the system clock do not need to be completely accurate. As long as the error does not exceed a certain range, it will not affect the system, which is completely reasonable in the actual environment, and large clock beats can be completely avoided through proper operation and maintenance.
The distributed system itself is very complex, and the effect of the mechanism and theory needs to be based on certain mathematical derivation. Martin and Antirez are both experts in this field, and they have their own views and reflections on some problems. More importantly, there is no perfect solution to the problem in many cases.
The debate was a very good clash of ideas in the field of distributed systems. Many people posted their views and opinions, and Dr. Martin expressed some of his views again some time after Antirez’s reaction:
For me, this is the most important point: I don’t care who is right or wrong in this debate — I care about learning from others’ work, so that we can avoid repeating old mistakes, and make things better in future. So much great work has already been done for us: by standing on the shoulders of giants, we can build better software.
By all means, test ideas by arguing them and checking whether they stand up to scrutiny by others. That’s part of the learning process. But the goal should be to learn, not to convince others that you are right. Sometimes that just means to stop and think for a while.
For Martin, it didn’t matter who was right or wrong. He was more concerned with learning from the work of others so that he wouldn’t repeat his own mistakes
As we stand on the shoulders of giants, we can make better achievements.
In addition, only through others’ arguments and tests can we make our own ideas stand the test. Our goal is to learn from each other rather than persuade others to believe that you are right. The so-called one person is short, thinking and refuting can be closer to the truth.