“This is the second day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”

Many people believe that the relationship between Redis and THE CPU is simple, that is, the Redis threads run on the CPU, the CPU is fast, and Redis processes requests quickly.

This kind of cognition is actually one-sided. The multi-core CPU architecture and multi-CPU architecture also affect Redis performance. If you do not understand the impact of CPU on Redis, you may miss some tuning methods when tuning Redis performance and not be able to maximize Redis performance.

Today, we will learn the CPU architecture of the current mainstream servers, and how to optimize Redis performance based on CPU multi-core architecture and multi-CPU architecture.

The dominant CPU architecture

To understand exactly how the CPU affects Redis, we need to take a look at the CPU architecture.

A CPU processor generally has more than one running core. We call one running core a physical core, and each physical core can run applications. Each physical core has a private Level 1 cache (L1 cache), consisting of a Level 1 instruction cache and a Level 1 data cache, as well as a private Level 2 cache (L2 cache).

One concept mentioned here is the private cache of the physical core. It means that the cache space can only be used by the current physical core, and no other physical core can access the cache space. Let’s look at the architecture of the CPU physical core.

Because L1 and L2 caches are private to each physical core, when data or instructions are stored in L1 and L2 caches, the physical core accesses them with a latency of no more than 10 nanoseconds, which is very fast. Redis can access instructions and data at high speed if it stores them in L1 and L2 caches.

However, the size of these L1 and L2 caches is limited by the manufacturing technology of the processor and is usually only KB, which does not hold much data. If the required data is not in the L1 and L2 caches, the application will need to access memory to retrieve the data. The latency of application accesses is generally at the level of nanoseconds, which is nearly 10 times that of L1 and L2 caches, inevitably affecting performance.

Therefore, different physical cores also share a common Level 3 cache (L3 cache for short). L3 cache can use a lot of storage resources, so it is generally large, ranging from a few megabytes to tens of megabytes, which allows applications to cache more data. When there is no data cache in L1 and L2 cache, L3 can be accessed to avoid memory access.

In addition, in today’s mainstream CPU processors, each physical core typically runs two hyperthreads, also known as logical cores. Logical cores with the same physical core share L1 and L2 caches.

To make things easier for you, I have a diagram showing the relationship between physical and logical cores, as well as level 1 and level 2 caches.

On mainstream servers, a CPU processor will have between 10 and 20 physical cores. At the same time, in order to improve the processing capacity of the server, the server usually has multiple CPU processors (also known as multi-CPU Socket), each processor has its own physical core (including L1 and L2 cache), L3 cache, and connected memory. At the same time, different processors are connected through the bus.

The following figure shows a multi-CPU Socket architecture with two sockets, each with two physical cores.

On a multi-CPU architecture, applications can run on different processors. In the diagram above, Redis can run on Socket 1 for a period of time and then be scheduled to run on Socket 2.

However, if an application runs on one Socket and saves data to memory, and then is scheduled to run on another Socket, then the application needs to access the memory that was previously connected to the Socket. This is remote memory access. Remote memory access can increase application latency compared to accessing memory directly connected to a Socket.

In a multi-CPU architecture, the latency for an application to Access the local Memory of the Socket and Access the remote Memory is not the same, so we also call this architecture non-Uniform Memory Access (NUMA) architecture.

Now that we know the major CPU multi-core architectures and multi-CPU architectures, let’s briefly summarize the impact of CPU architectures on application execution.

  • The instruction and data access speed in L1 and L2 cache is very fast. Therefore, making full use of L1 and L2 cache can effectively shorten the execution time of application programs.
  • Under THE NUMA architecture, if an application is scheduled from one Socket to another, remote memory access may occur, which directly increases the execution time of the application.

Next, let’s take a look at how multiple CPU cores affect Redis performance.

The CPU checks the impact of Redis performance

When running on a CPU core, an application needs to record information about the hardware and software resources it uses (such as stack Pointers, register values of the CPU core, etc.), which we call runtime information. At the same time, the most frequently accessed instructions and data are cached in L1 and L2 caches to improve execution speed.

However, in a multi-core CPU scenario, as soon as the application needs to run on a new CPU core, the runtime information needs to be reloaded onto the new CPU core. Also, the L1 and L2 caches of the new CPU cores require reloading of data and instructions, resulting in increased program runtime.

Speaking of which, I’d like to share with you an example of how I tuned Redis performance in a multi-core CPU environment. Hope to use this case to help you fully understand the impact of multi-core CPU on Redis performance.

At that time, our project requirements were to optimize 99% of Redis tail latencies, requiring GET tail latencies of less than 300 microseconds and PUT tail latencies of less than 500 microseconds.

For those of you who don’t know what 99% tail delay is, let me explain. We rank all requests in order of processing latency from smallest to largest, and the value of less than 99% request latency is 99% tail delay. For example, if we have 1000 requests, let’s assume that the 991st request has a latency of 1ms, and the first 990 requests have a latency of less than 1ms, so 99% of the tail latency here is 1ms.

When we started, we used String with O(1) GET/PUT for data access, with BOTH RDB and AOF turned off. Moreover, no other data of collection type was stored in the Redis instance, so there was no bigkey operation, avoiding many of the situations that could cause increased latency.

But even then, when we ran the Redis instance on a server with 24 CPU cores, the 99% tail latency of GET and PUT was 504 microseconds and 1175 microseconds, respectively, significantly higher than our goal.

Later, we carefully detected the status indicator value of the server CPU when the Redis instance was running, which found that the CPU context switch times were relatively large.

Context switch refers to the context switch of the thread, where the context is the runtime information of the thread. Context switch occurs when a program runs on one CPU core and then switches to another CPU core in a multi-CPU environment.

When the context switch occurs, the runtime information of the Redis main thread needs to be reloaded to the other CPU core. In addition, the L1 and L2 cache of the other CPU core does not contain the instructions and data frequently accessed by the previous Redis instance. These instructions and data need to be reloaded from L3 cache, or even from memory. This reloading process takes time. Also, Redis instances need to wait for this reload process to complete before they can start processing requests, so this can cause some requests to take longer to process.

If a Redis instance is frequently scheduled to run on different CPU cores in a multi-cpu scenario, the impact on the request processing time of the Redis instance is even greater. Each time a request is scheduled, some requests are affected by the runtime information, instruction, and data reloading process, resulting in a significantly higher latency for some requests than others. At this point, we know why 99% of the tail delay values in the previous example never come down.

Therefore, we want to avoid Redis always scheduling execution back and forth between different CPU cores. Therefore, we tried to bind the Redis instance to the CPU core, so that one Redis instance runs on one CPU core. We can use the taskset command to bind a program to a core.

For example, we bind the Redis instance to core 0 by executing the following command, where the “-c” option is used to set the number of cores to bind.

taskset -c 0 ./redis-server
Copy the code

After binding, we tested it. We found that the 99% tail delay for the Redis instances of GET and PUT dropped to 260 microseconds and 482 microseconds, respectively, exactly where we wanted it to be.

Let’s look at the 99% tail delay of Redis before and after binding.

As you can see, in a multi-cpu environment, the Redis tail delay can be effectively reduced by binding the Redis instance to the CPU core. Of course, binding core not only helps reduce tail latency, but also reduces average latency and improves throughput, thus improving Redis performance.

Next, let’s look at how the multi-CPU architecture, also known as NUMA architecture, affects Redis performance.

The impact of CPU NUMA architecture on Redis performance

When using Redis in practice, I often see a way to improve Redis network performance by binding the operating system’s network interrupt handler to the CPU core. This approach can avoid the network interrupt handler scheduling execution between different cores, which can effectively improve the network processing performance of Redis.

However, the network interrupt program is to interact with the Redis instance of network data. Once the network interrupt program is bound to the core, we need to pay attention to which core the Redis instance is bound to, which will affect the efficiency of Redis to access network data.

Let’s start by looking at the data interaction between a Redis instance and a network interrupt handler: The network interrupt handler reads data from the nic hardware and writes it to a memory buffer maintained by the operating system kernel. The kernel triggers an event via the epoll mechanism, notifying the Redis instance, which copies data from the kernel’s memory buffer to its own memory space, as shown in the following figure:

Therefore, under the NUMA architecture of the CPU, there is a potential risk when the network interrupt handler and Redis instance are bound to the CPU core respectively: If the network interrupt handler and the Redis instance are bound to different CPU cores on different CPU sockets, then the Redis instance will need to access memory across the CPU Socket to read the network data, which can take a long time.

This might be a little abstract, but let me illustrate it with another picture.

As you can see, the network interrupt handler in the figure is tied to a core of CPU Socket 1, while the Redis instance is tied to CPU Socket 2. At this point, the network data read by the network interrupt handler is stored in the local memory of CPU Socket 1. When Redis instance wants to access the network data, Socket 2 needs to send the memory access command to Socket 1 through the bus for remote access, which costs a lot of time.

We have tested an 18% increase in memory access latency across CPU sockets compared to accessing CPU Socket local memory, which naturally results in increased latency for Redis to process requests.

Therefore, to prevent Redis from accessing network data across CPU sockets, it is best to bind the network interrupt and Redis instance to the same CPU Socket, so that the Redis instance can read network data directly from local memory, as shown in the following figure:

However, it is important to note that in the NUMA architecture, the numbering rule for CPU cores is not that all logical cores in one CPU Socket are coded first and then the logical cores in the next CPU Socket are coded. Instead, the first logical core of each physical core in each CPU Socket is numbered in sequence. The second logical core of the physical core in each CPU Socket is numbered.

Let me give you an example. Assume that there are two CPU sockets with six physical cores on each Socket, and two logical cores on each physical core, for a total of 24 logical cores. We can run the lscpu command to see the numbers of these cores:

lscpu Architecture: x86_64 ... NUMA Node0 CPU(S): 0-5,12-17 NUMA Node1 CPU(S): 6-11,18-23...Copy the code

As you can see, the CPU core numbers for NUMA Node0 are 0 to 5 and 12 to 17. Where 0 to 5 are the numbers of the first of the six physical cores on Node0, 12 to 17 are the numbers of the second logical core in the corresponding physical cores. NUMA Node1 has the same CPU core numbering rules as Node0.

We must not assume that the first 12 logical cores on the Socket are numbered from 0 to 11. Otherwise, network interrupters and Redis instances might be tied to different CPU sockets.

For example, if we tie the network interrupt program and Redis instance to CPU cores numbered 1 and 7 respectively, they will still be on two CPU sockets and the Redis instance will still need to read network data across the sockets.

Therefore, it is important that you pay attention to the numbering of CPU cores in NUMA so that you do not bind the wrong cores.

Let’s just briefly summarize what we’ve just learned. In multi-core CPU scenarios, using taskset to bind a Redis instance to a single core can reduce the cost of Redis instance execution on different cores and avoid high tail delay. In the multi-CPU NUMA architecture, it is recommended that you bind the Redis instance and the network interrupt to different cores on the same CPU Socket to avoid the time cost of Redis accessing the network data in memory across the Socket.

However, “every coin has two sides,” and binding the core also carries certain risks. Next, let’s take a look at its potential risks and solutions.

The risks and solutions of binding nuclear

In addition to the main thread, Redis also has child processes for RDB generation and AOF rewriting (recall lecture 4 and lecture 5). In addition, we learned about background threads in Redis in Lecture 16.

When a Redis instance is tied to a CPU core, it causes child processes, background threads, and the main Redis thread to compete for CPU resources. Once the child or background threads occupy the CPU, the main thread will block, resulting in increased Redis request latency.

In view of this situation, I will introduce you two solutions, respectively is a Redis instance corresponding to a physical core and optimize Redis source code.

Scheme 1: One Redis instance corresponds to one physical core

When we core a Redis instance, we do not bind an instance to a logical core, but to a physical core, that is, to use both logical cores of a physical core.

Using the NUMA architecture as an example, NUMA Node0 has CPU cores numbered from 0 to 5 and 12 to 17. The numbers 0 and 12, 1 and 13, and 2 and 14 are two logical cores representing one physical core. Therefore, when binding, we use two logical cores belonging to the same physical core for binding operation. For example, by executing the following command, we bind the Redis instance to logical cores 0 and 12, both of which happen to belong to physical core 1.

The taskset 0, 12. C/redis server. -Copy the code

Compared with binding only one logical core, binding the Redis instance to the physical core enables the main thread, child process and background thread to share two logical cores, which can alleviate CPU resource competition to a certain extent. However, since only two logical cores are used, there will still be CPU competition between them. If you want to further reduce CPU contention, here’s a solution.

Plan 2: optimize Redis source code

The solution is to modify the Redis source code and tie the child process and background threads to different CPU cores.

If you’re not familiar with the Redis source code, that’s ok, because this is a common way to programmatically bind cores. Once you learn this solution, you can use it once you’re familiar with the source code, or in other scenarios that require core binding.

Next, I’ll take a look at the general approach, and then talk more specifically about which parts of the Redis source code you can apply this approach to.

Core binding is implemented programmatically using the operating system’s one data structure cpu_set_t and three functions CPU_ZERO, CPU_SET, and sched_setaffinity, which I’ll explain first.

  • Cpu_set_t data structure: is a bitmap, with each bit representing a CPU logical core on the server.
  • The CPU_ZERO function takes the bitmap of the cpu_set_t structure as input parameter and sets all bits in the bitmap to 0.
  • The CPU_SET function sets the bit corresponding to the input CPU core number and cpu_set_T bitmap to 1.
  • The sched_setaffinity function takes the process/thread ID number and cpu_set_t as arguments and binds the process/thread represented by the input ID to the corresponding logical core by checking that cpu_set_t has a 1 digit.

So how do you combine these three functions in programming to bind the core? It’s easy. We just have to do it in four steps.

  • Step 1: Create a bitmap variable of the cpu_set_T structure;
  • Step 2: use the CPU_ZERO function to set all bits of the cpu_set_T bitmap to 0.
  • Step 3: According to the number of the logical core to bind, use the CPU_SET function to set the bitmap of the CPU_SET_T structure to 1.
  • Step 4: Bind the program to a logical core of 1 in the CPU_set_T structure bitmap using the sched_setaffinity function.

Below, I will specifically introduce, respectively, the background thread, child process tied to different cores.

Let’s start with background threads. To give you a better understanding of programming to bind the core, you can take a look at this sample code that binds the core to a thread:

Void worker(int bind_cpu){cpu_set_t cpuset; // Create bitmap variable CPU_ZERO(&cpu_set); // Set all bits of bitmap variable to 0 CPU_SET(bind_CPU, &cpuset); // Set the bitmap to 1 sched_setaffinity(0, sizeof(cpuset), &cpuset) based on the input bind_CPU number; Int main(){pthread_t pthread1; // Bind the created pthread1 to the logical core at number 3 pthread_create(&pthread1, NULL, (void *)worker, 3); }Copy the code

For Redis, it creates background threads in the bioProcessBackgroundJobs function in the bio-.c file. The bioProcessBackgroundJobs function is similar to the worker function in the previous example in that it performs four core binding steps to bind the background thread to a different core than the main thread.

Similarly, when we fork a child process, we can implement the four steps in the child process code as follows:

Int main(){pid_t p = fork(); if(p < 0){ printf(" fork error\n"); } // Child code part else if(! p){ cpu_set_t cpuset; // Create bitmap variable CPU_ZERO(&cpu_set); // Set all bits of bitmap variable to 0 CPU_SET(3, &cpuset); // Set the third bitmap bit to 1 sched_setaffinity(0, sizeof(cpuset), &cpuset); // The actual child process works exit(0); }... }Copy the code

For Redis, the child processes that generate RDB and AOF log overrides are implemented in the functions of the following two files, respectively.

  • Rdb. c file: rdbSaveBackground function;
  • Aof. C file: rewriteAppendOnlyFileBackground function.

Both of these functions call fork to create the child process, so we can add a four-step core binding operation to the child process code.

Using the source optimization scheme, we can not only realize the Redis instance binding core, avoid the performance impact caused by switching the core, but also make the child process, background thread and main thread run on the same core, avoid the CPU resource competition between them. This approach further reduces the risk of binding compared to using Taskset binding.

summary

In this lesson, we learned about the impact of the CPU architecture on Redis performance. First, we looked at the current mainstream multi-core CPU architectures, as well as NUMA architectures.

On a multi-core CPU architecture, Redis requires frequent context switching if it is running on different cores, which increases Redis execution time, and clients observe high tail latencies. Therefore, it is recommended that you bind the instance to a core while Redis is running. This way, you can reuse the L1 and L2 caches on the core to reduce response latency.

To improve Redis network performance, we sometimes bind the network interrupt handler to the CPU core. In this case, if the server is using NUMA architecture, Redis instances are scheduled to access network data across CPU sockets if they are not on the same CPU Socket as the interrupt handler, which can degrade Redis performance. Therefore, I recommend that you tie Redis instances and network interrupt handlers to different cores under the same CPU Socket to improve Redis performance.

While binding cores can help Redis reduce request execution time, in addition to the main thread, Redis also has child processes for RDB and AOF overrides, as well as background threads for lazy deletion provided after version 4.0. When a Redis instance is bound to a logical core, these child processes and background threads compete with the main thread for CPU resources, which also affects Redis performance. So, I have two suggestions for you:

  • If you don’t want to change the Redis code, you can bind it to one physical core per Redis instance, so that the main thread, child process, and background thread can share two logical cores on one physical core.
  • If you are familiar with the source code of Redis, you can add a core binding operation to the source code. The child process and background thread are tied to different cores, so that the main thread can not compete for CPU resources. However, if you are not familiar with the Redis source code, don’t worry too much. Redis 6.0 now supports CPU core binding configuration operations. I will introduce the latest features of Redis 6.0 in Lecture 38.

The low latency of Redis is our eternal goal, and the multi-core CPU and NUMA architecture has become the mainstream configuration of the current server, so I hope you can grasp the core optimization solution and put it into practice.

Each lesson asking

Well, as usual, LET me give you a quick question.

On a server with 2 CPU sockets (each Socket has 8 physical cores), we deployed a Redis slice cluster with 8 instances (all 8 instances are primary nodes, there is no active/standby relationship). Now we have two solutions:

  1. Run 8 instances on the same CPU Socket and bind to 8 CPU cores;
  2. Run four instances on each of the two CPU sockets and bind to the cores on the corresponding sockets.

If you don’t consider the impact of network data reading, which solution would you choose?

Feel free to leave your thoughts and answers in the comments section, and if you feel you have learned something, you are also welcome to help me share today’s content with your friends. I’ll see you next time.