We recently discovered an interesting problem in the Redis cluster. After spending a lot of time debugging and testing, we were able to reduce Redis memory usage in some clusters by 25% by changing key expiration.
Twitter runs several caching services internally. One of them is implemented by Redis. Our Redis cluster stores some of Twitter’s most important use-case data, such as display and engagement data, AD spend counts, and direct messaging.
The problem background
Back in early 2016, Twitter’s Cache team made extensive updates to the Redis cluster architecture. Several changes have been made to Redis, including an update from Redis 2.4 to 3.2. After this update, several issues arose, such as users starting to see memory usage inconsistent with what they expected or were prepared to use, latency increases, and key cleanup issues. Key cleanup is a big problem, which can result in data that should have been persisted being deleted or requests being sent to the original data store.
A preliminary investigation
The affected team and the caching team begin a preliminary investigation. We found that the delay increase was related to the key cleanup that was now taking place. When Redis receives a write request but has no memory to hold the write, it stops the operation in progress, clears the key and saves the new key. However, we still need to figure out what is causing these newly cleared memory usage increases.
We suspect that memory is full of expired but undeleted keys. Some people suggest scanning, which reads all keys and removes expired keys.
In Redis, keys have two expiration modes, active expiration and passive expiration. The scan triggers passive expiration of the key, and when the key is read, the TTL is checked. If the TTL has expired, it is deleted and nothing is returned. Active expiration of keys in version 3.2 is described in the Redis documentation. The active expiration of a key begins with a function called activeExpireCycle. It runs at a rate of several times per second, on an internal timer called cron. The activeExpireCycle function iterates through each key space and checks random KrY with TTL set. If the percentage threshold of expired KRy is met, the process is repeated until the time limit is met.
This method of scanning all krY works, and memory usage is reduced when the scan is complete. It seems that Redis is no longer effectively expiring keys. However, the solution at the time was to increase the size of the cluster and more hardware, so that keys would be distributed more widely and there would be more memory available. This is disappointing because the aforementioned project to upgrade Redis reduced the size and cost of running these clusters by making them more efficient.
Redis version: What’s changed?
The implementation of activeExpireCycle has changed between Redis versions 2.4 and 3.2. In Redis 2.4, every database is checked at every run time, and in Redis3.2, the number of databases that can be checked is at its maximum. Version 3.2 also introduced a quick option to check the database. “Slow” runs on the timer, and “fast” runs before checking for events on the event loop. A fast expiration cycle returns early under certain conditions, and it also has a low threshold for timeout and exit functionality. Time limits are also checked more frequently. A total of 100 lines of code have been added to this function.
Further investigation
Recently we have had time to go back and revisit this memory usage issue. We want to explore why regression occurs, and then see how we can better implement key expiration. Our first thought was that there were so many keys in Redis that sampling only 20 was not enough. Another thing we wanted to investigate was the impact of the database restrictions introduced in Redi 3.2.
The way the zooming and handling of the shard makes running Redis on Twitter unique. We have key space with millions of keys. This is not common for Redis users. Shards are represented by the key space, so each instance of Redis can have multiple Shards. Our Redis instance has a lot of key space. Sharding combined with Twitter’s scale to create a dense back end with lots of keys and databases.
Improvements to out-of-date testing
The number sampled on each loop is determined by variables
ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP
Configuration. I decided to test three values and run them on one of the problematic clusters, then scan and measure the difference before and after memory usage. If the difference between before and after memory usage is large, there is a large amount of expired data waiting to be collected. The test initially had positive results in memory use.
The test has one control and three test instances that can sample more keys. 500 and 200 are arbitrary. The value 300 is the output of a calculator based on the statistical sample size, where the total number of keys is the population size. In the diagram above, it is clear that they performed better even by looking at the initial number of test instances. This difference from the percentage of scans running indicates that the cost of expired keys is about 25%.
Although sampling more keys helps us find more expired keys, the negative delay effect is more than we can afford.
The figure above shows a 99.9% delay in milliseconds. This indicates that the delay is related to the increase in sampled keys. Orange represents the value 500, green 300, blue 200, and control is yellow. These lines match the colors in the table above.
After seeing that the latency was affected by the sample size, I wondered if the sample size could be automatically adjusted based on how many keys were out of date. When there are more keys out of date, latency suffers, but when there is no more work to do, we scan fewer keys and execute faster.
The idea basically works, we can see that memory usage is lower, latency is not affected, and a measurement tracking sample size shows that it increases and decreases over time. However, we did not adopt this solution. This solution introduces some latency spikes that are not present in our control instance. The code is also a bit complex, hard to explain, and not intuitive. We also had to adjust for each suboptimal cluster because we wanted to avoid adding operational complexity.
Investigate the fit between versions
We also want to investigate changes between Redis versions. The new Redis version introduces a variable called CRON_DBS_PER_CALL. This variable sets the maximum number of databases to check each time this CRon is run. To test the effect of this variable, we simply comment out the lines.
//if (dbs_per_call > server.dbnum || timelimit_exit)dbs_per_call = server.dbnum;Copy the code
This compares the effect of checking all databases at run time, with limitations, and with no limitations. Our benchmark results are very exciting. However, our test example has only one database, and logically this line of code makes no difference between a modified version and an unmodified version. Variables are always set.
99.9% are measured in microseconds. The unmodified Redis are above, and the modified Redis are below.
We set out to see why commenting out this line makes such a huge difference. Since this is an if statement, the first thing we suspect is branching prediction. We use
GCC s__builtin_expect '
To change the way your code compiles. However, this has no effect on performance.
Next, let’s look at the generated assembly to see what’s going on.
We compiled the if statement into three important instructions mov, CMP, and JG. Mov will load some memory into registers, CMP will compare two registers and set another register based on the results, and JG will perform conditional jumps based on the value of the other register. The code to jump to will be code in the if block or else block. I took out the if statement and put the compiled assembly into Redis. I then test the effect of each instruction by commenting on different lines. I tested the MOV instructions to see if there was a performance issue with loading memory or CPU cache, but found no difference. I tested the CMP command and found no difference. When I run the tests with the included JG instructions, the latency goes back up to the unmodified level. After finding this, I tested to see if it was just a jump or a specific JG instruction. I added an unconditional jump instruction, JMP, that jumps and then jumps back to code execution with no performance loss.
We spent some time looking at different performance metrics and tried out some of the custom metrics listed in the CPU manual. We don’t have any conclusions as to why a single instruction would cause such a performance problem. We have some ideas related to instruction cache buffers and CPU behavior when performing jumps, but time is running out, and we’ll come back to this in the future if possible.
resolution
Now that we have a good understanding of the cause of the problem, we need to choose a solution to the problem. Our decision was to make simple changes so that we could configure a stable sample size in the startup options. In this way, we can find a good balance between latency and memory usage. Even if removing the if statement led to such a dramatic improvement, it was hard to make a change if we couldn’t explain why.
This graph shows memory usage for the first cluster deployed to. The top line (pink), hidden behind orange, is the median memory usage for the cluster. The top orange line is a control instance. In the middle of the chart are new trends. The third section shows an instance of a control being restarted, compared to the light yellow. After the restart, the memory usage of the control increases rapidly.
This was a fairly large survey involving engineers and multiple teams, and reducing the cluster size by 25% was a very good result and we learned a lot from it! We wanted to take another look at this code and see what optimizations we could make with the help of other teams that focus on performance and tuning.
Other engineers who made significant contributions to the study included Mike Barry, Rashmi Ramesh and Bart Robinson.
– end –
By Matthew Tejo
Translation: Xu Ye
Improving Key Expiration in Redis