AOF log

Imagine if Redis appended each write command to a file, and then read the command from the file and executed it when Redis was restarted.

The AOF(Append Only File) persistence function in Redis is used to save write commands to logs. Note that Only write commands are recorded, but read commands are not recorded, because they are meaningless.

In Redis, AOF persistence is disabled by default. You need to modify the following parameters in Redis.

AOF log files are plain text that can be viewed by using the cat command, but can be difficult to read without knowing certain rules.

Here I use the “set name xiaolin” command as an example. After executing this command, Redis will record the content in the AOF log as follows:

Let me explain it to you.

*3 indicates that the current command has three parts. Each part starts with $+ number, followed by a specific command, key, or value. Then, the “number” here indicates how many bytes the command, key, or value in this section has. For example, “$3 set” means that this part has 3 bytes, which is the length of the string “set” command.

If you have noticed, Redis executes a write command before logging the command to the AOF log, which has two advantages.

The first benefit is to avoid additional inspection overhead.

If the write operation command is recorded in the AOF log before the command is executed, if the current command syntax is wrong, if the command syntax is not checked, after the wrong command is recorded in the AOF log, Redis may fail to recover data using the log.

If the write operation command is executed first and then logs are recorded, the command is recorded in the AOF log only after the command is successfully executed. In this way, there is no extra check cost and the commands recorded in the AOF log are executable and correct.

Second, it does not block the execution of the current write command, because the command will not be logged to the AOF until the write command is successfully executed.

Of course, AOF persistence is not without potential risks.

The first risk is that executing the write operation command and logging are two processes. When Redis fails to write the command to the hard disk, the server will be down and the data will be lost.

The second risk, as mentioned above, is that the AOF log is logged after the write command is successfully executed. Therefore, the execution of the current write command is not blocked, but the next command may be blocked.

Since writing the command to the log is also done in the main process (executing the command is also done in the main process), the two operations are synchronized.

If the I/O pressure on the hard disk of the server is too high when the log content is written to the hard disk, the hard disk write speed is slow, which blocks the hard disk, and subsequent commands cannot be executed.

After careful analysis, in fact, these two risks have a common feature, which is related to “the time when AOF logs are written back to disk”.

Three write back strategies

Redis writes to the AOF log as shown below:

Let me elaborate:

  1. Redis appends the write command toserver.aof_bufThe buffer;
  2. Then write the data in the AOF_buf buffer to the AOF file through the write() system call. At this time, the data is not written to the disk, but copied to the kernel buffer page cache, waiting for the kernel to write the data to the disk.
  3. It is up to the kernel to decide when data from the specific kernel buffer is written to the disk.

Redis provides three write back strategies that control the process described in step 3 above.

The appendfsync configuration item in the redis. Conf configuration file can have the following three parameters:

  • Always, this word means “Always”, so it means that after each write operation command is executed, the AOF log data is written back to the hard disk synchronously.
  • Everysec, which means “every second,” writes every write command to the kernel buffer of the AOF file and then writes the contents of the buffer back to hard disk every second.
  • No means that Redis does not control the write back time, but the operating system controls the write back time. That is, after each write operation command is executed, the command is first written into the kernel buffer of the AOF file, and then the operating system decides when to write the buffer content back to the disk.

None of the three write back strategies is a perfect solution to the problem of “main process blocking” and “data loss reduction” because the two problems are opposable, and favoring one will sacrifice the other for the following reasons:

  • The Always policy ensures data loss to the greatest extent. However, because the AOF policy synchronously writes AOF content back to the hard disk every time it executes a write operation command, it inevitably affects the performance of the main process.
  • The No policy, in which the OPERATING system decides when to write AOF logs back to the hard disk, performs better than the Always policy. However, the timing of writing AOF logs back to the hard disk is unpredictable. If the AOF logs are not written back to the hard disk, an uncertain amount of data will be lost once the server breaks down.
  • The Everysec policy is a compromise. It avoids the performance overhead of the Always policy and avoids data loss more than the No policy. Of course, if the log of the last second is not written back to the disk and a downtime occurs, the data within this second will also be lost.

Choose according to your business scenario:

  • If you want high performance, choose the No policy.
  • If you want high reliability, choose Always.
  • If you allow a bit of data loss, but want high performance, choose the Everysec policy.

I also summarized the pros and cons of these three strategies into a table:

Do you know how these three strategies work?

Digging into the source code, you’ll see that these three strategies simply control when the fsync() function is called.

When an application writes data to a file, the kernel typically copies the data into a kernel buffer, then queues it, and the kernel decides when to write to disk.

If you want your application to synchronize data to disk immediately after writing to a file, you can call fsync(), which will write the kernel buffer directly to disk and return only after the disk write is complete.

  • Always executes fsync() every time data is written to the AOF file.
  • Everysec creates an asynchronous task to execute the fsync() function;
  • The No policy is never to execute fsync();

AOF rewrite mechanism

An AOF log is a file that grows in size as more write commands are executed.

If the AOF log file is too large, performance problems will occur. For example, after Redis is restarted, the contents of the AOF file need to be read to recover data. If the file is too large, the whole recovery process will be slow.

Therefore, Redis provides AOF rewriting mechanism in order to avoid larger AOF files. When the size of AOF files exceeds the set threshold, Redis will enable AOF rewriting mechanism to compress AOF files.

AOF rewriting mechanism is to read all key/value pairs in the current database during rewriting, and then record each key/value pair to “new AOF file” with a command. When all records are completed, the new AOF file will replace the existing AOF file.

For example, if “set name xiaolin” and “set name xiaolincoding” are executed before the override mechanism is used, these two commands will be recorded to the AOF file.

However, if the override mechanism is used, it will read the latest value (key-value pair) of name and record it to the new AOF file with a “set name xiaolincoding” command. The first command does not need to record because it is a “history” command and has no effect. In this way, a key-value pair needs only one command in the rewrite log.

When the rewrite is complete, the new AOF file will overwrite the existing AOF file, which is equivalent to compressing the AOF file, making the AOF file smaller.

This command is then used to write the key/value pair directly when recovering data from the AOF log.

So, the beauty of rewriting mechanism is that, despite more than a key/value pair is write command modify repeatedly, eventually also need only according to the current state of the latest “key/value pair”, then use a command to record key value pairs, instead of record before the key/value pair of multiple orders, thus reducing the AOF commands in the file number. Finally, the new AOF file will overwrite the existing AOF file after the rewrite work is completed.

Here’s why when you rewrite AOF, you don’t just reuse the existing AOF file. Instead, you write to the new AOF file and overwrite it.

Because if the AOF rewrite fails, existing AOF files become contaminated and may not be available for recovery.

So the AOF rewrite process, first rewrite to the new AOF file, rewrite failure, directly delete the file, will not affect the existing AOF file.

AOF background rewrite

Although writing to the AOF log is done in the main process, it generally does not affect command operations because it does not write much.

However, when the AOF rewrite is triggered, such as when the AOF file is larger than 64M, the AOF file will be overwritten. In this case, all the cached key/value data will be read, and a command will be generated for each key/value pair, which will then be written to the new AOF file. After the rewrite, the current AOF file will be replaced.

This process is time-consuming, so overrides cannot be placed in the main process.

Therefore, the Redis AOF rewrite process is completed by the backend process bgrewriteAof, which can achieve two benefits:

  • During the child process’s AOF rewrite, the main process can continue processing command requests, thus avoiding blocking the main process.
  • A child process with a copy of the master process’s data (How does a copy of the data come into being), the child process is used instead of thread, because if thread is used, the memory will be shared between multiple threads, so when modifying the shared memory data, it needs to lock the data to ensure the security, which will degrade the performance. While the use of the child, to create the child, father and son process is Shared memory data, but the Shared memory can only be in read-only mode, and when the father and son process either side to modify the Shared memory, will occur when you write “copy”, so the process of father and son have independent data copy, you won’t have to lock to ensure data security.

How can a child process have the same copy of data as the master process?

Main process through the fork system call bgrewriteaof child processes, the operating system will be the main process of a copy of “page table” to the child, the page table records the virtual address and physical address mapping relationship, rather than replicate physical memory, that is to say, the virtual space is different, but the corresponding physical space is the same.

In this way, the child process shares the physical memory data of the parent process, which saves physical memory resources. The attribute of the page table corresponding to the page table marks the physical memory as read-only.

However, when the parent or the child to the memory write operation, the CPU will trigger a page fault interrupt, the page fault interrupt is due to violation of authority, then the operating system will be in the “missing page exception handler” reproducing physical memory, and reset the memory mapping relationship, the father and son process permissions set to read-write memory, speaking, reading and writing. Finally, the memory is written, a process known as “Copy On Write.”

Copy on write As the name implies, the operating system copies the physical memory only when a write operation occurs. This prevents the parent process from being blocked for a long time due to the long replication time of the physical memory data when the child process is created by fork.

Of course, when the operating system copies the parent page table, the parent process is blocked, but the size of the page table is much smaller than the actual physical memory, so copying the page table is usually faster.

However, if the parent’s memory data is very large, the page table will naturally be large, and the parent will block for longer while forking.

So, there are two phases that cause the parent process to block:

  • In the process of creating a child process, because the data structure such as the page table of the parent process needs to be copied, the blocking time depends on the size of the page table. The larger the page table, the longer the blocking time.
  • After the child process is created, if the child or parent process modifies the shared data, write-on-copy occurs. During this process, physical memory is copied. The larger the memory is, the longer the natural blocking time is.

When the override mechanism is triggered, the main process will create a child that overwrites the AOF. At this point, the parent process shares the physical memory, and the overriding child will only read the memory. The overriding AOF child will read all the data in the database and convert the key-value pairs of the memory data into a command one by one. The command is then logged to the rewrite log (the new AOF file).

However, during the child process rewrite, the main process can still process the command normally.

If the primary process modifies an existing key-value, write time replication occurs. Note that only the physical memory data modified by the primary process is copied, and the physical memory that is not modified is shared with the child process.

Therefore, if you modify a bigkey, that is, a key-value with a large amount of data, the process of copying physical memory data will be time-consuming and may block the main process.

If the primary process changes an existing key-value, then the child process’s memory data is inconsistent with the primary process’s memory data.

To resolve this data inconsistency, Redis sets up an AOF rewrite buffer, which is used after the bgrewriteAof child process is created.

During AOF rewrite, when Redis executes a write command, it writes the write command to both the AOF buffer and the AOF rewrite buffer.

That is, during AOF rewrite by the bgrewriteAOF child, the main process needs to do the following three things:

  • Execute the command sent by the client.
  • Appends the executed write command to the “AOF buffer”;
  • Append the executed write command to the “AOF rewrite buffer”;

When the child process completes the AOF rewrite (scanning all the data in the database, converting the key-value pairs of the in-memory data into a command one by one, and recording the command in the rewrite log), it sends a signal to the main process, which is an asynchronous method of communication between processes.

When the main process receives the signal, it calls a signal handler function that does the following:

  • Append all contents of the AOF rewrite buffer to the new AOF file, so that the database states stored in the old and new AOF files are consistent;
  • The new AOF file is renamed to override the existing AOF file.

After the signal function completes, the main process can continue processing the command as usual.

During the whole process of AOF background rewriting, the main process will be blocked by copying at write time, and the main process will be blocked by signal processing function execution. At other times, the AOF background rewriting will not block the main process.

conclusion

This time Kobayashi introduced Redis persistence technology in the AOF method, this method is to execute a write operation command, write the command to the AOF file in the way of the addition, and then in the recovery, in order to execute the command for data recovery.

Redis provides three policies for writing AOF logs back to disk, Always, Everysec, and No, which are high to low in reliability and low to high in performance.

As more commands are executed, the size of AOF files naturally increases. In order to avoid large log files, Redis provides AOF rewriting mechanism, which directly scans all key/value pairs in the data, and generates a write operation command for each key/value pair, and then writes this command to a new AOF file. When the rewrite is complete, the existing AOF logs are replaced. The process of rewriting is done by the backend process so that the host process can continue processing the command as normal.

Using AOF logs to recover data is slow because Redis executes commands in a single thread, whereas AOF logs execute commands sequentially. If the AOF log is large, the “replay” process can be slow.


The resources
  • Redis Design and Implementation
  • Redis Core Technology and Practice – Geek Time
  • Redis Source Code Analysis