redo log
Transaction support is one of the important characteristics of database to distinguish file system. Transaction has four features:
- Atomicity: All operations are either done or not done, indivisible.
- Consistency: When the database changes from one state to another, the result is consistent. For example, when A transfers 500 to B, A ends up losing 500 and B ends up gaining 500, but the value of A+B remains unchanged.
- Isolation: Transactions and their predecessors are isolated from each other.
- Persistence: Once a transaction is committed, its changes to data are permanent.
This article focuses on persistence.
When we update a record in a transaction, for example:
update user set age=11 where user_id=1;
Copy the code
It goes something like this:
- Check if the page where user_id resides is in memory. If not, read it from the database and load it into memory
- Example Change the age of the memory to 11
- Write redo log and redo log is in the prepare state
- Write binlog
- Commit a transaction and redo log becomes commit
There are a few key points: What is a redo log? Why do I need redo logs? What is the redo log in the prepare state? Can I choose between redo log and binlog? ? With this list of problems, let’s debunk the Redo log.
Why update memory data first instead of disk data directly?
Why don’t we just update the disk data every time we update the data? First of all, we know that disk IO is slow, memory is fast, the speed of the two is not the same order of magnitude, then for slow disk IO, there is an index, through the index even if the data is millions of we can still find our data on the disk very fast, this is the role of the index. However, the index also needs to be maintained, which is not static. When we insert A new piece of data A, since this piece of data needs to be inserted after the existing data B, we have to move the data B to make room for A, which has some overhead. Even worse, the page is already full, so you have to apply for a new page and move some data over. This is called page splitting, which is more expensive. If our SQL change is to directly modify the disk data, which happens to be the above problem, then the efficiency will be very low, or even cause timeout, which is why the above update process first loads the corresponding data page into memory, and then updates the data in memory first. In the case of mysql, all changes must be made to update data in the buffer pool. Then dirty pages in the buffer pool are flushed to disk at a certain frequency (checkPoint mechanism). The buffer pool is used to optimize the gap between CPU and disk so that overall performance does not degrade too quickly.
Why do I need redo logs?
Buffer pools help to bridge the gap between CPU and disk. The checkpoint mechanism ensures that data is eventually dropped. However, since checkpoint is not triggered every time a change is made, the master thread processes it at intervals. So the worst-case scenario is that the database goes down right after the buffer pool is written, and the data is lost and cannot be recovered. In order to solve the problem of persistence in this case, InnoDB engine uses write-Ahead Logging (WAL) technology. The idea of this technology is to Write logs first, and then Write disks. Only when the log is written successfully, the transaction is considered to be committed successfully. The log is called the redo log. When an outage occurs and data is not flushed to disk, the redo log can be used to restore the D in ACID. This is what the redo log does.
How is redo log implemented?
The redo log is not written directly to disk. There is also a redo log buffer called a redo log buffer. InnoDB writes a redo log first. The redo log buffer does not have to be large. The redo log buffer is a temporary container. The master thread brushes the redo log buffer to the redo log file every second. Make sure that the redo log buffer stores transaction changes for 1s (mysql5.7.23, for example). The default is 16MB.
mysql> show variables like '%innodb_log_buffer_size%';
+------------------------+----------+
| Variable_name | Value |
+------------------------+----------+
| innodb_log_buffer_size | 16777216 |
+------------------------+----------+
Copy the code
A 16M buffer is sufficient for most applications. The main strategies for buffer synchronization to redo log are as follows:
- The master thread flushes buffers to the redo log every second
- Each transaction is flushed to the redo log
- If the free buffer space is less than half, the buffer is flushed to the redo log
Note that the redo log buffer is not actually flushed to disk. The redo log buffer is only flushed to the OS cache. This is an optimization made by modern operating systems to improve file writing efficiency. The actual write is left up to the system (e.g., the OS cache is large enough). Then there is a problem for InnoDB, if the system is fsync, also if the system goes down, then the data is lost (although the probability of the whole system going down is relatively small). In this case, InnoDB gives the innodb_flush_LOG_at_trx_COMMIT policy and lets the user decide which one to use.
mysql> show variables like 'innodb_flush_log_at_trx_commit';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 1 |
+--------------------------------+-------+
Copy the code
- 0: indicates the fYSNC that does not perform fsync after the transaction is committed. Instead, the master performs the redo log every 1s
- 1: the default value, fsync is synchronized every time a transaction commits
- 2: After writing to the OS cache, the operating system decides when to fsync
In terms of three spawn strategies:
2 is certainly the most efficient, but whenever the operating system goes down, the data in the OS cache is lost, and in this case D in ACID cannot be satisfied
0 is a compromise. Its IO efficiency theory is higher than 1 and lower than 2, and its data security theory is lower than 1 and higher than 2. This strategy also has the risk of data loss and cannot guarantee D.
1 is the default value, which guarantees D that data will never be lost, but it’s the least efficient. I recommend using the default value, although the probability of operating system downtime is theoretically less than the probability of database downtime, but generally since transactions are used, data security should be relatively more important.
Redo log is a physical modification of a page. The x position of page X is changed to xx, as in:
page(2.4),offset 64.value 2
Copy the code
In InnoDB, redo logs are stored in 512-byte chunks. Each log block is called a redo log block. If a page contains more than 512 bytes, it is logically divided into multiple blocks.
A redo log block consists of a log header, a log body, and a log tail. The log header takes up 12 bytes and the log tail takes up 8 bytes, so the amount of data a block can really store is 512-12-8=492 bytes.Redo log blocks make up our redo log.
The default size of each redo log is 48 MB:
mysql> show variables like 'innodb_log_file_size';
+----------------------+----------+
| Variable_name | Value |
+----------------------+----------+
| innodb_log_file_size | 50331648 |
+----------------------+----------+
Copy the code
InnoDB has 2 redo logs by default. This is the log group that does all the work.
mysql> show variables like 'innodb_log_files_in_group';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| innodb_log_files_in_group | 2 |
+---------------------------+-------+
#ib_logfile0
#ib_logfile1
Copy the code
When ib_logFILE0 is written, ib_logFILe1 is written. When ib_logfile1 is written, ib_logfile0 is written again. “And so on and so on.
Why is a block designed to be 512 bytes?
The associated and disk sector, mechanical disk sector by default is 512 bytes, if you want to write more than 512 bytes of data to write sectors must be more than one, then will be involved in the rotation of the plate, find the next sector, under the assumption that now need to write A and B two sectors, if the sector A written to success, and sector B write failure, Non-atomic writes occur, whereas if only 512 bytes are written each time as large as the sector, each write is atomic.
Why a two-paragraph submission?
For a transaction, write the redo log(prepare), binlog (prepare), and commit (commit). Why is there a prepare motion here? Do you want to commit the redo log? If the redo log is submitted directly, and a crash occurs during the redo log writing process, the binlog does not have the corresponding data. Therefore, all slaves that rely on the binlog to restore data do not have the corresponding data, resulting in inconsistency between the master and slave. It is necessary to have a two-part (2PC) commit for redo log and binlog consistency. The XID of 2PC is recorded in the prepare redo log. The XID of 2PC is also recorded in the binlog after the redo log is written. A commit mark is marked in the redo log.
Can either redo log or bin log be required?
Can’t. The redo log itself is of a fixed size. When the redo log is full, the redo log overwrites the old data. Because the redo log cannot hold all data, it is impossible to synchronize data to the slave database using the redo log in master/slave mode. When a binlog file is full, it will be written to a new binlog file. When a binlog file is full, it will be written to a new binlog file. So do we just need the binlog? Do I not need redo log? The redo log provides crash-safe functionality. The redo log provides crash-safe functionality. Data changes are made to pages in the buffer pool first. The advantage is that master can write multiple changes to disk at once. A database crash occurs when a commit occurs and data is stored in the dirty pages of the buffer before it is flushed to disk. The commit data cannot be restored even after the database is restored, which does not satisfy the D of ACID. A transaction is committed only if the redo log is written successfully. Redo logs are mostly sequential disks, so they are much more efficient. In the case of a post-commit crash, redo logs can be used to recover data, which is why redo logs are needed. However, the transaction submission also requires the success of writing to the binlog, so why not use the binlog to recover the undropped data? This is because the binlog does not know which data fell off the disk, so it does not know which data needs to be recovered. In the case of redo logs, data in the redo log is deleted after data falls to the disk. After a database restart, all data in the redo log is restored.
How to recover from crash?
The redo log and the binlog are marked as prepare or COMMIT, and the XID of the transaction is recorded. When the database restarts, all transactions are checked in the redo log first. If the redo log was in the commit state, a crash occurred after the commit. If the redo log was in the prepare state, a crash occurred before the commit. The state of the transaction determines the current state of the transaction. If there is a corresponding XID in the binlog, it indicates that the transaction has been successfully written, but the XID has not been committed. Then execute commit again. A rollback should be performed at this point.
undo log
Redo log is a guarantee of transaction persistence, undo log is a guarantee of transaction atomicity. The pre-operation of updating data in a transaction is actually written to an undo log first, so its flow is roughly as follows:
When is an undo log generated?
Undo log is used for MVCC (multi-version control) and rollback. We mainly refer to rollback. When we insert, update, or delete some data in a transaction, the corresponding Undo log will be generated. Using undo log, you can return to the beginning of the transaction. Note that the rollback is not A physical page, but A logical return to its original appearance, such as A data A, in the transaction you changed to B, but at the same time another transaction has changed it to C, if the rollback directly changed the data page to A, then C is overwritten.
For the InnoDB engine, each row record has several hidden columns in addition to its own data:
- DB_ROW_IDInnoDB automatically adds a hidden column with row_ID as the primary key if no explicit primary key is defined for the table and no unique index is defined for the table.
- DB_TRX_ID: Each transaction is assigned a transaction ID, and when changes are made to a record, the transaction ID of that transaction is written to trx_id.
- DB_ROLL_PTR: Rollback pointer, which is essentially a pointer to undo log.
When we perform INSERT:
begin;
INSERT INTO user (name) VALUES ("tom")
Copy the code
Insert data generates an INSERT undo log and the data’s rollback pointer points to it. Undo log records the undo log sequence number, column inserted into the primary key, and value… During rollback, delete the corresponding data using the primary key.
The update undo log is generated for the updated operation and is divided into the updated primary key and the unupdated primary key.
UPDATE user SET name="Sun" WHERE id=1;
Copy the code
The old record is written to the new Undo log, and the rollback pointer points to the new Undo log, whose Undo no is 1, and the new Undo log points to the old Undo log (undo no=0).
Suppose we now execute:
UPDATE user SET id=2 WHERE id=1;
Copy the code
For the operation of updating the primary key, the original data deletemark will be opened first, at this time, there is no real deletion of data, the real deletion will be handed to the cleaning thread to judge, and then insert a new data, the new data will also generate undo log, and the undo log sequence number will increase.
It can be found that every change to data will generate an Undo log. When a record is changed for many times, multiple Undo logs will be generated. Undo logs record the log before the change, and the serial number of each Undo log is increasing. We’ll be able to find our raw data.
How is undo log rolled back?
In the example above, assuming a ROLLBACK, the corresponding process would look like this:
- Run the undo no=3 log to delete the data whose ID is 2
- Undo no=2 undo no= 1 deletemark = 0
- Undo no=1 undo no=1 undo no=1 undo no=1 undo no=1
- Run the undo no=0 log to delete the data whose ID is 1
Where does undo log exist?
InnoDB manages undo logs in segments, namely rollback segments. Each rollback segment records 1024 undo log segments. InnoDB engine supports 128 rollback segments by default
mysql> show variables like 'innodb_undo_logs';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| innodb_undo_logs | 128 |
+------------------+-------+
Copy the code
The maximum number of concurrent transactions that can be supported is 128*1024. Each undo log segment is like maintaining an array with 1024 elements.
When we start a transaction and need to write an Undo log, we need to find an empty space in the Undo log segment first. When there is an empty space, we will apply for an undo page and finally write the Undo log in the applied undo page. We know that mysql defaults to a page size of 16K.
mysql> show variables like '%innodb_page_size%';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| innodb_page_size | 16384 |
+------------------+-------+
Copy the code
Allocating a page for a transaction is wasteful (unless your transaction is very long), assuming your application’s TPS is 1000, then 1s will require 1000 pages, approximately 16M of storage, approximately 1G of storage per minute… At this rate, unless mysql cleans up very diligently, disk space will grow very quickly over time, and a lot of space will be wasted. Then the undo page is designed to be reused. When the transaction commits, the undo page is not deleted immediately. Because of reuse, the undo page may be dirty, so the undo page may be mixed with the Undo log of other transactions. After the undo log is committed, it is placed in a linked list and determined whether the undo page space is less than 3/4. If it is less than 3/4, it indicates that the current undo page can be reused, so it will not be recycled. The Undo log of other transactions can be recorded after the current undo page. Because the undo log is discrete, clearing the disk space is not efficient.
Pay attention to the public account of the same name, receive computer network, data structure, Redis, mysql, Java, Go, Python and other e-books