Author: Lab Chen/Big Data Open Lab

For a transactional database, the most critical function is to ensure ACID properties of things, where atomicity and persistence are guaranteed by the recovery subsystem. If the transaction cannot continue, it needs to use the recovery subsystem to roll back. Or in the event of a system crash, you also need to rely on the recovery subsystem to restore the database to its pre-crash state. In this column, we focus on Logging Protocols/Recovery Algorithms, which are two key parts of the transactional database Recovery subsystem.

– Logging Schemes –

The key of the recovery subsystem is the recovery algorithm, which aims to achieve two processes. The first is the preparation for system recovery during transaction execution. Currently, most systems usually use logging. Although there is an extra cost to record data update logs during transaction execution, without logging, system recovery and rollback of unfinished transactions cannot be realized once the system crashes. In addition, there is the Shadow Paging scheme, in which every modification of data is done in copy-on-write mode. When data is updated, a copy of the original data is made and updated on the copy, and the operation is completed by replacing the original data with the copy. The Shadow Paging scheme is expensive and generally applies to scenarios with infrequent updates, such as text editors. Therefore, most transactional database systems use log-based schemes. The second process is how to use the log information recorded by the system and adopt appropriate policies to ensure that the database can be restored to the correct state in the event of a system failure or transaction rollback.

  • Physical Logging & Logical Logging

Logging is classified into two types: Physical Logging and Logical Logging. Physical Logging records changes to data items. For example, the value of data item A is 90 before the change and 100 after the change. Physical Logging records changes to data item A. In a database system, Physical Logging may be Value Logging, which records data items, data item ids, property values before and after modification, etc. It could also be real physical Logging, which records the disk page PageID, Offset, and length, as well as the values before and after modification.

The other type is Logical Logging, which does not record execution results but only data modification operations such as DELETE/update. In contrast to Physical Logging, which restores or replays based on the values before and after the change, Logical Logging needs to re-perform operations in the log during replay and reverse operations recorded in the log during rollback, such as inserting corresponding deletes.

  • Physical Logging VS Logical Logging

Both types of Logging have their advantages and disadvantages. Logical Logging logs less content, such as update operations. Logical Logging only needs to log one UPDATE statement and has less Logging overhead. The disadvantage is that it is difficult to implement in concurrent scenarios. When multiple transactions generate update operations at the same time, these operations will be scheduled to be executed in serial sequence within the database. Therefore, a mechanism is needed to ensure that the execution sequence of each playback operation is consistent with the sequence generated by scheduling. So, most database systems using Physical Logging to ensure the consistency of data recovery, the transaction manager (concurrent control subsystem) execution sequence generated by the transactions will be recorded in the form of a log, recovery subsystem according to the log in order to ensure that each data item to modify rollback and playback are carried out in accordance with the order strictly. Today, however, some database systems, such as the in-memory database engine VoltDB, still use Logical Logging. This is because VoltDB is designed with no concurrency control, and all operations are performed sequentially by each CPU core, so Logical Logging can play them back sequentially.

For database management system, to ensure the persistence and correctness of data in case of failure, so the recovery subsystem is indispensable. In the process of transaction execution, it can be rolled back to ensure atomicity. At the same time, however, recovering the subsystem can have a performance impact, because all log records are not truly down until they are flushed to disk. Even if all transactions are completed, they must wait until the log is down before responding to the client. Therefore, the performance of Logging often becomes a performance bottleneck for the entire system.

– Recovery System Optimization –

There are two main types of techniques for tuning logging or recovery subsystems: Group Commit and Early Lock Release.

  • Group Commit

Group Commit is to flush a set of transaction logs that are executed in parallel to disk at the same time, rather than separately. Logs have a separate log buffer. All transactions write logs to the log buffer first, and then set separate threads to periodically flush the contents of the log buffer to disk, or flush to disk when the log buffer is full.

The OPERATING system provides different disk write modes, such as Sync, Fsync, and Fdatasync. When Sync flushes data to the operating system file buffer, it is regarded as the end, and then the background process of the operating system flushes the buffer to disks. Therefore, data may be lost if you flush disks in Sync mode. Database systems typically use Fsync to drop logs to disk and return them when they are actually written to disk. Fsync writes file data and file metadata, such as modification time and file length, to the disk. Fdatasync differs from Fsync in that it only brushes data, not metadata.

In some DBMSS, a mixture of Fsync and Fdatasync is used: when metadata changes do not affect Logging, such as only file modification time changes, Fdatasync is used; However, if the operation changes the file length, Fdatasync cannot be used, because Fdatasync does not store metadata modification information and will partially miss the content during recovery. Because many DBMSS do not incrementally add log file contents during log writing, but rather allocate enough space to log files at once, and the log file length remains the same during subsequent log writing, Fdatasync can be used to write logs to disk. As you can see, the Group Commit writes one set of transactions to disk with one system call at a time, merging many transaction I/ OS and reducing the overall system I/O.

  • Early Lock Release

In the implementation of concurrency control based on lock mechanism, if the lock of the prior transaction is not released, the subsequent transaction can only be in the state of waiting for the lock. In the figure, the black part indicates the transaction operation in progress, and the gray part indicates the time for the log to fall from disk. Although the data is not modified at this time, the lock can be released only after the log is flushed to disk. Early Lock Release is an optimization method for improving performance in this scenario. The strategy is to Release the Lock when the part of the transaction is finished, and then drop the log to disk, reducing the time for the next transaction to wait for the Lock and improving the concurrent execution.

                            

However, this method also has defects. For example, the first transaction has released the lock, but when the log falls, the failure needs to be rolled back. However, the lock has been acquired by the next transaction, and the next transaction needs to be rolled back together with the previous transaction. In reality, early release of locks is widely used in databases. For index structures, locking a node in an index can have a large impact because an index leaf node often involves a series of many data records. If a leaf node is locked, all related records are locked. Therefore, in the use of indexes, Early Lock Release is usually adopted rather than the two-stage locking protocol to shorten the time that data records are locked.

— ARIES algorithm —

In disk-based database systems, Recovery subsystems are mostly implemented based on ARIES (Algorithms for Recovery and Isolation ExploitingSemantics) algorithm. ARIES uses a Steal + No Force management strategy to manage data and log buffers. The Steal + No Force management strategy is described in detail in in-memory database parsing and Mainstream Products. In ARIES, each Log has a Log Sequence Number (LSN). In the following figure, the Log no. 10 is the update operation of transaction T1 writing Page 5. LSN 20 is the update operation for transaction T2 to write Page 3. Note that the transaction end record is retained in the log, indicating that the transaction is committed and returned to the client, indicating that all operations of the transaction have been completed. If the log contains only commit and no end, it may mean that the transaction has completed, but the client may not receive the response.

              

  • ARIES Three-stage recovery

The ARIES recovery algorithm is divided into three phases: Analysis, Redo, and Undo. Details of each phase will be described later.

  1. **Analysis: ** After a crash restart, the system first reads the log file from the disk and analyzes the log file contents to determine which transactions were Active during the crash and which pages were modified during the crash.

  2. **Redo: ** In the Redo phase, the system recreates the fault site based on the logs, including restoring Dirty pages in memory to their crash state, Repeating log History and executing every log, including uncommitted transaction logs.

  3. **Undo: **Undo phase the system starts to Undo unfinished transactions. The above is a simple example of logging. The system crashed after LSN 60. In the log, transaction T2 end is marked, so T2 has been committed, while transactions T1 and T3 have not yet completed.

  • The data structure for logging

For the ARIES recovery subsystem, the recovery process needs to be based on the information stored in Logging. ARIES logs consist of multiple log records. One log record contains the transaction ID, Page ID + Offset, length, value before and after the change, and additional control information.

The ARIES Log types include Update, Commit, Abort, End, and Compensation Log Record. The CLR is used to prevent errors caused by transaction rollback. When a transaction is rolled back, a CLR is recorded for each log. The system can determine which operations have been rolled back based on the CLR.

ARIES logs redo and undo data, including the values before and after changes. Generally, logs are written to disks in sequence. Therefore, the database allocates a separate disk for the log service and does not use the disk for storing data records to improve log write performance.

In ARIES, the sequence on the right represents all logs. The blue and blue parts represent the logs that have been dropped, and the orange part represents the logs that are still in the log buffer. ARIES records that Flushed LSN is the uploading buffer that has been Flushed to disk. In addition, a Page LSN is recorded in each disk block that stores data, which indicates the maximum log number of all operations that modify the data Page (that is, the log number of the last operation that modified the data Page). When flushing data from the data buffer to disk, determine whether the uploading Page LSN and flushed LSN can flush data to disk. If the Flushed Page LSN is smaller than or equal to the uploading LSN, all uploading logs that modify this data Page are uploaded to disk, so data can also be uploaded. This is known as write-ahead-log (WAL). Logs are always uploaded to disk before data.

In addition, a Prev LSN is stored in the log record to correspond to the previous log number of the transaction to which the log belongs. Because all transactions in the system share the log buffer, the generated logs are interspersed. You can find all the logs corresponding to the transaction by concatenating all LSNS belonging to the same transaction through Prev LSN.

                                

Xact tables and Dirty Page tables need to be maintained in the recovery subsystem. The Xact Table is used to record the status of all active transactions, such as active, COMMIT, abort, and end. The log number Last LSN generated by the transaction is also recorded. The Dirty Page Table records which data pages are modified after being loaded from the disk to the buffer and the log number Rec LSN when each Page was first modified (that is, the log number corresponding to the first modification operation after the data Page was loaded to the buffer).

In addition to recording information in logs, the database system needs to Record the LSN of the Checkpoint using the Master Record to ensure that the database starts from the latest Checkpoint. The database system needs to stop Checkpoint (no transaction is allowed to be executed), which is unacceptable to users. Therefore, Checkpoint in ARIES adopts Fuzzy Checkpoint mode, which allows the transaction to continue to be executed during Checkpoint. Fuzzy Checkpoint generates two log records, Begin_Checkpoint and End_Checkpoint. Begin_Checkpoint records the Checkpoint start time, End_Checkpoint records Xact tables and Dirty Page tables. The Checkpoint LSN is written to the disk’s Master Record for persistent storage. The above are all data structures required for recovery. The following table shows the collation and summary of various LSNS.

– Transaction recovery of the database system –

  • Simple transaction recovery

For simple transaction recovery (where the system did not fail but the transaction did not continue during execution), a rollback is required. During the rollback, the system first finds the latest LSN in the Xact Table and performs undo. Then, the system finds the previous log records based on the Prev LSN of the log records and performs undo until the whole transaction is played back to the original state. Similar to normal transaction operations, data in undo is actually locked and the compensation log CLR is logged before rollback. The CLR records the LSN of undo next to point to the next LSN that needs to be undone. Transaction Abort and Transaction End are recorded to indicate that the rollback is complete.

  • Failed transaction recovery

As mentioned earlier, ARIES is divided into three phases of fail-over. The execution details of the three phases are described below.

  1. The Analysis phase

    In the Analysis phase, the system obtains the last Checkpoint log from the Master Record on the disk and reconstructs Xact tables and Dirty Page tables. Start with Begin_Checkpoint logs and process subsequent logs in sequence. When a transaction’s end log is encountered, remove it from the Xact Table. If a commit log for a transaction is encountered, the state of the corresponding transaction in the Xact Table is updated. If other log records are encountered, check whether the transaction is in the Xact Table. If no, add the transaction to the Xact Table and update the Last LSN of the transaction to the LSN of the current log. In addition, the system checks whether the updated data Page is in the Dirty Page Table. If not, it adds the data Page to the Dirty Page Table and sets its Rec LSN to the current log number.

  2. Redo phase

    During the Redo log phase, the system starts the Redo log from the smallest PageRec LSN in the Dirty Page Table because data changes in previous logs have been dumped and do not appear in the Dirty Page Table. Redo operations (replays) are then performed on subsequent update log records, including the CLR, starting from the redo location. If the updated Page is not in the Dirty Page Table, or the Rec LSN of a Page in the Dirty Page Table is larger than the current LSN, or the Page LSN of a disk is larger than the current LSN, the records of the LSN have been dropped to disks and can be skipped. There is no need to perform redo. During redo, the system does not need to log because redo only reconstructs the entire memory state. If a system failure occurs during redo, the system does the same.

  3. The Undo phase

    The Undo phase aims to Undo transactions that have not been completed when the system fails. At the beginning, a log set that needs to be undone is established, and the last log number of each transaction that needs to be rolled back is put into this set, and then the circular processing begins. First, the system selects the largest LSN from the set, namely the last one, for undo. If this log is CLR compensation log, and its undo-next is empty, it indicates that the transaction has completed undo, and an End log can be recorded to indicate that the transaction has ended. If the undo-next value of the compensation log is not empty, it indicates that there is another log that needs undo. Then, the LSN of the next log is added to the set. If it is an update log, the log is rolled back and a CLR log is recorded, and the Prev LSN of the log is added to the collection. The system repeats the above process until the entire undo set is empty.

    Let’s go through the whole process with an example. The system first performs Fuzzy Checkpoint, and there are two updates: P5 is modified for T1, and P3 is modified for T2. Subsequently, T1 abort is cancelled, and LSN40 logs compensation — rollback LSN10, followed by T1 End. Next, other transactions continue: T3 modifies P1 and T2 modifies P5. What should I do to recover from the Crash? Xact tables contain only T2 and T3, and Dirty Page tables include P1, P3, and P5. After analysis, redo was performed to restore the fault site, and then the CLR was recorded when each log was undone at T2 and T3 until the first log of each transaction was undone.

If a Crash occurs during recovery (as shown in the figure below), the two undo operations are redone because CLRS are recorded, and the new undo process does not roll back. The recovery system will continue on the original basis until all transactions undo complete.

  • ARIES summary

ARIES is a recovery system with a mature design that guarantees atomicity and persistence of transactions, using WAL and Steal + No Force buffer management strategies without affecting system correctness. The LSN in ARIES is the monotonically increasing logging unique identifier that links together all the logs of a transaction. The Page LSN records the log number corresponding to the last modification operation on each Page. The system uses Checkpoint to reduce Recovery costs. The recovery is divided into three steps: Analysis, Redo, and Undo. The purpose of Analysis is to find out which transactions need Redo, which pages were modified, and whether the modification has fallen. The fault site is then restored through redo and the transactions that need to be undone are rolled back using undo.

Summary of this paper

This paper introduces the basic concepts of Logging and Recovery, and discusses the technical principle of ARIES Recovery subsystem in traditional disk-based database management system. The next article will continue to explore the database recovery subsystem, discussing Early Lock Release and Logical Undo in DBMS recovery, and introducing two database recovery techniques as well as in-memory database recovery methods.

References:

1. C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking And Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst. 17, 1 (March 1992), 94 — 162.

2. Antonopoulos, P., Byrne, P., Chen, W., Diaconu, C., Kodandaramaih, R. T., Kodavalla, H., … & Venkataramanappa, G. M. (2019). Constant time recovery in Azure SQL database. Proceedings of the VLDB Endowment, Twelve (12), 2143-2154.

3. Zheng, W., Tu, S., Kohler, E., & Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 465-477).

4. Ren, K., Diamond, T., Abadi, D. J., & Thomson, A. (2016, June). Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 International Conference on Management of Data (pp. 1539-1551).

5. Kemper, A., & Neumann, T. (2011, April). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering (pp. 195-206). IEEE.

_____________________________________________________________________________________________