preface

Following the SQL statement execution process principle, we began to touch on the storage engine, but also have understood the role of the storage engine, it is the data storage and query of the underlying support.

We know that MySQL storage engine design is based on plug-in architecture, it supports many different storage engine implementation, such as InnoDB, MyISAM, CSV, Memory and so on, and their advantages and disadvantages and application scenarios have been introduced in the previous article.

InnoDB

Although there are many types of storage engines, we don’t need to know all of them in depth. First, we will pick a key and high priority one, so next we will take an in-depth look at the architecture principles based on InnoDB storage engine.

So why pick it first?

Because it is the default storage engine for MySQL 5.5.8 and later, it also represents a more reliable and high-performance universal storage engine compared to other storage engines. But in the actual development of most of our business scenarios under the table design is to use it as the bottom of the data support, so to perform performance tuning of MySQL, it is not enough to know it, must understand it, understand it.

Important features

Before we dive into InnoDB, we need to know the features of the storage engine it implements so that we can understand the purpose of its implementation. The following is an excerpt from the official documentation (MySQL version 5.7) :

features Whether to support
B-tree indexes is
Backup/point-in-time recovery (implemented in the server, not in the storage engine.) is
Cluster Database Support no
Clustered index is
Compressed data is
Data cache is
Encrypt the data Yes (implemented in the server via encryption functions; Static data encryption is supported in MySQL 5.7 and later.)
Foreign key support is
Full-text search index Yes (MySQL 5.6 and later support FULLTEXT indexes.)
Geospatial data type support is
Geospatial index support Yes (MySQL 5.7 and later supports geospatial indexes.)
The hash index No (InnoDB uses hash indexes internally to implement its adaptive hash index function.)
The index buffer is
Locking granularity line
MVCC is
Replication support (implemented in the server, not the storage engine.) is
Storage limits 64TB
T tree index no
The transaction is

In addition, there are other additions, deprecations, or deletions in the release iteration, which are not listed here but can be consulted in the official documentation.

We also believe that all of these features have been touched or used in actual development at some point. However, classification summary down, the most direct concern is the following points:

  • The transaction
  • The index
  • Data replication
  • The data backup

The rationale for these features will be covered later, but before we can do this, we must first understand the infrastructure design of InnoDB.

Architecture design

Let’s take a look at the architecture diagram of InnoDB, as follows:

First of all, we can not understand its architecture and the execution process, but the most intuitive understanding is that the original data is stored from the memory structure through the operating system cache (from the user state to the kernel state switch), and then written to the disk structure.

Memory structure

As you can see from the architecture diagram, InnoDB’s memory structure is used to manipulate data before it is dropped onto disk, which contains several important structures.

**Buffer Pool: **Buffer Pool. First, InnoDB disks store data in the smallest unit of pages, including index pages and data pages. For data reading and writing, it is too slow to directly operate the disk. Therefore, buffer pool technology is used to improve performance. During the read operation, the system saves the page data read from the disk in the buffer pool. If the page data is read from the disk in the buffer pool, the system returns the page data directly without accessing the disk. When a write operation is performed, the page data in the buffer pool will be written first (the data in the buffer pool will be inconsistent with the data on disk, which is called dirty pages). InnoDB then has special background threads (Checkpoint mechanism) to write the page data in the buffer pool to disk at a certain time (brush clean).

The types of data pages cached in the cache pool include Index page, data page, undo page, Adaptive Hash Index, Change buffer, lock information, etc., among which Index page and data page occupy a large space. Let’s start with adaptive hash indexing and write caching,

  • Adaptive Hash Index: First of all, we know that it is very fast to search data by Hash Index, because its time complexity is O(1) and only one search is required. So what is adaptive? InnoDB will, understandably, automatically hash the pages based on the frequency and mode of access, provided that the continuous access mode (query criteria) for the page is the same and can only be used for equivalent queries. Its purpose is to automatically hash indexes these pages based on the frequency and conditions of SQL execution, optimizing the speed of page lookup in the buffer pool.

  • Change the buffer: MySQL5.5, formerly called the Insert buffer, now supports DELETE and update operations, including insert buffer, update buffer, and DELETE buffer, hence the name write buffer. It works only if the index of the data is not unique, that is, it is an optimization for data with secondary indexes. When frequent insert, update, and delete operations are performed on the data, if the data table contains a large number of secondary indexes and the data is written to the disk, a large amount of I/O is consumed to update the secondary indexes. Therefore, the changes in the secondary indexes are cached in the memory and recorded. Master threads then merge asynchronously to disk, saving a lot of extra overhead.

    The merge write is triggered when the following three conditions are met: 1. The background thread periodically flusher disks; 2. 2, database shut down; 3. when the redo log is full.

Log Buffer: Log Buffer. Also called redo Log Buffer. We have already seen the functions of several different buffers. In fact, the same applies here. They all avoid the large I/O overhead caused by frequent direct disk operations, thus reducing the efficiency of data operations and significantly improving throughput. The Log buffer is optimized for redo Log operations. So what’s a redo log? We’ll see what it does when we finish the redo log structure on disk.

Innodb_flush_log_at_trx_commit The log buffer can be set to three different flushing mechanisms:

  1. When the default is 1, logs are written to and flushed to disk each time a transaction commits, ensuring persistence in transaction ACID.
  2. When set to 0, logs are written and flushed to disk once per second.
  3. When the value is set to 2, logs are written to the operating system cache after each transaction commit and flushed to disk once per second.

If the value is set to 0 or 2, unflushed logs may be lost when the database is down. However, if the value is set to 2, logs in the system cache of the operating system can still be used to recover data if the database is down but the server is not down. The range of lost data is determined by the log refresh frequency innodb-flush-log-at-timeout.

In summary, we can understand that these structures are distributed in the memory space, which is bound to be limited by the size of the allocated memory in some scenarios. Therefore, if we need to optimize them, we can adjust the size of different buffers by setting some parameters to save disk I/O. In addition, in order to ensure the efficiency of cache utilization, memory weeding will be carried out in the case of insufficient memory. As we know, there are many kinds of memory weeding mechanisms, such as random, LRU, LFU, etc., while MySQL is implemented based on the variant of LRU algorithm. At the same time, asynchronous flush is carried out through background threads, so adjusting the flush strategy to improve throughput is also a means of optimization according to the requirements that can be met in different scenarios. For details about supported parameter Settings, see the official documentation.

The disk structure

After the memory structure, let’s look at the disk structure of InnoDB. Since it is a disk structure, we have to understand that the existence of different structures in these disks must be presented in the form of files. InnoDB storage engine architecture diagram, we can see that there are two main types: table space and Redo log.

Redo log: Let’s start with redo log. It’s called redo log. Front and we know that the Buffer pool, the function and principle of here we assume that a kind of situation, when the Buffer pool page cache dirty data has not been a background thread asynchronous to disk or write half, database downtime, because at that time the data is cached in memory, the data will be lost, not with this case?

In order to avoid this problem, InnoDB writes all page modification operations to a separate log file for recording, so that when the database is restarted, the recovery operation can be carried out from the log file, and then uses it to implement crash-safe mechanism to ensure the persistence of transactions.

WAL (write-Ahead Logging) of MySQL is also used to Write logs to disk, so why not Write to db files first? This is because writing db files is random IO, while writing log files is sequential IO. This design not only solves the problem that writing DB files directly is slow and leads to throughput reduction, but also provides a recoverable copy of data.

If you open the corresponding /mysql directory, you can see that the ib_logfile0 logfile is our Redo log. By default, each file is 48 MB, and the next ib_logfile file is created in sequence.

Table Spaces: see table space, again for InnoDB storage engines, it is very important, is a place where all of the data storage, we can see it is divided into several different table space, respectively for the system table space, an exclusive table empty, common table space, temporary table space, Undo the Log table space, and their significance is also very easy to understand.

  • System tablespace: a shared tablespace created by default. The corresponding file is/mysql/ibdata1It is also the storage area for data dictionaries, double-write buffers, write buffers, and Undo logs, which represents the basis for other table Spaces and holds some important basic information about data.
  • Exclusive table space: This is the default table space type, which can be passed as a parameterinnodb_file_per_tableOn or off, each table is implicitly used when it is created. Each table is created independently and occupies a tablespace, rather than sharing a tablespace. In the case of the user table, one tablespace is generateduser_innodb.ibdThe index and data in the user table are stored separately in the file, but the data in other classes, such as undo log, double-write buffer, and change buffer, are still stored in the system tablespace.
  • Generic table space: Similar to a system table space, it is a shared table space, but it can specify that different tables can be created to share the same table space, and their data will be stored in the same table.ibdTable data from different table Spaces can also be moved in a file.
  • Temporary table space: It is used to store temporary table data created by users and on disk. The generated file format isibtmpTemporary table Spaces are deleted when MySQL normally shuts down and are recreated each time the server starts.
  • Undo Log tablespace: This is used to store Undo Log data. By default, it is stored in the system tablespace, but because the size of the system tablespace is not shrinkable (unused data is only marked as available, not reclaimed), we can pass through the Undo Log tablespaceinnodb_undo_directoryThe configuration specifies that the undo tablespace is stored in one or more undo tablespace files named asundo_001.

We know that these different table Spaces are actually areas or files for storing different data, but what about the internal logical storage structure of these table Spaces?

A tablespace consists of four parts: segment, extent, page, and row. The following figure shows the contents of a tablespace:

  • Segment: a tablespace consists of various segments, such as data segment, index segment, rollback segment, etc. Data segment is the leaf node of the B+ tree index, index segment is the non-leaf node of the B+ tree index (InnoDB index structure is related, we will explore the principle of indexing later), rollback segment is the undo log storage area.
  • Extents: InnoDB can allocate up to four extents in a segment at a time. The default size of each extents is 1MB. The extension size depends on the page size.
  • Page page: the smallest unit of disk storage in InnoDB. Each page size is 16KB by default, so an extent has 64 pages by defaultinnodb_page_sizeReduce the page size to 8KB or 4KB, so that each extent corresponds to 128 or 256 pages.
  • Final line row: specific data is stored in rows, and the number of rows is limited by the size of the page limit, at the same time, we in the design table will give each field definition of different data formats, combined by the size, in bytes of each field calculation, we can probably calculated how many row under a page can save most of the data.

InnoDB storage engine is a storage engine that can be used to store data in a table space, such as data dictionary, double-write buffer, Change buffer and Undo log.

  • Data dictionary: An internal system table used to store metadata, including user-defined tables, columns, indexes, and other information.
  • Double-write buffer: This buffer guarantees data recovery and persistence along with the redo log. We already know what the redo log does, so why do we need a double-write buffer? This is because we know that InnoDB’s disk storage structure is pages, the default size is 16KB, but the operating system’s file management data page is not like this, but only 4K, which means that each InnoDB page is filled with 4 writes by the operating system. If the operating system is writing data to an InnoDB page, for example, the operating system crashes after writing 4K data. In this case, the InnoDB page data is damaged and invalid. In this case, redo log recovery cannot restore the data stored in the page. A double-write buffer is used to pre-write a copy of a page. In this case, redo logs are used to restore data to a copy of the page to ensure data durability. This mechanism is enabled by default in InnoDB. Although it ensures high data recovery, you may have a question: wouldn’t it take twice as many I/OS to write data to two copies? Wouldn’t it significantly reduce system throughput? In fact, it is not the same as the Log buffer. Part of it is stored in memory and part is stored on disk. This kind of write operation is sequential I/O, so it does not occupy a large I/O resource overhead.
  • Change Buffer: This has been explained in detail above.
  • Undo log: Rollback log, which records the data operation records before the transaction submission. If an exception occurs before the transaction submission (no matter business layer exception or server downtime), Undo log can be used to roll back and forth the data to the state before the transaction to ensure the atomicity of the data. So it’s really about transactions, and we’ll talk about that when we get into the principles of transactions.

Update SQL execution

Through the above in-depth analysis, we believe that we have a clear understanding of InnoDB’s internal architecture design. It is no longer mysterious to us, but when we think about it, we seem to have a little vague idea of how these architectures are strung together in the actual execution of SQL. In the last article, we took query SQL as an example to understand the execution principle of SQL, while the storage engine aspect is not detailed, so next we take update SQL as an example, to connect InnoDB architecture design.

As we know, update SQL is the same as query SQL, which is parsed by the parser, then optimized by the optimizer to generate the execution plan, and finally handed to the query execution engine to call the storage engine API execution. The difference lies in the implementation of storage engines.

Suppose you now need to execute an UPDATE statement,

update user set name = "Wind grass" where id = 1
Copy the code

So what exactly is its execution process? Let’s look at the picture first,

The general process is described as follows:

  1. The MySQL server layer implicitly calls the storage engine API to enable transactions.
  2. Then the storage engine is called to execute THE SQL API, and the adaptive index of the original data that needs to be modified in the Buffer pool is first tried to match. If the index fails to match, it will be searched from the data page. If it still fails, the table space in the disk structure will be called to search and cached in the data page of the Buffer pool.
  3. After modifying the data on the data page of the Buffer pool, the operation records are written to the Log Buffer. The Log Buffer data is flushed asynchronously to the redo Log file on the disk. The data is in the prepare state.
  4. When the update is complete, the Server layer is notified, this time the data is recorded in the binlog, and the storage engine API is called to commit the transaction.
  5. After the transaction is committed, data is written to the Log buffer again and asynchronously flushed to the redo Log file on the disk. The status of the data is commit.
  6. Finally, when the transaction is completed, the SQL update is completed, and the server layer returns the message of successful update to the client.
  7. The data pages, index pages and change Buffer stored in the memory Buffer pool will be periodically and asynchronously flushed to the disk table space by special background threads.

The server layer involves another log file: binlog, which records all DDL and DML statements in the form of events. By enabling the generation of binary binlog file, we can use it for data recovery, and the function of master/slave replication is realized depending on it.

conclusion

InnoDB storage engine is not only the principle of these, there are many features of the implementation principle has not been in-depth, but through the above, we have actually mastered InnoDB infrastructure principles, I believe that it will be more comprehensive to look at the function encapsulation and extension on its basis.

If there is anything wrong in the article, please criticize and correct it. Thank you very much.


To do one thing extremely well is talent!