InnoDB is a storage engine that supports transaction security and is the default storage engine for mysql. This paper mainly introduces InnoDB row record format and data page realization principle in detail from the perspective of data structure, and see InnoDB storage engine clearly from the bottom.
The main content of this article is based on the nuggets pamphlet “from the root to understand MySQL” collation. If you want to understand in detail, it is recommended to buy the nuggets brochure to read.
InnoDB profile
It is well known that data in mysql is stored on physical disk, and the real data processing is performed in memory. The disk reads and writes slowly. If the disk reads and writes frequently for each operation, the disk performance will be poor. To address these issues, InnoDB divides data into several pages, using the page as the basic unit of disk and memory interaction. Typically, the page size is 16KB. In this case, at least one page of data is read to memory or written to disk at a time. Improves performance by reducing the number of interactions between memory and disk.
In fact, this is essentially a typical cache design idea. Generally, the design of cache is basically considered from the time dimension or space dimension:
Time dimension
: If a piece of data is being used, there is a high probability that it will be used again for some time to come. You can thinkHot data cache
Are all implementations of this idea.Spatial dimensions
: If a piece of data is in use, the data stored near it will most likely be used soon.InnoDB data page
andPage caching for the operating system
Is the embodiment of this idea.
InnoDB row format
Mysql inserts data into a table in the form of a record (row). These records are stored on disk in a format called row format. Mysql supports four different types of row formats: Compact, Redundant (older than this article), Dynamic, and Compressed. We can specify the row format in the statement creating or modifying the table:
CREATE TABLETable name (column information) ROW_FORMAT=Line format nameALTER TABLEThe table name ROW_FORMAT=Line format nameCopy the code
Record_format_demo; record_format_demo; record_format_demo;
mysql> CREATE TABLE record_format_demo (
-> c1 VARCHAR(10),
-> c2 VARCHAR(10) NOT NULL.-> c3 CHAR(10),
-> c4 VARCHAR(10)
-> ) CHARSET=ascii ROW_FORMAT=COMPACT;
Query OK, 0 rows affected (0.03 sec)
Copy the code
Suppose we insert 2 rows of data into the record_format_demo table:
mysql> SELECT * FROM record_format_demo;+------+-----+------+------+ | c1 | c2 | c3 | c4 | +------+-----+------+------+ | aaaa | bbb | cc | d | | eeee | fff | NULL | NULL | + -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + 2 rows in the set (0.00 SEC)Copy the code
COMPACT row format
As can be seen from the figure above, a complete record contains two major parts: the additional information recorded and the real data recorded.
Additional information recorded
There are three main types of additional information for records: variable-length field lists, NULL value lists, and record header information.
Variable-length field list
Mysql supports variable length data types (such as VARCHAR(M) and TEXT). The storage space occupied by these data types is not fixed, but changes with the storage content. To accurately describe this data, the storage space occupied by this variable length field should include both:
- Real data content
- Number of bytes occupied
In the Compact row format, the actual data length of all variable-length fields is stored at the beginning of the record, resulting in a variable-length field length list with the number of bytes occupied by variable-length field data in reverse column order.
We take therecord_format_demo
Take the first row of data as an example. Due to thec1
,c2
andc4
Are converted to data types (VARCHAR(10)
), so save the three column values at the beginning of the record.
The other thing to note is,The variable-length field length list stores only the length of columns with non-NULL values, not the length of columns with NULL values. That is, for the second record, since the value of c4 column is NULL, the variable length field length list for the second record only needs to store the length of c1 and C2 columns.
A NULL value list
For nullable columns, mysql does not store NULL values in the real data portion of the record to save storage space. Instead, it is stored in a list of NULL values in the extra information of the record.
Each null-allowed column has a binary bit (1: the value is NULL, 0: the value is not NULL) to indicate whether or not to store NULL values, and is sorted in reverse order. MySQL specifies that the list of NULL values must be represented as integer bits. If the number of binary bits used is not integer bytes, 0 will be added to the high part of the byte. C1, C3, and C4 are allowed to store NULL values in the record_format_demo table. The first two records look like this after filling in the NULL value list:
Record header information
The record header consists of a fixed set of 5 bytes (40 bits), each of which represents a different meaning:I won’t go into all the details.
Recorded real data
In addition to the column specific data, some hidden column data is automatically added to the recorded real data.
The column name | Whether must | Take up the space | describe |
---|---|---|---|
row_id | no | 6 bytes | A row ID that uniquely identifies a record |
transaction_id | is | 6 bytes | Transaction ID |
roll_pointer | is | 7 bytes | Rollback Pointers |
The actual columns are DB_ROW_ID, DB_TRX_ID, and DB_ROLL_PTR (row_id, transaction_id, and roll_pointer).
Only if the database is not definedA primary key
orThe only key
When, hide the columnrow_id
Exists and uses it as a data tableA primary key
. Because the tablerecord_format_demo
There is no primary key defined, so the MySQL server adds the above three columns to each record. Now let’s seeRecorded real data
Data structures for two records of:
CHAR(M) column storage format
For CHAR(M) columns, the number of bytes used by the column in a fixed-length character set is not added to the variant-length field length list, as is the number of bytes used by the column in a variant-length character set. It is also important to note that the CHAR(M) columns of variable length character sets require at least M bytes, whereas VARCHAR(M) does not. For example, for a CHAR(10) column using the UTF8 character set, the size of the data stored in the column ranges from 10 to 30 bytes, even if we stored an empty string into the column, it would take 10 bytes.
Row overflow data
VARCHAR(M) Maximum number of data that can be stored
MySQL has a limit on the maximum storage space a record can occupy. All columns (excluding hidden columns and record headers), except BLOB or TEXT columns, can occupy a maximum of 65535 bytes. It can be loosely considered that the storage space occupied by a mysql row cannot exceed 65535 bytes. Storage overhead: VARCHAR(M) type VARCHAR(M) type VARCHAR(M) type VARCHAR(M)
- Real data
- The length of bytes occupied by real data
- NULL indicates that the column may NOT have the space if it has the NOT NULL attribute
If varchar_size_demo has only one field of type VARCHAR, the field occupies a maximum of 65532 bytes. Because the length of real data may be 2 bytes, a NULL value indicates that 1 byte is required. If the VARCHAR column does NOT have a NOT NULL attribute, it can only store 65532 bytes of data at most. If the column is an ASCII character set, the maximum number of characters is 65532. If the character set is UTF8, the maximum number of characters is 21844.
Overflow caused by too much data in the record
Let’s use the varchar_size_demo table in the ASCII character set as an example and insert a record:
mysql> CREATE TABLE varchar_size_demo(
-> c VARCHAR(65532)
-> ) CHARSET=ascii ROW_FORMAT=Compact;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO varchar_size_demo(c) VALUES(REPEAT('a'.65532));
Query OK, 1 row affected (0.00 sec)
Copy the code
The basic unit of disk-memory interaction in mysql is a page, which is usually 16KB, 16,384 bytes, and a single record can occupy the maximum65535
Bytes, and that createsMore than one row of data can be stored on a page. In Compact and Redundant line format, to take up the storage space is very large, in the record of the real data will only be stored in the column part of the data, to spread out the rest of the data stored in a few other pages, and then record the real data in the address of a storage point to these pages with 20 bytes, which can find the remaining data in the page, As shown in the figure:This kind ofOnly the first 768 bytes of the column and an address pointing to another page are stored in the actual data of the record, and then the remaining data is stored in another pageLine overflow
Pages that store more than 768 bytes are also calledOverflow page
.
Critical point of row overflow
MySQL specifies that a page contains at least two rows of records. For example, the varchar_size_demo table has only one column c. If we insert two records into the table, how many bytes must be inserted into each record before the row overflow occurs? We need to analyze how the space on the page is being used.
- In addition to storing our records, each page also stores some additional information, about 132 bytes.
- Each record requires 27 bytes of additional information.
Assume that the number of data bytes stored in a column is N. To ensure that the column does not overflow, the following requirements must be met:
132 + 2X (27 + n) < 16384
Copy the code
So n is less than 8099. That is, if a column stores less than 8099 bytes of data, it is not an overflow column. This value is even smaller if there are multiple columns in the table.
Dynamic and Compressed row format
The default row format in mysql isDynamic
.Dynamic
andCompressed
Row format andCompact
The line format is very similar, just processingLine overflow
There is a discrepancy in the data.Dynamic
andCompressed
The line format will not be inRecorded real data
Instead of storing the first 768 bytes, you store all the bytes on other pages.Compressed
The line format uses a compression algorithm to compress the page to save space.
InnoDB data page structure
We already know that pages are the basic unit of storage InnoDB manages, and the size of a page is typically 16KB. InnoDB has many different types of pages designed for different purposes, but here we focus on pages that store data records, officially called index pages. Since we haven’t introduced indexes yet, let’s call them data pages.
A quick look at the data page structure
The data page can be structurally divided into multiple parts, and different parts have different functions, as shown below:
An InnoDB data page is divided into 7 sections, which are roughly described below.
The name of the | Chinese name | Space occupied | A brief description |
---|---|---|---|
File Header | The file header | 38 byte | Page for some general information |
Page Header | Page header | 56 bytes | Some information unique to the data page |
Infimum + Supremum | Minimum record and maximum record | 26 bytes | Two virtual row records |
User Records | User record | Not sure | The contents of the row records that are actually stored |
Free Space | Free space | Not sure | Unused space in a page |
Page Directory | Page directory | Not sure | The relative positions of some records in a page |
File Trailer | File the tail | 8 bytes | Verify page integrity |
Storage of records in a page
The user’s own stored data will follow the correspondingRow format
There areUser Records
In the. In fact, the newly generated page does notUser Records
Only when we insert data for the first time will theFree Space
Mark a space for the record sizeUser Records
. whenFree Space
When this is done, it means that the current data page is also used.In order to be able toUser Records
To be clear, we have to understand what was mentioned earlierRecord header information
.
Understand the record header information
Firstly, the description of each attribute of record header information is briefly introduced:
The name of the | Size (unit: bit) | describe |
---|---|---|
Set aside a 1 | 1 | Don’t use |
Set aside a 2 | 1 | Don’t use |
delete_mask | 1 | Flags whether the record has been deleted |
min_rec_mask | 1 | This tag is added to the smallest record in each layer of the B+ tree’s non-leaf nodes |
n_owned | 4 | Indicates the number of records owned by the current record |
heap_no | 13 | Represents the current record location information in the record heap |
record_type | 3 | Indicates the type of the current record. 0 indicates the common record, 1 indicates the B+ tree non-leaf node record, 2 indicates the minimum record, and 3 indicates the maximum record |
next_record | 16 | Represents the relative position of the next record |
Next, take the page_demo table as an example and insert some data to detail the record header information.
mysql> CREATE TABLE page_demo(
-> c1 INT.-> c2 INT.-> c3 VARCHAR(10000),
-> PRIMARY KEY (c1)
-> ) CHARSET=ascii ROW_FORMAT=Compact;
Query OK, 0 rows affected (0.03 sec)
mysql> INSERT INTO page_demo VALUES(1.100.'aaaa'), (2.200.'bbbb'), (3.300.'cccc'), (4.400.'dddd');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
Copy the code
These 4 rows are recorded in InnoDB in the following format (only the header and real data are displayed), and the data in the columns are expressed in decimal:Let’s highlight the details of a few properties against this diagram:
delete_mask
: indicates whether the current record is deleted. 0 indicates that the record is not deleted and 1 indicates that the record is deleted. Undeleted records are not immediately removed from the disk. Instead, they are marked for deletion. All deleted records form oneWaste list
. Newly inserted records may later be reusedWaste list
Space occupied, so the storage space occupied by garbage lists is also calledReusable space
.heap_no
: indicates the positions of the current records on the page. For example, the positions of the previous four records on the page areTwo, three, four, five
. In fact, InnoDB automatically adds two virtual records per page, one of which isMinimum record
And the other one isThe biggest record
. The structure of these two records is very simple5 bytes of record header information
andFixed portion of 8 bytes
(infimum or supremum). The two records are kept separatelyInfimum + Supremum
Part.We can see from the figure that the minimum record and the maximum recordheap_no
The values are 0 and 1, which means they’re at the front.next_record
From:Address offset from the real data in the current record to the real data in the next record. It can be simply understood as a one-way linked list. The next smallest record is the first record, and the next last record is the largest record. To visualize this, we can replace the address offset in next_record with an arrow:As you can see from the picture,User records are actually a one-way linked list, sorted positively by primary key size. If one entry is deleted from the list, the linked list will also change. For example, if we delete the second entry:- The second record is not removed from the storage space
delete_mask
The value is set to 1. - Article 2
next_record
The value changes to 0, meaning that there is no next record for this record. - Article 1
next_record
This points to record 3.
- The second record is not removed from the storage space
Page Directory
We already know that records are concatenated in a single linked list in positive order of primary key size in a page. If we want to find a specific record based on the primary key, the simple way is to traverse the list. However, in the case of large amount of data, this method is obviously inefficient. So mysql uses Page Directory to solve this problem. The principle of Page Directory is as follows:
- Divide all normal records (including maximum and minimum records, excluding those marked as deleted) into groups. We don’t care how we divide it.
- The n_owned attribute in the header of the last record in each group (that is, the largest record in the group) indicates how many records there are in the group.
- The address offset of the last record of each group is separately extracted and sequentially stored near the end of the page, which is called
Page Directory
.
Mysql allows only one record in the smallest group, only 1-8 records in the largest group, and only 4-8 records in the remaining groups. Like nowpage_demo
There are 18 normal records in the table, and InnoDB divides them into 5 groups. The first group has only one minimum record, as shown below:
The process of finding a record with a specified primary key value in a data Page through Page Directory is two-step:
- Determine the slot where the record resides by dichotomy and find the record with the smallest primary key value in the group where the record resides.
- recorded
next_record
Property traverses the records in the group that the slot is in.
The optimization of query performance of linked list is basically realized by dichotomy. This is true of Page Directory, skip lists, and lookup trees described above.
Page Header
Page headers are specifically used to store various state information related to the data Page, such as how many records have been stored in the Page, what is the address of the first record, how many slots have been stored in the Page directory, and so on. It occupies 56 bytes. The attributes of each part of the bytes are as follows:
The name of the | Space occupied | describe |
---|---|---|
PAGE_N_DIR_SLOTS | 2 – | Number of slots in the page directory |
PAGE_HEAP_TOP | 2 – | The minimum address of the space not yet used, that is, after that addressFree Space |
PAGE_N_HEAP | 2 – | Number of records on this page (including minimum and maximum records and records marked for deletion) |
PAGE_FREE | 2 – | The address of the first record that has been marked for deletion (each deleted record also forms a single linked list through next_record, which can be reused) |
PAGE_GARBAGE | 2 – | Number of bytes occupied by deleted records |
PAGE_LAST_INSERT | 2 – | The position where the record was last inserted |
PAGE_DIRECTION | 2 – | The insertion direction of the last entry |
PAGE_N_DIRECTION | 2 – | The number of consecutive insertions in one direction. If the insertion direction of the last record changes, the value of this state will be reset. |
PAGE_N_RECS | 2 – | The number of records on the page (excluding minimum and maximum records and records marked for deletion) |
PAGE_MAX_TRX_ID | 8 bytes | Modifies the maximum transaction ID of the current page, which is defined only in the secondary index |
PAGE_LEVEL | 2 – | The level of the current page in the B+ tree |
PAGE_INDEX_ID | 8 bytes | Index ID, indicating which index the current page belongs to |
PAGE_BTR_SEG_LEAF | 10 bytes | Header information for the leaf segment of the B+ tree, defined only on the Root page of the B+ tree |
PAGE_BTR_SEG_TOP | 10 bytes | The header information of the non-leaf segment of the B+ tree, defined only on the Root page of the B+ tree |
Here is just listed, do not need to understand all temporarily.
File Header
The File Header is used to describe general information that applies to various pages and consists of the following:
The name of the | Space occupied | describe |
---|---|---|
FIL_PAGE_SPACE_OR_CHKSUM | 4 bytes | Page checksum |
FIL_PAGE_OFFSET | 4 bytes | Page number |
FIL_PAGE_PREV | 4 bytes | The page number of the previous page |
FIL_PAGE_NEXT | 4 bytes | Page number of the next page |
FIL_PAGE_LSN | 8 bytes | The position of the Log Sequence corresponding to the last modification of the page |
FIL_PAGE_TYPE | 2 – | The type of the page |
FIL_PAGE_FILE_FLUSH_LSN | 8 bytes | Defined only on one page of the system tablespace, representing that the file has been flushed to at least the corresponding LSN value |
FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID | 4 bytes | Which table space the page belongs to |
Here is just listed, do not need to understand all temporarily. Let’s focus on a few attributes:
FIL_PAGE_SPACE_OR_CHKSUM
The checksum of the current page. For a long string of bytes, we can use some algorithm to calculate a short value to represent the long string of bytes. The short value is called the checksum. The efficiency of string equivalence comparison can be greatly improved by checksum. 2. FIL_PAGE_OFFSET Each page has a unique page number that InnoDB uses to locate a page. 3. FIL_PAGE_TYPE represents the current page type. As we mentioned earlier, InnoDB divides pages into different types for different purposes.
Type the name | hexadecimal | describe |
---|---|---|
FIL_PAGE_TYPE_ALLOCATED | 0x0000 | New assignment, not used yet |
FIL_PAGE_TYPE_ALLOCATED | 0x0000 | New assignment, not used yet |
FIL_PAGE_UNDO_LOG | 0x0002 | The Undo log page |
FIL_PAGE_INODE | 0x0003 | Segment information node |
FIL_PAGE_IBUF_FREE_LIST | 0x0004 | Insert Buffer free list |
FIL_PAGE_IBUF_BITMAP | 0x0005 | The Insert Buffer bitmap |
FIL_PAGE_TYPE_SYS | 0x0006 | The system page |
FIL_PAGE_TYPE_TRX_SYS | 0x0007 | Transaction system data |
FIL_PAGE_TYPE_FSP_HDR | 0x0008 | Table space header information |
FIL_PAGE_TYPE_XDES | 0x0009 | Extended Description page |
FIL_PAGE_TYPE_BLOB | 0x000A | Overflow page |
FIL_PAGE_INDEX | 0x45BF | Index pages are what we call data pages |
4. FIL_PAGE_PREV And FIL_PAGE_NEXT |
||
Indicates the number of the previous and next pages of the page. Each page passesFIL_PAGE_PREV And FIL_PAGE_NEXT Form a bidirectional linked list. |
||
File Trailer
The basic unit of memory and disk interaction in mysql is the page. If an in-memory page is modified, it must be synchronized to disk at some point. If there is a problem with the system during the synchronization process, the page data on the disk may not be fully synchronized, that is, dirty pages may occur. To avoid this problem, mysql adds File Trailer to the end of each page to verify page integrity. File Trailer consists of 8 bytes:
- The first four bytes represent the checksum of the page and this part corresponds to the checksum in the File Header. That’s the simple answer
File Header
andFile Trailer
Both have checksums, and if they agree, the data page is complete. Otherwise, the data page isDirty pages
. - The last four bytes represent the log sequence position (LSN) corresponding to the last modification of the page.
Original is not easy, feel that the article is written well small partners, a praise 👍 to encourage it ~
Welcome to my open source project: a lightweight HTTP invocation framework for SpringBoot