Storage engines, such as InnoDB, MyISAM, Memory, etc. Different storage engines are generally developed by different people to achieve different features. At present, OLTP service tables using MySQL generally use InnoDB engine, which is also the default table engine.

In order to explain how InnoDB engine works, we must first understand the storage structure of InnoDB, through which InnoDB transactions can be implemented.

Let’s start by looking at how a row of InnoDB tables is stored. InnoDB is a persistent storage engine, meaning data is stored on disk. But reading and writing data, processing data, that happens in memory. That is, data needs to be read from disk into memory. So how does this read read? It is also inefficient to read every piece of data that is processed into memory. Since each piece of data is a hard disk addressing read, we want to reduce the number of hard disk addressing reads and consider reading the data piece by piece, so that we are likely to need the data in memory for the next request, rather than reading from the hard disk. Based on this idea, InnoDB divides the data of a table into pages, which are linked by b-tree indexes. The default page size is 16384 Bytes, or 16KB (innodb_page_size).

Also, the b-tree Index is a Clustered Index. If a table has a primary key, it is a Clustered Index. As described above, the node of this index contains all rows and columns (the page we just mentioned). Other secondary-indexed nodes simply have Pointers to primary keys.

For large fields, such as those of Text type, if they also exist on the clustered index, the node data will be too large and many pages will be read at once, which will be inefficient (for example, if we don’t have a request to read the Text column). So InnoDB generally prefers to store variable-length fields elsewhere. As for how to store it, this also has to do with InnoDB Row Format. There are four row formats: Compact, Redundant, Dynamic, and Compressed.

We can specify the row format in the statement creating or modifying the table:

CREATE TABLE ()ROW_FORMAT= row format; ALTER TABLE ROW_FORMAT= row format;Copy the code

Compact row format storage

Create a table that contains almost all of the basic data types, including geometry, timestamp, and so on, which are also based on double and Bigint, and text, JSON, and blob, which are not stored together with row data.

create table record_test_1 (
	id bigint,
	score double,
	name char(4),
	content varchar(8),
	extra varchar(16)
)row_format=compact;
Copy the code

Insert the following records:

INSERT INTO 'record_test_1' (' id ', 'score', 'name', 'content', 'extra') VALUES (1, 78.5,'hash'.'wodetian'.'nidetiantadetian'); INSERT INTO 'record_test_1' (' id ', 'score', 'name', 'content', 'extra') VALUES (65536, 17983.9812,'zhx'.'shin'.'nosuke'); INSERT INTO 'record_test_1' (' id ', 'score', 'name', 'content', 'extra') VALUES (NULL, -669.996,'aa', NULL, NULL);
INSERT INTO `record_test_1`(`id`, `score`, `name`, `content`, `extra`) VALUES (2048, NULL, NULL, 'c'.'jun');

Copy the code

Current table structure:

+-------+------------+------+----------+------------------+ | id | score | name | content | extra | + -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | | 78.5 1hash| wodetian | nidetiantadetian | | 65536 | | 17983.9812 ZHX | shin | nosuke | | NULL | | 669.996 aa | NULL | NULL | | 2048 | NULL | NULL | c | jun | +-------+------------+------+----------+------------------+Copy the code

Look at the underlying storage file: record_test_1.ibd, opened with a hexadecimal Editor. I’m using Notepad++ and its hex-editor plugin. Mysql > create rows with the same length and same contents as mysql > create rows with the same length and same contents as mysql > create rows with the same length

Here we directly give the meaning of these data, so that you can intuitively feel:

Variable length Field length list: 10 08 Null value list: 00 Record header information: 00 00 10 00 47 Hide columns DB_ROW_ID: 00 00 00 00 08 0C Hide columns DB_TRX_ID: 00 00 00 03 C9 4D DB_ROLL_PTR: B9 00 00 01 2D 01 10 column data ID (1) : 80 00 00 00 00 00 00 00 00 00 01 Score (78.5) : 00 00 00 00 00 A0 53 40 column data name(hash77 6F 64 65 74 69 61 6E column extra(Nidetiantadetian) : 6E 69 64 65 74 69 61 6E 74 61 64 64 65 74 69 61 6E Variable length Field Length list: 06 04 Null value list: 00 Record header: 00 00 18 00 37 Hide column DB_ROW_ID: 00 00 00 00 08 0D DB_TRX_ID: 00 00 00 03 C9 4E DB_ROLL_PTR: BA 00 00 01 2F 01 10 column ID (65536) : 80 00 00 00 00 00 00 01 00 00 score(17983.9812) : B5 15 FB CB fe 8F D1 40 column data name(ZHX) : 7a 68 78 20 column data content(Shin) : 73 68 69 6E column extra(noSUKE) : 6E 6F 73 75 6B 65 Null Value list: 19 Record header: 00 00 00 00 27 Hide column DB_ROW_ID: 00 00 00 00 08 0e DB_TRX_ID: 00 00 00 03 C9 51 DB_ROLL_PTR: BC 00 00 01 33 01 87 16 D9 CE F7 EF 84 C0 Column Data Name (AA) : 61 61 20 20 Variable length Field length list: 03 01 Null value list: 06 Record header information: 00 00 28 FF 4B Hidden column DB_ROW_ID: 00 00 00 08 0f DB_TRX_ID: 00 00 00 03 C9 54 DB_ROLL_PTR: be 00 00 01 3D 01 10 column ID (2048) : 80 00 00 00 00 00 08 00 column data Content (C) : 63 column data extra(jun) : 6a 75 6ECopy the code

As you can see, in Compact row record format, an InnoDB record is structured as follows:

Compact Row format storage – Variable-length field length list

For variable length fields like varchar, varbinary, text, BLOb, JSON, and their various types, you need to store the exact number of bytes they take up. This eliminates the need to define boundaries between column data, and MySQL can tell which data is in this column and which is not. Compact row format storage begins with a variable-length field length list, which includes the length of each variable-length field whose data is not NULL, in reverse order of column.

For example, the first data above:

+-------+------------+------+----------+------------------+ | id | score | name | content | extra | + -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | | 78.5 1hash | wodetian | nidetiantadetian |
+-------+------------+------+----------+------------------+
Copy the code

There are two non-null fields, content and extra, with length 8 and 16 respectively, converted to hexadecimal: 0x08 and 0x10 respectively. In reverse order, 10, 08

This is for the short case, the length can be expressed in one byte. If the contents of a variable length column are large in number of bytes, they may need to be represented in two bytes. So when do you use one byte and when do you use two bytes?

Let’s add a column to this table to test it:

alter table `record_test_1` 
add column `large_content` varchar(1024) null after `extra`;
Copy the code

There is no change in the row data.

  • ifThe maximum length of a character set in bytesThe character set is Latin, so the length is 1Maximum number of characters in a field(is the parameter in varchar, which we have herelarge_content1024) < 255, so it’s a byte. Here forlarge_content, has exceeded 255.
  • If it exceeds 255 then:
    • If the field really takes up less than 128 bytes, one byte is used
    • If the field really takes up 128 bytes, use two bytes

Q1: then why use 128 as the dividing line? A byte can represent a maximum of 255 bytes, but MySQL designs the length representation to distinguish whether a byte represents a length, specifying that if the highest bit is 1, then two bytes represent the length, otherwise one byte represents the length. For example, 01111111, this means the length is 127, and if the length is 128, you need two bytes, 10000000 10000000. The first byte has the highest bit of 1, so this is the beginning of two bytes for the length, the second byte can use all the bits for the length, and note that, MySQL uses the Little Endian counting mode, with the lowest value first and the highest value second, so 129 is 10000001 10000000. Also, the maximum length of this type of identifier is 2^ 15-1 = 32767, or 32 KB.

Question 2: What if two bytes are not enough to represent the length? InnoDB’s default page size is 16KB. For fields with a large number of bytes, such as a field larger than 16KB, if the record cannot be stored in a single page, innoDB stores a portion of the data in so-called overflow pages. In the varied-length field length list, only the length left in the page is stored. So it can be stored in two bytes. We’ll talk about this overflow page mechanism later with the Text field.

The large_content field is then populated for the first row of data, and the new field is updated to an empty string for the second row.

update `record_test_1` set `large_content` = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmno pqrstuvwxyz' where id = 1;
update `record_test_1` set `large_content` = ' ' where id = 1;
Copy the code

View data:

It is found that in COMPACT row record format, the update of variable length field invalidates the original data and generates a new data at the end.

The first row of data is discarded, and the record header is changed, mainly marked delete, which we’ll talk about later. The first line of new data:

Variable length Field length list: 82 80 10 08 Null value list: 00 Record header information: 00 00 30 01 04 Hide columns DB_ROW_ID: 00 00 00 00 08 0C Hide columns DB_TRX_ID: 00 00 00 03 C9 6E DB_ROLL_PTR: 4F 00 00 01 89 1C: 80 00 00 00 00 00 00 00 00 01 score(78.5) : 00 00 00 00 00 A0 53 40 column data name(hash77 6F 64 65 74 69 61 6E column extra(Nidetiantadetian) : 6e 69 64 65 74 69 61 6e 74 61 64 65 74 69 61 6e Column data large_content (abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy Zabcdefghijklmnopqrstuvwxyz) : 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7aCopy the code

As you can see, the list of variable length fields is 82, 80, 10, 08, and the large_content character encoding is 1 and the maximum number of field characters is 1024. The first line here states that the field has 130 characters, so it should be two bytes. 130*1 is converted to hexadecimal 0x82, which is 0x02 + 0x80. After the highest bit identifier 1, it is 0x82 + 0x80, which corresponds to the beginning of our variable length field length list.

In the new second line, the varied-length field length list becomes 00 06 04, because the actual large_content takes up 0 bytes.

The Compact row format stores a list of -NULL values

Some fields may be NULL, but storing NULL separately is a waste of space and runs counter to the concept of Compact row storage. Using the idea of a BitMap, you can save space by marking these fields. A Null value list is one such BitMap.

The NULL list is only for nullable fields. If a field is marked not NULL, it will not enter the BitMap of the NULL list.

How many bytes does the NULL value list take up? Each non-null field occupies one bit, and each more than eight fields is 8 bits, which is an extra byte. If all fields in a table are not NULL, then there is no list of null values, which takes up 0 bytes. Also, each field in the bitmap, like a variable-length field length list, is sorted in reverse order.

+-------+------------+------+----------+------------------+------------------------------------------------------------- -----------------------------------------------------------------------+ | id | score | name | content | extra | large_content | +-------+------------+------+----------+------------------+------------------------------------------------------------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | | 78.5 1hash| wodetian | nidetiantadetian | abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnop Qrstuvwxyz | | 65536 | | 17983.9812 ZHX | shin | nosuke | lex | | NULL | | 669.996 aa | NULL | NULL | NULL | | 2048 | NULL | NULL | c | jun | NULL | +-------+------------+------+----------+------------------+------------------------------------------------------------- -----------------------------------------------------------------------+Copy the code

For the first and second rows, since there are no NULL fields, their NULL value list is 00. For the third row, his NULL fields are ID, content, extra, large_content, and the first, fourth, fifth, and sixth columns respectively, so the NULL value list is 00111001 (0x39). NULL columns id, content, and extra are the first, fourth, and fifth columns, respectively. For the fourth row, its NULL fields are score,name, large_content, and the second, third, and sixth columns, respectively, so the list of NULL values is: 00100110, or 0x26. Before adding a new field, the NULL fields are score,name, and the second and third columns respectively, so the NULL value list is: 00000110, i.e. 0x06.

Compact row format storage – Record header information

For Compact row format storage, the record header is fixed at 5 bytes:

The name of the Size (bits) describe
A useless 2 It’s not available yet
deleted_flag 1 Whether the record is deleted
min_rec_flag 1 Minimum record mark of non-leaf node in B+ tree
n_owned 4 This record corresponds to the number of records in the slot
heap_no 13 The sequence number of the record in the heap can also be interpreted as the location information in the heap
record_type 3 The common data record is 000, and the node pointer type is 001. The first pseudo record records infimum behavior 010, and the last pseudo record records supremum behavior 011,1 xx are reserved
next_record pointer 16 The relative position of the next record in a page

For the first and second lines before the update:

0000 1000 47 Converted to base 2:00000000 00000000 00010000 00000000 01000111 Useless bits: 00, deleted_flag: 0, min_rec_flag: 0 0, n_owned: 0000, heAP_NO: 0000000000010, record_type: 000, next_record: 00000000 01000111 Second line record header: 0000 18 00 37 convert to base 2: 00000000 00000000 00011000 00000000 00110111 Useless bits: 00, deleted_flag: 0, min_rec_flag: 0, n_owned: 0000, heAP_no: 0000000000010, record_type: 000, next_record: 00000000 01000111Copy the code

For the original first and second lines updated:

The first line records header information: 20 001000 47 Converted to base 2:00010000 00000000 00010000 00000000 01000111 Useless bits: 00, deleted_flag: 1, min_rec_flag: 0, n_owned: 0000, heAP_NO: 0000000000010, record_type: 000, next_record: 00000000 01000111 Second line record header: 20 00 18 00 37 convert to base 2: 00010000 0000000011000 00000000000 00110111 Useless bits: 00, deleted_flag: 1, min_rec_flag: 0, n_owned: 0000, heAP_no: 0000000000010, record_type: 000, next_record: 00000000 01000111Copy the code

It can be seen that the original data deleted_flag changes to 1, indicating that the data is deleted.

For the new first and second lines updated:

0000 30 00 Ca converts to base 2:00000000 00000000 00110000 00000000 11001010 Useless bits: 00, deleted_flag: 0, min_rec_flag: 0 0, n_owned: 0000, HEAP_NO: 0000000000011, record_type: 000, next_record: 00000000 11001010 00000000 00000000 00111000 11111110 11100110 Useless bits: 00, deleted_flag: 0, min_rec_flag: 0, n_owned: 0000, heAP_no: 0000000000111, record_type: 000, next_record: 11111110 11100110Copy the code

Other fields of this information will be explained in more detail when we use them later.

Compact row format store – Hide columns

The hidden column contains three:

The column name Size (bytes) describe
DB_ROW_ID 6 Primary key ID, this column is not necessarily generated. A user-defined primary key is preferred. If no Unique key is defined, a Unique key is selected as the primary key. If no Unique key is defined, a hidden column DB_ROW_ID is added as the primary key by default
DB_TRX_ID 6 When a new transaction is started, the system version number is automatically increased. The system version number at the start of the transaction is used as the transaction ID. The DB_TRX_ID will be updated when the transaction commits
DB_ROLL_PTR 7 The undo log pointer points to the undo log of the current record item and is used to search for the data of the previous version. If the transaction is rolled back, the original value is read from the undo Log and placed in the record

We will not expand the description of these columns in detail here, but will only know about them in the section on clustering indexing and multi-version control analysis.

Compact row format storage – Data column BIGINT storage

For bigINT, if the value is not NULL, it occupies 8 bytes. The first byte is the sign bit, and the rest bits store digits. The digits range from -2^63 to 2^63-1 = -9223372036854775808 to 9223372036854775807. If NULL, no storage space is occupied.

If it is a positive number, the first bit is 1; if it is a negative number, the first bit is 0 and is stored as a complement.

For our four rows of data:

Id (1) : 80 00 00 00 00 00 00 00 00 01 Id (65536) : 80 00 00 00 00 00 00 01 00 00 00 00 NULL id(2048) : 80 00 00 00 00 00 00 00 00 01 80 00 00 00 00 00 08 00Copy the code

Other similar integer stores, tinyint (1 byte), SmallInt (2 bytes), MediumInt (3 bytes), int (4 bytes), etc. The corresponding unsigned types, tinyInt unsigned, SmallInt unsigned, MediumInt unsigned, int unsigned, BigInt unsigned, and so on, are only the difference between having a sign bit or not.

Also, let’s mention the role of the 20 in bigint(20). It only restricts the display and has nothing to do with the underlying storage. An integer field has a Zerofill property, which, when set (e.g. Bigint (20) Zerofill), precedes any number of digits less than 20 with 0 to reach the specified length. This 20 is the display length.

Compact row format storage – Data column double storage

For columns that are not NULL, use the IEEE 754 floating-point “Double format” bit layout standard:

  • The highest bit represents the sign bit (0x8000000000000000)
  • Bits 2 through 12 represent exponents (0x7FF0000000000000)
  • The remaining bits represent the real number of floating-point numbers (0x000fffffffffffffL)

Meanwhile, Innodb stores data files in the format of Little Edian, which needs to be reversed to get the true values of fields. Similarly, if NULL, no space is occupied.

Such as:

Score (78.5) : 00 00 00 00 00 a0 53 40 01000000 01010011 10100000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 sign bit: Index of 0, 10000000101 = 1029, minus the actual index order number 1023 = 6, the decimal part 0.0011101000000000000000000000000000000000000000000000, Converted to decimal 0.125 + 0.0625 + 0.03125 + 0.0078125 = 0.2265625, plus the implied number 1 is 1.2265625, then multiplied by 2 to the sixth power is 1.2265625 * 64 = 78.5Copy the code

The calculation process is more complex, we can use The Java Double.longbitstodouble () conversion:

public static void main(String[] args) {
    System.out.println(Double.longBitsToDouble(0x4053a00000000000L));
}
Copy the code

The output of 78.5

A similar type, float, has the same format, but with half the length.

Compact row format storage – Data column CHAR storage

For fixed-length fields, the data can be directly stored without the length information. If the length is insufficient, the data can be supplemented. For char, add 0x20, which corresponds to a space.

Such as:

First column data name(hash) : 68 61 73 68 Second column name(ZHX) : 7A 68 78 20 Third column name(AA) : 61 61 20 20 Fourth column name(NULL) : NULLCopy the code

For similar binary types, add 0x00.

Compact Row format storage – Data column VARCHAR storage

Because there is a list of variable-length field lengths at the beginning of the data, vARCHAR only needs to save the actual data and does not need to populate additional data.

Because of this feature, variable-length field updates typically mark the old record as deleted and add a new record at the end to fill the updated record. This increases the update speed, but increases storage fragmentation.