preface

To cut to the chase, how would you answer such a question?

10 million, 20 million, or billions of pieces of data? The exact answer is not important, and certainly not a fixed number, so today we are going to explore that question.

InnoDB is a universal storage engine that combines high reliability and high performance. It has many functions and features, and its architecture and working principle are complicated. To be clear, it’s not something that can be achieved in one or two blog posts, and it’s not the point of today.

Therefore, this article does not involve too much knowledge of the original reason, we put forward the problem at the beginning, through familiar with some basic concepts and the use of tools to verify, to do a good understanding of this problem.

File structure

As we know, the InnoDB engine supports transactions, so the data in tables must be stored on disk. If you create two tables under the test database, T1 and T2, you will find two files in the corresponding data directories.

[root@localhost test]# ls
db.opt  t1.frm  t1.ibd  t2.frm  t2.ibd
[root@localhost test]# pwd
/var/lib/mysql/test
Copy the code

FRM files are the table structure information, and IBD files are the data in the table.

Table structure information contains the metadata of the MySQL table (e.g., table definition), such as the name of the table, the number of columns in the table, and the data type of the column.

Ibd files store data in tables, such as rows and indexes. This is an important file, and it’s the focus of our study today.

Let’s say that the data in a MySQL table is stored on disk. The smallest unit on a disk is a sector, each of which can hold 512 bytes of data. The smallest unit in an operating system is a block. The smallest unit is 4kb.

On Windows, you can check this out by using fsutil fsinfo ntfsinfo C:.

C:\Windows\ System32 >fsutil fsINFO ntFSInfo C: NTFS Volume sn: 0x78F40B2CF40aEC66 NTFS version: 3.1 LFS version: 2.0 Number of sectors: 0x000000001BCB6FFF Total Number of Clusters: 0x0000000003796DFF Available Clusters: 0x0000000000A63A03 Total Number of Reserved clusters: 0x00000000000017C3 Number of bytes per sector: 512 Number of bytes per physical sector: 4096 Number of bytes per file cluster: 4096 Number of bytes per FileRecord segment: 1024 Number of file clusters per FileRecord segment: 0Copy the code

Depending on the format of the file system, you can view it using the following two commands.

xfs_growfs /dev/mapper/centos-root | grep bsize
tune2fs -l /dev/mapper/centos-root | grep Block
Copy the code

Let’s go back to the MySQL, InnoDB storage engine which also has a minimum unit of storage, called a Page, and the default size is 16kb.

Let’s create a new table T3 with no data in it and look at its IBD file.

1 mysql mysql 67 11月 30 20:59 db.opt -rw-r-----. 1 mysql mysql 12756 Frm-rw-r -----. 1 mysql mysql 13077839872 12月 7 21:37 t1. ibd-rw-r -----. 1 mysql mysql 8608 12月 7 21:43 T2.frm-rw-r -----. 1 mysql mysql 5947523072 12月 7 21:52 t2.ibd-rw-r -----. 1 mysql mysql 12756 12月 8 21:02 t3.frm -rw-r-----. 1 mysql mysql 98304 12月 8 21:02t3.ibdCopy the code

Not only t3, we see that the IBD file size of any table is always an integer multiple of 16K. It’s important to understand that MySQL loads data from disk and reads it by page. Even if you query for a single piece of data, it will read a page of 16K of data.

Clustering index

The data in a database table is stored in a page. How many records can be stored in a page?

Depending on how big the row is, if the row is 1K, you could theoretically fit 16 rows on a page.

Of course, when querying data, MySQL cannot traverse all pages, so there is index. InnoDB storage engine uses B+ tree to build index.

Clustered index is to construct a B+ tree according to the primary key of each table. The leaf node stores the whole row of record data, and the non-leaf node stores the key values and Pointers to data pages. At the same time, each data page is linked through a two-way linked list.

The diagram above shows the rough structure of a clustered index tree. It starts by sorting the data records by primary key and placing them in a separate page, with the data page in the next row. The non-leaf node above holds the primary key and a pointer to the page.

When we look up data by primary key, such as id=6, we look up data by B+ tree. It first finds the root page (page offset=3) and then, through binary search, locates the data with ID =6 on the page with pointer 5. Then load the data from the page offset=5.

Here, we need to understand two things:

In the figure above, the root node of the B+ tree (page offset=3) is fixed and does not change. As soon as the table creates a clustered index, its root node page number is recorded somewhere. In addition, the B+ tree index itself cannot directly find a specific record, but only know which page the record is on. The database will load the page into memory, and then locate the specific record through binary search.

Now we know that the smallest storage unit of InnoDB storage engine is pages. In the B+ tree index structure, pages can hold row by row of data (leaf nodes) or primary key + pointer (non-leaf nodes).

As mentioned above, if one row of data is 1K, then theoretically 16 data can fit on one page. How many primary keys and Pointers can fit on that page?

Let’s say our primary key ID is bigint, 8 bytes long, and the pointer size is set to 6 bytes in the InnoDB source code. This works out to 16384/14 = 1170, which means there are 1170 Pointers on a page.

A pointer to a record page, a page can hold 16 data, so a height of 2 B+ tree can hold 1170 * 16=18720 data. Similarly, a B+ tree of height 3 can store 1170 * 1170 * 16 = 21902400 records.

In theory, this is the case. In InnoDB storage engine, the height of B+ tree is usually 2-4 layers, which can store tens of millions of data. When looking up data, a page lookup represents one IO, so when we look up data through the primary key index, we actually need at most 2-4 IO.

So, is this really the case? Let’s move on.

The type of page

Before we start validating, we need to know not only about pages, but also that there is not just one type of page in the InnoDB engine. Common page types are:

  • Data page, B-tree Node;
  • Undo Page, undo Log Page;
  • System Page;
  • Transaction System Page;
  • Insert Buffer Bitmap, Insert Buffer Bitmap;
  • Insert Buffer Free List, Insert Buffer Free List;
  • Uncompressed BLOB Page;

Here we focus on b-Tree nodes, where our indexes and data reside. Since there are different page types, how do we know what the current page belongs to?

The data page is made up of seven parts, each with a different meaning.

  • File Header, fixed 38 bytes;
  • Page Header, fixed 56 bytes;
  • Infimum + supremum, fixed 26 bytes;
  • User Records, that is, row Records, size is not fixed;
  • Free Space, size is not fixed;
  • Page Directort, size is not fixed;
  • File Trailer, end of File information, fixed 8 bytes.

File Header is used to record some Header information of the page, occupying 38 bytes in total. In this header, we can get the offset value of the page in the table space and the type of the data page.

Next comes the Page Header, which records the status of the data Page and takes up 56 bytes. In this section, we can get two important pieces of information: the number of records in the page and the level of the current page in the index tree, where 0x00 represents the leaf node. For example, the leaf node in the clustered index holds the entire row of data, which is always at level 0.

validation

As we said earlier, ibD files are table data files. The file will grow as the data in the database table grows, but it will always be an integer multiple of 16K. The Page Level of the b-tree Node is +1, which represents the height of the current B+ tree.

To create a new table with a row size of 1K, set a few char(255) fields.

CREATE TABLE `t5` (
  `id` bigint(8) NOT NULL,
  `c1` char(245) NOT NULL DEFAULT '1',
  `c2` char(255) NOT NULL DEFAULT '1',
  `c3` char(255) NOT NULL DEFAULT '1',
  `c4` char(255) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

Innodb_flush_log_at_trx_commit =0; innodb_flush_log_data =0;

BEGIN
    DECLARE i int DEFAULT 0;
    select ifnull(max(id),0) into i from t5;
    set i = i+1;
    WHILE i <= 100000 DO
        insert into t5(id)value(i);
        set i = i+1;
    END WHILE;
END
Copy the code

Innodbpageinfo.jar innodbPageInfo.jar is a utility class written by the author in Java code, which is used to output ibD file, page information.

-path Indicates the file path. -v Indicates whether to display page details. 0 yes 1 No.

Above we created the table t5. In the absence of a single piece of data, let’s look at the ibD file information.

[root@localhost innodbInfo]# java -jar innodbPageInfo.jar -path /var/lib/mysql/test/t5.ibd -v 0 page offset 00000000,page type <File Space Header> page offset 00000001,page type <Insert Buffer Bitmap> page offset 00000002,page type <File Segment inode> page offset 00000003,page type <B-tree Node>,page level <0000> page offset 00000000,page type <Freshly Allocated Page> Page offset 00000000, Page type <Freshly Allocated Page> Total number of records on the data Page :0 Total number of Page: 6 Insert Buffer Bitmap: 1 File Segment inode: 1 B-tree Node: 1 File Space Header: 1 Freshly Allocated Page: 2 [root@localhost innodbInfo]#Copy the code

T5 shows no data and its IBD file size is 98304, which means there are 6 pages in total. The fourth page (Page offset 3) is the data page, and page level equals 0, indicating that the page is a leaf node. Since there is no data yet, you can assume that the index of the B+ tree is only 1 level.

We then insert 10 pieces of data. The page level is still 0 and the height of the B+ tree is still 1, because a page can hold about 16 pieces of data of 1K size.

Page offset 00000003, Page type <B-tree Node>, Page level <0000> Total number of page records :10 Total number of Page: 6Copy the code

When we inserted 15 pieces of data, one Page would not be fit. The Freshly Allocated Page would become the data Page, and the original root Page (Page offset=3) would be upgraded to the one storing the directory items. Offset 04 and 05 become the data pages of the leaf node, so the entire B+ tree is now 2 in height.

page offset 00000003,page type <B-tree Node>,page level <0001> page offset 00000004,page type <B-tree Node>,page level <0000> Page offset 00000005, Page type <B-tree Node>, Page level <0000> Total number of pages :15 Total number of pages: 6Copy the code

Let’s go ahead and insert 10,000, and let’s look at the B+ tree height. Of course, now that we have more information, let’s write the output to a file.

java -jar innodbPageInfo.jar -path /var/lib/mysql/test/t5.ibd -v 0 > t5.txt

The partial results are as follows:

[root@localhost innodbInfo]# vim t5.txt page offset 00000003,page type <B-tree Node>,page level <0001> page offset 00000004,page type <B-tree Node>,page level <0000> page offset 00000005,page type <B-tree Node>,page level <0000> page Offset 00000000, Page type <Freshly Allocated Page> Total number of data page records :10000 Total number of Page: 1216 B-tree Node: 716Copy the code

As you can see, there are 716 data pages for 10,000 1K records, and the tree height displayed on the root page is still 2 levels.

Previously, we calculated that a two-layer B+ tree could theoretically store about 18,000 pieces of data. The author tested about 13,000 pieces of data, and the B+ tree would become a three-layer tree.

Page offset 00000003, Page Type <B-tree Node>, Page level <0002> Total number of data pages :13000 Total number of Page: 1472 B-tree Node: 933Copy the code

It’s not hard to see why, because you can’t just put data on each page.

First, each page has some fixed formats, such as file header, page header, file tail, etc. Our data is placed in the user record part;

Second, user records don’t just contain rows of data. Each row has other markers, such as delete, minimum record, number of records, position in the heap, type of record, relative position of the next record, and so on.

InnoDB will keep 1/16 of the page free for future index inserts or updates. If the primary key ids are not inserted sequentially, it may not be 1/16 and will occupy more free space.

In summary, we understand that one page will not contain all data. Therefore, it is perfectly normal for the actual measurement to be inconsistent with the theory, since the above theory does not exclude these terms.

Next, we insert another 10 million pieces of data, and now the IBD file is 11GB.

Offset 00000003, Page Type <B-tree Node>, Page level <0002> 10000000 Total number of Page: 725760 B-tree Node: 715059Copy the code

We can see that there are 710,000 pages of 10 million pieces of data, and the height of the B+ tree is still 3 layers, which means that the query efficiency of tens of thousands of pieces of data is basically the same as that of 10 million pieces of data. Select * from t5 where ID = 6548215; select * from t5 where ID = 6548215; , the query time is displayed in 0.010 seconds.

When will it be on the fourth floor? At around 13 million, a B+ tree increases its height to four stories.

When will it be on the fifth floor? The author did not test it out, because when 50 million pieces of data were inserted, the IBD data file size was already 55 GB, and the VIRTUAL machine was running out of space.

Page offset 00000003, Page Type <B-tree Node>, Page level <0003> Total number of records on data pages: 50,000 million B-tree Node: 3575286Copy the code

Even for 50 million pieces of data, the query time is millisecond by primary key ID.

In theory, a tree would have to be five stories tall to reach a billion to a billion lines of data. If you use this method to test the data of 5 stories, please leave a comment in the comments section, let me have a look.

Also, are your friends aware of a problem? In fact, the height of the B+ tree is not only affected by the data row, but also the length of the primary key ID. In our test above, the ID was of type Bigint (8). If we changed the ID to type int(4) without changing the length of other fields, the same tree height will hold more data because it can hold more Pointers on a single page.

CREATE TABLE `t6` (
  `id` int(4) NOT NULL,
  `c1` char(245) NOT NULL DEFAULT '1',
  `c2` char(255) NOT NULL DEFAULT '1',
  `c3` char(255) NOT NULL DEFAULT '1',
  `c4` char(255) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

For the table t6, we insert 16,000 entries and output the page information.

Page offset 00000003, Page Type <B-tree Node>, Page level <0001> Total number of records on data pages :16000 B-tree Node: 1145Copy the code

Bigint (8) = 1; bigint(4) = 2; bigint(8) = 1; bigint(8) = 1; Although the number of B-tree nodes is still the same and does not change much, it does not affect the tree height.

Ok, see here, I believe friends have their own answers to the questions raised at the beginning, if you also try again, you may understand more deeply.

Why does MySQL index use B+ trees instead of other tree structures? Like a B tree?

In a nutshell, one of the reasons is that the height of a B+ tree is relatively stable, because its non-leaf nodes do not hold data, only key values and Pointers, a page can hold a lot of data. The size of the same line of data is 1KB, so it can only store 16 Pointers on a page at most. In the case of a large amount of data, the tree height will expand, resulting in a large number of IO times, and the query will become slow.

The source address

This article innodbPageInfo.jar code is the author reference MySQL technology insider (InnoDB storage engine) book toolkit, book author is written in Python, so I use Java to implement again.

Java version of the source CODE I put on GitHub: github.com/taoxun/inno…

Have finished packing the Jar version, can also download: pan.baidu.com/s/1IZVJRNUk… Extraction code: 5RNZ.

You can use this tool to take a look, you think the larger table, its B+ tree index exactly how many levels?

References:

Jiang Chengyao: Inside MySQL Technology :InnoDB Storage Engine

Tianya lacrimal wu: tianyalei.blog.csdn.net/article/det…

Fluttering red scarf: www.cnblogs.com/leefreeman/…

Official reference manual: MySQL dev.mysql.com/doc/refman/…