Source: Original contribution

Author: Hua Jia She

Profile: Database technology enthusiast.

  • GreatSQL community original content is not allowed to use without authorization, please contact xiaobian and indicate the source.

Take a quick look at data file parsing

This is a brief introduction to the process of trying to parse MySQL 8.0 data files (*.ibd) using Golang. Parsing is a process of deserialization, or decoding.

1. Why parse

Although there is a lot of open source code that implements this decoding process, such as undrop-for-Innodb [1] in C, which supports MySQL 5.7 and has not been updated since.

Py_innodb_page_info.py can also analyze data files (analysis page types). Compatibility with MySQL 8.0 seems to have not been done yet.

Ali also implements data file parsing using Java [2] and is somewhat compatible with MySQL 8.0. However, through the process of data file parsing, I have in-depth study of data file encoding/decoding, recovery of deleted data and so on.

The analysis process is boring, and there is a kind of steed feeling, that is, through the page structure of the document or source code description, to establish the corresponding structure to analyze, the difficulty may not be complex, through the establishment of a more three-dimensional and comprehensive understanding of the storage, to complete the puzzle of MySQL knowledge is very meaningful.

2. Use Golang to parse data files

Golang was chosen to implement this process because it is much simpler than C++, easy to use, doesn’t have to worry too much about garbage collection, the code is concise, there are plenty of packages to reference, and you don’t have to work hard to implement yourself.

Of course, there are more attractive features, such as goroutine and channels, and excellent support for gRPC and Protocol Buffers, which are not used in this article when parsing MySQL data files.

3. The file

3.1 the Linux file

A file is an abstraction of an I/O device. For example, we do not need to know disk technology to process the contents of disk files by manipulating the file abstraction.

In the third Chinese version of CSAPP (ComputerSystem: A Programer Perspective), A file is nicely defined as A sequence of bytes and nothing more. So the parsing of MySQL data files can be regarded as the parsing of byte sequences, and the parsing process can be simply regarded as the reverse process of coding rules, namely decoding.

3.2 MySQL 8.0 Data files

MySQL has reworked the data dictionary in version 8.0 to store metadata in InnoDB data dictionary tables. Table structure information is no longer redundant in the Server layer (no longer stored in IBData) and no longer stored in FRM files.

The metadata Information is not only stored in the data Dictionary table, but also redundant in the Serialized Dictionary Information (SDI). For InnoDB, THE SDI data is directly stored in the IBD data file, which has SDI type pages for storage. Because metadata information is stored in the InnoDB engine table, crash-safe is supported, combined with a DDL_LOG mechanism to achieve DDL atomicity.

At the beginning of the file (not very far up, around the fourth data page, page number=3, which in previous versions was called the root page: Root page) redundant table structure information, very similar to Avro encoding format, that is, at the beginning of a large file containing millions of records contains schema (description of data composition, including element composition, data type, etc.) information, each subsequent record is no longer redundant information, in order to save storage and transmission costs.

When parsing data, the schema is necessary to obtain the table structure information first, so as to facilitate the subsequent data decoding process. So when parsing the data file this time, you first need to get the table structure (schema) in the SDI page of the data file.

To simplify the parsing process, the default database is set to utF-8 character set, that is, data files are encoded in UTF-8. The innodb_file_per_TABLE parameter is enabled and tables are stored in their own tablespace (*.ibd).

4. Parse the file

Parsing is not a very complicated process. From the program’s point of view, first open the file, then parse the file, and finally close the file.

However, the process is likely to produce other unknown problems, such as different encoding formats, complex data types, etc., which are just chosen to avoid in the analysis of this article.

Golang provides the os.openFile () function to open the file (in read-only mode is recommended) and the file.close () function to Close the file, leaving the parsing process to implement itself.

4.1 Obtaining the table structure

Self-describing data, such as JSON or XML, has a redundant self-description, and in a large file it is unwise to have too much redundant information. In MySQL data files, data growth is a significant problem, the self-description part needs to be simplified, not every write a data, must follow the write table name, column name.

So MySQL only stores table structure information at the beginning of the file (in earlier versions, table structure information was stored in FRM files and redundant in the data dictionary).

This way of encoding can save storage space, but parsing opens the data file, the output is byte sequence, obviously is not human readable, need to find the pattern (table structure information) to interpret into human readable, to achieve the purpose of parsing.

If you can get the table structure information from the database, you can easily get the table result information from SDI if you have only one data file, which is a cumbersome process.

Fortunately, MySQL officially provides a proprietary tool: IBd2sdi. By default, ibd2sdi gets all the column and index information, which is output in JSON format.

4.2 Structures and sections

MySQL data files are stored in pages (the default size is 16K). Indexes can be quickly located to pages using the B-tree lookup feature. Data within pages is located in a very similar way to binary lookup.

In analytical data files, also is in page, about 31 in InnoDB type page, in the source storage/innobase/include/fil0fil. H can be seen in the related definitions.

The pages that actually store user data are of type FIL_PAGE_INDEX (numeric code 17855, parses to page_type=17855). This does not mean that other types of pages are useless. • Description of page use in file (FIL_PAGE_TYPE_FSP_HDR), empty page allocation (FIL_PAGE_TYPE_ALLOCATED), Insert buffer (FIL_PAGE_IBUF_FREE_LIST), Compressed pages (FIL_PAGE_COMPRESSED), etc., need various types of pages to store different information.

Different pages, during parsing, are abstracted using Golang structs, which are similar to Java classes, abstractions of concrete transactions, and very similar to C language structures. Different data pages are parsed using corresponding constructs, and only about six pages are parsed to simplify the parsing process (which is, after all, a learning process).

Structs are not only used in page parsing, but also in some parts of the page for abstraction, such as the 38-byte File Header[3] at the beginning of each data page, which is also parsed through the structure.

Said documents in section 3 is a sequence of bytes, the InnoDB data file is very big, usually on the whole is a byte array, long is not convenient for overall resolution, or data files than memory size, memory is unlikely to be loaded, all need to read, step by step, or called for cutting, cutting according to the default page size, for example, This is the shredding of the data file into fragments of 16K size (the data file must be an integer multiple of the page size).

In Golang, the data structure for this array fragment is called a slice. This is reminiscent of breakfast bread, a whole loaf of bread that is not easily spread with jam, so it can be cut into slices that can be disposed of later. In the parsing part of the code, two data structures, structure and slice, are often used to parse byte sequences.

4.3 pp.

This page structure

In data files, pages are stored as sequences of bytes of 16K size, or 16384 bytes (the default size, controlled by the innodb_page_size parameter). A data file contains many data pages. In memory, it is organized in the form of b-tree. Each node of the B-tree is a page. Non-leaf nodes point to nodes (pages) at the next layer through Pointers.

When a transaction places a logical lock on a record, a latch is added on a B-tree to prevent the page (which may also contain the parent node of the page when the page splits and contracts) from being modified by other threads. The standard INDEX page contains seven parts, and its structure is shown in the following table [4].

Name Size Remarks
File Header 38 Page for some general information
Page Header 56 Some information unique to different data pages
Infimum/Supremum 26 Two virtual row records, the maximum and minimum values
User Records The contents of the row records that are actually stored
Free Space Unused space in a page
Page Directory The relative positions of some records in a page
File Trailer 8 The checksum is in 4 bytes and the Low32lsn is in 4 bytes

Of these, the 38-byte File Header is present in almost all types of pages. Its structure consists of eight parts, which are as follows:

Name Size Remarks
FIL_PAGE_SPACE 4 ID of the space the page
FIL_PAGE_OFFSET 4 ordinal page number from start of space
FIL_PAGE_PREV 4 offset of previous page in key order
FIL_PAGE_NEXT 4 offset of next page in key order
FIL_PAGE_LSN 8 log serial number of page’s latest log record
FIL_PAGE_TYPE 2 current defined page type
FIL_PAGE_FILE_FLUSH_LSN 8 log serial number
FIL_PAGE_ARCH_LOG_NO 4 archived log file number

The 56-byte Page Header contains 14 parts, which are structured as follows:

Name Size Remarks
PAGE_N_DIR_SLOTS 2 number of directory slots
PAGE_HEAP_TOP 2 record pointer to first record in heap
PAGE_N_HEAP 2 number of heap records; initial value = 2
PAGE_FREE 2 record pointer to first free record
PAGE_GARBAGE 2 number of bytes in deleted records
PAGE_LAST_INSERT 2 record pointer to the last inserted record
PAGE_DIRECTION 2 either PAGE_LEFT, PAGE_RIGHT, or PAGE_NO_DIRECTION
PAGE_N_DIRECTION 2 number of consecutive inserts in the same direction
PAGE_N_RECS 2 number of user records
PAGE_MAX_TRX_ID 8 the highest ID of a transaction
PAGE_LEVEL 2 level within the index (0 for a leaf page)
PAGE_INDEX_ID 8 identifier of the index the page belongs to
PAGE_BTR_SEG_LEAF 10 file segment header for the leaf pages in a B-tree
PAGE_BTR_SEG_TOP 10 file segment header for the non-leaf pages in a B-tree

The above is a description of the structure of the data page of type INDEX, not the other types of pages. Pages like file Hader, page header sections, or other types of pages have structures in their code that describe their data structures.

In addition to the official website, InnoDB Diagrams is a project run by Jeremy Cole on GitHub for more information on data page structure [5]. InnoDB Diagrams may be more based on MySQL 5.7, but they are still a good reference.

4.3.2 page type

InnoDB has about 30 different types of pages, and only 6 of them are parsed here. The main purpose is to parse the data in FIL_PAGE_INDEX. The rest of the pages are parsed differently.

  • FIL_PAGE_TYPE_FSP_HDR

Extents are used to allocate and manage extents and pages, including the number of pages currently used and allocated in a tablespace. In addition, 256 XDES entries (extents) are stored and 256 XDES entries (extents) are maintained. If 256 extents are used up, Pages of type FIL_PAGE_TYPE_XDES are appended

  • FIL_PAGE_IBUF_BITMAP

To track the change buffer for each subsequent page, use four bits to describe each page’s change buffer

  • FIL_PAGE_INODE

Seinterfaces in the management data file, each index occupies 2 segments, respectively for the management of leaves and non-leaves nodes. Each inode page can store FSP_SEG_INODES_PER_PAGE (default: 85) inode entries (Segment). FIL_PAGE_INODE manages segments, and segment information manages Extent information

  • FIL_PAGE_SDI

Stores Serialized Dictionary Information(SDI), which stores some Data Dictionary Information for this table space

  • FIL_PAGE_INDEX

The data needed to construct the B-tree in memory is stored in this type of page, and each leaf node of the B-tree points to such a page. The structure is described in detail everywhere, with the File Header and Page Header sections described above.

  • FIL_PAGE_TYPE_ALLOCATED

Pages that have been allocated but not yet used are empty pages.

4.4 the decoding

4.4.1 Record parsing

Encoding and decoding are often combined, but since we have the data file by default, we won’t cover the encoding process.

The decoding in this article is to decode a sequence of bytes into a human-readable string. That is, decoded as the data is stored in the database and written by the business.

Usually programming languages will provide encode and decode methods for encoding and decoding operations. In this analysis, it is a self-realized decoding process after referring to MySQL source code.

In fact, according to the length of the data stored in bytes, a byte length of data is shifted left (the number of left shifts is determined by the length of the byte, the left shift n bits is multiplied by 2 to the n), and then each byte is calculated or.

Unittest \gunit\innodb\lob\mach0data.h is the process of writing and reading four consecutive bytes of data, i.e. Mach_write_to_4 () and mach_read_from_4(). You can see the shift operation of the bit operation.

Note: When the source code reads byte data, it converts the data to type ULint, which is an unsigned type.

inline void mach_write_to_4(byte *b, ulint n) {
  b[0] = (byte)(n >> 24);
  b[1] = (byte)(n >> 16);
  b[2] = (byte)(n >> 8);
  b[3] = (byte)n;
}

/** The following function is used to fetch data from 4 consecutive
bytes. The most significant byte is at the lowest address.
@param[in] b pointer to four bytes.
@return ulint integer */
inline ulint mach_read_from_4(const byte *b) {
  return (((ulint)(b[0]) << 24) | ((ulint)(b[1]) << 16) | ((ulint)(b[2]) << 8) |
          (ulint)(b[3]));
}
Copy the code

Each byte is moved to the right when writing, and to the left when reading. This involves the content of the size end. If you are interested, you can choose to study it in depth and not introduce it too much.

4.4.2 Line record format

The previous section describes the byte resolution process for specific records (corresponding to specific values of columns in the database).

How records are located (in the User RECORD section of the Index data page) is also affected by the setting of the innodb_default_ROW_format parameter, in addition to the page where the data is recorded.

MySQL data records are stored as rows in MySQL 5.7.9 and later. The default value is DYNAMIC.

Prior to MySQL 5.6, this parameter defaults to COMPACT. Innodb uses the flags (length 4 bytes) of the FSP page (the first page of the data file, FIL_PAGE_TYPE_FSP_HDR) to determine various attributes of the table space, such as whether it is compressed, encrypted, shared, temporary, etc.

The row format type is determined by the FSP_FLAGS_WIDTH_ATOMIC_BLOBS attribute in the FLAGS structure, which is the sixth of the 32 flags 4-byte bits.

To reduce program complexity, the default line format is DYNAMIC (storage\ Innobase \include\fsp0types.h). COMPACT row format

Recording additional data . . The content of recorded data . . .
The length list of variable-length fields The flag bit of the NULL value Record header information col1 Data col2 Data . coln Data

In the record header information, 5 bytes are used to mark the record, including a delete_mask flag bit to mark whether the current record has been deleted. For more detailed explanation of the line record format, please refer to the Alikernel monthly. https://www.bookstack.cn/read/aliyun-rds-core/24bcb047feb68f13.md every record in accordance with the above formats to store data.

DYNAMIC is similar to COMPACT, but the difference is that the processing and strategy for dealing with row overflow are treated. DYNAMIC and Compressed format will store all the field values with too much data in the record to the overflow page, and keep 20-byte Pointers in the original page. Instead of storing the 768 byte prefix of that field’s value in the corresponding field of the record’s data content.

Large field types such as BLOB are not involved in this parse, so the parse is actually resolved in COMPACT format.

4.5 Writing Code

Code level is limited, the introduction code level is more limited, the end of the article will have code address for detailed reference. Here is a brief introduction to pick up the key points.

4.5.1 Opening the File

You are advised to open files in read-only mode to perform simple verification. Check the size of the file. Check the size of the file. Check the size of the file. Open file:

file, err := os.OpenFile(filePath, os.O_RDONLY, os.ModePerm)
Copy the code

4.5.2 Read by page

Since data files are sequences of bytes, they can read any number of bytes into memory at a time (if memory allows), but in fact MySQL data files are managed on a page basis, so reads into memory are read on a page basis.

In MySQL instances, the root page of BTREE and the page of layer 2 or even layer 3 are resident in memory most of the time. When primary key bigint is used, each directory entry in non-leaf nodes is 8 bytes long primary key + 6 bytes long pointer. Each page can store 16384/(8+6)≈1170 directory items, that is, the fan out of each node is 1170. Then, the size of the first several BTREE layers is as follows: Layer 0:16,384 bytes Layer 1:16384 * 1170 = 19,169,289 bytes Layer 2:16384 * 1170 * 1170 = 22,428,057,600 bytes

It can be seen that the data of the first three layers is about 20GB, so it is not difficult to put all of them into memory. The data of the first two layers is only tens of MB, and the height of BTREE reaches four layers can store tens of millions of data.

The data that really needs to be read from disk is more likely to be concentrated at layer 2 or above, because the in-memory middle page read is traversed quickly and reduces disk operations, so it is efficient.

In this case, the pages are read sequentially, which of course can be read concurrently, so the order of the parsed output data is not guaranteed.

// every page read, j increment 16384, Now, err := file.Seek(j, IO.SeekStart) // mysql_def.UNIV_PAGE_SIZE=16384 b := make([]byte, // select * from mysql_def.UNIV_PAGE_SIZE; // select * from mysql_def.UNIV_PAGE_SIZE; // Select * from mysql_def.UNIV_PAGE_SIZE; int8[] n, err := file.Read(b)Copy the code

4.5.3 Structural section

The read data is put into an array. In Golang, slices are constructed according to the dynamic array, and the parsing position changes with the parsing progress, so a struct is constructed accordingly:

type Input struct {
 Data     []byte
 Position int64
}
var IN Input
Copy the code

The Data element of the Input structure is used to hold the sequence of bytes read from the file, with a fixed length of 16384.

The element Position marks the parsing Position, increasing with parsing. This way, after reading the file, you can pass the byte sequence to the structure Input’s declared variable IN.

Slice.in. Position = 0 // The array that holds the Data page byte sequence is passed to the structure Data element Slice.in.data = bCopy the code

4.5.4 Construct page structures and parse data

MySQL data files contain different types of pages that contain different parts, such as File Header, File Trailer, Page Header, etc.

Creating a corresponding structure for each structure is somewhat similar to Java’s object-oriented thinking. So that the code can be abstractly reused later when the data is parsed to the same structural location.

Here, File Header is taken as an example and its composition is introduced in Section 4.3.1. It contains eight parts with a total length of 38 bytes. The corresponding structure is as follows:

type FilHeader struct {
 PageSpace        int64
 PageOffset       int64
 PagePrev         int64
 PageNext         int64
 PageLsn          int64
 PageType         string
 PageFileFlushLsn int64
 PageArchLogNo    int64
}
Copy the code

Because each page in the data page contains a File Header in its Header, its 38-byte length is always parsed from byte 0 to position 37. FileHeader parsing is as follows:

Func FileHeaderReadSlice() FilHeader {var fh FilHeader fh.pagespace = slice.readunsignedint () fh.PageOffset = slice.ReadUnsignedInt() fh.PagePrev = slice.IfUndefined(slice.ReadUnsignedInt()) fh.PageNext = slice.IfUndefined(slice.ReadUnsignedInt()) fh.PageLsn = slice.ReadLongInt64() fh.PageType = mysql_def.ParsePageType(slice.ReadShortInt16()) fh.PageFileFlushLsn = slice.ReadLongInt64() fh.PageArchLogNo = slice.ReadUnsignedInt() return fh }Copy the code

In the code

fh.PageSpace = slice.ReadUnsignedInt()
Copy the code

PageSpace (space ID of page) is read in a sequence of bytes and assigned a value of four bytes after parsing. After parsing, the position is incremented by 4 (int storage length is 4 bytes).

The parsing process is a reference to the MySQL source code, as described in section 4.4.1. For example, 4 bytes Int is parsed as follows:

func decodeInt32(in Input) int64 {
 data := in.Data
 index := in.Position
 checkPositionIndexes(index, index+mysql_def.SIZE_OF_INT)
 return (int64(int8(data[index+3])) & 0xff) | (int64(int8(data[index+2]))&0xff)<<8 | (int64(int8(data[index+1]))&0xff)<<16 | (int64(int8(data[index]))&0xff)<<24
}
Copy the code

In this order, the eight parts of the File Header are parsed and the necessary information (for example, PageType, PageType) is retrieved. Subsequent parses complete the data parsing by invoking different structures based on the PageType.

The difference is that the length of different bytes is resolved, and the length of different page structures or structures within pages is indicated in MySQL documentation or source code, especially the User record in the index page, which is controlled by the row_format parameter. This article is limited in space and only introduces the File Header parsing process.

Code to complete the page type parsing, the index page in the record parsing is still in progress, interested can continue to pay attention to.

5. Add

Admittedly, the parsing process has been simplified in many places. Only a limited number of data pages are parsed. encoding methods are limited to UTF-8, and only int and VARCHAR data types are selected.

There is no validation for the file, just the default is that we parse a normal data file. This is a brief parsing code that will continue to be refined.

Also, concurrency is not used in the parsing process, so consider introducing concurrency later to speed up the parsing process. For example, when data files are sliced and parsed sequentially, Golang’s coroutine can be used to speed up the parsing. Different pages can be scheduled to different coroutines for parsing.

Perhaps most of the data files are index pages, which is not enough concurrency by page type, so you can introduce thread pools that schedule different numbers of index pages to be parsed on different threads. In addition, code to parse deleted data was added later. Data is incorrectly deleted. If you use backup and binlog to restore data, it takes time and effort.

There is an urgent need for a flashback tool similar to Oracle to quickly flash back data to the state before the data was modified.

In several existing logs, update events in binlog are recorded before and after the update, which makes it possible to parse binlog events and reverse operations. Based on this feature provided by binlog, the open source community has implemented a flashback tool. For example, Binlog2SQL implemented in Python and MyFalsh developed by meituan DBA team can flash back DML as well as part of DDL.

In addition to binlogs, there are redo logs that log data updates, but because redo logs are a change vector, they may contain uncommitted transactions. The other log undo is not possible because of the existence of the Page Cleaner thread, so it is not possible to flash back using redo only.

To restore delete data, we can use the mark mechanism when data files are deleted to restore data. The disadvantage is that when data is deleted incorrectly, it is necessary to immediately lock the table to prevent further data operations from reusing the space marked as delete.

6. Afterword.

Code address: https://gitee.com/huajiashe_byte/ibdreader hope that through this, promote the exchange of knowledge and improve the understanding of the database file storage. Level limits, it is inevitable that mistakes and mistakes, welcome correction.

annotation

Note 1: undrop-for-innodb:

https://github.com/twindb/undrop-for-innodb

Ali has a special article about this tool:

https://www.bookstack.cn/read/aliyun-rds-core/2dd363c8cc64171b.md

Note 2:

https://github.com/alibaba/innodb-java-reader

Note 3: Reference:

https://dev.mysql.com/doc/internals/en/innodb-fil-header.html

Note 4: Details of this section can be found on the official website:

https://dev.mysql.com/doc/internals/en/innodb-page-overview.html

Note 5: Address is:

https://github.com/jeremycole/innodb_diagrams

Enjoy GreatSQL 🙂

This article is published by OpenWrite!