Basic concept

concept

An Index is logically a B+Tree. Everything look: www.slideshare.net/ovaistariq/…

Root Page

An index tree starts at a “root” page, whose location is fixed (and permanently stored in the InnoDB’s data dictionary) as a starting point for accessing the tree. The tree may be as small as the single root page, or as large as many millions of pages in a multi-level tree.

Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed. Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.

However, the root page itself can’t actually be split, since it cannot be re-erected. Instead, a new, Vacant Page is allocated, the records in the root are moved there (the root is “raised” a level), and that new page is split into two. The root page then does not need to be split again until the level immediately Below it has enough pages that the root becomes full of child page Pointers (called “node Pointers”), which in practice often means several hundred to more than a thousand.

Leaf Page Non-Leaf Page

Pages are referred to as being “leaf” Pages or “non-leaf” Pages (also called “internal” or “node” Pages in some) contexts). Leaf pages contain actual row data. Non-leaf pages contain only pointers to other non-leaf pages, or to leaf pages. The tree is balanced, so all branches of the tree have the same depth.

Level

InnoDB Assigns each Page in the tree a “level” : leaf pages are assigned level 0, and the level increments going up the tree. The root page level is based on the depth of the tree. All pages that are Neither leaf pages nor the root page can also be called “internal” pages, if a distinction is important.

The sample

The data dictionary

FSP_DICT_HDR_PAGE_NO The 8th page of ibdata, which is used to store information about the data dictionary table. The Dict_Hdr Page structure is shown in the following table:

Data dictionary and loading

Blog.51cto.com/yanzongshua… Blog.51cto.com/yanzongshua…

dict_table_t ----dict_index_t --------unsigned space:32; / *! < space where the index tree is placed */ --------unsigned page:32; / *! < index tree root page number */Copy the code

If you get the Root Page, you can iterate through the Index.

The Root Page format is the same as the normal Page format, as shown in the example above.

Data format of non-LEAF nodes:

< Key, Pointer > inZhongmingmao. Me / 2017/05/12 /…Let’s take the first example 80 00 00 0A indicates 10(the highest bit is a sign bit). 00 00 00 04 indicates Page 4

80 00 00 1E indicates 30(the highest bit is a sign bit). 00 00 00 05 indicates Page 5

Between the pages to find

Physical organization

The next level of the tablespace is called Segment. Segments map to indexes in the database. In Innodb engine, each index (including clustered index) has two segments: the Segment for managing leaf nodes and the Segment for managing non-leaf nodes. Innodb internally uses inodes to describe segments (stored in Inode pages, the first Inode page in IBD is the third page of the IBD file).

The logical organization

Logically, an index is a B+ tree

B tree characteristics

  • All leaf nodes appear in the same layer.
  • The records inside the leaf node also form a one-way ordered linked list.
  • Pages of the same height are connected to a bidirectional linked list.
  • The key of a non-leaf node is the smallest key in the page to which its value points.
  • The root page information is stored in the data dictionary.

Pages to find

Record the form of organization in the page

  • The records in the page are connected to a one-way linked list in ascending order. The records in the data page are logically adjacent, but not necessarily physically adjacent.
  • Infimum Record (lower bound)/supremum Record (upper bound) : Two system records (with fixed in-page offsets) that are the top/bottom nodes of the in-page Record list.

directory slots

To improve the efficiency of retrieving data within a page, Innodb uses slot to manage records.

  • Slot refers to a Record in the page, which is the last Record managed by that slot.
  • Each slot occupies two bytes and stores the in-page offset of the Record.
  • Slots are stored in reverse order, which means that Infimum slots are always on the last two bytes, and the rest are stored in reverse order.
  • Owned in the record pointed to by slot indicates how many forward RECs belong to the slot.
  • N_owned for Infimum is always 1, n_owned for Supremum is in the range of [1,8], and n_owned for User Record is in the range of [4,8].

The order of the page lookups

The in-page search is divided into two steps

  • First determine the slot in which the record to be found resides
  • Determine the location of the record in slot

Slot lookup

Binary lookup is used for the search between slots. If the intermediate records in binary lookup are different from the records to be searched, the logic of binary lookup is simply used to move up and low. If the intermediate record of binary search is the same as the record to be searched, then determine how to move up and low according to the search mode, judge

// Get the boundary value of binary search low = 0; up = page_dir_get_n_slots(page) - 1; /* Perform binary search until the lower and upper limit directory slots come to the distance 1 of each other */ while (up - low > 1) { mid = (low + up) / 2; slot = page_dir_get_nth_slot(page, mid); Mid_rec = page_dir_SLOt_get_rec (slot); mid_rec = page_dir_slot_get_rec(slot); ut_pair_min(&cur_matched_fields, &cur_matched_bytes, low_matched_fields, low_matched_bytes, up_matched_fields, up_matched_bytes); Offsets = rec_get_offsets(mid_rec, index, offsets_, dtuple_get_n_fields_cmp(tuple), &heap); cmp = cmp_dtuple_rec_with_match_bytes( tuple, mid_rec, index, offsets, &cur_matched_fields, &cur_matched_bytes); /*@return the comparison result of dtuple and rec @retval 0 if dtuple is equal to rec @retval negative if dtuple is less  than rec @retval positive if dtuple is greater than rec */ if (cmp > 0) { low_slot_match: low = mid; low_matched_fields = cur_matched_fields; low_matched_bytes = cur_matched_bytes; } else if (cmp) { up_slot_match: up = mid; up_matched_fields = cur_matched_fields; up_matched_bytes = cur_matched_bytes; } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_LE ) { goto low_slot_match; } else { goto up_slot_match; }}Copy the code

Slot internal lookup

  • Instead of binary search, the slot is traversed. First, the records inside the slot are organized by Pointers, and there are not many entries inside the slot.
  • The slot is the last record maintained by the slot, so the slot is traversed backwards from the previous slot.
  • The duplicate values are filtered out through two rules of access pattern and the boundary is finally found.
slot = page_dir_get_nth_slot(page, low); low_rec = page_dir_slot_get_rec(slot); slot = page_dir_get_nth_slot(page, up); up_rec = page_dir_slot_get_rec(slot); /* Perform linear search until the upper and lower records come to distance 1 of each other. */ while (page_rec_get_next_const(low_rec) ! = up_rec) { mid_rec = page_rec_get_next_const(low_rec); offsets = rec_get_offsets( mid_rec, index, offsets_, dtuple_get_n_fields_cmp(tuple), &heap); /*@return the comparison result of dtuple and rec @retval 0 if dtuple is equal to rec @retval negative if dtuple is less  than rec @retval positive if dtuple is greater than rec */ cmp = cmp_dtuple_rec_with_match_bytes( tuple, mid_rec, index, offsets, &cur_matched_fields, &cur_matched_bytes); if (cmp > 0) { low_rec_match: low_rec = mid_rec; low_matched_fields = cur_matched_fields; low_matched_bytes = cur_matched_bytes; } else if (cmp) { up_rec_match: up_rec = mid_rec; up_matched_fields = cur_matched_fields; up_matched_bytes = cur_matched_bytes; } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_LE ) { if (! cmp && ! cur_matched_fields) { cur_matched_fields = dtuple_get_n_fields_cmp(tuple); } goto low_rec_match; } else { goto up_rec_match; } } /* Page cursor search modes; the values must be in this order! */ /* enum page_cur_mode_t { PAGE_CUR_UNSUPP = 0, PAGE_CUR_G = 1, PAGE_CUR_GE = 2, PAGE_CUR_L = 3, PAGE_CUR_LE = 4, PAGE_CUR_LE_OR_EXTENDS = 5, This is a search mode used in "column LIKE 'abc%' ORDER BY column DESC"; we have to find strings which are <= 'abc' or which extend it // These search mode is for search R-tree index. PAGE_CUR_CONTAIN = 7, PAGE_CUR_INTERSECT = 8, PAGE_CUR_WITHIN = 9, PAGE_CUR_DISJOINT = 10, PAGE_CUR_MBR_EQUAL = 11, PAGE_CUR_RTREE_INSERT = 12, PAGE_CUR_RTREE_LOCATE = 13, PAGE_CUR_RTREE_GET_FATHER = 14 }; */ if (mode <= PAGE_CUR_GE) { page_cur_position(up_rec, block, cursor); } else { page_cur_position(low_rec, block, cursor); } *iup_matched_fields = up_matched_fields; *iup_matched_bytes = up_matched_bytes; *ilow_matched_fields = low_matched_fields; *ilow_matched_bytes = low_matched_bytes;Copy the code

search mode

There are four search modes for a search, and these four search modes determine which record is finally located (see the page_cur_search_WITH_match function).

  • PAGE_CUR_G (>, >) : SELECT * WHERE column > 1
  • PAGE_CUR_GE (>=, greater than or equal to) : UPDATE/DELETE/SELECT * WHERE column = 1
  • PAGE_CUR_L (<, <) : SELECT * WHERE column < 1
  • PAGE_CUR_LE (<=, < or equal to) : INSERT…

Four search modes in various cases, finally find the effect of the record.

  • For an insert operation, records are always looked up through the pattern PAGE_CUR_LE, which results in the previous record to be inserted.
  • Query primary keys and unique indexes with the mode of choice PAGE_CUR_GE, not PAGE_CUR_LE.
  • For non-leaf nodes, the query mode is always less than or equal to the smallest key in the key child of the non-leaf node.
	switch (mode) {
	case PAGE_CUR_GE:
		page_mode = PAGE_CUR_L;
		break;
	case PAGE_CUR_G:
		page_mode = PAGE_CUR_LE;
		break;
	default:
		page_mode = mode;
		break;
	}
Copy the code

Legacy issues: Page reverse good trouble, record why not two-way list?

Concurrency control

Memory concurrency control

To maintain the consistency of memory structures, such as Dictionary Cache, Sync Array, TRX System, etc. InnoDB does not use the libraries provided by Glibc directly, but encapsulates two classes of its own:

  1. One is mutex, which implements serialized access to memory structures.
  2. One type is RW lock, which implements read/write blocking, read/write lock for concurrent access.

B+ tree concurrency control

The concurrency control of B+ tree is mainly divided into two aspects, one is the concurrency control of content in nodes, the other is the concurrency control of tree structure, such as the change of tree height, page splitting and so on.

In order to achieve the above two aspects of concurrency control.

Version 5.6

InnoDB maintains two types of RW locks for each index:

  1. Index level, used to protect the B-tree from damage.
  2. Block level: protects the internal structure of the block from being damaged. It applies only to leaf nodes. Non-leaf nodes are not locked.

Rw lock can be divided into S and X modes. If SMO is designed, add X lock to INDEX level RW lock. The advantage of this implementation is that the code implementation is very simple, but the disadvantage is also obvious because the SMO operation cannot be read, and the SMO operation may have IO operation, The jitter performance is very obvious Specific reference mysql.taobao.org/monthly/202…

Version 5.7

There are two major changes

  1. Rw-lock introduces sX mode to optimize blocking read problems.
  2. Block level RW-lock, non-leaf points are also locked.

Compatibility relationships among S, X, and SX modes are as follows:

/*
 LOCK COMPATIBILITY MATRIX
    S SX  X
 S  +  +  -
 SX +  -  -
 X  -  -  -
 */
Copy the code

The specific process of locking:

  1. If it is a query request
  • Btree index->lock S lock
  • Add S LOCK to all non-Leaf node pages along the btree path
  • LOCK leaft node page until the leaf node is found, and then open index-> LOCK

2. See the process for modifying a requestMysql.taobao.org/monthly/202…Does this page change the btree?

  • If not, then fine, after performing an X LOCK on a Leaf node, the data will be returned after modification
  • If yes, perform pessimistic insert and iterate over btree. In this case, add SX lock to index->lock.

Since sX lock has been added to btree, no lock is required for all btree pages in the search path, but the pages in the search process need to be saved. In the final stage, x lock is added for pages that may have structural changes on the search path. This ensures that the impact on the read operation is minimized during the search. Only in the final stage to determine the scope of the btree structure modification, the possible structural changes in the page and X lock, will have an impact. ## Code to achieve B tree search entry function is btr_cur_search_to_nth_level, its parameter latch_mode is divided into two parts, low bytes are the following basic operation mode:

/** Latching modes for btr_cur_search_to_nth_level(). */
enum btr_latch_mode {
	/** Search a record on a leaf page and S-latch it. */
	BTR_SEARCH_LEAF = RW_S_LATCH,
	/** (Prepare to) modify a record on a leaf page and X-latch it. */
	BTR_MODIFY_LEAF	= RW_X_LATCH,
	/** Obtain no latches. */
	BTR_NO_LATCHES = RW_NO_LATCH,
	/** Start modifying the entire B-tree. */
	BTR_MODIFY_TREE = 33,
	/** Continue modifying the entire B-tree. */
	BTR_CONT_MODIFY_TREE = 34,
	/** Search the previous record. */
	BTR_SEARCH_PREV = 35,
	/** Modify the previous record. */
	BTR_MODIFY_PREV = 36,
	/** Start searching the entire B-tree. */
	BTR_SEARCH_TREE = 37,
	/** Continue searching the entire B-tree. */
	BTR_CONT_SEARCH_TREE = 38
};
Copy the code

Each mode locking process can reference: zhuanlan.zhihu.com/p/164705538

Pessimism is written

Pessimistic writes need to iterate over B Tree locks because SMO is involved

Row_ins_clust_index_entry // here is BTR_MODIFY_LEAF ----row_ins_clust_index_entry_low(flags, BTR_MODIFY_LEAF, index, n_uniq, Entry, n_ext, THR, dup_chk_only) --------btr_pcur_open call btr_cur_search_to_nth_level query index tree, Move the cursor to the position of the record to be inserted ------------btr_cur_optimistic_insert // Optimistic insert // If the insert fails, traverse the B tree, BTR_MODIFY_TREE ----row_ins_clust_index_entry_low(FLAGS, BTR_MODIFY_TREE, index, n_UNIq, entry, n_ext, THR, Dup_chk_only) --------btr_pcur_open // Call btr_cur_search_to_nth_level to query index tree Move the cursor to the position where records to be inserted ----------- btr_cur_optimistic_INSERT // Optimistic insert ----------- btr_cur_pessimism insert //Copy the code

MTR

InnoDB’s Mini-Transaction (MTR for short) is the unit that guarantees atomic changes for several pages. An MTR contains several log records — mlogs, and each log record is for a page — mblock; Liuyangming. Tech / 05-2019 / Inn…

Why do YOU need MTR to guarantee atomic Page change?

First, Redolog is not idempotent, so operations on Page must be guaranteed to all succeed or all fail.

A single Page can have multiple RedoLOG’s to execute, let alone multiple pages.

How do you tell if a physical transaction is complete?

The problem is that there is a clever way to ensure that before committing, if the physical transaction is found to have a log, some special logs will be written at the end of the log. These special logs are a sign of the end of the physical transaction, so the special logs will be written at the end of the commit. If this flag exists at the end of the current batch of logs during redo, the logs are complete; otherwise, the logs are incomplete and will not be redone. Cloud.tencent.com/developer/a…

General implementation process of MTR

MMTR. Start () starts a mini transaction

Mtr_x_lock () creates an X lock on the space->latch. 2. Place space->latch into mtr_T ::m_impl:: Memo (so that the lock added before MTR can be released after mtr.mit ())

Mlog_write_ull writes data. This operation also has two steps: 1. Modify data directly on the page; 2. Write the redo log to MTR ::m_impl::m_log

Mtr.com MIT () to write the redo log + put locks, this operation will step on the contents of the m_log written to redo log file, and put in the lock zhuanlan.zhihu.com/p/56188735?…

After mtr_start, there is only one mtr_commit operation; Mtr_commit copies Mlogs and Mblocks (dirty pages) in MTR to logBuffer and flushlist, respectively. When a real transaction commits, all mlogs involved in that transaction are flushed so that individual atomic changes are persisted. Call the corresponding callback function according to the type (mtr_type_T) to restore the page.

Split and merge

TODO

The static. Kancloud. Cn/taobaomysql… Mysql.taobao.org/monthly/202… Zhuanlan.zhihu.com/p/164705538 liuyangming. Tech / 07-2019 / Inn… Mysql.taobao.org/monthly/201… Mysql.taobao.org/monthly/201… Mysql.taobao.org/monthly/201…

zhuanlan.zhihu.com/p/164728032

Liuyangming. Tech / 07-2019 / Inn…

Blog.csdn.net/yuanrxdu/ar…

www.jianshu.com/p/0cdd573a8…

zhuanlan.zhihu.com/p/265834112

zhuanlan.zhihu.com/p/164705538

Mysql.taobao.org/monthly/202…