Mysql data storage structure

InnoDB takes the approach of dividing data into several pages, using the page as the basic unit of interaction between disk and memory. The page size in InnoDB is typically 16 KB. That is, at least 16KB is read from the disk to the memory at a time and at least 16KB is flushed to the disk at a time. When there is too much data in a record to fit on the current page, excess data is stored on other pages, a phenomenon called row overflow.

InnoDB has different types of pages designed for different purposes. We call the pages for storing recordsData pageBelow is the data page structureData writing process

Buffer Pool

When the MySQL server starts up, it requests a contiguously large chunk of memory from the operating system. They call this chunk of memory a Buffer Pool. We can put all the control blocks corresponding to the free cache pages as a node in a linked list, which can also be called a free list (or free list).

We can use the table space number + page number as the key, the cache pages as a value to create a hash table, when need to access a page of data, first according to the table space from a hash table number + page number to see if the corresponding cached page, if you have, it is good to use the cached page directly, if not, then choose a free from free list page caching, The corresponding page on disk is then loaded into that cached page location.

If we modify the data of a cached page in the Buffer Pool, it will be inconsistent with the page on disk. Such cached pages are also called dirty pages. We had to create a linked list of dirty pages, and any control block for the modified cached page was added to the list as a node, also known as flush, because the cached page was flushed to disk. So every time a cached page is changed, we don’t rush to synchronize the changes to disk immediately, but at some point in the future.

The size of the Buffer Pool is limited. If the number of pages in the Buffer Pool exceeds the size of the Buffer Pool, then there are no more pages in the free list.

We can also create a linked list, which can be called the LRU list (Least Recently Used) because the list is designed to eliminate cached pages according to the Least Recently Used principle.

When do you want to brush the plate?

A special thread in the background is responsible for flushing dirty pages to disk at regular intervals, so that the user thread can handle normal requests. There are two main refresh paths:

  • Flush a portion of the page to disk from the cold data of the LRU list.

    The background thread will periodically scan some pages from the end of the LRU list. The number of pages to scan can be specified by the system variable innodb_lru_scan_depth. If dirty pages are found from the inside, they will be flushed to disk. This way of refreshing the page is called BUF_FLUSH_LRU.

  • Flushes a portion of the page from the Flush list to disk.

    Background threads also periodically flush pages from the Flush list to disk at a rate that depends on whether the system is busy at the time. This way of flushing the page is called BUF_FLUSH_LIST.

  • Sometimes the background thread is slow to refresh dirty pages, so that the user thread does not have available cache pages when it is ready to load a disk page into the Buffer Pool. In this case, the user thread will try to see if there are any unmodified pages at the end of the LRU list that can be freed directly. If not, a dirty page at the end of the LRU list would have to be synchronously flushed to disk (interacting with disk is slow, which slows down the processing of user requests). This flushing of individual pages to disk is called BUF_FLUSH_SINGLE_PAGE.

How do I determine if a page is fully flushed to disk? What if the power goes out in the middle?

To ensure the integrity of pages synchronized from the memory to the disk, the checksum of data in the page and the LSN value corresponding to the last modification of the page are stored in the front and back of the page. If the checksum and LSN value of the front and back fail to be checked, it indicates that the synchronization process is faulty.

Is deleted data removed directly?

The deleted records are not immediately removed from the disk, because the other records after remove them on disk rearrange need performance overhead, so just make a delete tags, all deleted records will form a so-called junk list, in the list of the records in the occupied space is called the so-called reusable space, Later, if new records are inserted into the table, the storage space occupied by these deleted records may be overwritten

The index

  • Each index corresponds to a treeB+The tree,B+The tree is divided into several layers, with the lowest layer being the leaf nodes and the rest being the inner nodes. allUser recordAre stored in theB+The leaves of the tree, all of themDirectory entry recordAre stored on the inner node.
  • InnoDBThe storage engine will automatically set up the primary key (it will automatically add it for us if it doesn’t exist)Clustering indexThe leaf node of the cluster index contains the complete user record.
  • We can create columns that we are interested inSecondary indexes.Secondary indexesThe leaf node contains user records byIndex column + primary keyComposition, so if you want to passSecondary indexesTo find the full user history, you need to go throughBack to the tableOperation, that is, passingSecondary indexesAfter you find the primary keyClustering indexTo find the complete user record.
  • B+The nodes at each level of the tree form a bidirectional linked list (for each data page) by ordering the index column values in ascending orderFile HeaderSections have previous and next page numbers). Moreover, the records within each page (whether user records or directory entry records) form a single linked list in ascending order of index column values (each record has one in its header information)next_recordProperties). If it isJoint indexThen, the page and record according to firstJoint indexIf the value of the preceding column is the same, thenJoint indexThe following columns are sorted.
  • Looking up records by index is fromB+Start at the root of the tree and work your way down. Because each page is created according to the value of the index columnPage Directory(page directory), so lookup in these pages is very fast.

Process for finding records within a page:

InnoDB divides the records in a Page into several groups. The address offset of the last record in each group is stored as a slot in the Page Directory, so it is very fast to find records in a Page by primary key in two steps:

  1. The slot in which the record resides is determined by dichotomy.
  2. Each record in the group in which the slot is located is traversed through the record’s next_record property.

Clustering index

  1. Sorting records and pages using the size of the record primary key has three implications:

    • The records in the page are arranged in a one-way linked list in order of the size of the primary key.
    • Each page storing user records is also arranged in a bidirectional linked list according to the primary key size of user records in the page.
    • The pages that store directory entry records are divided into different levels. The pages in the same level are also arranged in a bidirectional linked list according to the primary key size of directory entry records in the page.
  2. The leaves of the B+ tree store the complete user record.

    A complete user record is one in which the values of all columns (including hidden columns) are stored.

Joint index

Create joint index based on c2 and C3 columns:

  • Start by following the individual records and pagesc2Columns are sorted.
  • In the record ofc2Column the same case as adoptedc3Column sort

Root pages never move

When we introduced B+ tree index earlier, for the convenience of everyone to understand, first draw the leaf nodes that store user records, and then draw the internal nodes that store directory entries. In fact, the formation process of B+ tree is like this:

  • Each time one is created for a tableB+When a tree index (a clustered index is not created artificially, it is created by default), an index is created for this indexThe root nodePage. At the beginning, when there’s no data in the table, eachB+Corresponding to the tree indexThe root nodeThere are neither user records nor directory entry records in.
  • When you subsequently insert user records into the table, you store the user records in this firstThe root nodeIn the.
  • whenThe root nodeContinues to insert records when the available space inThe root nodeCopy all records in to a newly assigned page, for exampleA page, and proceed to the new pagePage dividedTo get another new page, for examplePage b. At this point, the newly inserted record is assigned to the value of the key (that is, the primary key value in the cluster index, the corresponding index column value in the secondary index)A pageorPage b, and theThe root nodeUpgrade to a page that stores directory entry records.

This process requires special attention: the root node of a B+ tree index does not move from the date of birth. So whenever we create an index for a table, the page number of the root node will be recorded somewhere, and then whenever InnoDB storage engine needs to use the index, the page number of the root node will be retrieved from that fixed place to access the index.

Precautions for using indexes

  1. B+ tree indexes are expensive in space and time, so don’t build indexes when you don’t have to.

  2. B+ tree indexes apply to the following situations:

    • All values match
    • Matches the column on the left
    • Matching range value
    • Matches exactly one column and ranges exactly the other
    • For sorting
    • Used for grouping
  3. Here are some things to consider when using indexes:

    • Create indexes only for columns used for searching, sorting, or grouping

    • Create indexes for columns with large cardinality of columns

    • Index column types should be as small as possible

    • You can index only the prefixes of string values

    • An index applies only if the index column appears alone in a comparison expression

    • It is recommended that the primary key have the AUTO_INCREMENT attribute in order to minimize the number of page splits and record shifts in the cluster index.

    • Locate and delete duplicate and redundant indexes in a table

    • Try to use overwrite indexes for queries to avoid performance loss caused by back to the table.

The optimizer

Explain access methods

The way MySQL executes queries is called an access method or access type

Const by primary key or the only secondary indexes is compared with the constant of the equivalent column to locate a record ref search criteria for secondary index column compared with constant equivalent, secondary index is used to execute the query access methods Both ordinary secondary indexes, and the only secondary indexes, their index column does not limit to the amount of contain NULL values, So we use key IS NULL as a search condition that can only use ref access methods, not const access methods. Ref_or_null: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL: ref_or_NULL Index All The mode of performing a query using full table scan is called all

Let’s outline the roles of each column in the output of the EXPLAIN statement:

The column name describe
id Each in a large query statementSELECTEach keyword corresponds to a uniqueid
select_type SELECTThe type of query that the keyword corresponds to
table The name of the table
partitions Matched partition information
type Access methods for a single table
possible_keys Possible indexes
key The index actually used
key_len The actual length of the index used
ref When indexed column equivalence query is used, information about the object with which the index column is matched
rows Estimated number of records to read
filtered The percentage of records that remain after a table has been filtered by search criteria
Extra Some additional information

It should be noted that large

Cost-based optimization

I/O costs

MyISAM and InnoDB storage engines are used for our tables to store data and indexes on disk. When we want to query the records in the table, we need to load the data or indexes into memory before operation. The time spent loading from disk to memory is called I/O cost. CPU cost

The time spent reading and checking whether records meet the corresponding search criteria, sorting result sets, and so on is called CPU cost.

In a single table query before the real execution, MySQL query optimizer will find out the solutions, execute the statement all may use contrast after find out the lowest cost, the lowest cost solution is known as the execution plan, before calling a interface provided by the storage engine real query execution, summarize the process is like this:

  1. Find out all possible indexes based on the search criteria
  2. Calculate the cost of a full table scan
  3. Calculates the cost of executing queries using different indexes
  4. Compare the costs of various implementation options and find the one with the lowest cost

Calculating the cost of a full table scan requires two pieces of information:

  • The number of pages occupied by the cluster index
  • Number of records in this table

Where do these two pieces of information come from? The uncle who designed MySQL maintains a set of statistics for each table. We’ll talk in detail about how these statistics are collected later in this chapter. Now let’s look at how to view these statistics. The uncle who designed MySQL gave us the SHOW TABLE STATUS statement to view the statistics of the TABLE

SHOW TABLE STATUS LIKE 'TABLE_83'

The optimizer needs to calculate the number of records in a range of secondary indexes. In this case, the number of records in a range of (10, 1000) that idx_KEY2 contains looks like this:

  • Step 1: Based onkey2 > 10This condition is accessedidx_key2The correspondingB+Tree index, find meetkey2 > 10The first record of this condition, let’s call this recordThe leftmost interval is recorded. As we said beforeB+The process of locating a record in a tree is fast and constant, so the performance cost of this process is negligible.
  • Step 2: Then follow upkey2 < 1000This condition continues fromidx_key2The correspondingB+Find the last record in the tree index that meets this condition, and let’s call this recordThe rightmost record of the intervalThe performance cost of this process is also negligible.
  • Step 3: IfThe leftmost interval is recordedandThe rightmost record of the intervalNot too far apartMySQL 5.7.21In this version, as long as there are no more than 10 pages between each other), you can calculate exactly what is satisfiedkey2 > 10 AND key2 < 1000The number of secondary index records for the condition. Otherwise only alongThe leftmost interval is recordedRead 10 pages to the right, calculate how many records each page contains on average, and multiply that average byThe leftmost interval is recordedandThe rightmost record of the intervalThe number of pages between is ok.

The optimizer needs to calculate the number of records in a range of secondary indexes. In this case, the number of records in a range of (10, 1000) that idx_KEY2 contains looks like this:

  • Step 1: Based onkey2 > 10This condition is accessedidx_key2The correspondingB+Tree index, find meetkey2 > 10The first record of this condition, let’s call this recordThe leftmost interval is recorded. As we said beforeB+The process of locating a record in a tree is fast and constant, so the performance cost of this process is negligible.
  • Step 2: Then follow upkey2 < 1000This condition continues fromidx_key2The correspondingB+Find the last record in the tree index that meets this condition, and let’s call this recordThe rightmost record of the intervalThe performance cost of this process is also negligible.
  • Step 3: IfThe leftmost interval is recordedandThe rightmost record of the intervalNot too far apartMySQL 5.7.21In this version, as long as there are no more than 10 pages between each other), you can calculate exactly what is satisfiedkey2 > 10 AND key2 < 1000The number of secondary index records for the condition. Otherwise only alongThe leftmost interval is recordedRead 10 pages to the right, calculate how many records each page contains on average, and multiply that average byThe leftmost interval is recordedandThe rightmost record of the intervalThe number of pages between is ok.

Complete user records obtained after table operation, and then check whether other search criteria are valid

The transaction

The concept of transactions

We call one or more database operations that require atomicity, isolation, consistency, and persistence a transaction. Ultimately, it is to solve a series of problems caused by the concurrency of transactions: dirty write, dirty read, unrepeatable read, phantom read. Each engine implements transactions differently. Innodb uses MVCC or locks

State of a transaction

Transaction persistence – Redo log

If we modify the page only in the Buffer Pool of memory, we can’t afford to lose the changes made to the database by the committed transaction if a failure occurs after the transaction is committed and all the data in memory is invalidated.

How about flushing all the modified pages in memory when the transaction commits?

Poor performance. The benefits of flushing only the redo logs generated during a transaction to disk are as follows, as opposed to flushing all modified in-memory pages to disk at transaction commit time:

  • Redo logs take up very little space

    The amount of storage required to store tablespace ids, page numbers, offsets, and values that need to be updated is small. We’ll talk more about redo log formats later, but for now it’s good to know that a redo log doesn’t take up a lot of space.

  • Redo logs are written sequentially to disk

    During a transaction, several redo logs may be generated for each statement executed, and these logs are written to disk in the order in which they were created, using sequential IO.

mtr

InnoDB’s designers believe that the process of inserting a record into the B+ tree corresponding to an index must be atomic, and cannot be said to stop halfway through. So they specify that redo logs must be stored as a group when performing atomicity operations, and that during a crash restart recovery, redo logs must be either completely restored or none of the redo logs must be restored.

log buffer

Redo logs cannot be written directly to disk. In fact, the operating system requests a large contiguously large amount of memory during startup called the redo log buffer

However, it is not an option for these logs to remain in memory. They can be flushed to disk in some cases, such as:

  • The log buffer space is insufficient

    The size of the log buffer is finite (specified by the system variable innodb_log_buffer_size), and if you keep filling this finite size with logs, it will soon fill up. InnoDB was designed to flush logs to disk if the number of redo logs currently written to the log buffer has reached about half of the total log buffer capacity.

  • When a transaction commits

    As mentioned earlier, the main reason for using redo logs is that they take up less space and are written sequentially. Modified Buffer Pool pages are not flushed to disk during transaction commits, but redo logs must be flushed to disk for durability.

  • Background threads keep brushing

    There is a thread in the background that flusher the redo logs in the log buffer to disk about once every second.

  • Shut down the server properly

  • When you do something called a checkpoint (we haven’t talked about checkpoint right now, but we’ll talk about it in detail later)

  • Other cases…

Redo logs are redo log files that are written to the log buffer before being flushed to disk. So the uncle who designed InnoDB proposed a global variable called buf_next_to_write, which marks which logs from the current log buffer have been flushed to disk

check point

The redo log is used only to restore dirty pages after a system crash. If the dirty pages are flushed to disk, the redo log is not needed after a restart, so there is no need for the redo log. The disk space it occupies can then be reused by subsequent redo logs. In other words: InnoDB uses a global variable checkpoint_lsn to represent the total number of redo logs that can be overwritten. For example, if page A has been flushed to disk, The redo log generated by mtr_1 is overwritten, so we can perform a checkpoint_lsn increment, which we call a checkpoint. Check the maximum LSN value of the redo logs that can be overwritten in the current system.

Step 2: Write the checkpoint_lsn and the corresponding redo log group offset and the checkpint number to the log file’s management information (checkpoint1 or checkpoint2).

How did you crash and recover?

Identify the starting point for recovery

As mentioned earlier, all redo logs prior to checkpoint_lsn can be overwritten. This means that the dirty pages corresponding to the redo logs are flushed to disk. Since they have been flushed, there is no need to restore them. For redo logs after checkpoint_lsn, the dirty pages may not have been flushed or may have been flushed, so we need to read redo logs from checkpoint_lsn to restore the page.

Determine the endpoint of recovery

The starting point for redo log recovery is defined. What is the end point? This starts with the structure of the block. We say that redo logs are written sequentially, with one block filled and the next block written: The log block header of a normal block has an attribute called LOG_BLOCK_HDR_DATA_LEN, whose value records how many bytes of space are used in the current block. For filled blocks, the value is always 512. If the value of this property is not 512, then it is the last block to be scanned in crash recovery.

How to restore

Use hash tables

Redo logs with the same space ID and page number are hashed in the same slot. If there are multiple redo logs with the same space ID and page number, Then they are linked by a linked list in order of generation, as shown in the figure:

You can then iterate through the hash table, and because the redo logs that made changes to the same page are placed in a single slot, you can repair one page at a time (avoiding a lot of random IO reading the page), which makes recovery faster.

  • Skip pages that have been flushed to disk

How do you know during recovery if the dirty pages of a redo log were flushed to disk at the time of the crash? If dirty pages are flushed to disk after a checkpoint, the LSN value of FIL_PAGE_LSN must be greater than the value of checkpoint_lsn. Pages that do this do not need to redo redo logs with LSN values less than FIL_PAGE_LSN, which further improves crash recovery.

undo log

The execution of a transaction may have changed a lot of things, to ensure the atomicity of transaction, we need to change the things back to the original appearance, this process is called rollback (English name: rollback), so that you can make a false: this transaction seems didn’t do anything, so as to meet the requirements of atomic transaction id:

A read-only transaction is assigned a transaction ID only when it first adds, deletes, or modifies a temporary table created by a user. Otherwise, no transaction ID is assigned. For a read/write transaction, a transaction ID is assigned only when it first adds, deletes, or modifies a table (including temporary tables created by the user), and otherwise no transaction ID is assigned.

Hidden columns: In addition to holding the full user data, records in the cluster index are automatically added with hidden columns named trx_id, roll_pointer, and a hidden column named row_id if the user does not define a primary key or UNIQUE key in the table

MVCC principle

Version of the chain

Roll_ pointer: Every time a change is made to a clustered index record, the old version is written to the Undo log. The hidden column is then used as a pointer to the previous record. After each update to the record, the old value will be put into an undo log. As the number of updates increases, all versions of the record will be connected into a linked list by the roll_pointer attribute. This list is called the version chain, and the first node of the version chain is the latest value of the current record. In addition, it is important that each version contains the transaction ID for which the version was generated

ReadView

For transactions using the READ UNCOMMITTED isolation level, since records that have not been modified by a committed transaction can be READ, it is better to READ the latest version of the record. For transactions using the SERIALIZABLE isolation level, InnoDB’s uncle was designed to access records using locks. For transactions using READ COMMITTED and REPEATABLE READ isolation levels, it is necessary to ensure that the records modified by the COMMITTED transaction are READ. The core problem is that you need to determine which version in the version chain is visible to the current transaction. To this end, InnoDB design uncle proposed a ReadView concept, this ReadView mainly contains four important content:

  • M_ids: represents a list of transaction ids of the read and write transactions currently active in the system at the time the ReadView was generated.

  • Min_trx_id: Indicates the smallest transaction ID of the active read/write transaction in the system at the time the ReadView was generated, that is, the minimum value in m_IDS.

  • Max_trx_id: indicates the id value in the system that should be assigned to the next transaction when the ReadView is generated.

    Tip: Note that max_trx_id is not the maximum value in M_IDS; transaction ids are incrementally assigned. Let’s say there are three transactions with ID 1, 2, and 3, and then the transaction with ID 3 commits. When a new read transaction generates a ReadView, m_ids will include 1 and 2, min_trx_id will be 1, max_trx_id will be 4.

  • Creator_trx_id: indicates the transaction ID of the transaction that generated this ReadView.

    Tip: As mentioned earlier, a transaction ID is assigned to a table only when a change is made to a table record (INSERT, DELETE, UPDATE, etc.), otherwise the transaction ID defaults to 0 in a read-only transaction.

With this ReadView, when accessing a record, you only need to follow these steps to determine if a version of the record is visible:

  • If the version is accessedtrx_idAttribute values andReadViewIn thecreator_trx_idThe same value means that the current transaction is accessing its own modified record, so that version can be accessed by the current transaction.
  • If the version is accessedtrx_idAttribute value less thanReadViewIn themin_trx_idValue indicating that the transaction that generated this version was generated in the current transactionReadViewPreviously committed, so the version can be accessed by the current transaction.
  • If the version is accessedtrx_idAttribute value is greater than or equal toReadViewIn themax_trx_idValue indicating that the transaction that generated this version was generated in the current transactionReadViewTherefore, this version cannot be accessed by the current transaction.
  • If the version is accessedtrx_idAttribute values inReadViewthemin_trx_idandmax_trx_idBetween, that needs to judgetrx_idIs the property value inm_idsIn the list, if yes, the system is createdReadViewThe transaction that generated the version is still active, and the version cannot be accessed. If no, the system is createdReadViewThe transaction that generated the version has been committed and the version can be accessed.

If one version of the data is not visible to the current transaction, go down the version chain to the next version of the data, continue the above steps to determine the visibility, and so on, until the last version in the chain. If the last version is also not visible, it means that the record is not visible to the transaction at all and is not included in the query result.

One big difference between READ COMMITTED and REPEATABLE READ isolation levels in MySQL is when they generate readViews.

READ COMMITTED – A ReadView is generated every time data is READ

REPEATABLE READ – Generates a ReadView when reading data for the first time

Concurrency Control for Multi-version Concurrency Control Multi-version concurrency control (MULTI-version concurrency control) refers to the process of accessing the version chain of records when ordinary SELECT operations are performed by transactions with READ COMMITTD and REPEATABLE READ isolation levels. In this way, read-write and read-read operations of different transactions can be executed concurrently, thus improving system performance. Unlike the SQL standard, MySQL actually solves the phantom READ problem at the REPEATABLE READ isolation level.

The lock

How to solve the problem of dirty read, unrepeatable read, phantom read? There are actually two possible solutions:

  • Scheme 1: Read operation uses multi-version concurrency control (MVCC), the write operation is performedlock.

The so-called MVCC is described in detail in the previous chapter by generating a ReadView and then using the ReadView to find the qualified record version (the history version is built from undo logs). Just as a time freeze is done at the time the ReadView is generated (like taking a snapshot with a camera), the query can only read the changes that were committed before the ReadView was generated, not the changes that were not committed before the ReadView was generated or those that were opened later. The write operation must be for the latest version of the record. The historical version of the read record does not conflict with the latest version of the change record, that is, the read-write operation does not conflict with the MVCC.

  • Scheme 2: Use both read and write operationslockThe way.

If some of our business scenarios do not allow reading the old version of the record, but must read the latest version of the record every time, for example, in a bank deposit transaction, you need to read the account balance first, then add it to the current deposit amount, and finally write to the database. After the account balance is read out, other transactions are not allowed to access the account balance until the deposit transaction completes. This also locks the record when it is read, which means that read and write operations are queued like write-write operations. Obviously, in MVCC mode, read-write operations do not conflict with each other, resulting in higher performance. In lock mode, read-write operations need to queue each other, affecting performance. In general, we would like to use MVCC to solve the problem of concurrent read-write operations, but in some special cases, the business must be locked, that is not possible.

We say that phantom read problem occurs because the current transaction reads a range of records, and then another transaction inserts new records into the range. When the current transaction reads the range of records again, it finds the new inserted records. We call the newly inserted records phantom records. With the method of locking phantom read problem have so a diu diu trouble, because in the current transaction reads the record for the first time that the phantom record does not exist, so read lock is a little awkward – because you don’t know who to lock, it doesn’t matter, this uncle li design InnoDB, we later your answer, be calm.

The classification of the lock

Shared and exclusive locks

MySQL > create a class for locks:

  • Shared Locks are Shared Locks for short. Before a transaction reads a record, it first acquires the record’s S lock.

  • Exclusive Locks are also called Exclusive Locks. When a transaction changes a record, it first acquires the record’s X lock.

  • Lock S on read records:

SELECT ... LOCK IN SHARE MODE;
Copy the code
  • Add to the read recordX lock:
SELECT ... FOR UPDATE;
Copy the code

We mentioned in front of the lock is aimed at a record, can also be called row-level locks or row lock, the effect of a record locking is just the record, add the lock to table can also be divided into a Shared lock (S) and an exclusive lock lock (X) : we are in the classroom building overall lock (lock), how to know if there is any buildings in the classroom has been locked (row lock)? Check if the door of each classroom is locked. That efficiency is too slow! It’s impossible to traverse, it’s impossible to traverse in this life, so the guys who designed InnoDB came up with something called Intention Locks:

  • Intended shared lockIntention Shared Lock, hereinafter referred to asIS the lock. When a transaction is about to add to a recordS lockYou need to add one at the table level firstIS the lock.
  • Intent exclusive lockIntention Exclusive Lock, hereinafter referred to asIX lock. When a transaction is about to add to a recordX lockYou need to add one at the table level firstIX lock.

Table level locking in InnoDB

  • Table level S locks, X locks

    InnoDB storage engine does not add table level S or X locks to a table when performing SELECT, INSERT, DELETE, UPDATE statements on the table.

    In addition, when some DDL statements such as ALTER TABLE and DROP TABLE are executed on a TABLE, other transactions concurrently execute statements such as SELECT, INSERT, DELETE, and UPDATE on the TABLE will block. Similarly, SELECT, INSERT, DELETE, UPDATE statements on a table in a transaction will block DDL statements on that table in other sessions. This is done by using something called Metadata Locks (MDL) at the server level, and not using the TABLE level S and X Locks provided by the InnoDB storage engine itself. There are two main principles for the system to implement this AUTO_INCREMENT assignment:

  • Auto-inc lock is used, that is, when an insert statement is executed, an auto-Inc lock is added at the table level, and then increments are allocated for each AUTO_INCREMENT column. When the insert statement is completed, the auto-Inc lock is released. Such a transaction holds the auto-Inc lock, and inserts of other transactions are blocked, ensuring that increments allocated in a statement are continuous.

Row-level locking in InnoDB

Let’s take a look at the common row locking types.

  • Record the Locks:

    The type of record lock we mentioned earlier is this type, that is, only one record lock, there are S locks and X locks,

  • Gap the Locks:

    As shown in the figure, a gap lock is added to the record with the number value of 8, which means that other transactions are not allowed to insert new records in the gap before the record with the number value of 8. In fact, new records in the interval between the value of number column (3, 8) are not allowed to be inserted immediately. For example, if another transaction wants to insert a new record with the number value of 4, it locates the next record with the number value of 8, and this record has a gap lock, so it blocks the insert operation until the transaction with the gap lock commits. New records in the interval (3, 8) for the value of the number column can be inserted.

The gap lock is only proposed to prevent the insertion of phantom records

  • Next-Key Locks:

Sometimes we want to both lock a record and prevent other transactions from inserting new records in the gap in front of the record. The essence of a next-key lock is a combination of a proper record lock and a gap lock. It not only protects this record, but also prevents other transactions from inserting new records into the gap in front of the protected record.

So when to use record lock? When to lock with the next key?

The lock required by a statement is subject to a number of conditions, such as:

  • The isolation level of the transaction
  • Indexes used for statement execution (such as clustered indexes, unique secondary indexes, plain secondary indexes)
  • Query criteria (for example=,= <,> =Etc.)
  • The type of statement to execute

For example, gap locks and next key locks are designed to solve illusory problems, so RR/SERIALIZABLE is available, RC is not

For range queries using primary keys

SELECT * FROM hero WHERE number >= 8 LOCK IN SHARE MODE;
Copy the code

Add an S-type next-key lock to all clustered index records whose number value is greater than 8

For equivalent queries using primary keys or unique indexes

SELECT * FROM hero WHERE number = 8 LOCK IN SHARE MODE;
Copy the code

Record lock, because there will be no magic read, each query is this record

In the case of equivalent queries using ordinary secondary indexes

SELECT * FROM hero WHERE name = 'c cao 'LOCK IN SHARE MODE;Copy the code
  • Since ordinary secondary indexes are not unique, a transaction must prevent other transactions from inserting after the above statement is executednameA value of'cao cao c'The new record of designInnoDBSQL > select * from ‘lock’;
    • Add type S next key lock to all secondary index records whose name is ‘c Cao ‘.

    • Add a gap lock to the next secondary index record whose last name is ‘c cao ‘.

Full table scan

For example:Copy the code
SELECT * FROM hero WHERE country  = '魏' LOCK IN SHARE MODE;
Copy the code

Since there is no index on the country column, the query can only be executed using a full table scan. The storage engine locks an S-type next-key lock on each clustered index record and returns it to the server layer. If country = ‘wei’ is true or not, the server layer will send it to the client, otherwise InnoDB storage engine will send a message to release the lock on the record, but in REPEATABLE READ isolation level, The InnoDB storage engine does not actually release locks, so all records in the clustered index are locked and not released until the transaction commits. Use the SELECT… FOR UPDATE, UPDATE, UPDATE, UPDATE, UPDATE, UPDATE

For the UPDATE… SELECT… FOR UPDATE, if there are other secondary index columns in the updated column, the corresponding secondary index column will be locked.

And DELETE… SELECT lock from SELECT… FOR UPDATE, if there are other secondary index columns in the table, these secondary index columns will also be locked.

The INSERT statement

INSERT statements typically do not have locks, but the current transaction needs to locate the record’s position in the B+ tree before inserting it. If the next record at that position has a gap lock (next-key locks also include a gap lock), The current transaction then places a lock of type insert intent lock on the record and the transaction enters the wait state.

Click here for details

The log

Biglog and Redolog write order

Click here for details

How does a SQL statement execute in mysql

Click here