1 Innodb MVCC composition

1. Why Do I Use MVCC to Read Snapshots

Innodb storage engine’s snapshot reads are based on multi-version concurrent control MVCC and Undolog. The MVCC mechanism improves system read and write concurrent performance. lock in share mode, select … for update

  • Ideas for improving concurrency

Concurrent data consistency is generally implemented as follows: Lock + multiple data versions To improve concurrent read performance: Use shared locks to ensure concurrent read performance, and use exclusive locks to achieve mutual read/write exclusion to improve concurrent read/write performance

1.2 MVCC related Concepts

InnoDB’s MVCC implementation is based on undo log, using rollback segments to save undo log version snapshot data, using the readView mechanism to determine data visibility, and using the Purge thread to purge old data versions

1.2.1 undolog
  • Before the mysql transaction is committed, the old version data before the transaction modification will be stored in the Undo log for the impact of transaction rollback or database crash on the database data to ensure the atomicity of data.
  • Undo logs are stored inRollback segmentsIn, divided into: insert undo logupdate undo log
1.2.2 readview
  • Readview is primarily used for visibility judgment
  • At repeatable Read isolation level, snapshot generation only creates readView snapshot after the first SELECT read operation after the transaction is started
  • At the Read Committed isolation level, a Read View snapshot is created for each SELECT read in a transaction
1.2.3 Three implicit fields
field meaning Storage location The size of the
DB_TRX_ID Last updated transaction ID (UPDATE,delete, INSERT) Table data row and cluster index Six byte
DB_ROLL_PTR Roll back the pointer to the previous undolog record to form the undo list Table data row and cluster index 7 bytes
DB_ROW_ID Data row ID, monotonically increasing Table data row and cluster index Six byte

2 Innodb MVCC implementation (5.6.x)

2.1 Multi-version implementation

2.1.1 Transaction snapshot update process

Assume that the field ID in the user table is a clustered index and the field name is a non-clustered index

  • Step 1 Create a record
INSERT INTO `user`(`id`, `name`, `score`) VALUES (9.'zhang3'.60)
Copy the code

As shown in the figure:

The data row generated after insertion has three implicit fields: the row ID identifier for the should row, the latest transaction ID identifier, and the rollback pointer, along with the corresponding business property data

  • Step 2 Update the cluster index
 update table `user` set `score`= 70 where id=9
Copy the code

As shown in the figure:

Alter table undo log undo log alter table undo log alter table undo log Update DB_TRX_ID and DB_ROLL_PTR in the row where the cluster index is located. 5 Commit a transaction to release the lock

  • Step 3 Update the non-clustered index

     update table `user` set `score`= 80 where `name` = 'zhang3'
    Copy the code

    As shown in the figure:

Secondary index update

Alter table undo log undo log alter table undo log alter table undo log Change DB_TRX_ID to 668. 5 Insert a new index row and update the maximum trx_id on the Page. 6 Update the DB_TRX_ID and DB_TRX_ID on the cluster index row DB_ROLL_PTR 7 Commit a transaction to release the lock

  • conclusion

    Through multiple updates, old version snapshots are concatenated into a linked list by rolling back Pointers

    The Undolog log is also deleted by the Purge thread looking for the rollback log of the oldest transaction ID that is currently active

2.2 ReadView visibility judgment

2.2.1 Visibility judgment process
  • The readView structure contains the following properties
field meaning
creator_trx_id The transaction ID for creating the view
trx_ids An array of active read/write transaction ids is stored in order when a ReadView is created
low_limit_id Set to the current maximum transaction ID
up_limit_id The minimum value in the m_IDS collection
  • Visibility algorithm

When a transaction is started at the RC or RR level, a readView snapshot is generated. When a data is queried by the SELECT, the trx_ID of the data is compared with the readVEiW data. The visibility algorithm is as follows:

  • If you record a rowtrx_idLess thanread_view_t::up_limit_id, the transaction was committed when the ReadView was created and must be visible
  • If you record a rowtrx_idGreater than or equal toread_view_t::m_low_limit_id, the transaction was started after the readView was created and must not be visible
  • whentrx_idinup_limit_idandlow_limit_idBetween, and inread_view_t::trx_idsArray, the transaction was active when the readView was created and its data changes are not visible to the current view, if not in the active transaction listtrx_idsIn thetrx_idChanges to visible
2.2.2 RRR Repeatable Read process
  • Data preparation
CREATE TABLE `user` (
  `id` int(10) NOT NULL AUTO_INCREMENT,
  `name` varchar(100) COLLATE utf8mb4_bin NOT NULL,
  `scores` int(10) NOT NULL,
  `status` bigint(20) NOT NULL.PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;


BEGIN;
INSERT INTO `user` VALUES (1.'wang5'.33.2);
INSERT INTO `user` VALUES (2.'zhao6'.23.2);
INSERT INTO `user` VALUES (9.'zhang3'.12.1);
COMMIT;

Copy the code
  • Procedure Step 1 Start two new transactions AB

    -A transactionset session transaction isolation level repeatable read
    BEGIN
    SELECT * from user where id = 1Query the transaction ID of the current sessionselect TRX_ID from INFORMATION_SCHEMA.INNODB_TRX where TRX_MYSQL_THREAD_ID = CONNECTION_ID();
    
    -Transaction Bset session transaction isolation level repeatable read
    BEGIN
    SELECT * from user where id = 9Query the transaction ID of the current sessionselect TRX_ID from INFORMATION_SCHEMA.INNODB_TRX where TRX_MYSQL_THREAD_ID = CONNECTION_ID();
    Copy the code

    After transaction A initiates transaction B, the current session transaction IDS are 710 and 711, respectively. According to the readView mechanism, the readView snapshot of transaction B is as follows

  • Step 2 Transaction A updates the submission score data

    -Transaction A UPDATEuser set scores = 112 where id = 1
    commit
    Copy the code

    At this time, transaction A will generate the old version of Undolog snapshot, transaction B’s readView snapshot is still as follows:

    SELECT * from user where id = 1 and trx_ids are [710, 711]

    According to the visibility algorithm, the current transaction ID is found in the transaction active list, so we need to go to undolog to query the version. Trx_id = 709; up_limit_ID = up_limit_id

  • Step 3 Transaction C inserts data

    -Transaction Cset session transaction isolation level repeatable read
    BEGIN
    INSERT INTO `user`(`name`, `scores`, `status`) VALUES ('li4'.55.1SQL > query transaction id of current sessionselect TRX_ID from INFORMATION_SCHEMA.INNODB_TRX where TRX_MYSQL_THREAD_ID = CONNECTION_ID();
    COMMIT
    Copy the code

    SELECT * from user where id > 8 and trx_id = 712

    Two records with id 9 and 10 are queried. However, the record with ID 10 trx_id = 712 exceeds the maximum value of readView snapshot, so it is not visible. The undo log also does not have the old version record, so only the record with ID 9 can be queried

2.3 Snapshot Read Source Code Parsing

2.3.1 InnoDB three hidden field source code
  • The dict0dict.cc#dict_table_add_system_columns method is added to the table data row

    /**
     * 添加数据行的隐藏字段
     * @param table  
     * @param heap 
     */
    void dict_table_add_system_columns(
    	dict_table_t*	table,
    	mem_heap_t*	heap)	
    {
    
      // Add row ID, transaction ID, and rollback pointer in sequence
    	dict_mem_table_add_col(table, heap, "DB_ROW_ID", DATA_SYS,
    			       DATA_ROW_ID | DATA_NOT_NULL,
    			       DATA_ROW_ID_LEN);
    
    	dict_mem_table_add_col(table, heap, "DB_TRX_ID", DATA_SYS,
    			       DATA_TRX_ID | DATA_NOT_NULL,
    			       DATA_TRX_ID_LEN);
    
    	dict_mem_table_add_col(table, heap, "DB_ROLL_PTR", DATA_SYS,
    			       DATA_ROLL_PTR | DATA_NOT_NULL,
    			       DATA_ROLL_PTR_LEN);
    
    }
    Copy the code
  • The cluster index adds the row0upD.cc # ROW_upD_index_entry_sys_field method

/** * Set the transaction ID or roll back the pointer to the record row corresponding to the cluster index * @param Entry Data row record * @param index Cluster index * @param Type Type DATA_TRX_ID or DATA_ROLL_PTR * @param val Updated value */
void row_upd_index_entry_sys_field(
	dtuple_t*	entry,	
	dict_index_t*	index,
	ulint		type,
	ib_uint64_t	val)	
{
	dfield_t*	dfield;
	byte*		field;
	ulint		pos;
    // Query the initial position of the index hidden field
	pos = dict_index_get_sys_col_pos(index, type);
    // Get a pointer to the corresponding data record row
	dfield = dtuple_get_nth_field(entry, pos);
	field = static_cast<byte*>(dfield_get_data(dfield));
    // Add type
	if (type == DATA_TRX_ID) {
		trx_write_trx_id(field, val);
	} else {
		ut_ad(type == DATA_ROLL_PTR);
		trx_write_roll_ptr(field, val); }}Copy the code

Additional information is stored on the cluster index, the 6-byte DB_TRX_ID field, which represents the transaction ID of the record that was last inserted or updated. The 7-byte DB_ROLL_PTR field points to the undo log of the rollback segment. The 6-byte DB_ROW_ID is automatically incremented when new data is inserted. InnoDB automatically generates a clustered index containing the DB_ROW_ID field if the table does not have a primary key.

2.3.2 ReadView execution process
  • Readview data structure

    struct read_view_t{
      
      / / equal low_limit_id
    	trx_id_t	low_limit_no;
    			
      // Maximum transaction ID
    	trx_id_t	low_limit_id;
    			
      // Minimum transaction ID
    	trx_id_t	up_limit_id;
    			
      // Number of active transactions
    	ulint		n_trx_ids;
    				
      // Active transaction
    	trx_id_t*	trx_ids;
      
      // The transaction ID of the snapshot created
    	trx_id_t	creator_trx_id;
    			
    	UT_LIST_NODE_T(read_view_t) view_list;
    };
    Copy the code
  • The row0sel.cc#row_search_for_mysql method is used to determine the current read or snapshot read judgment process

      // If locked, the current read
    	if(prebuilt->select_lock_type ! = LOCK_NONE) {/ / to omit
      }else {
        // Snapshot read without lock
    		if (trx->isolation_level == TRX_ISO_READ_UNCOMMITTED) {
        // Read the uncommitted direct latest record row
    		} else if (index == clust_index) {
        // If the current index is clustered index
    		
          // Direct visibility judgment
    			if (UNIV_LIKELY(srv_force_recovery < 5) &&!lock_clust_rec_cons_read_sees(
    				    rec, index, offsets, trx->read_view)) {
    
    				rec_t*	old_vers;
            // If not, use undo log to get the version on the record line
    				err = row_sel_build_prev_vers_for_mysql(
    					trx->read_view, clust_index,
    					prebuilt, rec, &offsets, &heap,
    					&old_vers, &mtr);
    
    				if(err ! = DB_SUCCESS) {goto lock_wait_or_error;
    				}
    
    				if (old_vers == NULL) {
    					goto next_rec;
    				}
            // Assign to the old versionrec = old_vers; }}else {
          // Non-clustered index visibility judgment
    
    			ut_ad(!dict_index_is_clust(index));
    			DB_TRX_ID = DB_TRX_ID = DB_TRX_ID = DB_TRX_ID = DB_TRX_ID = DB_TRX_ID
          // Secondary index does not contain hidden columns, only the maximum trx_id of the page
    			if (!lock_sec_rec_cons_read_sees(
    				    rec, trx->read_view)) {
    				switch (row_search_idx_cond_check(
    						buf, prebuilt, rec, offsets)) {
            // If the ICP does not meet the conditions and the ultra-scan range is not reached, the next record is obtained to continue the search
    				case ICP_NO_MATCH:
    					goto next_rec;
    				case ICP_OUT_OF_RANGE:
            // If the condition is not met and the hyperscan scope is not met
    					err = DB_RECORD_NOT_FOUND;
    					goto idx_cond_failed;
    				case ICP_MATCH:
            // If the ICP matches the record, the cluster index is checked to determine the visibility
    					gotorequires_clust_rec; } ut_error; }}}Copy the code

For non-clustered indexes, because there are no hidden columns in the clustered index, when snapshot reads match the secondary index, lock_sec_rec_cons_read_sees is called to determine whether the trx_id of the last change recorded on the page is smaller than up_limit_id. If the value is smaller than the specified value, the data on the page is visible. Otherwise, the system checks the row in the cluster index and determines the visibility based on the DB_TRX_ID value to find the correct visible version data

  • Readview start row0sel.cc#row_search_for_mysql&rarr; Trx_assign_read_view ` method
trx_assign_read_view(
	trx_t*	trx)	
  
{
	ut_ad(trx->state == TRX_STATE_ACTIVE);
  // Return if there is one
	if(trx->read_view ! =NULL) {
		return(trx->read_view);
	}
  // If no, create one
	if(! trx->read_view) { trx->read_view =read_view_open_now(
			trx->id, trx->global_read_view_heap);

		trx->global_read_view = trx->read_view;
	}
  // Returns a snapshot
	return(trx->read_view);

}

Copy the code

Trx_assign_read_view → read_view_open_now → read_view_open_now_low

  • The closing of the readviewha_innodb.cc#ha_innobase::external_lockmethods
	if (trx->n_mysql_tables_in_use == 0) {

		trx->mysql_n_tables_locked = 0;
		prebuilt->used_in_HANDLER = FALSE;
    // autocommit=1, commit transaction
		if (!thd_test_options(
				thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN)) {
      Commit the transaction or mark the end of the SQL statement
			if (trx_is_started(trx)) {
				innobase_commit(ht, thd, TRUE);
			}
    // The transaction isolation level is lower than or equal to the RC level
		} else if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
			   && trx->global_read_view) {
      // Close the snapshot
			read_view_close_for_mysql(trx); }}/** * transaction isolation level less than or equal to RC level call * @param TRX */
void read_view_close_for_mysql(
	trx_t*		trx)	
{
	ut_a(trx->global_read_view);

	read_view_remove(trx->global_read_view, false);

	mem_heap_empty(trx->global_read_view_heap);
  // Snapshot is set to NULL
	trx->read_view = NULL;
	trx->global_read_view = NULL;
}
Copy the code

RR isolation level: Snapshots are created only after the first SELECT read after the transaction is started and are not deleted until the transaction is committed. For RC isolation level, readView snapshots are created for each SELECT read in the transaction

  • Transaction visibility judgmentlock_clust_rec_cons_read_seesread_view_sees_trx_id
/** * visibility check process * @param view current transaction readView snapshot * @param trx_id Data row corresponding transaction ID * @return */
bool read_view_sees_trx_id(
	const read_view_t*	view,	
	trx_id_t		trx_id)
{
    // If less than the minimum id of the current transaction
	if (trx_id < view->up_limit_id) {

		return(true);
  // If greater than or equal to the maximum ID of the current transaction snapshot
	} else if (trx_id >= view->low_limit_id) {

		return(false);
	} else {
    // If in between
		ulint	lower = 0;
		ulint	upper = view->n_trx_ids - 1;

		ut_a(view->n_trx_ids > 0);
    // Based on the currently active transaction array, find and compare whether trx_id exists in it by dichotomy
		do {
			ulint		mid	= (lower + upper) >> 1;
			trx_id_t	mid_id	= view->trx_ids[mid];

			if (mid_id == trx_id) {
				return(FALSE);
			} else if (mid_id < trx_id) {
				if (mid > 0) {
					upper = mid - 1;
				} else {
					break; }}else {
				lower = mid + 1; }}while (lower <= upper);
	}
    // The current row is visible if it is not in an active transaction
	return(true);
}
Copy the code

3 Reference Materials

Database multi-version implementation insider

Engine features · InnoDB MVCC related implementation

Recommended reading

Guava Cache actual Combat – From scenario to principle analysis

Details of HTTP2.0 and HTTPS protocols

, recruiting

Zhengcaiyun Technology team (Zero) is a passionate, creative and executive team based in picturesque Hangzhou. The team has more than 300 r&d partners, including “old” soldiers from Alibaba, Huawei and NetEase, as well as newcomers from Zhejiang University, University of Science and Technology of China, Hangzhou Electric And other universities. Team in the day-to-day business development, but also in cloud native, chain blocks, artificial intelligence, low code platform system, middleware, data, material, engineering platform, the performance experience, visualization technology areas such as exploration and practice, to promote and fell to the ground a series of internal technical products, continue to explore new frontiers of technology. In addition, the team is involved in community building, Currently, There are Google Flutter, SciKit-Learn, Apache Dubbo, Apache Rocketmq, Apache Pulsar, CNCF Dapr, Apache DolphinScheduler, and Alibaba Seata and many other contributors to the excellent open source community. If you want to change something that’s been bothering you, want to start bothering you. If you want to change, you’ve been told you need more ideas, but you don’t have a solution. If you want change, you have the power to make it happen, but you don’t need it. If you want to change what you want to accomplish, you need a team to support you, but you don’t have the position to lead people. If you want to change the original savvy is good, but there is always a layer of fuzzy window…… If you believe in the power of believing, believing that ordinary people can achieve extraordinary things, believing that you can meet a better version of yourself. If you want to be a part of the process of growing a technology team with deep business understanding, sound technology systems, technology value creation, and impact spillover as your business takes off, I think we should talk. Any time, waiting for you to write something and send it to [email protected]

Wechat official account

The article is published synchronously, the public number of political cloud technology team, welcome to pay attention to