1 Concept Differentiation

  • Normal and unique indexes

A normal index can be repeated, but a unique index cannot be repeated like a primary key. Unique index can be used as a legitimate means of data verification, for example, the id number field of the student table, we artificially stipulate that the field should not be repeated, so the unique index is used. (Generally set the primary key of student ID field)

  • Primary key and unique index

The primary key ensures that each row in the database is unique, such as ID card, student number, etc., in the table requirements unique, do not repeat. A unique index serves the same purpose as a primary key. The difference is that there can only be one primary key in a table, and the primary key cannot be empty. There can be multiple unique indexes, and the unique index can have one empty record, that is, ensure that it is different from others. For example, the student table, in the school with student number inside the general master key, id card is made into a unique index; When they get to the education bureau, they change the ID number into the primary key and the student number into a unique index. The primary key of the table depends on the application. The primary key cannot be empty.

2 Case Introduction

A resident system where everyone has a unique ID number. If the system needs to query the name by id number, it will execute SQL similar to the following:

select name from CUser where id_card = 'ooxx';
Copy the code

And then you would definitely build an index in the ID_card field. However, the ID_card field is large, so it is not recommended to use it as the primary key. So there are two options:

  1. toid_cardField creates a unique index
  2. Create a normal index

Assuming that the business code is guaranteed not to write duplicate id numbers, both choices are logically correct. But from a performance perspective, is a unique index a plain index?

Consider the following example: Assume that none of the values on field K are repeated.

  • InnoDB index structure:

Next, analyze the performance.

3 Querying Performance

select id from T where k=4
Copy the code

Through B+ tree sequence traversing from the root to the leaf node, it can be considered that the data page is searched by dichotomy.

  • Normal index: After finding the first record (4,400) that meets the condition, search for the next record until it encounters the first record that does not meet the condition k=4
  • Unique index, because the index is unique, the search is stopped after the first record that meets the condition is found

It looks like the performance difference is tiny.

InnoDB data is read and written in data page units. That is, when a record is read, it is not read from disk, but read into memory as a whole on a page basis.

Therefore, a normal index requires one more “find and judge the next record” operation, namely a pointer search and a calculation. If k=4 is exactly the last record on the data page, fetching the next record requires reading the next data page, which is slightly more complicated. For integer fields, a data page can hold nearly a thousand keys, so this is actually quite rare. Therefore, when calculating the average performance difference, the cost of this operation can be considered to be negligible for the current CPU overhead.

We know that MySQL has change buffer.

5 Updating Performance

Now what does InnoDB do when it inserts a new record (4,400) into the table?

Need to distinguish whether the target page of the record to be updated is in memory:

5.1 in the memory

  • The only index

Find the position between 3 and 5, check that there is no conflict, insert the value, and end the statement execution.

  • Normal index

Find the position between 3 and 5, insert the value, and end the statement.

The difference in the impact of a normal index versus a unique index on update statement performance is just a judgment that costs a small amount of CPU time.

5.2 Out of Memory

  • The only index

The data page is read into memory, the value is inserted, and the statement is executed.

  • Normal index

The update is logged in the change buffer, and the statement completes.

Reading data from disk into memory involves random IO access and is one of the most expensive operations in a database. However, change Buffer reduces random disk access, so update performance is significantly improved.

6 Index selection in practice

How to choose between a normal index and a unique index? There is no difference in query performance between these two types of indexes, which mainly consider the impact on update performance. Therefore, it is recommended to choose normal indexes as much as possible.

If all updates are followed by a query for that record, it is time to turn off the change Buffer. In other cases, change Buffer can improve update performance. The combination of normal indexes and change buffer makes it easy to optimize the update of large tables.

When using a mechanical hard disk, the change Buffer mechanism is very effective. So, if you have a library of historical data and use a mechanical hard disk for cost reasons, you should pay attention to the indexes in these tables. Use normal indexes as much as possible, and make the change buffer large to ensure that the historical tables are written fast.

6 Change buffer and redo log

WAL’s core mechanism for improving performance is also to minimize random reads and writes, which can be confusing. So, here I put them in the same flow to illustrate the difference.

6.1 Insertion Process

insert into t(id,k) values(id1,k1),(id2,k2);
Copy the code

Assuming the current state of k index tree, data page k1 is in memory (InnoDB Buffer pool) and data page K2 is not in memory after the location is found.

  • Update flow chart with change buffer. The two arrows in the figure are background operations and do not affect update response.

This update does the following:

  1. Page1 in memory, directly update memory
  2. Page2 is not in memory, it is in the change buffer, cache “insert a line to Page2” information
  3. Log the first two actions to the redo log

Then the transaction completes. The update statement is cheap to execute, with only two memory writes and then one disk (the first two operations combined write one disk), again sequentially.

6.2 How do I Handle Subsequent Read Requests?

select * from t where k in (k1, k2);
Copy the code

The read statement follows the update statement, and the data is still in memory. These two reads are not related to the system tablespace or redo log. So I didn’t draw these two.

  • Read process with change buffer

When Page1 is read, it is returned directly from memory. Does WAL have to read data from a disk or update data from a redo log before it can be returned? Not really. If you look at the state in the figure above, although the previous data is still on disk, the result is returned directly from memory, and the result is correct.

To read Page2, read Page2 from disk into memory, and then apply the operation log in the change Buffer to generate a correct version and return the result. You can see that the data page is not read into memory until you need to read Page2.

Therefore, it is important to compare the impact of these two mechanisms on update performance

  • Redo log saves I/O consumption for random disk writes.
  • The change buffer saves I/O consumption for random disk reads

6 summarizes

Since unique indexes do not use the optimization mechanism of Change Buffer, it is recommended to give priority to non-unique indexes from a performance perspective if the business is acceptable.

6.1 On whether to use a unique index

The main dilemma is “the business may not be assured”. This article discusses performance issues under “business code has been guaranteed not to write duplicate data”.

  • If the business is not guaranteed, or if the business simply requires the database to do the constraint, then there is no choice but to create a unique index. In this case, the significance of this paper is to provide a troubleshooting idea if a large number of data insertion is slow and memory hit ratio is low.
  • Then, in some “archive repository” scenarios, consider using a unique index. For example, online data is only kept for half a year, and historical data is kept in an archive. At this point, archive the data to ensure that there are no unique key conflicts. To improve archiving efficiency, consider changing the unique index of a table to a normal index.

6.2 Will the change Buffer data be lost if the host abnormally restarts after the Change Buffer is written?

Not lost. Although only memory is updated, the change buffer is also recorded in the redo log at transaction commit time, so the change buffer can be retrieved during crash recovery.

6.3 Can DATA Be Directly Written back to Disks during Merge?

Merge Execution Process

  1. Read data pages from disk into memory (older data pages)
  2. Find out the change buffer records of the data page from the change buffer (there may be more than one) and apply them successively to get the new version of the data page
  3. Write the redo log

The redo log contains data changes and changes to the change buffer

At this point the merge process ends. At this time, the data page and the disk position corresponding to the change buffer in memory have not been modified, so they are dirty pages. After that, they will flush back their own physical data, which is another process.

thinking

During the construction of the first example, session A had session B delete the data and then insert the data again, and found that the rows fields in the Explain result had gone from 10001 to 37000. If you execute delete from t, call idata() and explain alone without session A, you will see that the rows field is still around 10000. What is the reason for this?

If it does not recur, check

  • Is the isolation level RR (Repeatable Read)?
  • The table t created is not an InnoDB engine

Why is the explain result incorrect after this sequence of operations? The delete statement deletes all the data and then inserts 100,000 rows via call idata(), seemingly overwriting the original 100,000 rows. However, session A opened the transaction and did not commit, so the 100,000 rows inserted before cannot be deleted. In this way, each row of the previous data has two versions. The old version is the data before the DELETE and the new version is the data marked deleted. Thus, there are actually two copies of the data on index A.

Then you might say, “No, the primary key cannot be deleted either. Why is the number of rows scanned by the explain command still around 100000 when the statement does not use force index? Yes, but this is the primary key, which is directly estimated by the number of rows in the table. For the number of rows in the table, the optimizer directly uses the show table status value. Set innodb_flush_log_at_trx_COMMIT and sync_binlog to 0 if your machine is not capable of I/O.

reference

  • Dev.mysql.com/doc/refman/…
  • Time.geekbang.org/column/arti…