Indexing is an essential part of your interview, so let’s take a look at the world of MySQL indexing.

preface

Have you always been a hodgepodge of information about MySQL indexes, as if you know everything, and if you dig deep you may not be able to answer any of them.

If you go to an interview and the interviewer asks you to talk about your understanding of indexes, and your understanding of indexes is limited to the fact that retrieving data is fast, it’s a data structure, you go home and wait to be told.

In order to avoid such embarrassing things, Kaka spent two days to sort out the contents of the index within the scope of his own understanding. If the sorting is not comprehensive, he can add and make suggestions in the comments section.

What is an index for MySQL

I believe that most partners have bought technical books, did not finish to see do not know, but the number of directories to see for sure the most.

See if the directory has its current pain point, if there is, according to the corresponding page number of the directory with the fastest speed to read the corresponding content location.

The same is true in MySQL, where an index is a data structure used by the storage engine to quickly find data

MySQL also divides into several types of index, respectively for b-tree index, hash index, spatial index, full text index.

Everything below is discussed on the basis of Innodb.

Why use indexes

Indexes speed up data retrieval, which is the main reason for using indexes.

The index itself is sequential, so that when a range query is performed, the retrieved data is already sorted, avoiding the problems of the server reordering and creating temporary tables.

The underlying implementation of indexing is itself sequential, with disk prefetch allowing access to data on disk to be addressed roughly sequentially, that is, converting random I/ OS into sequential I/ OS.

These points do not understand temporarily put, continue to see below can, will give you a satisfactory explanation.

There are always two sides to everything, and if you can provide performance improvements, there will be additional costs in other areas.

Indexes coexist with data and therefore take up extra storage space.

Index creation and maintenance require time costs, which increase with the volume of data.

Index creation degrades the performance of adding, deleting, and modifying data because the index data needs to be modified at the same time as the data is modified.

Why Innodb uses B+Tree instead of BTree

BTree, BTree, BTree, BTree, BTree, BTree, BTree

1. The Btree parsing

First look at the data structure is like BTree, kaka here to provide a website address, https://www.cs.usfca.edu/~galles/visualization/Algorithms.html, you can see some of the implementation process about data structure.

Let’s take a look at the BTree data structure. Here’s how Kaka has filled in the data.

There is an unfamiliar area about Max. Degree, which you can understand as Degree or Degree.

For example, if the current value is set to 4, a maximum of three records can be stored in a node. If set to 5, a maximum of four records can be stored in a node.

You can now see that only three data have been inserted so far.

So if you add one more piece of data, the node will split, and this verifies that when the order is set to n, a node can hold n-1 pieces of data.

So let’s insert some more data.

To achieve fast data retrieval, it needs to meet two characteristics, one is order, the other is balance.

As can be seen from the following figure, BTree has a certain order, and its balance is more satisfied. You can see the first figure generated in the article.

So how to find a value in BTree!

Let’s say I’m looking for a value of 9, so let’s look at the process.

The first number you see is 4, and 9 is greater than 4, so you look to the right of 4.

I’m going to continue to find nodes in the range 6 to 8, 9 is greater than 8, so I’m going to go to the right node.

The most important step is to find data 9. This process is the execution of the BTree data structure to find data.

Now that we know about the BTree data structure, let’s take a look at how BTree is stored in MySQL.

In the figure below, P represents a pointer to the next disk block.

16 and 24 in the first node represent what our key is.

Date is what the row of records corresponding to this key value is.

If you want to find the record with key 33, what should you do?

33 is between 16 and 34, so it’s going to look for disk 3.

Check disk 3. The pointer points to disk 8.

Data 33 can be retrieved on disk 8 and returned.

So how many pieces of data are read in the process?

There are a few things you need to know before you do that.

Starting with MySQL5.7, the storage engine defaults to InnoDB, and the smallest disk unit innoDB storage engine uses to manage data is pages.

There are several types of this page, including data page, Undo page, system page and transaction data page.

Generally speaking pages are data pages. The default page size is 16kb, and each page stores at least two or more row records.

Therefore, according to the BTree data search process, we can know that a total of three disks are read, so the size of each disk is 16kb.

However, the current case looks for three layers, so the data stored in the three layers is 16KB * 16KB * 16KB = 4096KB.

If the memory required is 1KB per record, then the three-tier BTree can store 4096 records.

The data of each database is less than millions, more than tens of millions of data, then the BTree level will be deeper and deeper, the relative query efficiency will be slower and slower.

Is it time to consider a question, that is why in Btree 48KB memory can only store 4000 records

When calculating the size of the data, the memory of the pointer address and the key are not included, only the memory of the data is calculated.

In the BTree structure, nodes not only store keys, pointer addresses, and corresponding data, so the data stored in a single disk is relatively small.

In order to solve the problem of the small amount of data stored by a single node, another structure has been evolved, namely B+Tree, which is mentioned below

2. The B + Tree parsing

Again, look at the B+Tree data structure.

To facilitate comparison, the data structures of BTree and B+Tree are grouped together.

So it can be seen that in B+Tree, the leaf node stores the full amount of data, while the non-leaf node only stores the key value.

Yi! Isn’t this a good solution to the problem caused by BTree? You can make each node store more data. The more data each node stores, the relative depth of the tree will not be too deep.

Now that we know about the data structure of B+Tree, let’s take a look at how B+Tree is stored in MySQL.

It is obvious from the figure above that the two points are different.

First point: all data of B+Tree is stored on leaf nodes.

Second point: B+Tree is a chain ring structure between all leaf nodes

So how many pieces of data are read in the process?

If the data read depth of B+Tree is the same as that of b-tree, the size of each disk is 16kb.

How much data can be stored in a non-leaf node of a B+Tree? Typically we have a primary key for each table.

The first and second layers store key values, which are the primary key values.

The first layer node can store 16 * 1000/10 = 1600. The first layer node can store 16 * 1000/10 = 1600.

Similarly, each layer 2 node can store 1600 keys.

The third layer is leaf nodes. The storage size of each disk is the same as that of BTree. Each piece of data occupies 1KB.

In B+Tree, 1600 * 1600 * 16 = 40960000 data can be stored at three layers

From this point of view, the data stored by B+Tree is not at the same level as the data stored by BTree.

So the conclusion can be drawn:

B+Tree ensures that the amount of data retrieved and stored is the largest compared with BTree

B+Tree Select an index of a type that occupies a small memory space, such as an int.

The smaller the memory size of the key, the larger the storage area in the node.

3. A Hash index

Alter table user add index hash_gender using hash(gender);

The storage engine uses InnoDB.

Innodb hash index (Btree) innoDB hash index (Btree) innoDB hash index (Btree)

There is a special feature in the Innodb storage engine called adaptive hash index. When the index value is used very frequently, it will create a hash index in memory based on the BTree index, so it has some features of hash index, such as fast lookup

Hash index is realized based on hash table. Assuming that a hash index is established for name, the search process is shown in the figure below. Hash table is a data structure accessed according to key-value pairs, and it allows the retrieved data to be mapped to the corresponding position in the hash table through hash function, with very high search efficiency.

Hash indexes store hash values and row Pointers, no storage of key values, field values, but hash indexes are mostly completed in memory, data retrieval is very fast, so it does not affect performance.

Hash indexes are not sorted by index value and therefore cannot be sorted.

MySQL > alter table hash index =, in, <>

Hash indexes cannot avoid table scans at any time

Hash index In the event of a large number of hash conflicts, the storage engine must traverse all row Pointers of the linked list, comparing them row by row.

4. B+Tree is different from BTree

After a very long calculation, drawing now the basic difference between the two have a certain understanding of it!

Here’s a summary.

  • B+Tree Leaf nodes store full data (keys +data), whereas non-leaf nodes store only keys
  • B+Tree stores much more data at the same depth than BTree.
  • B+Tree Each leaf node has a link to the next leaf. The advantage of this is that we can start at any leaf node and get all the following data.

5. B+Tree is suitable for indexing

B+Tree non-leaf nodes only store key values. Therefore, compared with BTree nodes, they can store more data and read more key values into the memory each time, thus reducing the I/O

The search efficiency of B+Tree is stable. Any data must be searched from leaf nodes to non-leaf nodes. Therefore, the search efficiency of each data is almost the same.

The leaf nodes of B+Tree store the full amount of data and are ordered. Therefore, all keys can be scanned only by traversing the leaf nodes, which is more efficient in range search.

Innodb storage engine uses B+Tree as index.

4. Differences between clustered index and non-clustered index

Clustered indexes and non-clustered indexes are also called primary indexes and secondary indexes.

How to distinguish clustered indexes from non-clustered indexes?

First look at the Innodb engine, create table generated files, you can see that there are two IBD files.

See here do not know if you have any questions, why see some articles will also have FRM files! But not here!

After MySQL8.0, the source data is stored in the tablespace, so there are no FRM files.

This IDB file is known to store data information and index information.

Then take a look at the Myisam storage engine create table production file.

You can see that creating a table generates three files with extensions MYD, MYI, and sdi.

MYD: is a table data file (a file that holds data)

MYI: is a table index file (the file that holds the index)

One conclusion can then be drawn

As long as the data and index are stored in the same file, it is a clustered index, otherwise it is a non-clustered index.

Select * from idB; select * from IDB; select * from IDB;

Remember this: In Innodb, data inserts must be bound to an index value. If there is no primary key, select a unique index. If there is no unique index, select a 6Byte ROWId.

If there are multiple indexes in the table, how is the data stored

Innodb storage engine can generate multiple IDB files if there are multiple indexes.

In Innodb, only one copy of data is saved, and if there are multiple indexes, multiple B+ trees are maintained

For example, table fields ID, name, age, sex.

Id set as primary key index (cluster index), name set as normal index, so how many copies of the data will be stored!

No matter how many indexes are set in a table, only one copy of the data is stored, but the table maintains multiple B+ trees.

In this case, if id is the primary key index and name is the normal index, two B+ trees will be maintained in this table.

The id primary key index is stored with the data. The leaf node of the B+Tree where the name primary key index is stored stores the primary key ID value.

The corresponding graph is the following two pictures, you can have a good look.

One final point: In Innodb, there must be clustered indexes, other indexes are non-clustered indexes.

It is briefly mentioned that myISam only has non-clustered indexes.

Several technical terms of index

In the interview will often ask these keywords, respectively back table, cover index, the most left principle, index push down, must know ha!

1. Back to the table

There are all kinds of explanations for back tables on the Internet, and they can be easily understood, but only if you have to distinguish between clustered indexes and non-clustered indexes.

Again, use the above example where id is the primary key index and name is the normal index.

Select id,name,age from table where name = ‘kaka’

Select * from B+Tree; select * from B+Tree; select * from B+Tree;

The process of jumping from a non-clustered index to a clustered index is called a back table. That is, if the non-clustered index does not contain all the fields to be queried, the back table is called a back table.

In this case, the leaf node with the non-clustered index name only has id but no age, so it jumps to the clustered index and returns the required field data when querying the whole record according to the ID.

2. Overwrite indexes

An overwrite index is created for all fields in the query.

Select id,name from table where name = ‘kaka’

So this statement is used to cover index, because the id and name fields for indexing query fields is also the two fields, so called index.

That is, a non-overwritten index is called an overwritten index when the leaf node contains the field to be queried

3. Left-most match

The left-most matching principle exists in composite indexes.

Table id, name, age, sex;

Set name and age to composite indexes.

Which of the following statements does not comply with the leftmost rule.

select * from table where name = ? and age = ?

select * from table where name = ?

select * from table where age = ?

select * from table where age= ? and name= ?

You can take a quiz by yourself! Only the third statement does not use the index. The other three statements follow the leftmost rule.

About this leftmost principle is far more than so simple, a try is a pit, about this part of the content of kakaka later will be mentioned in the optimization article.

4. Index push down

Use the same SQL statement.

select * from table where name = ? and age = ?

Index push-downs occur in MySQL5.6 and later versions.

The previous query process was to get the data in the storage engine based on name, and then filter it at the server layer based on age.

With index push-down, the query process is to obtain data from the storage engine based on name and age and return the corresponding data, without filtering the data from the server layer.

If a Using index condition occurs when you analyze SQL statements Using Explain, you are Using index push-downs. Push-downs are most likely to occur when combining indexes.

Where is the index stored

Indexed data files are stored on disk and also need to be persisted.

However, when indexes are used, data is read from disk into memory in blocks.

This is where the concept of an operating system comes into play, the operating system is fetching data from disk, let’s say it’s fetching 1KB of data, but instead of just fetching the 1KB you need, the operating system is fetching 4KB of data.

Why is it 4KB? Because on an operating system, a page of data is 4KB.

Why do you need 1KB to fetch a whole page of data?

That brings up another concept, the principle of locality: data and programs tend to cluster together, and once they access one piece of data, they are more likely to access that piece of data and its neighbors again.

So MySQL’s Innodb storage engine also uses this locality principle when it reads data, which is 16kb at a time.

The default page size in the Innodb storage engine is 16KB. This parameter can be adjusted as well. The parameter is innodb_page_size.

One final point:

Index data is stored on disk and is read from disk to memory on a page-by-page basis.

Why not just store it in memory! Do you have this doubt!

If the index data is only stored in memory, then when the computer shuts down and the server goes down, it needs to regenerate the index, which is very inefficient.

Eight, summary

The above is kaka’s understanding of the index, as far as possible in the knowledge point comprehensive.

If there are any omissions or mistakes in the article, please give your suggestions.