define

What is an index? Before you learn a thing, know what you are learning.

In a relational database, an index is a storage structure that sorts the values of one or more columns in a database table. It is a set of values of one or more columns in a table and the corresponding logical pointer list to the data page in the table that physically identifies these values. Index function is equivalent to a book catalog, you can quickly find the required content according to the page number in the catalog.

Let’s chew through these bold keywords. When we don’t have indexes in our query conditions,mysql can only do sequential IO to compare the items, just like you can only go through the pages of the page one by one. If there are 1000 items of data, where you query the data, it has to do IO many times, maybe 1000 times. We know that disk IO efficiency is very slow. How to solve, index, index is essentially a data structure, the purpose of its existence is to let us avoid a large number of sequential IO, with specific data structure and search methods to reduce the I/O times as much as possible.

Indexes are not unique to mysql. They are available in any file or data search system, such as Windows operating system, to help us find corresponding files more quickly.

To speed things up, trees are a good data structure. Mysql in index design is to use B+ tree structure to sort and store data, whether clustered index auxiliary index joint index, in essence is to search on the structure of B+ tree.

Now that WE’re talking about B+ trees, we need to expand on why B+ trees are the dominant data structure. There are so many types of trees, why use B+ trees? Here are some common 🌲 trees 🌲.

The data structure

Ordinary binary tree

It is an ordinary tree, characteristic

1: The left byte point must be smaller than the right child node.

2: Each node has a maximum of two leaf nodes.

With this structure, you don’t need sequential IO anymore, you can passBinary searchTo find the data we want, the insertion and lookup time is order log(N), which does slow down the I/O count somewhat. Cons: ButEasy to imbalances, like this. Dynamic chart:

As the amount of data becomes larger and larger, the tree level becomes higher and higher, and the number of queries and IO increases significantly. So this definitely doesn’t work for mysql. Then you need to optimize it so that it needs to auto-balance, and determine that the left and right nodes can’t be more than one level apart each time you write, otherwise it will auto-balance left-handed or right-handed.

Balanced binary trees

The figure below Dynamic chart It looks like it’s balanced, but if you increase the number of nodes, the tree level will increase, and the I/O count will go up again, so it’s not suitable for mysql either.

Disadvantages: tree rank or high.

B tree

Further optimization based on the balanced binary tree, so that each node can put more leaf nodes, tree height is reduced, IO is also reduced. This is where B trees (also known as balanced multipath lookup trees) are needed, as shown below In the GIF, write:

GIF query:

It looks like the tree is balanced, and multiple paths can also reduce the height of the tree. You may have the audacious idea of storing all your data on a single leaf node, or 100 million data in IO three times at most.

How mysql interacts with disk to read data.

The storage engine mentioned above (InnoDB, myIsam), its base storage unit is pages. When we send a query,mysql will read data from disk, but disk I/O efficiency is very poor, so there are some small optimization measures,mysql will do a read in addition to loading the data to read, but also load the data near the address. This action is called prefetch. Mysql can load 16KB** of data at one time, so the size of a page is 16K(note that each square represents a page, and the content of a page is one IO read), so you can calculate how much data a node can hold based on the size of each key. However, in addition to the key, the row data of the B tree also exists on each node. By looking at the size of a row data, we can calculate how big the data of a node is. If the data is big, the data that can be stored on that page will be less, and there will be more nodes.

B + tree

So in order to get as much data on each node as possible, you need to tighten your belt. B+ trees continue to optimize the data store of nodes, the only difference is that B+ trees no longer store data on non-leaf nodes, just pointer addresses. The following figure

In the GIF, write:

As can be seen from the structure diagram of B tree and B+ tree, there is little difference between them in tree structure. They are both multi-path balanced search trees. The differences lie in the following two points:

1: there is only key on all non-leaf nodes of B+ tree, while data is only stored in leaf nodes. The space occupied by data on each non-page sub-node is less. Let’s calculate how much data can be stored in 16KB. (B+ tree height of three levels can store 20 million data, if each data is about 1KB).

First we figure out how many Pointers we can hold on non-leaf nodes. Each pointer is 6 bits If the id is an integer, there are 14 bits in total, and a page is 16 KB, that is, 16,384 bits. Therefore, a layer of non-leaf nodes can have a maximum of 16384/14=1170 Pointers. A layer of third-order tree with 16 leaf nodes can store 1770 * 16 Pointers, each pointer pointing to a page, just figured out 1170 Pointers per page, so multiply that by 1770, so you end up with 1170 * 1170 * 16= 21902,400 records. That is to say, in a third-order B+ tree can put 20 million data magnitude,3 IO can be queried. Same thing with B tree, except that the non-leaf nodes of B tree don’t just have ids, they also have data, so instead of 8+6=14 on a page, it might be bigger, so there’s less stuff in the same tree.

Tiling: if more than this number, number of IO will increase, so we need to know when design table how much each line of data, to ensure that each table in the case of three layers of tree height up to how much data can be stored, more than the order of magnitude, may need to introduce some work table, or data migration, it is also a big table to optimize some means.

2: B tree there is a fault at the same time, the scope of the query efficiency is not high, because when you look up to the child node found no way, the need to return to the root node traverse again, in the B + tree, each leaf node connection with two-way pointer, so when the range queries, you can through the leaf nodes of the two-way pointer for fast lookup.

tiling: Why do we usually have toThe primary key is required to increment? (both database primary key increment and distributed algorithm increment). It also has to do with the structure of the tree. Take a look at the GIF below.

Of the B + tree data in order to write, but skip the 4, if inserted a 4 key, behind the tree in order to maintain the balance of nodes, will be for left-handed or right-handed, in the case of a large amount of data or writing intensive, since the balance of movement is very resource-intensive, so to avoid this situation, we will let write key sequentially from increase! (Know what it is, know why it is)

The index infrastructure B+ tree and the benefits of B+ trees have been introduced, and the reason for the fast index is generally clear. Let’s look at the specific index types.

In addition to the tree structure, indexes are commonly used in hash table structures. Similar to Java hashMap, indexes store data in key-value pairs. Key stores index fields, and value stores row data or addresses. The hash structure has a very high query efficiency and time complexity of O(1), but it does not support range query, so it is not applicable to the scenario, it is good to know.

The index type

First, the index is divided into two blocks

  • Clustered index (also called primary key index or clustered index)
  • Non-clustered indexes (secondary indexes, union indexes, full-text indexes, etc are all non-clustered indexes)

Why is it so divided, and read on

The index of the innodb

Clustered index:

Innodb automatically creates a rowId to build a clustered index. If there is no primary key, innoDB will automatically create a rowId to build a clustered index. Innodb has a clustered index first (MyISam doesn’t really have a clustered index, more on that later). Look at this picture.

When we want to query 36, we first conduct IO twice and find the location of the ID. The leaf node of B+ tree can store data directly, so we directly locate the row data corresponding to the ID.

Total I/O count: 3

Secondary index

Select * from primary key; select * from primary key; select * from primary key; select * from primary key; IO count: a maximum of six times

Extension: This requery of row data based on the primary key ID is called a table-back query. So we know why auxiliary is called, auxiliary is to find the eldest brother (primary key), and then according to the primary key to find the final data you want. No head snatching. And of course there’s an optimization for this, which WE’ll talk about later.

Speaking of here can point to the question again, the reason for such points above. Innodb’s primary key index can fetch rows directly, while non-primary key indexes need to return to the table, which increases Io times. This explains why primary key indexes are the fastest.

Joint index

Join index (s) : join index (s) : join index (s) : join index (s) : join index (s)

IO count: a maximum of six times.

Extension point: leftmost matching principle. How do you sort a joint index? Based on the number of joint indexes you have, like ABC, sort by A first, Again in a equal conditions, according to the b sort, under the condition of a and b are equal, according to c, when you set up joint index (a, b, c), at the same time, equivalent to set up (a) (a, b) (a, b, c), three indexes, so when the where the back of the conditions to make a joint effect index, conditions must be according to the order, namely, there must be a, You have b, you have C, and if one of the indexes is missing, all the other conditions are invalid. Why doesn’t a b or C work? You see, my sorting is to start from a, only to ensure the order of A can check B, only to see the order of B is chaotic, then you directly give me a plug WHERE I should go? Only full table scan ah. So the index is dead. Select * from table where a=xx and b=xx or select * from table where b=xx and A =xx, that’s fine,mysql isn’t stupid enough to recognize it, and the query optimizer will adjust it for you.

Above, the basic content of the index is introduced, and the index has a close relationship with the following content.

The index of the myisam

Innodb and myiSam were mentioned when talking about primary key indexes. The index difference is that myIsam can be understood as only a secondary index, because its leaf node does not hold data, but only Pointers to data. Look at the picture:IO count :4. There is a difference between innoDB’s secondary index and this one!It gets the pointer and goes straight to disk to get the corresponding data, no return tableBut still one more IO. So the efficiency of the query will be relatively low. (But they are both B+ trees, and some people get confused and think myiSam is not a B+ tree without data, depending on whether your leaf node only holds Pointers).

Its other indexes will not be drawn, because either primary key index or secondary index. The query path is the same as this one, which is to find the address and then look again.

Just because MyISam doesn’t support primary key indexes doesn’t mean it can’t set primary key indexes! Myisam also has a primary key index, but it is mainly used to indicate that the field is unique. As mentioned earlier, it does not really have a clustered index, because it is unique except the query path and structure is basically the same, do not say that it has no primary key.

The storage engine

MyIsam and InnoDB are two database engines provided by mysql and specified when building tables. They have different file structures and can be used in different scenarios, so let’s take a look at the differences

innodb myisam
Storage file distinction Storage file distinction
Support transactions Transactions not supported
Support foreign keys Foreign keys are not supported
Do not save the exact number of rows of the table (select count requires full table scan) Holds the number of rows for the entire table
Full text indexing not supported (supported after version 5.8) support
Table and row (default) level locking is supported Only the table lock

In this installment we will focus only on the storage file differences, which relate to their index differences as we mentioned above.

Let’s start with the differences between myIsam and InnoDB index files. MyIsam’s table has three files:

  • Tablename. FRM: table structure file
  • Tablename.myd: Data file (MyISAM Data)
  • Tablename. MYI: Index file (MyISAM Index)

Innodb tables have two files:

  • Tablename. FRM: table structure file
  • Tablename.ibd: Index and Data file (InnoDB Data).

MyIsam does not have a primary key index because its index is separated from the physical data file. To get the address, you must go to the data file to obtain the specific row data content.

Optimization of indexes

The above index is basically introduced, it is not brainless index can be, everything is fine. Absolutely not! First of all, indexes bring two drawbacks: 1. The index files of different storage engines are different, but they all need to occupy physical disk space. Each increase of data adds an index, which leads to excessive disk space occupation. 2: Because each write and delete will adjust the index, the throughput will be affected when the write is frequently deleted and the db load will be increased.

The first optimization recommendation is 1: There is no need to index fields that are heavily duplicated (gender, status). 2: Create indexes for frequently queried hot fields. If there are multiple query conditions, try to create composite indexes to save space costs.

Good steel is used for the cutting edge

Mysql > alter table select * from table where index (select * from table); mysql > alter table select * from table where index (select * from table); mysql > select * from table where index (select * from table);

Note that this scenario is not commonly used. In many query scenarios, we actually need many fields. It is impossible to set indexes for all fields, so we still need to separate scenarios. This setting is possible only if the required fields are few and hot.

Secondly: there will be a lot of indexes created after the invalid, in addition to the left matching mentioned above, and such as the following

  • Implicit conversion – The data type of the query condition and the data type of the field do not match.
  • The search conditions can be “IS NULL” or “Is not NULL”.
  • Query criteria are used by functions or calculation operations. For example concat(‘ jingxi ‘,1) does not run the index;
  • An OR condition, any condition of an OR join without an index, is invalidated.
  • The cost calculation of the query optimizer causes the index not to walk.
  • Some of these reasons should be inferred from the above analysis, but I won’t go into details here. You can use explan to see if and what indexes this SQL query uses.

You need to be aware of these pitfalls in order to avoid writing risky SQL. The stress of db failure is obvious, so don’t ignore it.

Conclusion:

This article mainly focuses on index, which is introduced by index. The essence of index is a kind of data structure,B+ tree, based on the multiple index types of B+ tree, clustered index, auxiliary index, joint index and their concepts and differences. By the way, it brings about the interaction between mysql and disk. Then we introduce innoDB and myIsam storage engines based on clustered indexes. The difference between innoDB and myIsam storage engines is the difference between innoDB and myIsam storage engines, and the difference between innoDB and myIsam storage engines is the difference between innoDB and myIsam storage engines. Here are the index differences:

The storage engine The data structure Support the index
innodb B+ tree (leaf nodes can store data and IDS) Primary key (required), secondary index, joint index
myisam B+ tree (leaves only have Pointers) Primary key (optional, no difference except unique), secondary index, union index

SQL does not optimize, the boss optimize you).

But some of the concepts mentioned above can be extended, for example

  • How pages are assigned (if you can go back and see that my page number is randomly assigned, it is randomly assigned, very fine).

  • The above mentioned to all kinds of trees can be understood again, what tree order traversal ah, red black tree ah.

  • SQL optimization details, query optimizer decision logic, how to do large table pagination,explain various field meanings, mysql5.7 after optimization, pull down technology (interview frequency). More differences between the two data engines, usage scenarios.

Just look it up and learn how to use it through some mature business tools. Learning is constantly expanding and forming their own sea of knowledge.

The next article will talk about mysql lock, also a bunch of concepts, online easy to cause problems, the interviewer love disk, continue to disassemble it!

Code calligraphy and painting is not easy, if you feel that there is a harvest or wrong place please leave a message to exchange yo. I am Fan Fan, who has been working in the orange factory for 4 years and is now working as a service developer. I hope to bring you different ideas by settling down what I see and think! See you next time, go