background

I believe you will say to the index when database optimization, I is not exceptional also, everybody is basically to answer a 123, the optimization of the data structure and page cache can be in a few words, such as but once ali P9 asked me an interview, you can start the computer level on an index data loading process? (Just want me to talk about IO)

I died on the spot…. Because the basic knowledge of computer network and operating system is really my blind spot, but I make up for it, nonsense not to say, let’s start from the computer load data, talk about index from a different Angle.

The body of the

An index in MySQL is essentially a data structure

Let’s take a look at computer data loading.

Disk IO and prefetch:

Let’s talk about disk IO first. Disk reads data by mechanical movement. Each read requires three steps: seek, find, and copy to memory.

The seek time is the time required for the magnetic arm to move to the specified track, usually less than 5ms.

The average search time is half a turn. If it is a 7200 RPM disk, the average search time is 600000/7200/2=4.17ms.

The copy time to memory is fast and negligible compared to the previous two, so the average IO time is around 9ms. That sounds fast, but 9000s of data in a database of millions is a disaster.

Considering the disk I/o is very high, the optimization of computer operating system to do the pre-reading, when an I/o, not only the current disk address data, but the adjacent data are read into memory buffer, because when the computer access to an address data, the adjacent will soon be access to data.

The data read by each IO is called a page. The specific data size of a page depends on the operating system and is usually 4K or 8K. In other words, when we read the data in a page, we actually only have one IO.

(Suddenly, I was asked a question just after graduation. On a 64-bit operating system, how many bytes are ints in Java? What is the maximum? Why?

So if we want to optimize the database query, we want to minimize the DISK I/O operation, so that’s where the index comes in.

What is the index?

An Index is a data structure that helps MySQL obtain data efficiently.

There are two physical types of indexes commonly used in MySQL: B-tree index and hash index.

This topic focuses on the BTree index.

BTree index

BTree is also called multi-path balanced search tree. The BTree features of an M-fork are as follows:

  • Each node in the tree contains a maximum of m children.
  • Except for root and leaf nodes, each node has at least [ceil(m/2)] children (ceil() is rounded up).
  • If the root node is not a leaf node, there are at least two children.
  • All the leaf nodes are in the same layer.
  • Each non-leaf node is composed of N keys and N +1 Pointers, where [Ceil (m/2)-1] <= N <= m-1.

This is a 3 x (just for example, there will be a lot of real fork) BTree charts, each square piece of what we call a disk block or a block, block, this is an IO operation system to read the contents of memory, a block corresponding to the four sectors, purple represents the key of data in disk blocks, yellow represents the data to the data, Blue represents the pointer P, which points to the location of the next disk block.

To simulate the process of finding data with key 29:

1. Read the root disk block 1 of the file directory according to the root pointer. [Disk I/O operation once]

2. Disk block 1 stores 17,35, and three pointer data. We find 17<29<35, so we find the pointer P2.

3. According to the p2 pointer, we locate and read disk block 3. [Disk I/O operation twice]

Disk block 3 stores 26,30, and three pointer data. We find that 26<29<30, so we find the pointer P2.

5. Based on the p2 pointer, we locate and read disk block 8. [Disk I/O operations for 3 times]

6. Disk block 8 stores 28 and 29. We find 29 and obtain its corresponding data data.

Therefore, the BTree index plays a role in each disk I/O fetched from memory, thus improving the query efficiency.

But is there anything that can be optimized?

As we can see from the figure, each node contains not only the key value of the data, but also the data value. The storage space of each page is limited. If the data on each node (that is, a page) is large, the number of keys that can be stored is small. If the data on each page is large, the b-tree depth is large, which increases the disk I/O times and affects the query efficiency.

B + Tree index

B+Tree is an optimization based on B-Tree, making it more suitable for implementing external storage index structure. In a B+Tree, all data record nodes are stored on the leaf nodes at the same layer according to the order of key values, instead of the non-leaf nodes only storing key values. This greatly increases the number of key values stored on each node and reduces the height of the B+Tree.

B+Tree is different from B-tree in the following aspects:

Non-leaf nodes only store key value information, and data records are stored in leaf nodes. B-tree in the previous section is optimized. Since non-leaf nodes of B+Tree only store key value information, the height of B+Tree can be compressed to a particularly low level.

The specific data are as follows:

InnoDB storage engine page size is 16KB, the typical table primary key type is INT (4 bytes) or BIGINT (8 bytes), pointer type is also generally 4 or 8 bytes, That is, a page (a node in B+Tree) stores 16KB/(8B+8B)=1K keys (K = [10] ^3).

This means that a depth of 3 B+Tree index can maintain 10^3 * 10^3 * 10^3 = 1 billion records. (There are errors in this calculation method, and the leaf node is not counted, if the depth of the leaf node is actually 4)

We only need to do three IO operations to find the data we want from a billion data, compared to the original million data 9,000 seconds do not know how many better Wallace.

In addition, there are usually two head Pointers on B+Tree, one pointing to the root node and the other to the leaf node with the smallest keyword, and there is a chain-ring structure between all leaf nodes (i.e. data nodes). So in addition to the primary key range lookup and paging lookup for B+Tree, we can also start from the root node and do random lookup.

The B+Tree indexes in the database can be divided into clustered index and secondary index.

B + Tree above the figure in the database is the clustered index, aggregation index of B + Tree leaf nodes in the storage is the entire table row data, auxiliary index and aggregation index difference between auxiliary index of leaf node does not contain all the data rows, but store the corresponding row data aggregation index key, namely the primary key.

When querying data through secondary indexes, the InnoDB storage engine traverses the secondary index to find the primary key, which then finds the full row record data in the clustered index.

However, while indexes can speed up queries and improve MySQL’s processing performance, excessive use of indexes can also cause the following disadvantages:

  • Creating and maintaining indexes takes time, which increases with the volume of data.
  • In addition to the data table space, each index takes up a certain amount of physical space. If clustering indexes are to be built, more space is required.
  • Indexes are maintained dynamically as data in a table is added, deleted, or modified, which slows down data maintenance.

Note: Indexes can speed up queries in some cases, but reduce efficiency in others.

Indexes are only one factor in improving efficiency, so the following principles should be followed when creating indexes:

  • You can speed up searches by indexing columns that need to be searched frequently.
  • Create an index on a column that serves as a primary key, enforce the uniqueness of that column, and organize the arrangement of data in the table.
  • Create indexes on columns that use table joins frequently, mainly foreign keys, to speed up table joins.
  • Create an index on a column that often needs to be searched by range, specifying a contiguous range because the index is already sorted.
  • Create indexes on columns that often need to be sorted, because the indexes are already sorted, so you can take advantage of the sorting of indexes to speed up the sorting of queries.
  • Create indexes on columns that often use the WHERE clause to speed up the determination of conditions.

The index structure maximizes the number of database I/OS. After all, an I/O is too long.

conclusion

In terms of the interview a lot of knowledge we can be easily mastered, but want to study for the purpose, you will find a lot of things we have to go deep into the computer can be found on the basis of it, a lot of people ask me how to remember so many things, actually learning itself is a very helpless thing, now that we have to learn that why bad studious? To learn to enjoy? Recently I am also catching up on the basics, and I will start to update the computer basics and network related knowledge later.

I’m Aobing, and the more you know, the more you don’t know, and I’ll see you next time!

Talented people’s [three lian] is the biggest power of aobing’s creation, if there are any mistakes and suggestions in this blog, welcome talented people to leave a message!


This article is constantly updated. You can search “Santaizi Aobing” on wechat and read it for the first time. Reply [Information] There are the interview materials and resume templates for first-line big factories prepared by me.