With tens of millions of data, understanding the underlying principles of Mysql indexing would save the world

  • The nature of Mysql indexes
  • The underlying principles of Mysql indexing
  • Mysql index in action

The interview

Q: What are the most common query optimizations in a database?

Answer: Add index

Q: Why does indexing optimize queries?

Answer 1:… I don’t know

A2: Because an index is actually a data structure to optimize the query. For example, the index in Mysql is implemented using B+ tree, and B+ tree is a data structure to optimize the query speed. You can use the index to quickly find data, so it can optimize the query.

Q: Do you know which data structures can speed up queries? (There is a hole in this question…)

A: Hash tables, fully balanced binary trees, B trees, B+ trees, etc

Q: Since all of these data structures can optimize query speed, why do Mysql use B+ trees?

A:… I don’t know

Ask questions

SHOWINDEX FROM employees.titles;

There is a titles table with primary keys consisting of emp_NO, title, and from_date fields.

Do the following statements use indexes:

  1. select * from employees.titles where emp_no = 1
  2. select * from employees.titles where title = ‘1’
  3. select * from employees.titles where title = ‘1’ and emp_no = 1;

The Index (Index)

  • What is an Index, anyway?
  • An index is like a table of contents for a book
  • Indexes are used to find rows with specific column values quickly
  • I define it this way: An index is a data structure that optimizes a query

Why hash tables, fully balanced binary trees, B trees, B+ trees can optimize queries, why Mysql only likes B+ trees?

What is a hash table?

A Hash table (also called a Hash table) is a data structure that can be accessed directly based on Key values. That is, it accesses records by mapping key values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array of records is called a hash table.

In fact, hash table is very simple, is the Key through a fixed algorithm function called the hash function into an integer number, and then the number mod array length, the result of the remainder as the array subscript, the value is stored in the array space with the number as the subscript. When the hash table is used for query, the hash function is used again to convert the key into the corresponding array subscript and locate the space to obtain value. In this way, the array positioning performance can be fully utilized for data positioning.

What are the characteristics of a hash table?

If we had a sanguo:

Now create hash index on name field:

Note that the array subscripts corresponding to the field values are randomly calculated by the hash algorithm, so hash conflicts may occur.

So for such an index structure, now execute the following SQL statement:

Select * from sanguo where name =’ sanguo ‘

You can directly calculate an array subscript according to the hash algorithm of Zhou Yu, and then you can directly take out the data from the data and get the address of the corresponding line of data, and then query that line of data.

So if we now execute the following SQL statement:

Select * from sanguo where name >’

Can’t do anything about it, because the nature of a hash table is that it can be queried quickly and accurately, but it does not support range queries.

What if I use a perfectly balanced binary tree?

Again, the above data is represented as a fully balanced binary tree as shown in the figure below (for simplicity, the corresponding addresses of the data are not drawn in the figure). :

Each node in the diagram should actually have four parts:

  1. Left – handed pointer refers to left – handed subtree
  2. The key value
  3. The storage address of the data corresponding to the key value
  4. Right pointer to the right subtree

The other thing to remember is that binary trees are sequential, which simply means “the ones on the left are less than the ones on the right”

If we now search for ‘Zhou Yu’, we need to search twice (the first time for Cao Cao, the second time for Zhou Yu), which is more than the hash table. And because fully balanced binary trees are ordered, range lookups are also supported.

What if I use B trees?

Again, the above table data is represented as a B-tree in the figure below (for simplicity, the corresponding address of the data is not drawn in the figure). :

We can see that the representation of the same element in a B tree is shorter than a fully balanced binary tree, “because a node in a B tree can store multiple elements.

What if I use B plus trees?

Again, the above data is represented as a B+ tree (for simplicity, the corresponding address of the data is not drawn in the figure). :

We can find that the representation of the same element in B+ tree is “fatter” than that in B+ tree, because there is a redundant copy of non-leaf nodes in B+ tree in leaf nodes, and leaf nodes are connected by Pointers.

So what’s so good about B+ trees?

Here we use “proof by contradiction”, if we now use fully balanced binary tree as the index data structure, let’s see what’s wrong with it.

In fact, the index also is very “big”, because the index also storage elements, the more our data rows of a table, then the corresponding index files in fact is very big, actually is also a need to be stored in the disk, but not all in memory, so when we consider to choose what kind of data structure, we can change an Angle to think, Which data structure is better for reading data from disk, or which data structure improves DISK I/O efficiency.

Looking back at fully balanced binary trees, when we need to query “zhang Fei”, we need to do the following steps

  1. Fetch cao “from disk to memory, CPU fetch data from memory for notes,” Zhang Fei “‘ Cao “, fetch the left subtree ((generated a disk IO)
  2. Fetch “zhou Yu” from disk to memory, CPU fetch data from memory for notes, “Zhang Fei” “Zhou Yu”, fetch the right subtree (generated a disk IO)
  3. Fetch “sun Quan” from disk to memory, CPU fetch data from memory for note, “Zhang Fei” “Sun Quan”, fetch the right subtree (generated a disk IO)
  4. Fetch “huang Zhong” from disk to memory, CPU fetch data from memory for note, “Zhang Fei “=” Zhang Fei “, find result (generated a disk IO)

Similarly, if we look back at the B-tree, we can find the “zhang Fei” after sending only three disk I/OS. This is the advantage of the B-tree: a node can store multiple elements, so the height of the whole tree is reduced compared to the fully balanced binary tree, and the disk I/O efficiency is improved.

However, THE B+ tree is an upgraded version of the B tree, except that the non-leaf nodes are redundant. The advantage of doing this is to improve the efficiency of range search.

Therefore, we can conclude that Mysql uses B+ tree as the data structure as index, which can improve the disk I/O efficiency of index query and the efficiency of range query. Moreover, the elements in B+ tree are also ordered.