directory

  • preface
  • The index
  • Index common models
    • Hash table
    • Orderly array
    • Balanced binary trees
  • InnoDB index model
  • Primary and normal indexes
  • Page split and page merge
  • Primary Key Why you are advised to select auto-increment primary key?

preface

Data structure (B+ tree), index, transaction, lock, log, etc. Today I will talk about index. I will divide index into two parts to explain.

Mysql > create index in B+ tree mysql > create index in B+ tree

  • Why index using B+ tree instead of B tree, binary search tree?
  • Which data structures can be used as indexes? What are their strengths and weaknesses?
  • What is the difference between a normal index and a primary key index?
  • What is the difference between a clustered index and a non-clustered index?
  • When does an index fail, and what is the leftmost matching rule?
  • Index push down, okay?
  • Talk about your understanding of the return table, can you give an example?
  • Primary key Why is an autoincrement primary key recommended?
  • Page split and page merge

These interview questions will be revealed in the index two episodes. The first interview question, I answered in the last article (mp.weixin.qq.com/s/1sy-jLCuW…)

The index

In a word, the index is to speed up the database query efficiency, like a book table of contents.

Take life as an example. There is a 500-page book in front of you. If you want to find a certain knowledge point, you may have to turn the page to find it without the contents. If this data has a directory function, you only need to find the corresponding page of knowledge points on the directory, efficiency can be greatly improved. The same is true for mysql. The index is its “directory”.

Index common models

The index appears to speed up the query efficiency of database, but there are many ways to realize the index, such as hash table, ordered array, search tree, B+ tree. These data structures are relatively simple but are often asked questions in interviews. I hope to focus on them. I will introduce these data structures and their advantages and disadvantages.

Hash table

Hash table is a data structure stored by key-value pairs. It is relatively simple to use. We only need to input a specific key value to find the corresponding value. The implementation is also relatively simple, through the hash function to convert the key value into a fixed value, as a bucket, and then store the value in the bucket.

However, when using hash table as index data structure, there is a problem that multiple keys will have the same value after conversion of hash function. To handle this situation, it is implemented by pulling the linked list from a bucket as follows:

Therefore, the hash table is used as the index storage structure, which is suitable for the scenario with equivalent query only. Such as Memcached and other NoSQL engines.

Orderly array

Compared with hash tables, ordered arrays perform very well in equivalent query and interval query scenarios. Because the array is ordered, it is very efficient to locate the start value and the end value in interval query. Equivalent query is similar to hash table. It finds the corresponding value by key value.

However, if the ordered array is used as the index data structure, if you want to insert a record into the ordered array, you need to move all the records behind the record, which costs a lot.

Therefore, ordered array indexes are only good for static storage engines, such as if you want to store all the population information of a city in 2017, which will not be modified.

Balanced binary trees

Balanced binary trees are relatively common and commonly used data structures, and their characteristics are as follows:

  • Balanced binary trees are also known as AVL trees
  • A balanced binary tree is an empty tree or the absolute value of the difference in height between the left and right subtrees is no more than 1, and the left and right subtrees are also balanced trees
  • The value of the non-leaf node is greater than the value of the left child and less than the value of the right child
  • A non-leaf node can have up to two child nodes

Using balanced binary trees as indexed data structures has several disadvantages:

Given the extreme case where each insert is larger than the last, the balanced binary tree stores the data in a linear fashion with O(n) time. It is common for mysql to store millions of pieces of data in a single table when there is a large amount of data. This will cause the tree to become deeper and consume a lot of IO when mysql reads.

When mysql reads data from disk, it reads data in the unit of pages. Each node represents a page. However, the balanced binary tree stores one keyword per node, resulting in a waste of storage space.

InnoDB index model

In InnoDB, tables are stored as indexes according to the order of primary keys, which are called indexed organized tables. InnoDB engine uses B+ tree as index data structure by default, all data is stored in B+ tree.

Each index in InnoDB corresponds to a B+ tree.

B + tree

  • Non-leaf nodes do not store data, only indexes, and can store more indexes
  • The leaf node contains all index fields
  • Leaf nodes contain data (key and data fields) and Pointers
  • Leaf nodes are connected by Pointers to improve the performance of interval access

Image:

Primary and normal indexes

In the InnoDB engine, index types can be divided into primary key index and normal index.

Suppose we have a table with a primary key column ID, a field K in the table, and an index on k.

The construction sentence of this table is:

mysql> create table T(
id int primary key, 
k int not null, 
name varchar(16),
index (k))engine=InnoDB;
Copy the code

In the table, the (ID,k) values of R1~R5 are (100,1), (200,2), (300,3), (500,5) and (600,6) respectively. The schematic diagram of the two trees is as follows

It is not difficult to see from the figure that, according to the content of the leaf node, the index type is divided into primary key index and non-primary key index.

Leaf nodes indexed by primary keys hold entire rows of data. In InnoDB, primary key indexes are also called clustered indexes.

Leaf node contents that are not primary key indexes are primary key values. In InnoDB, non-primary key indexes are also called secondary indexes.

Based on the above index structure description, let’s discuss a question: what is the difference between a query based on a primary key index and a normal index? Select * from T where ID=500; select * from T where ID=500; If the statement is select * from T where k=5, that is, the common index query mode, search the K index tree first, obtain the ID value 500, and then search the ID index tree. This process is called table back. That is, a query based on a non-primary key index requires one more index tree to be scanned. Therefore, we should try to use primary key queries in our applications.

Page split and page merge

B+ trees perform necessary maintenance when inserting new values in order to maintain index orderliness. For example, if you want to insert a record with an ID of 700, you just need to add it after the record with an ID of 600. But if you add a record with an ID of 400, it’s a bit more cumbersome, and you need to logically move the data behind it to make room.

Even worse, if the data page R5 is on is already full, according to the B+ tree algorithm, it needs to apply for a new data page and move some data there. This process is called page splitting. In this case, performance naturally suffers.

In addition to performance, page splitting also affects data page utilization. Data originally placed on one page is now divided into two pages, reducing overall space utilization by about 50%.

Of course there is fragmentation and there is consolidation. When two adjacent pages have low utilization due to data deletion, the data pages are merged. The process of merging can be regarded as the reverse of the process of splitting.

Primary Key Why you are advised to select auto-increment primary key?

If you use auto-increment primary keys, each time new data is added, it is stored in the form of append, so that there is no moving data operation, and there is no page splitting problem mentioned above.

If a service field is used as the primary key, the index is maintained every time new data is added to maintain the order of the index, resulting in high data write costs.

In addition to the performance differences, consider storage. If you do have a unique field in your table, such as a string id number, should you use id number as the primary key or increment field as the primary key?

Alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table alter table If the id number is used as the primary key, it takes 20 bytes.

Therefore, the smaller the primary key length, the smaller the leaf nodes of the normal index, the smaller the space occupied by the normal index, and the more indexes the page stores.

Based on the above analysis of the advantages and disadvantages of using the auto-add primary key and using service fields as the primary key, it is concluded that the auto-add primary key is a reasonable solution.

The article will be updated continuously. You can search “Maimo Coding” on wechat to read it for the first time. Every day to share quality articles, large factory experience, school recruitment experience, help the school recruitment interview, is worth paying attention to every programmer platform.