A clustered index is an algorithm that reorganizes the actual data on disk to sort by the value of one or more specified columns. The data is stored in the same order as the index. In general, the primary key creates a clustered index by default, and only one clustered index is allowed per table.

The book Principles of Databases explains the difference between clustered and non-clustered indexes in this way: the leaf nodes of clustered indexes are data nodes, while the leaf nodes of non-clustered indexes are still index nodes with Pointers to corresponding data blocks.

** Clustered indexes: ** Stores data together with indexes. The leaf nodes of the index structure hold row data.

** Non-clustered indexes: ** Stores the data separately from the index, and the leaf nodes of the index structure point to the corresponding location of the data.

In InnoDB, indexes created on top of clustered indexes are called secondary indexes. Non-clustered indexes are secondary indexes, such as composite indexes, prefix indexes, and unique indexes. Secondary index leaf nodes no longer store the physical location of rows, but the primary key value, and secondary index access to data always requires a secondary lookup.

  1. In InnoDB, the clustered index organizes the primary key into a B+ tree, and the row data is stored in the leaf node. If the primary key is found using the condition “where ID = 14”, the corresponding leaf node can be found according to the B+ tree search algorithm, and then the row data can be obtained.
  2. If a conditional search is performed on the Name column, two steps are required: the first step is to retrieve the Name in the secondary index B+ tree and reach its leaf node to obtain the corresponding primary key. In the second step, the primary key is used to perform another B+ tree retrieval operation in the primary index B+ tree species. Finally, the whole row of data can be obtained when the leaf node is reached. (The emphasis is on secondary indexes needed by other keys)

The clustering index is unique, because the clustering index puts the data together with the index structure, so there is only one clustering index for a table.

The physical order of rows in a table is the same as the physical order of rows in an index. A clustered index is created before any non-clustered indexes are created. This is because a clustered index changes the physical order of rows in a table.

By default, a clustered index is a primary key. If a primary key is not defined in a table, InnoDB selects a unique and non-empty index instead. If there is no such index, InnoDB will ** implicitly define a primary key (similar to Oracle RowId) ** to act as a cluster index. If you already set the primary key as the cluster index and want to set the cluster index separately, you must first delete the primary key, then add the desired cluster index, and finally restore the primary key setting.

The clustered/non-clustered B+ trees in MyISAM look the same. The structure of the nodes is exactly the same except for the different contents. The nodes of the primary key index B+ tree store the primary keys, and the secondary key index B+ tree store the secondary keys. Table data is stored in separate places, and the leaves of both B+ trees use one address to point to real table data. There is no difference between the two keys for table data. Because the index tree is independent, secondary key retrieval does not require access to the index tree of the primary key.

Advantages of using clustered indexes:

** Every secondary index search requires two B+ tree lookups. ** It seems that clustered indexes are significantly less efficient than non-clustered indexes. What are the advantages of clustered indexes?

1. Since row data and clustered index leaf nodes are stored together, there may be multiple rows of data in the same page. When accessing different row records of the same data page, the page has been loaded into Buffer (cache). This way the primary key and row data are loaded together, and the row data can be returned immediately if the leaf node is found. If the data is organized by the primary key Id, the data can be retrieved faster.

2. Leaf nodes for secondary indexes that store primary key values rather than data storage addresses. The advantage is that when the row data changes, the nodes in the index tree also need to split. Or the data we need to find is not in the cache of the last IO read and write. When a new IO operation needs to occur, we can avoid the maintenance of secondary indexes and only need to maintain the clustered index tree. Another benefit is that secondary indexes reduce the amount of storage they take up because they hold primary keys.

Note: We know that 16K resources can be obtained in one IO read and write. We call the read data area Page. In our B tree, the index structure of the B+ tree, many keywords (index values) and corresponding data stored on the leaf node will be read into the cache in one IO operation, so when accessing different records in the same page, the operation will be performed in memory, instead of IO operation again. A new I/O operation is triggered only when a page split occurs, that is, the row to be queried is not in the same row as the last I/O operation.

3. Because the primary index of MyISAM is not a clustered index, the physical address of his data must be messy, get these physical addresses, according to the appropriate algorithm for I/O reading, and then start to keep seeking and keep rotating. Clustered indexes require only one I/O. (Sharp contrast)

4. However, MyISAM is more advantageous when it comes to sorting large volumes of data, full table scanning, count, etc., because indexes take up less space and these operations need to be done in memory.

Clustering index needs attention

When a primary key is used as a clustered index, it is recommended not to use a UUID for the primary key because uUID values are too discrete to be sorted and uuid values that may be added to the index tree will be inserted in the middle of the index tree, which increases the complexity of index tree adjustment and consumes more time and resources.

You are advised to use the increment of the int type to facilitate sorting. The primary key value is added to the end of the index tree by default, which has minimal impact on the index tree structure. In addition, the larger the storage space occupied by primary keys, the larger the primary keys stored in secondary indexes, occupying storage space and affecting the amount of data read by I/O operations.

Why do primary keys usually recommend auto-increment ids

The physical storage order of data in a cluster index is the same as the index order, that is, as long as the indexes are adjacent, the corresponding data must also be stored adjacent to each other on disk. If the primary key is not an auto-increment ID, you can imagine what it would do, constantly adjusting the physical address of the data, paging, and of course other measures to reduce these operations, but they cannot be completely avoided. However, if it is increment, it is simple, it only needs to write page by page, index structure is relatively compact, disk fragmentation is less, and high efficiency.

end