Database indexing and SQL performance optimization

1. Basic concepts of indexes

The index is similar to the contents of a Chinese dictionary, which can quickly locate the page number of Chinese characters to be searched according to the radicals.

1.1 Index Definition

An index is a secondary storage structure defined on the basis of a storage table to help quickly locate the required records without having to examine all the records.

An index entry consists of two parts:

The index field: Certain columns in a table, such as ids;Similar to an entry in a dictionary
Pointer to the index: points to the disk location where records in the table containing index field values are stored.Similar to the page number of an entry in a dictionary

1.2 Index Classification

Different database systems provide different index types.

By physical storage structure:

Clustered index:The physical order of data in a database table row is the same as the logical order of key values (indexes).Because a table can be stored in only one physical order, a table can have only one clustered index.
Nonclustered index: As opposed to a clustered index,The logical order of the indexes is different from that of the physical storage.

By field features:

The only index
Normal index

By number of fields:

Single index
Composite index

2. Storage structure of index file itself

Each index corresponds to an index file. MSSQL’s index file storage structure is B+ tree.

Why to use this structure, we need to start from the evolution of the data structure of B+ tree, binary search tree — balanced binary tree — B tree — B+ tree

2.1 Balanced binary tree

Balanced binary tree is a data structure based on dichotomy strategy to improve the search speed of data.

The figure above shows a simple binary tree simulation, with each node having the following characteristics:

Each node has a maximum of two subtrees (tree branches)
The left and right subtrees are ordered, and the value on the right of adjacent nodes in the same hierarchy is larger than that on the left.
Even if a node has only one subtree, distinguish between the left and right subtrees.

Balanced binary tree Based on dichotomy strategy, balanced binary tree query performance is inversely proportional to the tree level (H height). The higher the tree structure level, the more search times, the more DISK I/O times, and the slower the query efficiency.

Assuming that there are 100 values to be stored, a balanced binary tree with 100 nodes and 6 levels of tree height is produced. If there is a way for a node to store more than one element (value), you can reduce the number of nodes in the tree, thereby reducing the overall level of the tree, hence the b-tree.

2.2 B tree

The difference between B tree and balanced binary tree is that B tree belongs to multi-fork tree, also known as balanced multi-path search tree (search path is not only two), and a single node can store multiple data.

B tree rule:

Sorting method: all node keywords are in ascending order, and follow the principle of smaller on the left and larger on the right;
Number of child nodes: the number of child nodes of non-leaf nodes is >1, and <=M, and M>=2, except for empty trees (note: order M represents the maximum number of search paths of a tree node,M= M, when M=2, it is a 2-fork tree, and M=3, it is a 3-fork tree).
Key words: the number of keywords in the branches is greater than or equal to CeiL (m/2)-1 and less than or equal to m-1 (note: Ceil () is a function rounded in the direction of infinity, such as Ceil (1.1), the result is 2);
All leaf nodes are in the same layer. In addition to the pointer containing the keyword and the keyword record, leaf nodes also have Pointers to their children except that their pointer addresses are all null corresponding to the space child of the node at the last layer below.

Features:

The difference between B tree and balanced binary tree is that each node contains more keywords, especially when B tree is applied to the database, which makes full use of the principle of disk blocks (Disk data storage is stored in the form of blocks, each block size is 4K, every time I/O data is read, The data of the same disk block can be read at one time) limit the node size and make full use of the fast disk size range; After increasing the number of node keywords in the tree, the tree level is less than the original binary tree, which reduces the number and complexity of data search.

2.3 B + tree

B+ tree is an evolution of B tree. The main difference between B+ tree and B tree is that the non-leaf nodes of B+ tree do not store data, but only store key values. Data is stored in the leaf nodes of the same layer. Each non-leaf nodes can store more data, and reduces the level of the tree height, and all of the leaves section like a chain table, pointing to the right of the leaf nodes, thus can effectively accelerate the retrieval efficiency, if need to traverse all of the data, you just need to traverse the leaf node chain structure, convenient and efficient.

2.4 Storage structure of clustered indexes

A B+ tree index is created based on the field where the clustered index is created. Data in the table is sorted on disk based on the field where the clustered index is created. Each table can have only one clustered index (and only one sort).

Features:

The root and middle nodes of the clustered index are index pages, both containing only entry Pointers and entry values for the next level (the first primary key at the storage location);
The leaf node of the clustered index is the data page. It stores all the fields in the table

A clustered index is a way of reorganizing the actual data on disk to sort by a specified column or column value. Because under clustered indexes, the data is physically sorted on the data page and the duplicates are grouped together, a query containing bentween,<,><=,>= or a query using group by or order by will be joined once the first row of the key value is found. There is no need to search further, avoid a wide range of scans, can greatly improve the speed of the query.

2.5 Storage structure of non-clustered Indexes

Non-clustered indexes do not reorganize the data in a table, and there can be multiple non-clustered indexes in a table.

The specific structure is as follows:

The root and middle nodes of a non-clustered index are index pages, both containing only entry Pointers and entry values (the first key value at the storage location) at the next level;
Leaf nodes of non-clustered indexes are also index pages and store key values of clustered indexes and non-clustered indexes.
Each index row in a non-clustered index (whether root node, intermediate node, or leaf node) contains a non-clustered key and a row locator that points to a data row in a clustered index or heap (a table without a clustered index) that contains that key.

The conceptual diagram of non-clustered indexes is as follows:

3, how to build index

3.1 creating indexes on the sqlserver

Create an index using SQL:

CREATE [UNIQUE] [CLUSTERED|NONCLUSTERED] -- UNIQUE specifies the UNIQUE index. CLUSTERED and NONCLUSTERED specifies CLUSTERED or non-clustered indexes

    INDEX   index_name  -- Specify the index name

     ONTable_name, column_name...).Table name, column name
Copy the code

Create using a graphical interface:

3.2 General principles for indexing

The application of index technology can greatly improve the retrieval efficiency, but at the same time, it increases the storage space, makes the maintenance burden heavier (not only to maintain the main file, but also to maintain the index file), and the clustered index can only have one table, so what field to establish the index and what kind of index is particularly important.

Principle:

Index fields that often need sorting, grouping, and union operations;
Index fields that are commonly used as query criteria;
Minimize index page size by using columns of small data types as key indexes. (Int keys as indexes are the fastest)
Try to select highly differentiated columns as indexes. If the field data is not too different, there is no need to build indexes
Do not create indexes for tables with few queries and many operations, such as adding, deleting, or modifying.
Limit the number of indexes. More is not always better
Try to avoid uUID as index; Because the data generation is irregular, it cannot be stored in sequence and occupies a large space.

Myth:

** Automatically sets the primary key to the clustered index **. Usually, we will create an id in each table as the primary key, sqlserver will default to create a clustered index, but sometimes this field is meaningless automatic increment field, lost the significance of clustered index, waste the precious resources of clustered index.

4, SQL writing some matters needing attention

4.1 Pre-concept – Query optimizer

The database system has an internal query optimizer. When we execute a SQL statement, the query optimizer optimizes the SQL statement and selects the corresponding index according to the minimum IO/CPU principle.

4.2 SQL Precautions

Index left-most matching principle

An index that refers to multiple columns in a certain order is called a federated index. The left-most matching principle states that a query can be used if the query condition exactly matches one or more consecutive columns to the left of the index. As follows:

-- user (name,city) forms a joint index
select * from user where name=xx and city=xx ; Indexes can be hit
select * from user where name=xx ; Indexes can be hit
select * from user where city=xx ; Index could not be hit
Copy the code

Avoid null values for fields in the WHERE clause. Null values cause the database to abandon indexes and perform a full table scan instead
Use in where clauses should be avoided! =, <>, in, or the database will abandon the use of the index for a full table scan.

Avoid calculations on indexed columns, or the database will abandon the index for a full table scan.

select id from t where num/2=100 
-- should be changed to:
select id from t where num=100*2
Copy the code

Be careful with the OR operation. Any clause in the AND operation that can use the index improves query performance, but any clause in the OR condition that cannot use the index degrades query performance.

SQL performance optimization tool – Execution plan

An execution plan is how the database executes an Sql statement, including the order of Sql queries, whether or not indexes are used, and the index information used.

In SQLSERVER, where to open the execution plan graphical interface:

Above is an SQL statement execution plan (the execution plan is viewed from right to left), from which information can be obtained:

Which implementations cost more — overhead
SQL Server’s execution plan for the amount of data generated by each step is expressed as line thickness
What actions are performed in each step

By analyzing the execution plan, you can see which steps are more expensive and, in turn, try some improvements.

reference

Balanced binary trees, B trees, B+ trees, B* trees and you understand one of them

SQLServer index tuning practices

Explain an SQL execution process

To query data through a non-clustered index, you must first find the clustered index, and then obtain the actual data through the clustered index. Twice as many IO as clustered indexes.