Writing in the front
- The article is summarized on the basis of predecessors plus their own understanding, only as their own learning record, not for any commercial use!
- If you find any mistakes or infringement problems in this article, please point them out. Thank you!
MySQL architecture
- MySQL uses
C/S architecture, namely Client and Server architecture
When we use MySQL, we send a request to connect to the MySQL daemon running the server as the client. The MySQL server processes our request and returns the result after processing to us - Compared to other databases, the MySQL architecture can be used in many different scenarios and works well. This is mainly reflected in the architecture of the storage engine.
Plug-in storage engine architecture separates query processing from other system tasks and data storage and extraction
. In this architecture, you can select an appropriate storage engine based on service requirements - Architecture diagram
- It can be seen from the above figure that MySQL is mainly divided into connection layer, service layer, engine layer and storage layer
- Connection layer: mainly connects with the client, and is responsible for connection and authorization
- Service layer: mainly completes most of the core service functions, including query caching, SQL parsing, optimization and other operations. The interaction between the service layer and the storage engine is called through a unified API
- Engine layer: Storage engine is really responsible for storage and extraction of data in MySQL. Server communicates with storage engine through API. Different storage engines have different functions, so you can select them according to your actual needs
- Storage layer: Mainly stores data on the file system running on the device and interacts with the storage engine
Connection manager
- The connection manager manages and maintains all connections requested by MySQL clients. When we make a request to MySQL, the connection manager is responsible for creating a connection and verifying user permissions
The query cache
- When we establish a connection, if we perform a SELECT operation, the connector will first check the query cache to see if the statement was previously executed
- If executed, the cached result is returned
- If not, the operation continues
- The query cache is the
The query statement is used as a key, and the query result is used as a value
, the established key-value cache structure
- Query caching is not recommended
- As long as the database table is updated, all the query cache will be cleared, so it is often not able to hit the query cache, the existence of little significance
The parser
- What does this SQL do
- When there is no command query in the query cache, you need to actually execute the statement, leaving it to the parser for lexical and grammatical analysis
The optimizer
- How do I do this SQL
- The MySQL server already knows what the SQL statement does after lexical and syntax analysis by the parser, but it still needs to be optimized, such as
- When multiple indexes are involved, decide which index to use
- When multiple tables are associated, the join order is determined
Select c from t1=10D in table T2=20The ID value of theselect * from t1 join t2 using(ID) where t1.c=10 and t2.d=20; Copy the code
actuator
- Actual execution steps
- The SQL statement is optimized by the query optimizer and then sent to the executor to start execution, but before execution, the executor determines whether the user has permissions on the corresponding table
- If you have the permission, open the table and use the interface provided by the engine to send corresponding instructions to the underlying storage according to the engine definition of the table. The storage engine is responsible for the specific execution, and informs the executor of the execution result, and then returns it to the client
The storage engine
- Later on
Related interview questions
- Q: What is the database architecture diagram?
- In general is divided into four layers, and the connection layer used to connect the client authorized certification, and the service layer is the core of the main parts, includes the query cache, the parser, the optimizer, actuators and other parts, engine layer is MySQL really responsible for data access, different engines have different functions, storage layer is to store the data in a file system and engine layer to interact
- Q: How is the SQL statement of a query executed?
- This is the process described in each of the processes shown above
The storage engine
- MySQL supports nine storage engines
Storage engines, also known as table types, determine how a table is handled and stored
MySQL supports many different storage engines, and the storage engine is designed to be pluggable so that different tables in the same database can use different storage engines
Querying storage Engines
- Example Query storage engines supported by MySQL
mysql > show engines Copy the code
+--------------------+---------+----------------------------------------------------------------+--------------+------+- -----------+ | Engine | Support | Comment | Transactions | XA | Savepoints | +--------------------+---------+----------------------------------------------------------------+--------------+------+- -----------+ | InnoDB | DEFAULT | Supports transactions, row-level locking, and foreign keys | YES | YES | YES | | MRG_MYISAM | YES | Collection of identical MyISAM tables | NO | NO | NO | | MEMORY | YES | Hash based, storedin memory, useful fortemporary tables | NO | NO | NO | | BLACKHOLE | YES | /dev/null storage engine (anything you write to it disappears) | NO | NO | NO | | MyISAM | YES | MyISAM storage engine | NO | NO | NO | | CSV | YES | CSV storage engine | NO | NO | NO | | ARCHIVE | YES | Archive storage engine | NO | NO | NO | | PERFORMANCE_SCHEMA | YES | Performance Schema | NO | NO | NO | | FEDERATED | NO | Federated MySQL storage engine | NULL | NULL | NULL | +--------------------+---------+----------------------------------------------------------------+--------------+------+- -----------+ 9 rowsin set (0.00 sec) Copy the code
- A storage engine that queries tables
-- Check the storage engine used by a particular table. The default storage engine has been changed! Show create table tablename -- Check exactly which storage engine is used by a table in a database show table status like'tablename' show table status from database where name="tablename" Copy the code
Specifying a Storage engine
- Innodb is the default storage engine for MySQL
- Specify the storage engine for the user table
CREATE TABLE users( uid int not null, username varchar(32) not null, email varchar(64) not null, gender tinyint not null.primary key(uid) ) engine=MyISAM; Copy the code
Modifying a Storage Engine
- The storage engine of the database table can also be modified. Change the storage engine of the user table
ALTER TABLE users ENGINE=InnoDB; Copy the code
Storage Engine Comparison
-
Mainly refers to InnoDB and MyISAM two storage engine comparison
- Physical file storage structure comparison
-
MyISAM
- .frM file: Metadata related to tables are stored in FRM files, including the definition of table structures
- MYD (MYData) file: the MyISAM storage engine is dedicated to storing data of MyISAM tables
- MYI (MYIndex) file: the MyISAM storage engine is used to store index information of MyISAM tables
-
InnoDB
- .frM file: Metadata related to tables are stored in FRM files, including the definition of table structures
- Ibd files or.ibData files: Both of these files are used to store InnoDB data
There are two types of files to store InnoDB data because InnoDB can be configured to use a shared tablespace to store data or a private tablespace to store data
Ibd files are used for each table. Ibd files are shared for each table. Ibdata files are used for all tables
-
- Physical file storage structure comparison
-
Compared to other
MyISAM | InnoDB | |
---|---|---|
The storage space | MyISAM can be compressed and has less storage space | InnoDB tables require more memory and storage because it creates its own special buffer pool in main memory for caching data and indexes |
A foreign key | Does not support | support |
The lock range | Table locks, in which an operation on one piece of data locks the entire table, are not suitable for high-concurrency operations | Row-level locking, in which only one row of data is locked, is suitable for high-concurrency operations |
The cache | Only indexes are cached, not real data | Cache not only indexes but also real data, so memory requirements are high, and memory size can have a decisive impact on performance |
AUTO_INCREMENT | MyISAM tables can be indexed jointly with other fields | InnoDB must contain an index with only this field |
SELECT | MyISAM better | |
INSERT | InnoDB better | |
UPDATE | InnoDB better | |
DELETE | InnoDB better | |
COUNT without WHERE | MyISAM is superior because MyISAM stores the exact number of rows of the table | InnoDB does not store the exact number of rows of a table, so it is slow to scan statistics line by line |
COUNT with WHERE | Same thing, they lock the whole table | Same thing, they lock the whole table |
FULLTEXT FULLTEXT index | support | Full text indexes can be obtained from InnoDB using Sphinx, which is slower |
Related interview questions
-
Q: What are the differences between MyISAM and InnoDB?
- InnoDB supports transactions, while MyISAM does not. This is one of the reasons MySQL changed the default database engine to InnoDB
- InnoDB minimum-grained locks are row-level locks, while MyISAM minimum-grained locks are table-level locks. Executing an update statement will lock the entire table, causing other queries and updates to block, which is not suitable for high-concurrency scenarios. This is one of the reasons MySQL changed the default database engine to InnoDB
- InnoDB supports foreign keys, while MyISAM does not. Converting an InnoDB table with foreign keys to MYISAM will fail
- InnoDB is
Clustering index
MyISAM isNon-clustered index
(Explained in index classification below) - InnoDB does not store the exact number of rows of a table, execute
select count( * ) from table
MyISAM uses a variable to store the number of rows in the entire table. When executing the above statement, you only need to read the variable, which is very fast
-
Select * from table_name where ID = 15; select * from table_name where ID = 15; select * from table_name where ID = 15;
- If the table type is MyISAM, it is 18. Because MyISAM table will increment the primary key
The maximum ID is recorded in the data file
To restart the database, the maximum ID of the self-added primary key will not be lost - If the table type is InnoDB, it is 15. Because InnoDB tables only add primary keys
The maximum ID is recorded in memory
, so restarting the database or performing the OPTION operation on the table will cause the maximum ID to be lost
- If the table type is MyISAM, it is 18. Because MyISAM table will increment the primary key
-
Q: Which storage engine executes select count(*) faster and why?
- MyISAM is faster because MyISAM internally maintains a counter that stores the total number of rows of the table on disk when executed
select count( * ) from t
, directly returns the total data - In The InnoDB storage engine, unlike MyISAM, the total number of rows is not stored on disk when executed
select count( * ) from t
, the data will be read out first, add up line by line, and finally return the total number
- MyISAM is faster because MyISAM internally maintains a counter that stores the total number of rows of the table on disk when executed
-
Q: Why MyISAM reads faster and writes slower than InnoDB?
- Read operations are faster because MyISAM is a non-clustered index and InnoDB is a clustered index. INNODB has a lot more to maintain when doing SELECT than MYISAM
- Write operations are slower because MyISAM is a table-level lock and updates lock the entire table, while InnoDB is a row-level lock
-
Q: What is the difference in primary key support between MyISAM and InnoDB?
- MyISAM tables allow no primary key and other indexes, whereas InnoDB tables without primary keys generate a 6-byte primary key that is invisible to the user
The data type
- It mainly includes five categories
- Integer types: BIT, BOOL, TINY INT, SMALL INT, MEDIUM INT, INT, and BIG INT
- Floating point types: FLOAT, DOUBLE, DECIMAL
- The value can be CHAR, VARCHAR, TINY TEXT, TEXT, MEDIUM TEXT, LONGTEXT, TINY BLOB, BLOB, MEDIUM BLOB, or LONG BLOB
- Date type: Date, DateTime, TimeStamp, Time, Year
- Other data types: BINARY, VARBINARY, ENUM, SET, Geometry, Point, MultiPoint, LineString, MultiLineString, Polygon, GeometryCollection, etc
Related interview questions
- Q: Difference between CHAR and VARCHAR
- The same
- CHAR(n), VARCHAR(n) n represents the number of characters
- CHAR, VARCHAR the string is truncated after the maximum length n is specified
- The difference between
- CHAR takes up n characters regardless of the actual number of characters stored, whereas VARCHAR takes up only the size of the actual character in bytes plus 1 (actual length, 0<=length<255) or 2 (length>255)
- Reason: VARCHAR saves data by adding one byte to the string (two bytes if column declaration length is greater than 255)
- The maximum storage space for a CHAR is 255 bytes
- CHAR truncates trailing Spaces when stored, whereas VARCHAR does not
- CHAR takes up n characters regardless of the actual number of characters stored, whereas VARCHAR takes up only the size of the actual character in bytes plus 1 (actual length, 0<=length<255) or 2 (length>255)
- Use policies
- CHAR is better than VARCHAR for frequently changing data because CHAR is less prone to fragmentation.
- For very short columns, CHAR is more storage efficient than VARCHAR
- The same
- Q: What’s the difference between a BLOB and a TEXT?
- BLOB holds binary data, TEXT holds character data
The index basis
What is the index
Index is a data structure used to improve the efficiency of database query
Syntax of indexes
-
Create indexes
CREATE INDEX nameIndex ON test(name) CREATE UNIQUE INDEX idIndex ON test(id) Copy the code
CREATE TABLE mytable ( ID INT NOT NULL, username VARCHAR ( 16 ) NOT NULL, INDEX [ indexName ] ( username ( length )) ); Copy the code
-
Remove the index
-- DROP INDEX [indexName] ON mytable DROP INDEX idIndex ON test Copy the code
-
According to the index
SHOW INDEX FROM test Copy the code
Advantages and disadvantages of indexes
- advantages
- Improve data retrieval efficiency and reduce database IO costs
- Reduce the cost of data sorting and reduce CPU consumption
- disadvantages
- An index is also a table that holds the primary key and index fields and points to records in the entity table, so it is also memory intensive
- While indexes greatly speed up queries, they slow down the speed of updating tables, such as INSERTS, UPDATES, and DELETES. MySQL will update the index file every time it updates a column that has been added to the index
Classification of indexes
A Hash index
-
Hash indexes are implemented based on Hash tables. Only queries that exactly match all columns of the index are valid. The Memory engine uses this kind of index by default
-
The storage engine computes a HashCode for all the separate columns, stores the HashCode in the index, and holds a pointer to each data row in the HashCode. This is very fast for this kind of index lookup. In case of hash collisions, the index stores multiple pointer records in a linked list into the same hash entry
-
The retrieval algorithm
- When retrieving the query, it performs the same Hash algorithm again for the search keyword to obtain HashCode and fetch data from the corresponding position of the Hash table. If a Hash collision occurs, it needs to filter the value
-
For example,
name age Jane 28 Peter 20 Lily 30 - Suppose the hypothetical hash function f() is used to generate the corresponding hypothetical value:
- f(‘Jane’) = 2323
- f(‘Peter’) = 2456
- f(‘David’) = 2400
- The data structure of the Hash index is as follows.
Groove (slot) Value (value) 2323 Pointer to row 1 2400 Pointer to row 3 2456 Pointer to line 2 - for
select * from user where
name= 'Jane'
Jane’s HashCode = 2323 (HashCode = 2323, HashCode = 2323, HashCode = 2323
- Suppose the hypothetical hash function f() is used to generate the corresponding hypothetical value:
-
MySQL does not explicitly support Hash indexes, but rather as an internal optimization. Specifically, Innodb storage engine monitors the search of secondary indexes on tables. If a secondary index is frequently accessed and becomes hot data, a Hash index is created for it. Therefore, in MySQL’s Innodb, Hash indexes are automatically generated for hot data. This kind of Hash index is also called adaptive Hash index based on the characteristics of the scenario used
-
disadvantages
You cannot avoid reading rows
The: hash index contains only hash values and row Pointers, not field values, so you cannot use values in the index to avoid reading rows. However, access to rows in memory is fast, so in most cases this doesn’t have a noticeable effect on performanceCan't be used for sorting
Hash index data is not stored in index order and therefore cannot be used for sortingOnly equivalent lookup is supported
: Hash indexes only support equivalent comparison queries, including =, IN(), <=> (note that <> and <=> are different operations). No range queries, such as WHERE price>100, are supportedHash conflict exists
When a hash conflict occurs, the storage engine must traverse all row Pointers in the linked list, comparing them row by row until it finds all rows that match the criteriaPartial column match lookup is not supported
If (A,B) = (A,B) = (B) = (A,B) = (A,B) = (A,B) = (A,B) = (A,B) = (A,B)
B + Tree index
- This is the basic implementation of MySQL indexes. In addition to full-text index, Hash index, Innodb, MyISAM index are through B+Tree implementation
- This part has many contents and will be narrated from the following aspects:
- Why choose B+Tree instead of binary Tree
- Disk access
- B-tree data structure
- B+Tree data structure
-
Why choose B+Tree instead of binary Tree
-
Binary search tree
- Binary search trees have the following properties: the value of the left subtree is less than the value of the root, and the value of the right subtree is greater than the value of the root
- The search for the nodes of the binary tree shows that the search times of the node with depth of 1 is 1, the search times of the node with depth of 2 is 2, and the search times of the node with depth of N is N, so the average search times are (1+2+2+3+3+3+3) / 7 = 2.4 times
- But when the binary search tree is constructed in the following form
- The average number of lookups is (1+2+3+4+5+6+6) / 7 = 4
- Disadvantages of binary search tree:
If the tree is too deep (too high), the binary tree is inefficient to query
. Therefore, if the query efficiency of binary tree is to be as high as possible, the binary tree needs to be balanced, which leads to a new definition —Balanced binary trees, or AVLTree
-
Balanced binary trees
- A balanced binary tree (AVL tree) satisfies the requirement of a binary search tree that the maximum difference in height between two subtrees of any node is 1 (image from network)
- If nodes are inserted or deleted in AVL tree, the AVL tree may lose balance. Such unbalanced binary tree can be summarized as four postures:
- LL: That’s right. After a node is inserted or deleted, the Left Child of the root node also has non-empty nodes, resulting in the height of the Left subtree of the root node being 2 higher than the height of the right subtree, and the AVL tree losing balance
- RR: RightRight. After a node is inserted or deleted, the Right Child of the root node also has non-empty nodes, resulting in the height of the Right subtree of the root node being 2 higher than the height of the left subtree, and the AVL tree losing balance
- LR: LeftRight, also known as left and right. After a node is inserted or deleted, the Left Child of the root node and the Right Child of the root node also have non-empty nodes, resulting in the height of the Left subtree of the root node being 2 higher than the height of the Right subtree, and the AVL tree losing balance
- RL: RightLeft, also known as “RightLeft”. After a node is inserted or deleted, the Left Child of the Right Child of the root node also has non-empty nodes, resulting in the height of the Right subtree of the root node being 2 higher than the height of the Left subtree, and the AVL tree losing balance
- Once an AVL tree is out of balance, it can be rotated back into balance
- Data structures that use balanced binary trees as indexes
- Consider each node as a disk. The disk structure is shown in the following figure:
- Keyword: that is, the corresponding value of the key field that we build the index.
- Data area: indicates the location of the disk corresponding to the keyword. I/O operations are performed on the disk corresponding to the keyword to obtain data
- Node reference: refers to the disk location of the child node
- When we need to query data with ID=8, we will first get root node 10 and load it into memory, compare the data size and find that it is smaller than 10, then search left node 5 and find that it is larger than 5, search right node 5 and find that it is hit, and then perform IO read and write operations according to the data area address
- AVLTree faults:
If the data is too deep (or too high), it can lead to frequent IO operations, which are time-consuming
, hence the new definition —B-Tree
-
-
Disk access
- Locality principle and disk prefetch
- Due to the nature of the storage medium, disk access is much slower than main memory, so to improve efficiency, minimize disk I/O
- To achieve this goal, the disk does not read data strictly on demand. Instead, the disk prereads each time. Even if only one byte is required, the disk reads a certain length of data backward from this position into memory
- Prefetch improves I/O efficiency. The length of prefetch is typically an integer multiple of a page (page: a logical block of computer-managed memory – usually 4k).
Main memory and disk exchange data on a page basis
- When the data the program is trying to read is not in main memory, a page missing exception is triggered. In this case, the system sends a disk read signal to the disk. The disk finds the starting location of the data and reads one or more pages back to memory
- When the system reads data from disk to memory, the basic unit is disk block. The data in the same disk block is read at one time, rather than what is needed
- The default size of InnoDB storage engine is 16K per page, but the storage space of a disk block in the system is usually not that large. Therefore, InnoDB will use several consecutive disk blocks to achieve the page size of 16KB each time it applies for disk space
InnoDB reads disk data to disk on a page basis
If each piece of data in a page can help locate the data record during data query, it will reduce disk I/O times and improve query efficiency.Data in the B-tree structure enables the system to efficiently locate the disk block where the data resides
- Locality principle and disk prefetch
-
B-tree data structure
- First define a record as a binary [key, data],
Key is the key of a record, corresponding to the primary key in the table. Data is the data in a row except the primary key
To describe the b-tree
- An M-level B-tree has the following characteristics:
- Each node has a maximum of M children
- Each node except the root node and leaf node has at least Ceil(m/2) children
- If the root node is not a leaf node, there are at least 2 children
- All leaf nodes are in the same layer and contain no other keyword information
- Each non-terminal node contains n keywords (P0,P1… Pn, k1,… Kn)
- The number of keywords n is ceiL (m/2)-1 <= N <= M-1
- Ki (I = 1,… N) indicates the keyword in ascending order.
- Pi (I = 1,… N) is the pointer to the sub-root node, and the keywords of all nodes of the sub-tree pointed to by P(i-1) are smaller than ki, but larger than K (i-1).
- A third-order B-tree structure is shown in the following figure
- Each node occupies the disk space of one disk block.
A node has two ascending ordered keywords and three Pointers to child root nodes
, the pointer stores the address of the disk block where the child node resides. The three scope fields divided into two keywords correspond to the scope fields of the data of the subtree pointed to by the three Pointers. For the root node, the keywords are 17 and 35. The data range of the subtree to which the P1 pointer points is less than 17, the data range of the subtree to which the P2 pointer points is 17-35, and the data range of the subtree to which the P3 pointer points is greater than 35. Simulate the process of finding keyword 29:- Locate disk block 1 based on the root node and read it into memory. [Disk I/O operation 1]
- Compare keyword 29 in the interval (17,35) to find pointer P2 to disk block 1
- Locate disk block 3 according to P2 pointer and read into memory. [Disk I/O operation 2nd]
- Compare keyword 29 in interval (26,30) to find pointer P2 to disk block 3
- Locate disk block 8 according to P2 pointer and read into memory. [Disk I/O operation 3]
- Find keyword 29 in the keyword list in disk block 8
- Analyzing the above procedure, we found that three disk I/O operations and three memory lookups were required
Since the keyword in memory is an ordered table structure, dichotomy search can be used to improve the efficiency, and three disk I/O operations are the determining factor affecting the efficiency of the entire B-tree search
- Compared with AVLTree, b-tree reduces the number of nodes (because AVLTree has only one key word and two Pointers), so that each disk I/O takes effect, thus improving the query efficiency
- Disadvantages of b-tree: In the B-tree structure diagram, you can see that each node contains not only the key value but also the data value. And the storage space of each page is limited,
Large data results in a small number of keys per node (i.e. a page)
When a large amount of data is stored, the depth of the B-tree is large, which increases the disk I/O times and affects the query efficiency. Therefore, a new definition — is introducedB+Tree
- First define a record as a binary [key, data],
-
B+Tree
- B+Tree is an optimization based on B-Tree to make it more suitable for external storage index structure. InnoDB storage engine uses B+Tree to achieve its index structure
- In a B+Tree, all data record nodes are stored on the leaf nodes at the same layer according to the order of key values, instead of the non-leaf nodes only storing key values. This greatly increases the number of key values stored on each node and reduces the height of the B+Tree
- B+Tree is different from B-tree in the following aspects:
- A closed range is used for keyword search of B+Tree nodes
- Non-leaf nodes of B+Tree do not store data information, but only the references of keywords and child nodes
- The data corresponding to the B+Tree keyword is saved in the leaf node
- B+Tree leaf nodes are ordered, and adjacent nodes have a sequential reference relationship
- The upper limit of pointer to each node is 2D instead of 2d+1.
- Since non-leaf nodes of B+Tree store only key value information instead of data data, all data data are stored in leaf nodes, which makes the degree D value of non-leaf nodes larger
- Assuming that each disk block can store four key values and pointer information, the structure of the B+Tree is shown in the following figure
- There are usually two head Pointers on a B+Tree, one to the root node and the other to the leaf node with the smallest keyword, and there is a chain-ring structure between all the leaf nodes (that is, data nodes). So we can do two lookup operations on B+Tree:
- One is range lookup and paging lookup for primary keys
- One is to start at the root node and do a random lookup
- There are only 22 data records in the above example, so we can not see the advantages of B+Tree. Here is a calculation:
- InnoDB storage engine page size is 16KB, the typical table primary key type is INT (occupies 4 bytes) or BIGINT (occupies 8 bytes), pointer type is also generally 4 or 8 bytes, i.e
A page (i.e. a node in B+Tree) stores approximately 16KB/(8B+8B)=1K keys
(K is 10^3 for the sake of calculation). This means that a B+Tree index with a depth of 3 can maintain (10^3 * 10^3 * 10^3 = 1 billion) records
- InnoDB storage engine page size is 16KB, the typical table primary key type is INT (occupies 4 bytes) or BIGINT (occupies 8 bytes), pointer type is also generally 4 or 8 bytes, i.e
- In practice, each node may not be fully filled, so in the database, the height of B+Tree is usually 2-4 levels. MySQL’s InnoDB storage engine is designed to keep the root node resident in memory,
This means that it takes at most 1 to 3 disk I/O operations to find a row record for a particular key value
-
Performance analysis of b-tree /B+Tree indexes
- Precontents:
- H: The degree of the exponent
- Degree: The degree of a node in a binary tree refers to the number of subtrees that the node has, while the degree of the whole tree refers to the maximum degree of the node in the tree (the degree D represents the degree of the whole tree).
- The disk I/O count is generally used to evaluate the index structure
- Analyze from b-tree: According to the definition of b-tree, a maximum of H (height) nodes need to be accessed once. The designers of the database system took advantage of disk prefetch by setting the size of a node equal to one page so that each node could be fully loaded with only one I/O. In order to achieve this, the following techniques are also needed to implement the B-tree in practice:
- Each time you create a node, apply for a page of space directly
(corresponding to the previous page is a node of B+Tree)
This ensures that a node is physically stored on a page, and computer storage allocation is page-aligned, enabling a node to require only one I/O - A search in b-tree requires h-1 I/ OS at most (because the root node is resident in memory), and the progressive complexity is O(h)=O(logdN). In general practice, the output d is a very large number, usually over 100, so h is very small (usually no more than 3)
- In summary, using B-tree as index structure is very efficient
- Each time you create a node, apply for a page of space directly
- B+Tree: B+Tree is more suitable for external storage index, and the reason is related to the degree d of inner node. From the above analysis, it can be seen that the larger d is, the better the index performance is, and the upper limit of the outgo degree depends on the size of key value and data value in the node:
- Dmax = floor(pagesize/(keysize + datasize + pointsize)) (pagesize — dmax >= pointsize) or
- Dmax = floor(pagesize/(keysize + datasize + pointsize)) — 1 (pagesize — dmax < pointsize)
- Therefore, B+Tree with data field deleted from the non-leaf node obviously has a larger degree D value
(1K based on the rough estimate above)
, so more suitable for indexing
- Precontents:
-
Why InnoDB uses B+Tree instead of Hash index, red black Tree or B-tree?
- B+Tree = B+Tree Since the non-leaf nodes of B+Tree abandon the Data field, B+Tree can store more Pointers in a single node, that is to say, there are more child nodes pointing to, so there are more nodes queried. Meanwhile, it means that the height of the Tree h is not high (generally 2~4 layers), which can reduce the I/O times. Finally, all leaf nodes of B+tree are connected by a single linked list, which is suitable for range-based sequential search
- B+tree and red black tree For B+tree with N leaf nodes, the search complexity is O(logdN) and the D value is generally greater than 100. Even when the data volume reaches tens of millions, the height of B+tree remains around 3-4, ensuring that the target data can be queried after 3-4 disk I/O operations. However, a red-black tree is a binary tree with two node-point nodes, which means its search complexity is O(logN) and the height of the tree is much higher than that of B+ Tree. Therefore, a red-black tree requires more disk I/O times to retrieve target data
- Finally, compare B+Tree and Hash indexes: Unlike a B-tree index, which has to access the page node from the root node to the branch node, the search efficiency of a Hash index is much higher than that of a B-tree index. However, Hash indexes have many disadvantages. For example: Hash conflicts, support only equivalent lookup, can’t avoid reading rows…
- The following categories of indexes can be understood after looking at the implementation section of indexes
Primary and secondary indexes
- MyISAM primary and secondary indexes
- The index file and data file of MyISAM engine are separated, and the data field of leaf node in B+Tree structure stores the memory address of data record
- Index files are separated from data files, and such indexes are called “non-clustered indexes.”
- MyISAM’s primary and secondary indexes are not very different,
Only the primary key index cannot have duplicate keys
- InnoDB primary key index and secondary index
- InnoDB does not separate index files from data files, and the data field of leaf nodes in its B+Tree structure stores the actual data records (for the actual data records of primary key index, for secondary indexes, the values of primary key index are stored).
- Index files are not separated from data files. Such indexes are called “clustered indexes”, and only one clustered index can exist for a table
Single-value index and union index
-
An index can contain only one field or multiple fields at once. An index consisting of a single field can be called a single-valued index, otherwise it is called a joint index, also known as a composite index or a multi-valued index
-
If we have the following table structure: the id field is the primary key index and the username field is the secondary index, then (id, username) is the joint index
CREATE TABLE `user_table` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `username` varchar(255) NOT NULL, `password` varchar(255) DEFAULT NULL, `age` int(11) unsigned Not NULL.PRIMARY KEY (`id`), key (`username`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 Copy the code
- Note the InnoDB storage engine
-
The data order of an index in a joint index is related to the order of the fields. If the value of the front field in an index with multiple values is the same, the index will be sorted according to the value of the next field
-
InnoDB storage engine B+Tree follows the left-most prefix principle. Let’s start with an SQL statement to explain what the left-most prefix principle is.
select password from user_table where username = 'xdh' Copy the code
- We added a joint index (username,password), pay special attention to the order of the joint index, if we reverse the order to (password,username), so that the query can use this index? The answer is no!
- This is the first meaning of the leftmost prefix:
A query can use a federated index only when the query condition is one of the fields in the federated index
- We now have the following three query scenarios:
- The first word of the user name is the password of the user whose name starts with “Zhang”, that is, the search condition clause is
Where username = '%';
- Investigate the password of the user name containing the word “Zhang”, that is, the query condition clause is
"Where username = '% % %'"
- Find the password of the user whose user name ends with “Zhang”, that is, the search condition clause is
"Where username = '% '"
- The first word of the user name is the password of the user whose name starts with “Zhang”, that is, the search condition clause is
- So actually this joint index only works in the first scenario, why?
- This is the second meaning of the leftmost prefix:
Index can be used to query fuzzy query based on the leftmost characters of the field value when the condition field is the index field
-
extension
- For example, if you need to find the age of a username, there are three ways to do this:
- Select * from age; select * from age; select * from age
- Query by creating a federated index (username, age)
- We can create (username, password, age) federated index based on (username, password), so that we can
The number of indexes to be maintained remains the same
- When creating an index, use less space to create an index. For example, you often need to query age by username or username by age. There are two methods:
- (username, age) Joint index + age Single-field index
- (age, username) Combined index + USERNAME Single-field index
- In general, the username field is much larger than the age field, so we should choose the first one, which takes up less space. This is the least space rule
- For example, if you need to find the age of a username, there are three ways to do this:
Cover index
-
Is the best case for joint and index queries, do not perform back table queries
-
Assume that the following two SQL statements are executed
select id from user_table where username = 'xdh' select password from user_table where username = 'xdh' Copy the code
-
The first statement is executed like this: Select * from B+Tree; select * from B+Tree; select * from B+Tree; select * from B+Tree; select * from B+Tree; There is no need to find additional data through the primary key index
-
The second statement is executed like this: Select * from B+Tree; select * from B+Tree; select * from B+Tree; Select * from user where username > id > password; select * from user where id > password;
-
An overwrite index is used when both the query fields (SELECT column) and the query condition fields (WHERE clause) of an SQL statement are contained in an index, and the index query can be used directly without returning to the table
-
By using overwrite index, can reduce the number of search tree, is a common means of performance optimization, for example, the second SQL statement above can establish a joint index (username, password) to achieve the overwrite index, to achieve the purpose of optimization
-
See if the federated index was used successfully
EXPLAIN SELECT id FROM user_table WHERE username = 'xdh'; EXPLAIN SELECT password FROM user_table WHERE username = 'xdh'; Copy the code
-
You can see if the override index was successfully used from the Extra column
An index pushdown
- A MySQL database optimization strategy for federated indexes, described later in performance optimization
Clustering index
- Innodb primary key index, non-leaf node stores index pointer, leaf node stores both index and data, is a typical clustered index (it can be found here that the storage order of index and data is strongly correlated, so it is a typical clustered index)
Secondary indexes in Inndob are also non-clustered indexes
Non-clustered index
- MyISAM index and data files in separate storage, B + Tree leaf nodes stored is the address of data storage, rather than the specific data, is typical of the clustering index, in other words, data can be casually find a place to save on the disk, the index also can casually find a place to save on disk, as long as a leaf node records on the corresponding relationship. () Index storage order has nothing to do with data storage relationship, it is a typical non-clustered index.)
Secondary indexes in Inndob are also non-clustered indexes
The only index
- The system checks whether the index has the same key value when creating the index, and checks each time the INSERT or UPDATE statement is used to add data. If the index has the same value, the operation fails and an exception is thrown
- Note that:
A primary key index must be a unique index, and a unique index is not necessarily a primary key index
The full text indexing
- Prior to MySQL5.6 only the MyISAM storage engine supported the full-text engine
- InnoDB storage engine added full text index support in MySQL5.6, but does not support Chinese full text index
- InnoDB storage engine supports Chinese full-text indexing after MySQL5.6
Index implementation
- The index is realized by the storage engine. The two main engines in MYSQL are Myisam and InnoDB. The storage engine is built on the table, and the storage engine can be specified when the table is established
CREATE TABLE `user` { `id` int(11) NOT NULL, `name` varchar(255) NOT NULL.PRIMARY KEY(`id`) } ENGINE=InnoDB DEFAULT CHARSET=UTF8 Copy the code
CREATE TABLE `user` { `id` int(11) NOT NULL, `name` varchar(255) NOT NULL.PRIMARY KEY(`id`) } ENGINE=MYISAM DEFAULT CHARSET=UTF8 Copy the code
Implementation of B+Tree in MyISM
- FRM (table structure file), table_name.MYD(data save file), table_name.MYI(index save file)
(Photo from Internet)
- For example, in the appellate teacher table, two files save data and index respectively. Since only leaf nodes in B+Tree save data area, in MyISAM,
The data section stores the reference address of the data
For example, if the data with ID=101 is saved to the physical disk address 0x123456, the node data in the index is saved to the disk address pointer. When this pointer is scanned, data can be loaded from the disk pointer - In MyISAM B+Tree implementation, for example, now instead of ID as index, use name, then his representation is what? In fact, it is the same as ID as the index, but also saves the pointer to the specified disk location, as shown in the following picture (the picture is from the network)
B+Tree in InnoDB
- FRM (table structure file), table_name. Idb (data and index save file)
- In InnoDB, because primary keys were designed to be very important. The storage engine will implicitly create a 6-digit primary key index for us to organize the data store by the primary key index.
On the leaf node, the Data section holds all information about the data
- If the name field is indexed at this time:
A secondary index with the name field as the index will be generated, and the data stored on the leaf node at the moment is the keyword value of the clustered index (ID index), and the value of the ID index will be found based on the secondary index, and then the final data will be obtained through the ID index area
That’s what it’s called“Back table query”
The index strategy
Cover index
Left-most prefix policy
An index pushdown
-
Index Condition Pushdown (ICP) is a query optimization strategy introduced in MySQL 5.6. It pushes the Index Condition check originally done by the Server layer down to the storage engine layer to reduce The Times of returning to the table and accessing the storage engine and improve the query efficiency
-
The principle of
- What is the query process for federated indexes without ICP
- The storage engine reads the index record
- Locate and read the complete row record based on the primary key value in the index
- The storage engine passes the record to the Server layer to check whether the record is satisfied
WHERE
conditions
- In the case of ICP, what is the query process of the joint index
- Read index records (not full row records, i.e. records for secondary indexes)
- judge
WHERE
Whether the condition part can be checked by the column in the index, if the condition is not met, the next row of index records will be processed; , - If the condition is met, the primary key value in the secondary index is used to locate and read the complete row record (so-called back table);
- The storage engine passes the record to the Server layer, which checks if the record is full
WHERE
The rest of the condition
- What is the query process for federated indexes without ICP
-
case
- Create a user table and insert data
CREATE TABLE USER ( id INT ( 11 ) NOT NULLAUTO_INCREMENT COMMENT "Primary key ", nameVARCHAR ( 32COMMENT "name ", cityVARCHAR ( 32COMMENT "city ", ageINT ( 11COMMENT "age ",PRIMARY KEY ( id ), KEY idx_name_city ( NAME, city ) ) ENGINE = INNODB DEFAULT CHARSEt = UTF8; INSERT INTO USER2 (`name`, city, age ) VALUES ( "ZhaoDa", "BeiJing", 20 ), ( "QianEr", "ShangHai", 21 ), ( "SunSan", "GuanZhou", 22 ), ( "LiSi", "ShenZhen", 24 ), ( "ZhouWu", "NingBo", 25 ), ( "WuLiu", "HangZhou", 26 ), ( "ZhengQi", "NanNing", 27 ), ( "WangBa", "YinChuan", 28 ), ( "LiSi", "TianJin", 29 ), ( "ZhangSan", "NanJing", 30 ), ( "CuiShi", "ZhengZhou", 65 ), ( "LiSi", "KunMing", 29 ), ( "LiSi", "ZhengZhou", 30 ); Copy the code
- Table record
- View the index in the table
- Execute SQL query statements
SELECT * FROM user2 WHERE name = "LiSi" AND city LIKE "%Z%" AND age > 25; Copy the code
-
Case analysis
- Push under index conditions is enabled by default and system parameters can be used
optimizer_switch
To check whether the controller is enabled, you can use the following command to control whether to start index push down- set optimizer_switch=”index_condition_pushdown=off”;
- set optimizer_switch=”index_condition_pushdown=on”;
- Push under index conditions is enabled by default and system parameters can be used
-
Do not use index push-down: according to the “left-most match” principle of the joint index, only name column can be used in the index, city column is fuzzy match, can not use the index, in this case the execution process is like this
- Storage engine based
(name, city)
Joint index, findname
The value is LiSi. There are four records - Then, according to the ID values of the four records, the table is scanned one by one to fetch the complete row records from the main index and return these records to the Server layer
- The Server layer receives these records and conditions them
name="LiSi" and city like "%Z%" and age > 25
Filter it and leave it behind("LiSi", "ZhengZhou", 30)
The record
- Storage engine based
-
Push down using an index: This is how it works
- Storage engine based
(name, city)
Select * from LiSi where name=’LiSi’ - Because the federated index contains
city
Columns that the storage engine presses directly in the federated indexcity like "%Z%"
After filtering, two records are left - According to the id value of the filtered records, the table is scanned one by one, the complete row records are removed from the main index, and these records are returned to the Server layer
- According to the Server layer
WHERE
Other conditions on the statementage > 25
, the row records are filtered again, and only the row records are left("LiSi", "ZhengZhou", 30)
The record
- Storage engine based
-
Determine whether to use index push-down from the execution plan (Extra)
-
Usage scenarios
- Can only be used for
Range, ref, eq_ref, ref_OR_NULL
Access methods - Can only be used with InnoDB and MyISAM storage engines and their partitioned tables
- For the InnoDB storage engine, index push-downs are only available for secondary indexes, since the primary key index leaves contain entire rows, and no index push-downs are used
- Can only be used for
Application/Inapplication scenarios of indexes
Applicable scenario
- The primary key automatically creates a unique index
- Fields frequently used as query criteria
- The foreign key relationship is used to index the fields associated with other tables in the query
- Single key/composite index selection problem, high concurrency tends to create composite index
- A sorted field in a query that is significantly faster through index access
- Statistics or grouping fields in the query
Not applicable Scenario
- Too few table records
- A watch that is often added, deleted, or modified
- Table fields with repetitive, evenly distributed data should only be indexed for the columns that are most frequently queried and sorted (indexing doesn’t make much sense if a data class contains too many duplicates).
- Frequently updated fields are not suitable for index creation (add IO burden)
- Indexes are not created for fields that are not needed in the WHERE condition
Index relevant interview questions
- Q: What do you understand about MySQL indexes?
- Q: Database index principle, why use B+Tree, why not binary Tree, red black Tree or b-tree?
- An index is a data structure used in a storage engine to quickly find records. Its essence is to filter out the final results by constantly narrowing the range of data you want. In the database, it is through B+Tree to achieve the index
- B+Tree B+Tree B+Tree B+Tree B+Tree B+Tree B+Tree If the data value stored in a node is too large and the space of a node is limited, the key value will be reduced. When a large amount of data is stored, the height of the B-tree will be increased, resulting in the increase of I/O operations and ultimately affecting the query efficiency
- B+Tree B+Tree B+Tree B+Tree
- Why not use a red-black tree or Hash index
Performance analysis of b-tree /B+Tree indexes
Answer the content that has been answered
- Q: What is the difference between a clustered index and a non-clustered index?
- The difference is whether the storage order of the index and the storage order of the data is relational, relevant is clustered index, irrelevant is non-clustered index
- Q: Do leaf nodes store data or memory addresses pointing to data?
- If the storage engine of this table is MyISAM, the data section of its leaf node holds the memory address
- If the storage engine of the table is InnoDB and the table is a B+Tree created by the primary key index, the data field of the leaf node of the table stores the data. If the table is a B+Tree created by the non-primary key index, the leaf node stores the values of the primary key index
- Q: InnoDB engine index policy, know?
- Cover index
- Left-most prefix rule
- An index pushdown
- Q: What are the ways to create an index?
- Created when the table is created
CREATE TABLE `user_table` ( `id` INT ( 11 ) UNSIGNED NOT NULL AUTO_INCREMENT, `username` VARCHAR ( 255 ) NOT NULL, `password` VARCHAR ( 255 ) DEFAULT NULL, `age` INT ( 11 ) UNSIGNED NOT NULL.PRIMARY KEY ( `id` ), KEY ( `username` ) ) ENGINE = INNODB DEFAULT CHARSET = utf8 Copy the code
- Create a table after it is created
CREATE UNIQUE INDEX idIndex ON test(id) Copy the code
- View table index
SHOW INDEX FROM TABLE_NAME Copy the code
- Created when the table is created
- Q: What should I pay attention to when using indexes?
- The storage engine uses this field to create key values for nodes in a B+Tree. If the key value is too large, the number of leaf nodes on each node decreases, which affects the height of the Tree, increases I/O operations, and affects the query speed
- Q: Why is it recommended to increment the primary key by integer instead of UUID?
- UUID is a string that consumes more storage space than an integer
- Searching a B+Tree requires comparing the size of the node to the value of the node. Integer data is faster than string data
- The incremented integer index is stored consecutively on disk, and consecutively when a page of data is read. The UUID is generated randomly, and the upper and lower rows of data that are read are stored separately and are not suitable for execution
where id > 5 && id < 20
The conditional query statement of - When data is inserted or deleted, the integer increment primary key creates a new leaf node at the end of the leaf node without breaking the structure of the left subtree. The UUID primary key is prone to such a situation. In order to maintain its own characteristics, B+Tree may undergo structural reconstruction and consume more time
- Q: Why do non-primary key index leaves store primary keys?
- Ensure data consistency and save storage space
- Q: When does an index fail?
- If the query condition contains OR, the index may become invalid
- If the field type is a string, where must be quoted, otherwise the index will be invalid
The like wildcard may invalidate the index (wildcard is used at the far left)
Union index, the query condition column is not the first column in the union index, index failure
- Index invalid when using mysql’s built-in functions on index columns
- Index column operations (e.g., +, -, *, /) invalidate the index
- Index fields using (! = or < >, not in) may cause index invalidation
- If is null is not null is used on an index field, the index may become invalid
- Left-link query or right-link query The encoding format of the field associated with the query is different, which may cause index failure
- Mysql estimates that a full table scan is faster than an index, so no index is used
The SQL statement
- The difference between count(*) and count(1) and count(column name)?
-
Count (expr) function
- It means
- COUNT(expr), which returns the number of rows retrieved by the SELECT statement whose expr value is not NULL. The result is a BIGINT value
- If the query result does not match any records, 0 is returned
- The query result for COUNT(*) contains the number of rows with a value of NULL
- Count (*) is a standard syntax for counting rows specified in SQL92, regardless of the database
- It means
-
COUNT(column name), COUNT(constant), and COUNT(*)
- According to the count function, expr corresponds to column name, constant, and * respectively
- The difference between COUNT(*) and COUNT(constant) and COUNT(column name) in query results
- COUNT(constant) and COUNT(*) are direct
Query the number of rows in the eligible database table
- COUNT(column name) is
Query the number of eligible columns whose values are not NULL
- COUNT(constant) and COUNT(*) are direct
- Is there a difference between COUNT(*) and COUNT(1)? Here are two theories:
- Some say COUNT(*) is converted to COUNT(1) when executed, so COUNT(1) has fewer conversion steps, so it is faster
- MySQL is optimized for COUNT(*)
Later can speak
, so COUNT(*) is faster
- In fact, the official document says:
InnoDB handles SELECT COUNT(*) and SELECT COUNT(1) operations in the same way. There is no performance difference. Copy the code
- MySQL optimizations are exactly the same for COUNT(1) and COUNT(*), there is no one faster than the other!
- COUNT (column name)
- The query for COUNT(column name) performs a full table scan and determines whether the value of the specified column is NULL
- COUNT(column name) has one more step than COUNT(*) to determine whether the queried field is NULL, so its performance is slower than COUNT(*)
- Optimization of COUNT(*)
- MyISAM storage engine will directly record the total number of rows of the table separately for COUNT(*) query
- The InnoDB storage engine selects the smallest index when scanning tables to reduce costs
Note: All of these optimizations assume that there are no conditional queries for WHERE and group
-
conclusion
- COUNT(*) = COUNT(1) > COUNT(*)
- What is the difference between in and exists in MySQL?
-
First, a brief introduction to EXISTS and IN
-
EXIST: specifies a subquery that detects the existence of rows
- A subquery after exists() is called a correlation subquery and does not return the value of a list.
It simply returns a true or false result
, its operation mode is to run the primary query once, and then go to the sub-query query and its corresponding results. If it is true, it prints; otherwise, it does not - Query subqueries based on each row in the main query (a loop structure)
- As follows:
select * from user where exists (select 1); Copy the code
- Select 1 from user; select 1 from user; select 1 from user; select 1 from user
select * from user;
Is the same
- Select 1 from user; select 1 from user; select 1 from user; select 1 from user
- And is as follows:
select * from user where exists (select * from user where user_id = 0); Copy the code
- You can see that when you loop on the user table, check the conditional statement
(select * from user where user_id = 0)
Since user_id is never 0, the conditional statement always returns an empty set. The condition is always false, and all records in the user table are discarded
- You can see that when you loop on the user table, check the conditional statement
- A subquery after exists() is called a correlation subquery and does not return the value of a list.
-
IN: Determines whether a given value matches a value IN a subquery or list
- In behind the ()
A subquery returns a result set
In other words, the order of execution is different from exists(). The subquery produces the result set, and then the main query goes to the result set to find a list of fields that match the requirements. Meet the requirements of the output, otherwise not output - Note: the select statement corresponding to in must return a column! It can be multiple lines
- In behind the ()
-
Relationship between
- SQL has been changed so that both can achieve the same goal
SELECT * FROM p_user_2 WHERE id IN ( SELECT id FROM p_user ); SELECT * FROM p_user_2 WHERE EXISTS ( SELECT id FROM p_user WHERE id = p_user_2.id ); Copy the code
-
contrast
- In is
Hash join outer and inner tables
And the existsLoop the exterior
The inner table is queried again each time through the loop - So in is actually similar to equal, so in(1,2) is just a simple way of writing equal to 1 or equal to 2, so
Use in when there are few elements and exists if there are many
- Exists generally needs to be associated with sub-tables, and indexes are required to be associated to speed things up
- In can be used with subqueries or directly in (a, b…).
- Exist uses indexes for tables in subqueries
- Not Exist uses indexes for both primary and subqueries
- When in is used with subqueries, indexes can only be used for the primary query
- Not in does not use any indexes
Note: It has always been inaccurate to say that EXISTS is more efficient than IN
- In is
-
case
- Case a
SELECT * FROM A WHERE id IN ( SELECT id FROM B ) Copy the code
- The above query uses the in statement. In () is executed only once
cached
. Then check whether the IDS in table A are the same as those in table B. If they are equal, the records in table A are added to the result set until all records in table A are iterated. Its query process is similar to the following
List resultSet=[]; Array A=(select * from A); Array B=(select id from B); for(int i=0; i<A.length; i++) {for(int j=0; j<B.length; j++) {if(A[i].id==B[j].id) { resultSet.add(A[i]); break; }}}return resultSet; Copy the code
- The above query uses the in statement. In () is executed only once
- Conclusion a
- As you can see, it is not appropriate to use in() when table B is large, because it traverses all table B data once
- For example, if table A has 10,000 records and table B has 1,000,000 records, the maximum number of traversal times may be 10,000 x 1,000,000, which is inefficient
- If table A has 10000 records and table B has 100 records, the maximum number of traversals is 10000 * 100, and the number of traversals is greatly reduced
- In () applies when table B is smaller than table A
- As you can see, it is not appropriate to use in() when table B is large, because it traverses all table B data once
- Case 2
SELECT a.* FROM A a WHERE EXISTS ( SELECT 1 FROM B b WHERE a.id = b.id ) Copy the code
- The above query uses the EXISTS statement. Exists () is executed a.length times. It does not cache the exists() result set. Because exists() the contents of the result set are not important, what is important is whether there are records in the result set. Return true if exists, false if there are no records in the result set. Its query process is similar to the following
List resultSet=[]; Array A=(select * from A) for(int i=0; i<A.length; i++) {if(exists(A[i].id) { // SELECT 1 FROM B B WHERE a.id = b.idresultSet.add(A[i]); }}return resultSet; Copy the code
- Conclusion two
- Use exists() when table B is larger than table A because it does not iterate and only needs to perform one more query
- Exists () applies to the case where the data in table B is larger than that in table A
- Case a
-
conclusion
- Generally, the efficiency of using EXISTS is higher than that of using IN, because IN does not go through the index. However, it depends on the actual situation.
- IN is suitable for large on the outside and small on the inside
- EXISTS applies to a situation where the outer surface is small and the inner surface is large
- If the inner table data is as large as the outer table data, the efficiency of IN and EXISTS is similar. You can use either of them
- Generally, the efficiency of using EXISTS is higher than that of using IN, because IN does not go through the index. However, it depends on the actual situation.
SQL query execution sequence
- For query processing, it can be divided intoLogical query processingandPhysical Query processing
- Logical query processing means what should be the result of executing the query
- How does the physical query represent the MySQL database getting this result
- The methods of the two queries may be completely different, but the results must be the same
-
SQL Query case
- Let’s start with a query statement to explain the function of the related SQL commands
- Create a table structure and insert data
Create the Customers table CREATE TABLE customers ( customer_id VARCHAR(10), city VARCHAR(10) NOT NULL.PRIMARY KEY (customer_id) ) ENGINE = InnoDB; Insert data into the CUSTOMERS table INSERT INTO customers VALUES ('163'.'HangZhou'), ('9you'.'ShangHai'), ('TX'.'HangZhou'), ('baidu'.'HangZhou'); Create table Orders CREATE TABLE orders2 ( order_id INT AUTO_INCREMENT, customer_id VARCHAR(10), PRIMARY KEY (order_id) ) ENGINE = InnoDB; Insert data into the Orders table INSERT INTO orders2 VALUES (1.'163'), (2.'163'), (3.'9you'), (4.'9you'), (5.'9you'), (6.'TX'), (7.NULL); Copy the code
- Table record
- Execute query statement:
[Query customers from Hangzhou with order number less than 2, and query their order quantity, the query results are sorted according to order number from small to large]
SELECT c.customer_id, count( o.order_id ) AS total_orders FROM customers AS c LEFT JOIN orders2 AS o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count( o.order_id ) < 2 ORDER BY total_orders DESC; Copy the code
- The query results
-
Logical query order
- The general structure of a query statement is as follows
(8)SELECT (9)DISTINCT <select_list> (1)FROM <left_table> (3)<join_type>JOIN <right_table> (2)ON<join_condition> (4)WHERE<where_condition> (5)GROUP BY<group_by_list> (6)WITH {CUBE|ROLLUP} (7)HAVING<having_condition> (10)ORDER BY<order_by_list> (11)LIMIT<limit_number> Copy the code
- The sequence number of the query statement is the processing order of the query statement. It can be seen that there are a total of 11 steps. FROM operation is performed first, and LIMIT operation is performed last
- Each operation produces a virtual table that serves as input to a process. These virtual tables are transparent to the user, and only the virtual tables generated in the last step are returned to the user
- Analyze each stage in detail
- FROM: execute Cartesianproduct on left
and right
in FROM clause to produce virtual table VT1
- ON: Applies ON filtering to virtual table VT1. Only those rows that match < JOIN_condition > are inserted into virtual table VT2
- JOIN: If OUTER JOIN (such as LEFT OUTER JOIN and RIGHT OUTER JOIN) is specified, the unmatched rows in the table are added to virtual table VT2 as external rows, resulting in virtual table VT3. If the FROM clause contains more than two tables, repeat steps (1) through (3) for the result table VT3 generated by the previous join and for the next table until all tables are processed
- WHERE: Applies the WHERE filtering condition to virtual table VT3. Only records that match where_condition are inserted into virtual table VT4
- GROUP BY: VT5 is generated BY grouping the records in VT4 according to the columns in the GROUP BY clause
- CUBE | ROLLUP: CUBE or a ROLLUP operation on table VT5 VT6 produce table
- HAVING: Apply a filter to virtual table VT6. Only records that match
are inserted into virtual table VT7
- SELECT: Perform the SELECT operation for the second time to SELECT the specified column and insert it into virtual table VT8
- DISTINCT: Removes duplicate data and generates virtual table VT9
- ORDER BY: ORDER the records in virtual table VT9 according to
to generate virtual table VT10
- LIMIT: Retrieves the records of the specified row, generates virtual table VT11, and returns it to the query user
- FROM: execute Cartesianproduct on left
-
Analyze the execution of the query SQL above
-
Select * FROM table VT1; select * FROM table VT1; select * FROM table VT1; select * FROM table VT1
- If the table before the FROM clause contains row A and the table after the FROM clause contains row B, then virtual table VT1 contains row A * B, resulting in the result set with the columns of the former table first and the columns of the latter table last
FROM customers as c ....... JOIN orders as o Copy the code
-
Apply ON filter: There are three filter procedures for SELECT query: ON, WHERE, and HAVING. ON is the filtering process performed first. Based on the virtual table VT1 generated in the previous section, the filtering conditions are
ON c.customer_id = o.customer_id Copy the code
-
Add external rows: This step only happens if the JOIN type is OUTER JOIN
- LEFT OUTERJOIN, RIGHT OUTERJOIN, FULL OUTERJOIN
You can omit the OUTER keyword, but OUTER represents the OUTER row
. - Select * from LEFT OUTER JOIN where LEFT OUTER JOIN = reserved; select * from RIGHT OUTER JOIN where RIGHT OUTER JOIN = reserved; select * from RIGHT OUTER JOIN where LEFT OUTER JOIN = reserved
- External rows are added to the VT2 table to add the data filtered out by the filtering conditions in the reserved table. The data in the non-reserved table is assigned a NULL value. Finally, virtual table VT3 is generated
In this example, the reserved table is customers, and the customer baidu is filtered in the VT2 table because there is no order, so Baidu is added to the virtual table VT2 as an external row, assigning NULL to the data in the non-reserved table
- If more than two tables need to be connected, repeat steps (1) to (3) at the beginning of this section for virtual table VT3. The last virtual table is used as the output of the next step
customers as c LEFT JOIN orders as o Copy the code
- LEFT OUTERJOIN, RIGHT OUTERJOIN, FULL OUTERJOIN
-
Apply the WHERE filter: Filter the virtual table VT3 generated in the previous step WHERE criteria. Only records that match where_condition are output to the virtual table VT4
WHERE c.city='HangZhou' Copy the code
- Pay attention to: Two types of filtering are not allowed when the WHERE filter is currently applied
- The data hasn’t been grouped yet, so it’s still
Statistical filtering cannot be used in the WHERE filter
, such asSELECT customer_id, count(customer_id) FROM ordersWHERE COUNT(customer_id) < 2; Copy the code
- Because no column selection operation is performed, therefore
Using a column alias in a SELECT is also not allowed
, such asSELECT order_id as o, customer_id as cFROM ordersWHERE c = '163'; Copy the code
- The data hasn’t been grouped yet, so it’s still
- Filtering in a WHERE filter is different from filtering in an ON filter
- For filtering in OUTER JOIN, records that are filtered out of the reservation table by the ON condition are added after the ON filter is finished
- Records that are filtered out of the WHERE condition are permanently filtered
- Pay attention to: Two types of filtering are not allowed when the WHERE filter is currently applied
-
Group: In this step, group the virtual tables generated in the previous step based on the specified columns to obtain virtual table VT5
GROUP BY c.customer_id SELECT * FROM customers AS c LEFT JOIN orders2 AS o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id; Copy the code
-
Apply ROLLUP or CUBE: If the ROLLUP option is specified, an additional record is created to add to the end of virtual table VT5 and virtual table VT6 is generated
- Because our query does not use ROLLUP, this step will be skipped
-
Applying the HAVING filter: This is the last conditional filter after applying the ON and WHERE filters, respectively
- In this step, apply the HAVING filter to the virtual table generated in the previous step. HAVING is a filter to filter the grouping criteria
- For the query statement in the example, after grouping conditions are met, the order customer_ID 163 is deleted from the virtual table VT6 virtual table VT6 is generated
HAVING count(o.order_id < 2) SELECT * FROM customers AS c LEFT JOIN orders2 AS o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count(o.order_id) < 2; Copy the code
-
Process the SELECT list: Although SELECT is the first part of the query to be specified, it is not actually processed until Step 8). In this step, SELECT the column specified in SELECT from the virtual table generated in the previous step as SELECT part:
SELECT c.customer_id,count(o.order_id) AS total_orders SELECT c.customer_id, count(o.order_id) as total_orders FROM customers AS c LEFT JOIN orders2 AS o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count(o.order_id) < 2; Copy the code
-
Apply a DISTINCT clause: If a DISTINCT clause is specified in a query, a temporary table in memory is created (if it does not fit in memory, it is put on disk)
- The table structure of this in-memory temporary table is the same as the virtual table generated in the previous step, except that columns with DISTINCT operations are added with a unique index to remove duplicate data
- For queries that use GROUP BY, the use of DISTINCT is redundant because no rows are removed because the GROUP is already in place
- Because DISTINCT is not specified in this SQL query, skip this step
-
Apply the ORDER BY clause: Returns a new virtual table BY ordering the virtual tables output in the previous step according to the columns specified in the ORDER BY clause
- You can also specify the serial number of the columns in the SELECT list in the ORDER BY clause, as in statement 2,
In general, sorting in this way is not recommended because a programmer might change the columns in the SELECT list and forget to change the list in the ORDER BY
ORDER BY total_orders DESC SELECT c.customer_id, count(o.order_id) as total_orders FROM customers AS c LEFT JOIN orders2 AS o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count(o.order_id) < 2 ORDER BY total_orders DESC; SELECT c.customer_id, count(o.order_id) as total_orders FROM customers AS c LEFT JOIN orders2 AS o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count(o.order_id) < 2 ORDER BY 2.1; Copy the code
- Pay attention toHere are two common mistakes
- In MySQL databases, NULL values are always selected first in the ascending process, i.e. NULL values are treated as minimum values in the ORDER BY clause
- Many developers mistakenly believe that when selecting data from a table, records are fetched in ORDER of the size of the primary key in the table, resulting in an ORDER BY, which is not the case
- extension
- Relational database is developed on the basis of mathematics, and relation corresponds to the concept of set in mathematics. The common query operations in the database actually correspond to some operations of the collection: select, projection, join, union, cross, difference, division. Although the final result is presented to the user in the form of a two-dimensional table, it is a series of set operations from the inside of the database. Therefore, users need to understand the records in a table in terms of collections
- Because the data in a table is a member of a collection, and the collection is unordered. Therefore, for SQL statements without the ORDER BY clause, the parse result should be: select the desired subcollection from the collection, indicating that the results do not have to be ordered
- You can also specify the serial number of the columns in the SELECT list in the ORDER BY clause, as in statement 2,
-
LIMIT clause: Selects the specified row starting at the specified location from the virtual table in the previous step
- For a LIMIT clause that does not apply ORDER BY, the result is equally likely to be unordered, so the LIMIT clause is often used with the ORDER BY clause
- Because this SQL statement has no LIMIT clause, it will be skipped
-
-
Physical Query processing
The database may not perform queries in exactly the same manner as logical query processing
- In the MySQL database service layer, there are two components, the parser and the optimizer. The parser’s job is to analyze the SQL statement, and the optimizer’s job is to optimize the SQL statement, choose an optimal path to select data, but must ensure that the final result of the physical query processing is equal to the logical query processing
-
conclusion
JOIN the connection
- Common Joins can be divided into the following seven types
-
These seven joins are explained by creating two tables
CREATE TABLE customers ( customer_id VARCHAR(10), city VARCHAR(10) NOT NULL.PRIMARY KEY (customer_id) ) ENGINE = InnoDB; CREATE TABLE orders2 ( order_id INT AUTO_INCREMENT, customer_id VARCHAR(10), PRIMARY KEY (order_id) ) Copy the code
-
LEFT JOIN
- Returns records that include all records in left table (A) + join fields equal in right table (B)
- The records in the left table (A) will be all represented, while the right table (B) will show only the records that match the search criteria (in our example: Customers. Customer_id = orders2.customer_id)
Table B is NULL where insufficient records are recorded
SELECT * FROM customers LEFT JOIN orders2 ON customers.customer_id = orders2.customer_id; Copy the code
-
RIGHT JOIN
- Returns records that include all records in right table (B) + join fields equal in left table (A)
- The records in the right table (B) will all be represented, while the left table (A) will show only the records that match the search criteria (in the example: Customers. Customer_id = orders2.customer_id)
Table A is NULL where insufficient records are recorded
SELECT * FROM customers RIGHT JOIN orders2 ON customers.customer_id = orders2.customer_id; Copy the code
-
INNER JOIN
- Only records with equal join fields in two tables are returned
SELECT * FROM customers INNER JOIN orders2 ON customers.customer_id = orders2.customer_id; Copy the code
-
LEFT JOIN without the contents of the right table
- Means that we only return records whose right table is NULL
SELECT * FROM customers LEFT JOIN orders2 ON customers.customer_id = orders2.customer_id WHERE orders2.customer_id IS NULL; Copy the code
- RIGHT JOIN and do not contain the contents of the left table
- Means that we only return records whose left table is NULL
SELECT * FROM customers RIGHT JOIN orders2 ON customers.customer_id = orders2.customer_id WHERE customers.customer_id IS NULL; Copy the code
-
FULL JOIN
- All content on both sides should be present, and NULL should be added to those that can’t
- MySQL syntax does not support FULL OUTER JOIN, so we use UNION instead
SELECT * FROM customers LEFT JOIN orders2 ON customers.customer_id = orders2.customer_id UNION SELECT * FROM customers RIGHT JOIN orders2 ON customers.customer_id = orders2.customer_id WHERE customers.customer_id IS NULL; Copy the code
-
FULL JOIN with no intersection
SELECT * FROM customers LEFT JOIN orders2 ON customers.customer_id = orders2.customer_id WHERE orders2.customer_id IS NULL UNION SELECT * FROM customers RIGHT JOIN orders2 ON customers.customer_id = orders2.customer_id WHERE customers.customer_id IS NULL; Copy the code
Related interview questions
- Q: Count (*) differs from count(1) and count(column name).
- Q: What is the difference between in and exists in MySQL?
- Q: What is left join and right join, and what is the difference?
Subsequent content
- Interview questions 2: MySQL and related topics
Reference and thanks
- MySQL 30 thousand words summary + interview 100 questions with the interviewer
- MySQL > select * from ‘MySQL’;
- Talk about the MySQL architecture
- A Hash index
- Mysql > create index (B+Tree)
- Why B+Tree has a unique advantage in database indexes (why it is more suitable than red-black trees)
- Mysql index articles cover index, union index, index push down
- MySQL index classification, 90% of developers do not know
- Is count(1) more efficient than count(*)?
- The difference between IN and EXISTS usage IN SQL
- Description of the execution sequence of SQL query statements
- Interviewer: Do you know about InnoDB’s solution to magic reading?
- [Database lock mechanism] In-depth understanding of optimistic lock, pessimistic lock and CAS optimistic lock implementation mechanism principle analysis
- Full understanding of mysql lock mechanism (InnoDB) and troubleshooting
- MySQL index push down (ICP)
- SQL Optimization 2020 Most dry goods summary –MySQL
- Explore the principle of MySQL master-slave replication in depth