This article takes about 6.6 minutes to read.
SELECT COUNT(*) FROM t is a very common SQL requirement. In MySQL usage specification, we generally use the transaction engine InnoDB as the storage engine of (general business) tables. Under this premise, the time complexity of COUNT(*) operation is O(N), where N is the number of rows of the table.
The MyISAM table can quickly fetch the number of rows in the table. What is the mechanism behind these practical experiences, and why this is needed/can be, is what this article wants to explore.
MySQL COUNT(*) has some problems with two storage engines:
With these problems, InnoDB storage engine will be the main discussion.
InnoDB COUNT(*)
Main issues:
-
What is the implementation process like?
-
How do I calculate count? What factors influence count results?
-
Where does the count value live? What are the data structures involved?
-
Why InnoDB can only implement count(*) by scanning tables? (See question at the end of this article)
-
What is the risk of the full table COUNT(*) being a case of a table scan operation?
-
Is the COUNT(*) operation as likely as SELECT * to read overflow pages involving large fields?
1. Execution framework – Loop: read + count
1.1 Basic Conclusions
-
Full table scan, one loop to solve the problem.
-
Inside loop: Reads a row and decides if it counts.
-
Inside the loop, the counting is done line by line.
1.2 illustrates
Simple selelCt-SQL execution framework, analogous to INSERT INTO… SELECT is the same process.
The following steps will detail how to read and count (count++).
2. Execution process
Quote: The implementation process is divided into four parts:
-
COUNT(*) process: Before issuing SQL statements from the Client and executing SELECT from the mysql-server, prepare for the following elaboration.
-
COUNT(*) Flow: gives a brief overview of the code level flow framework and the key call stack portion of the 2 core steps.
-
Read a line: Visibility and the ROW_search_mVCC function that shows how visibility affects COUNT(*) results.
-
Evaluate_join_record and whether the column is empty describes how the counting process affects the COUNT(*) result.
If the reader wants to go straight to counting (*), he can ignore (1) and skip to (2).
2.1 COUNT(*) Process recall – Send SQL from Client to sub_SELECT function
To make the call less jarring, let’s recall how the sub_select function was executed:
-
The mysql-client sends SQL statements based on the MySQL communication protocol packet.
-
The mysql-server receives the data packet and parses the command type (QUERY) and SQL statement (string) through the protocol.
-
The SQL statement is parsed by the parser and output as an object of the JOIN class, which is used to express the SQL statement structurally.
PS: The JOIN structure here is not only a pure syntax structure, but has been semantic processing, roughly speaking, summarized the table list (table_list), the target column list (target_list), WHERE conditions, sub-query and other syntax structures.
In full table COUNT(*)-case, table_list = [table “t” (alias also “t”)], target_list = [target column object (column name “COUNT(*)”)], of course, there is no WHERE condition, subquery structure, etc.
-
The JOIN object has two important methods: JOIN::optimize() and JOIN::exec(), which are used for query optimization and query execution respectively.
Join ->optimize(), the optimization stage (myISam full table count(*) operation will cover a bit of this later). Join ->exec(), which contains InnoDB count(*) operation execution flow.
-
Join ->exec() calls the sub_SELECT function to execute simple SQL after several calls, including COUNT(*).
-
END of sub_select.
2.2 COUNT(*) flow (in sub_select function)
The upper level process and code is relatively simple, concentrated in the sub_select function, where the two classes of functions correspond to the two steps – read and count – described in the “Execution framework” section above. The conclusions are as follows:
-
Read a line: After a call from the sub_SELECT function at the relative top level, all branches are called into the row_search_MVCC function, This function is used to read a row from the B+-tree structure stored by InnoDB storage engine into a buF (uchar *) in memory for later processing. There are issues of row lock acquisition, MVCC, and row visibility. Of course, for snapshot reads like SELECT COUNT(*), only MVCC and its visibility are involved, not row locking. You can skip to the “Visibility and ROW_search_MVCC Functions” section for details.
-
Count one line: At the code level, the read line is evaluated in the evaluate_join_record function to see if it should be counted (that is, count++).
COUNT(arg) itself is a MySQL function. COUNT ++ is counted if the arg (column or entire row) in parentheses is not NULL, otherwise the line is not counted. For details, go to the “Evaluate_join_record and column is empty” section.
The effects of these two stages on COUNT(*) results are as follows:
SQL layer flow framework related code abstract is as follows:
1210 enum_nested_loop_state1211 sub_select(JOIN *join, QEP_TAB *const qep_tab,bool end_of_records)1212 {1213 DBUG_ENTER("sub_select"); . . // Omit 1000 words 1265 herewhile(rc == NESTED_LOOP_OK && join->return_tab >= qep_tab_idx)1266 {1267 int error; // First, get a row from the storage engine; 1268if (in_first_read)1269 {1270 in_first_read= false; // The first step is to scan the first record that meets the condition. SELECT * FROM t LIMIT 1; SELECT * FROM t LIMIT 1; 1271 error= (*qep_tab->read_first_record)(qep_tab); 1272} 1273else// The first step is to continue to read, continue to traverse the position of the previous scan, find a record that meets the condition; SELECT id FROM t WHERE id >$last_idLIMIT 1; 1274 error= info->read_record(info); . . Evaluate_join_record (join, qep_tab); evaluate_record (join, qep_tab); . . // omit 1000 words 1303 DBUG_RETURN(rc); 1304}Copy the code
Q: At the code level, the first step (reading a line) has 2 branches. Why?
A: From the perspective of InnoDB interface, there are two different execution processes: “Read the first row” and “read the next row”. Reading the first row requires finding A cursor position and doing some initialization to make the subsequent process recursive.
As if we use a script/program to scan the table row by row, implementation will involve the following two SQL:
// SELECT id FROM t LIMIT 1; OR SELECT MIN(id)-1 FROM t; -> $last_id// SELECT id FROM t WHERE id > $last_id LIMIT 1;Copy the code
Specifically involving the code of this example, the call relationship between SQL layer and storage engine layer, the call stack in the reading stage is as follows :(for reference)
InnoDB = InnoDB; sub_select = InnoDB; (the same color, and the indentation for the same layer) Ø (* qep_tab - > read_first_record) () | -- - > join_read_first (TAB) | -- - > tab->read_record.read_record=join_read_next; | -- > table->file->ha_index_init() | -- > handler::ha_index_init(uint idx, bool sorted) | -- > ha_innobase::index_init() | -- > table->file->ha_index_first() | -- > handler::ha_index_first(uint idx, bool sorted) | -- > ha_innobase::index_first() | -- > ha_innobase::index_read() | -- > row_search_mvcc() Initialize cursor and place it in a valid initial position; Ø info - > read_record (info) | -- - > join_read_next (info) | -- - > info - > table - > file - > ha_index_next (info - > record)) | -- - > handler::ha_index_next(uchar * buf) | -- > ha_innobase::index_next(uchar * buf) | -- > general_fetch(buf, ROW_SEL_NEXT, 0) | -- - > row_search_mvcc () "forward" move a cursor;Copy the code
As you can see, the row_search_MVCC function ends up with the same read from either branch.
With a brief explanation of the code in the LOOP, let’s see how Row_search_mvCC and evaluate_JOIN_RECORD output the final count.
2.3 Row visibility and row_search_MVCC function
Here we look at the impact of row visibility on COUNT(*) through a set of cases and a few questions.
Q:SELECT COUNT( * ) FROM t
orSELECT MIN(id) FROM t
Does the first row read read the smallest record in table T (in the left-most node page of the tree)? (ha_index_first
Why also callrow_search_mvcc
To get the minimum key value?
A: Not necessarily. Even MIN (id) does not necessarily read the row with the smallest ID, because there are also row visibility issues. Index_read actually takes the smallest index record visible to the statement in the current transaction. This also reflects the fact that join_READ_FIRST and join_READ_NEXT have the same path to row_search_MVCC.
Q: For the last question in the figure, if transaction X isRU ( Read-Uncommitted )
Isolation level, andC-Insert ( 100 )
The completion of theX-count( * )
During execution (only scan the record 5 or 10), thenX-count( * )
In the transactionC-Insert ( 100 )
Once done, can I see the 100 record in subsequent reads?
A: MySQL adopts A “what you read is what you read” policy, i.e. X-count(*) is followed by 100.
2.4 evaluate_JOIN_RECORD and whether the column is empty
Q: How does a row count?
A: Two cases count rows read:
1. If COUNT is a column, then whether the Nullable definition and the value of the column are NULL is checked. If both are true, it will not count, otherwise it will count.
-
e.g.
SELECT COUNT(col_name) FROM t
-
Col_name can be a primary key, a unique key, a non-unique key, or a non-index field
If COUNT is NULL, COUNT ++ is ignored. If COUNT is NULL, COUNT ++ is ignored.
-
e.g-1.
SELECT COUNT(*) FROM t
-
e.g-2.
SELECT COUNT(B.*) FROM A LEFT JOIN B ON A.id = B.id
Q: Specifically, forSELECT COUNT(id) FROM t
Where id is the primary key of table T.
A: Effectively equivalent to COUNT(*). Because both COUNT(*) and COUNT(pk_col) have primary keys that are sufficient to determine that the request data is not NULL, this type of COUNT expression can be used to obtain the number of currently visible table rows.
Q: Yes at the user levelInnoDB COUNT( * )
Optimization operation problem of
A: This problem is A familiar problem in the industry. The number of rows in A table can be obtained by scanning non-empty unique keys, but the number of bytes involved may be much smaller (when the length of the table differs from the length of the primary key and unique key), and the relative IO cost is much smaller.
The relevant call stack reference is as follows:
Reference 1: evaluate_join_record () | -- - > rc = (* qep_tab - > next_select) (join, qep_tab + 1, 0); | -- > end_send_group(...) | -- > init_sum_functions(join->sum_funcs, join->sum_funcs_end[idx+1])) | -- > (*func_ptr)->reset_and_add() | -- > Item_sum::aggregator_clear() | -- > Item_sum::aggregator_add() | -- > update_sum_func(Item_sum **func_ptr) | -- > (*func_ptr)->add() | -- > Item_sum::aggregator_add() (Item_sum::aggregator_add)((Item_sum *) (*func_ptr))->aggregator_add()| -- > (Item_sum *)this->aggr->add() | -- > ((Aggregator_simple *) aggr)->item_sum->add() | -- >if (! aggr->arg_is_null(false)) | ------ > ((Item_sum_count *)aggr->item_sum)->count++;Copy the code
Ii. Data Structure:
Q: Which memory variable is the count value stored in?
(Item_sum_count*)item_sum)-> COUNT(Item_sum_count*) -> COUNT(Item_sum_count*) -> COUNT(Item_sum_count*) -> COUNT(Item_sum_count*
Recall the JOIN structure we mentioned earlier in the “COUNT(*) pre-flow” section as shown below.
That is, the SQL parser structures each SQL statement and expresses it in a JOIN object. A list of result_field_list is created and populated in this object to hold the result column, and each element in the list is an (Item_result_field*) object (pointer) of the result column.
In COUNT(*)-case, the resulting column list contains only one element, (Item_sum_count: Public Item_result_field) type object (name = “COUNT(*)”), where the specific member variable COUNT is the desired.
Select COUNT(*) from MyISAM;
As the MyISAM engine is not often used in real business, it is briefly described as follows:
-
The myisam-count (*) operation is O(1) time complexity operation.
-
Each MyISAM table stores a meta information -count value in memory and file respectively. The value of count variable in memory is initialized by reading the value of count in file.
-
SELECT COUNT(*) FROM t; SELECT COUNT(*) FROM t;
-
The count value in memory and the count value in file are updated by write operations, and the consistency is guaranteed by table-level locks.
-
Serialization of writes guaranteed by table-level locks results in all reads of user threads being either locked or seeing only one data state at the same time.
Four, a few questions
Q:MyISAM
与 InnoDB in COUNT(*)
Where does the execution of an operation start to diverge?
-
Commonality: Commonality exists in THE SQL layer, that is, the data structure after SQL parsing is consistent, and the count variable exists in the Item_sum_count object as the result column; The return to the client is similar – the count variable is assigned and returned to the client via the MySQL communication protocol.
-
Difference: InnoDB count is calculated during SQL execution; The MyISAM table itself contains meta information in memory containing the row_count value of the table. In the SQL optimization stage, the memory engine gives the optimizer a hint, indicating that the storage engine used by the table stores the exact number of rows, which can be directly obtained without entering the actuator.
Q: Why can’t InnoDB maintain a row_count variable like MyISAM?
A: The reason can be found in the MVCC mechanism and row visibility issues. Each transaction may see A different row and its count(*) result may be different. Conversely, mysql-server cannot provide a uniform read view for all user threads at the same time, and therefore cannot provide a uniform count value.
For multiple user threads accessing MySQL (COUNT(*)), there are several factors that determine their respective results:
-
The data state before a set of transactions is executed (the initial data state).
-
The execution sequence of overlapping transactions (operation timing, transaction theory states that serializability of concurrent transaction operations is a necessary condition for correctness).
-
The isolation level of transactions (input for each operation).
Among them, 1 and 2 are global or controllable to the Server, and only 3 is the unique attribute of each user thread, which is uncontrollable by the Server side, so the Server side cannot control each COUNT(*) result.
Q:InnoDB-COUNT( * )
属 table scan
Operation, whether will be existingBuffer Pool
Other user threads require hot pages fromLRU-list
So other user threads also need to be removed from diskload
Once, a sudden increase in IO consumption might block existing requests?
A: MySQL has an optimization policy that places the page loaded by the scan operation at the oung/old junction of the LRU-list (about 3/8 of the end of the LRU). In this way, the hot pages needed by the user thread are still in the Lru-list-young area, while the pages continuously loaded by the table sweep operation will continuously flush the pages in the old area. The pages in this part are considered to be non-hot pages, so it is relatively logical.
PS: I think there is a similar optimization idea, which is to limit the size of the Buffer Pool used by the scan operation to O(1) level, but this needs to pay extra memory management costs.
Q:InnoDB-COUNT( * )
Whether will be likeSELECT * FROM t
So read the overflow page that stores the large field (if present)?
A: no. Because Innodb-count (*) only counts rows and the primary key for each row is definitely not NULL, it only reads rows within the primary key index page without reading additional overflow pages.
It is said that handsome people have set this public number as a star
, END,
The growth path of programmers
Though the road is long, the journey is sure to come
WeChat ID: cxydczzl
Highlights from the past
7 big platform tools for programmers to connect private work
The growth path of Java programmers
Why does TCP require a three-way handshake
50 Details of Java Performance Optimization (Collector’s Edition)
Design e-commerce platform coupon system
A conversation that tells you what an architect does?
Teach you a way to use IDE programming to improve efficiency SAO operation!
The classic ebook package for programmers