preface
Hello, everyone. I am a little boy picking up field snails.
In daily development, we often use Group by. Dear friends, do you know how Group by works? What’s the difference between group by and having? What is the optimization idea of Group BY? What are some concerns about using Group by? In this paper, we will learn to conquer group by~
- A simple example using Group by
- Working principle of Group by
- Group by + WHERE and Group by + having
- Group by optimization idea
- Group by use note points
- How to optimize a production slow SQL
Public number: a boy picking up snails
1. A simple example using group by
Group by is generally used for group statistics, which expresses the logic of grouping according to certain rules. Let’s start with a simple example and review it.
Assume a table of employees with the following structure:
CREATE TABLE 'staff' (' id' bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主 库 iD ', 'id_card' varchar(20) NOT NULL COMMENT 'iD ',' id 'varchar(64) NOT NULL COMMENT' iD ', 'age' int(4) NOT NULL COMMENT 'age ',' city 'varchar(64) NOT NULL COMMENT' city ', PRIMARY KEY (' id ') ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT=' 表';Copy the code
Table size data is as follows:
We now have a requirement to count the number of employees in each city. The corresponding SQL statement can be written like this:
select city ,count(*) as num from staff group by city;
Copy the code
The result is as follows:
The logic of this SQL statement is clear, but what is the underlying execution flow?
2. Group by principle analysis
2.1 the explain analysis
Let’s first look at the execution plan using Explain
explain select city ,count(*) as num from staff group by city;
Copy the code
- “Extra”
Using temporary
Indicates that it is used when performing groupingA temporary table - “Extra”
Using filesort
To indicate the use ofThe sorting
Why does group by use temporary tables and sorting? Let’s take a look at the execution flow of this SQL
2.2 Simple execution process of Group by
explain select city ,count(*) as num from staff group by city;
Copy the code
Let’s take a look at the execution process of this SQL
- Create temporary table in memory with two fields
city
andnum
; - A full table scan
staff
City = ‘X’
- Insert a row (X,1) where city=’X’;
- If there are rows in the temporary table where city=’X’, increment the num value of X by 1;
- After the traversal is complete, the field is followed
city
doThe sortingThe result set is returned to the client.
An execution diagram of this process is shown below:
What is the sort of temporary table?
Sort the fields into sort Buffer and return them as they are sorted. Note that there are full field sort and Rowid sort
- If it is
Full field sort
, and the fields to be returned are addedsort buffer
, according to theSort fieldWhen you’re done, go straight back- If it is
The rowid sorting
, just need to sort the fields put insort buffer
And then one more timeBack to the tableOperation and return.- How do I determine if I’m going to go full-field sort or ROwid sort? Controlled by a database parameter,
max_length_for_sort_data
If you are interested in sorting, you can read this article.
- Order by in detail
3. Where and having
- Group by + WHERE execution flow
- Group by + Having execution flow
- There are also execution orders for WHERE, Group by, and HAVING
3.1 Execution process of Group by + WHERE
SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL
Ok, let’s add a condition to it and an index of idx_age as follows:
select city ,count(*) as num from staff where age> 30 group by city; Alter table staff add index idx_age (age);Copy the code
Expain:
explain select city ,count(*) as num from staff where age> 30 group by city;
Copy the code
From the explain execution plan results, you can see that the query criteria hit the index of idx_age and used temporary tables and sorts
Using index condition: indicates that the index is pushed down to filter as much data as possible according to the index, and then it is returned to the server layer for filtering according to where other conditions. Why is there an index push down here for a single index? The presence of Explain does not necessarily mean that index push-downs are used, only that it is available but not necessarily used. If you have any thoughts or questions, please add me to discuss it on wechat.
The execution process is as follows:
- Create temporary table in memory with two fields
city
andnum
; - Scan index tree
idx_age
, find the primary key ID greater than age greater than 30 - Select * from primary key where city = ‘X’
- Insert a row (X,1) where city=’X’;
- If there are rows in the temporary table where city=’X’, increment the num value of X by 1;
- Continue to repeat step 2 and 3 to find all the data that meets the criteria,
- And finally according to the field
city
doThe sortingThe result set is returned to the client.
3.2 Execution of group by + having
If you want to query the number of employees in each city, having at least 3 employees in each city can solve your problem.
select city ,count(*) as num from staff group by city having num >= 3;
Copy the code
The query results are as follows:
having
Called a grouping filter condition, it operates on the returned result set.
3.3 There is an execution order of WHERE, Group by, and HAVING
If an SQL statement contains clauses where, Group by, and HAVING, what is the order of execution?
Such as this SQL:
select city ,count(*) as num from staff where age> 19 group by city having num >= 3;
Copy the code
- perform
where
Clause to find data that matches employees older than 19 group by
Clause to employee data, grouped by city.- right
group by
Run the aggregation function to calculate the number of employees in each group; - In the end,
having
Clause selects the city group with 3 or more employees.
3.4 Where + having difference summary
having
Clause is used toPost group screeningThe where clause is used forlineConditions for screeninghaving
It’s usually a matchgroup by
With aggregate functions such as (count(),sum(),avg(),max(),min()
)where
Aggregate functions cannot be used in conditional clauses, whilehaving
Clauses will do.having
Can only be used after group by, where is executed before group by
4. Use group by for attention
There are several main points to note when using group by:
group by
Must it be used with aggregate functions?group by
The field must appear in selectgroup by
Cause slow SQL problems
4.1 Must group by be used with aggregate functions?
Group by means of group statistics, usually used in conjunction with aggregate functions such as count(),sum(),avg(), Max (),min()).
- The count ()
- The sum () combined
- Avg () average
- Max () the maximum value
- The min () the minimum value
Is it ok if it is not used with the aggregate function?
I use Mysql 5.7 and it works. It doesn’t get an error and it returns the first row of grouped data.
Such as this SQL:
select city,id_card,age from staff group by city;
Copy the code
The query result is
And if you look at that, what’s returned is the first item in each group
Of course, group by is used in conjunction with aggregate functions, except in special cases where you want to have a distinct, but you can reuse distinct.
4.2 Group by must appear in select.
Not necessarily, for example:
select max(age) from staff group by city;
Copy the code
The result is as follows:
The grouping field city does not follow select and does not generate an error. Of course, this may be related to different databases, different versions. You can verify it before you use it. There is a saying called, the paper come zhongjue shallow, must know this to practice.
4.3 group by
Cause slow SQL problems
The most important thing to note is that group by can easily cause slow SQL problems when used incorrectly. Because it uses both temporary tables and sort by default. It is also possible to use disk temporary tables.
- If you are executing, you will find that the memory temporary table size has reachedceiling(The parameter that controls this upper limit is
tmp_table_size
), will be putThe memory temporary table is converted to the disk temporary table.- If the data volume is large, it is likely that the query will require a temporary table on disk, which will take up a lot of disk space.
These are all X factors that cause slow SQL. Let’s explore optimization solutions together.
5. Some optimization schemes of Group by
In what directions should we optimize?
- Direction 1: since it is sorted by default, we don’t need to sort it.
- Direction 2: Since temporary tables are X factors that affect group by performance, can we not use temporary tables?
Why do we need a temporary table to execute a group by statement? The semantic logic of group by is to count the number of different values that occur. If these values were ordered to begin with, wouldn’t we just scan down and count them instead of using a temporary table to record and count them?
- Add index to group by
- Order by NULL is not sorted
- Try to use only in-memory temporary tables
- Using SQL_BIG_RESULT
5.1 Add indexes to the fields following group by
How do I guarantee that the values after group by are ordered to begin with? Index, of course.
Let’s go back to this SQL
select city ,count(*) as num from staff where age= 19 group by city;
Copy the code
Its execution plan
If we give it a federated index idx_age_city (age,city)
alter table staff add index idx_age_city(age,city);
Copy the code
When we look at the execution plan, we find that neither sorting nor temporary tables are needed.
Adding appropriate indexes is the simplest and most effective way to optimize group by.
5.2 Order by null
Not all scenarios are suitable for indexing, so how can we optimize the scenarios that are not suitable for indexing?
If your requirement does not require sorting the result set, use order by NULL.
select city ,count(*) as num from staff group by city order by null
Copy the code
The execution plan is as follows. Filesort is no longer available
5.3 Try to use only temporary tables in memory
If group by requires a small number of statistics, we can try to use only in-memory temporary tables; The group by process is time-consuming if the disk temporary table is used because the data cannot be stored. Therefore, the tmp_table_size parameter can be appropriately increased to avoid using disk temporary tables.
5.4 Optimization using SQL_BIG_RESULT
What if there’s too much data? You can’t increase tmp_table_size infinitely, can you? But don’t let the data be put into a temporary table in memory, and then converted to a temporary table on disk as the data insertion reaches its limit? That’s kind of not smart.
Therefore, if the estimated amount of data is high, we use the SQL_BIG_RESULT prompt to directly use disk temporary tables. The MySQl optimizer found that disk temporary tables are B+ tree storage, which is not as efficient as arrays. So I’m just going to store it in arrays
Example SQl is as follows:
select SQL_BIG_RESULT city ,count(*) as num from staff group by city;
Copy the code
executive-plannedExtra
As you can see, the execution no longer uses temporary tables, but only sorts
The execution process is as follows:
- Initialize sort_buffer and place it in the city field;
- Scan table staff, take out city values in turn and store them in sort_buffer;
- After scanning, sort the city field of sort_buffer
- Once you’ve sorted, you’ve got an ordered array.
- Count the number of occurrences of each value based on the ordered array.
6. How to optimize a slow production SQL
SQL group by SQL group by SQL group by SQL
The table structure is as follows:
CREATE TABLE 'staff' (' id' bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主 库 iD ', 'id_card' varchar(20) NOT NULL COMMENT 'iD ',' id 'varchar(64) NOT NULL COMMENT' iD ', 'status' varchar(64) NOT NULL COMMENT 'Y- varchar ',' age 'int(4) NOT NULL COMMENT' age ', 'age' int(4) NOT NULL COMMENT 'age ', 'city' varchar(64) NOT NULL COMMENT 'city ',' enterprise_no 'varchar(64) NOT NULL COMMENT' enterprise ', 'legal_cert_no' varchar(64) NOT NULL COMMENT 'iD ', PRIMARY KEY (' id ') ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT=' 表';Copy the code
The SQL for the query looks like this:
select * from t1 where status = #{status} group by #{legal_cert_no}
Copy the code
Let’s not discuss whether this SQL = is reasonable. If this is the SQL, how do you optimize? If you have an idea, you can leave a message to discuss it, or add me to the wechat group to discuss it. If you feel that the article is written wrong, you can also put forward ha, progress together, refueling ah
Reference and thanks
- MySQL 45 speak