Group by

preface

Hello, everyone. I am a little boy picking up field snails.

In daily development, we often use Group by. Dear friends, do you know how Group by works? What’s the difference between group by and having? What is the optimization idea of Group BY? What are some concerns about using Group by? In this paper, we will learn to conquer group by~

A simple example using Group by
Working principle of Group by
Group by + WHERE and Group by + having
Group by optimization idea
Group by use note points
How to optimize a production slow SQL

Public number: a boy picking up snails

1. A simple example using group by

Group by is generally used for group statistics, which expresses the logic of grouping according to certain rules. Let’s start with a simple example and review it.

Assume a table of employees with the following structure:

CREATE TABLE 'staff' (' id' bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主 库 iD ', 'id_card' varchar(20) NOT NULL COMMENT 'iD ',' id 'varchar(64) NOT NULL COMMENT' iD ', 'age' int(4) NOT NULL COMMENT 'age ',' city 'varchar(64) NOT NULL COMMENT' city ', PRIMARY KEY (' id ') ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT=' 表';Copy the code

Table size data is as follows:

We now have a requirement to count the number of employees in each city. The corresponding SQL statement can be written like this:

select city ,count(*) as num from staff group by city;
Copy the code

The result is as follows:

The logic of this SQL statement is clear, but what is the underlying execution flow?

2. Group by principle analysis

2.1 the explain analysis

Let’s first look at the execution plan using Explain

explain select city ,count(*) as num from staff group by city;
Copy the code

“Extra”Using temporaryIndicates that it is used when performing groupingA temporary table
“Extra”Using filesortTo indicate the use ofThe sorting

Why does group by use temporary tables and sorting? Let’s take a look at the execution flow of this SQL

2.2 Simple execution process of Group by

explain select city ,count(*) as num from staff group by city;
Copy the code

Let’s take a look at the execution process of this SQL

Create temporary table in memory with two fieldscityandnum;
A full table scanstaffCity = ‘X’

Insert a row (X,1) where city=’X’;
If there are rows in the temporary table where city=’X’, increment the num value of X by 1;

After the traversal is complete, the field is followedcitydoThe sortingThe result set is returned to the client.

An execution diagram of this process is shown below:

What is the sort of temporary table?

Sort the fields into sort Buffer and return them as they are sorted. Note that there are full field sort and Rowid sort

If it isFull field sort, and the fields to be returned are addedsort buffer, according to theSort fieldWhen you’re done, go straight back

If it isThe rowid sorting, just need to sort the fields put insort bufferAnd then one more timeBack to the tableOperation and return.

How do I determine if I’m going to go full-field sort or ROwid sort? Controlled by a database parameter,max_length_for_sort_data

If you are interested in sorting, you can read this article.

Order by in detail

3. Where and having

Group by + WHERE execution flow
Group by + Having execution flow
There are also execution orders for WHERE, Group by, and HAVING

3.1 Execution process of Group by + WHERE

SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL

Ok, let’s add a condition to it and an index of idx_age as follows:

select city ,count(*) as num from staff where age> 30 group by city; Alter table staff add index idx_age (age);Copy the code

Expain:

explain select city ,count(*) as num from staff where age> 30 group by city;
Copy the code

From the explain execution plan results, you can see that the query criteria hit the index of idx_age and used temporary tables and sorts

Using index condition: indicates that the index is pushed down to filter as much data as possible according to the index, and then it is returned to the server layer for filtering according to where other conditions. Why is there an index push down here for a single index? The presence of Explain does not necessarily mean that index push-downs are used, only that it is available but not necessarily used. If you have any thoughts or questions, please add me to discuss it on wechat.

The execution process is as follows:

Create temporary table in memory with two fieldscityandnum;
Scan index treeidx_age, find the primary key ID greater than age greater than 30
Select * from primary key where city = ‘X’

Insert a row (X,1) where city=’X’;
If there are rows in the temporary table where city=’X’, increment the num value of X by 1;

Continue to repeat step 2 and 3 to find all the data that meets the criteria,
And finally according to the fieldcitydoThe sortingThe result set is returned to the client.

3.2 Execution of group by + having

If you want to query the number of employees in each city, having at least 3 employees in each city can solve your problem.

select city ,count(*) as num from staff  group by city having num >= 3;
Copy the code

The query results are as follows:

havingCalled a grouping filter condition, it operates on the returned result set.

3.3 There is an execution order of WHERE, Group by, and HAVING

If an SQL statement contains clauses where, Group by, and HAVING, what is the order of execution?

Such as this SQL:

select city ,count(*) as num from staff  where age> 19 group by city having num >= 3;
Copy the code

performwhereClause to find data that matches employees older than 19
group byClause to employee data, grouped by city.
rightgroup byRun the aggregation function to calculate the number of employees in each group;
In the end,havingClause selects the city group with 3 or more employees.

3.4 Where + having difference summary

havingClause is used toPost group screeningThe where clause is used forlineConditions for screening
havingIt’s usually a matchgroup byWith aggregate functions such as (count(),sum(),avg(),max(),min())
whereAggregate functions cannot be used in conditional clauses, whilehavingClauses will do.
havingCan only be used after group by, where is executed before group by

4. Use group by for attention

There are several main points to note when using group by:

group byMust it be used with aggregate functions?
group byThe field must appear in select
group byCause slow SQL problems

4.1 Must group by be used with aggregate functions?

Group by means of group statistics, usually used in conjunction with aggregate functions such as count(),sum(),avg(), Max (),min()).

The count ()
The sum () combined
Avg () average
Max () the maximum value
The min () the minimum value

Is it ok if it is not used with the aggregate function?

I use Mysql 5.7 and it works. It doesn’t get an error and it returns the first row of grouped data.

Such as this SQL:

select city,id_card,age from staff group by  city;
Copy the code

The query result is

And if you look at that, what’s returned is the first item in each group

Of course, group by is used in conjunction with aggregate functions, except in special cases where you want to have a distinct, but you can reuse distinct.

4.2 Group by must appear in select.

Not necessarily, for example:

select max(age)  from staff group by city;
Copy the code

The result is as follows:

The grouping field city does not follow select and does not generate an error. Of course, this may be related to different databases, different versions. You can verify it before you use it. There is a saying called, the paper come zhongjue shallow, must know this to practice.

4.3 `group by`Cause slow SQL problems

The most important thing to note is that group by can easily cause slow SQL problems when used incorrectly. Because it uses both temporary tables and sort by default. It is also possible to use disk temporary tables.

If you are executing, you will find that the memory temporary table size has reachedceiling(The parameter that controls this upper limit istmp_table_size), will be putThe memory temporary table is converted to the disk temporary table.

If the data volume is large, it is likely that the query will require a temporary table on disk, which will take up a lot of disk space.

These are all X factors that cause slow SQL. Let’s explore optimization solutions together.

5. Some optimization schemes of Group by

In what directions should we optimize?

Direction 1: since it is sorted by default, we don’t need to sort it.
Direction 2: Since temporary tables are X factors that affect group by performance, can we not use temporary tables?

Why do we need a temporary table to execute a group by statement? The semantic logic of group by is to count the number of different values that occur. If these values were ordered to begin with, wouldn’t we just scan down and count them instead of using a temporary table to record and count them?

Add index to group by
Order by NULL is not sorted
Try to use only in-memory temporary tables
Using SQL_BIG_RESULT

5.1 Add indexes to the fields following group by

How do I guarantee that the values after group by are ordered to begin with? Index, of course.

Let’s go back to this SQL

select city ,count(*) as num from staff where age= 19 group by city;
Copy the code

Its execution plan

If we give it a federated index idx_age_city (age,city)

alter table staff add index idx_age_city(age,city);
Copy the code

When we look at the execution plan, we find that neither sorting nor temporary tables are needed.

Adding appropriate indexes is the simplest and most effective way to optimize group by.

5.2 Order by null

Not all scenarios are suitable for indexing, so how can we optimize the scenarios that are not suitable for indexing?

If your requirement does not require sorting the result set, use order by NULL.

select city ,count(*) as num from staff group by city order by null
Copy the code

The execution plan is as follows. Filesort is no longer available

5.3 Try to use only temporary tables in memory

If group by requires a small number of statistics, we can try to use only in-memory temporary tables; The group by process is time-consuming if the disk temporary table is used because the data cannot be stored. Therefore, the tmp_table_size parameter can be appropriately increased to avoid using disk temporary tables.

5.4 Optimization using SQL_BIG_RESULT

What if there’s too much data? You can’t increase tmp_table_size infinitely, can you? But don’t let the data be put into a temporary table in memory, and then converted to a temporary table on disk as the data insertion reaches its limit? That’s kind of not smart.

Therefore, if the estimated amount of data is high, we use the SQL_BIG_RESULT prompt to directly use disk temporary tables. The MySQl optimizer found that disk temporary tables are B+ tree storage, which is not as efficient as arrays. So I’m just going to store it in arrays

Example SQl is as follows:

select SQL_BIG_RESULT city ,count(*) as num from staff group by city;
Copy the code

executive-plannedExtraAs you can see, the execution no longer uses temporary tables, but only sorts

The execution process is as follows:

Initialize sort_buffer and place it in the city field;
Scan table staff, take out city values in turn and store them in sort_buffer;
After scanning, sort the city field of sort_buffer
Once you’ve sorted, you’ve got an ordered array.
Count the number of occurrences of each value based on the ordered array.

6. How to optimize a slow production SQL

SQL group by SQL group by SQL group by SQL

The table structure is as follows:

CREATE TABLE 'staff' (' id' bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主 库 iD ', 'id_card' varchar(20) NOT NULL COMMENT 'iD ',' id 'varchar(64) NOT NULL COMMENT' iD ', 'status' varchar(64) NOT NULL COMMENT 'Y- varchar ',' age 'int(4) NOT NULL COMMENT' age ', 'age' int(4) NOT NULL COMMENT 'age ', 'city' varchar(64) NOT NULL COMMENT 'city ',' enterprise_no 'varchar(64) NOT NULL COMMENT' enterprise ', 'legal_cert_no' varchar(64) NOT NULL COMMENT 'iD ', PRIMARY KEY (' id ') ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT=' 表';Copy the code

The SQL for the query looks like this:

select * from t1 where status = #{status} group by #{legal_cert_no}
Copy the code

Let’s not discuss whether this SQL = is reasonable. If this is the SQL, how do you optimize? If you have an idea, you can leave a message to discuss it, or add me to the wechat group to discuss it. If you feel that the article is written wrong, you can also put forward ha, progress together, refueling ah

Reference and thanks

MySQL 45 speak