SQL query don’t start with SELECT
By Julia Evans (authorized)
HelloGitHub- little bear & bittern egg
In everyday use, writing SQL queries starts with SELECT (note: this article covers only SELECT queries, not INSERTS or other SQL commands).
A question came to my mind yesterday: Can I use WHERE, HAVING, or any other method to filter the results of window function execution?
After some exploration, MY final conclusion is no, because window functions must run after WHERE and GROUP BY. However, this also extends to the larger question of what order SQL queries should be executed.
The order in which SQL queries are executed that you do not know
I specifically checked the document SQL query execution order is as follows:
If you don’t like the colorful photo format above, you can also see the following text:
FROM/JOIN/ON
WHERE
GROUP BY
HAVING
SELECT
(The window function is executed in this step)ORDER BY
LIMIT
The picture above should answer your questions:
The figure above shows the semantics of an SQL query. By understanding this graph, you can quickly determine what results a given SQL query will return, and you can easily answer questions like:
- Can be
GROUP BY
Result ofWHERE
Screening? No! Because WHERE is executed before GROUP BY.) - Is it possible to filter the execution results of window functions? No! Because the window function executes in the SELECT step, which is after WHERE and GROUP BY.)
- Can be
GROUP BY
Is executedORDER BY
Operation? (you can!ORDER BY
This is basically the last step, so it can be performed on the result of any operationORDER BY
) LIMIT
In which step? (The last step!)
However, the database engines do not actually run queries in this exact order, as they also perform a series of optimizations to speed up queries.
So:
-
When you want to understand the validity of a query or why a query result is returned, you can try this diagram.
-
However, it is not possible to use this graph to explain query performance or index-related issues, which involve more variables and are therefore more complex.
The most confusing: Column aliases
For example, associate first and last names and group them. SQL syntax is allowed to write:
SELECT CONCAT(first_name, ' ', last_name) AS full_name, count(*)
FROM table
GROUP BY full_name
Copy the code
The query above looks like GROUP BY is executed after SELECT, but GROUP BY is actually executed first because GROUP BY refers to the alias in SELECT.
The database engine can rewrite the query as:
SELECT CONCAT(first_name, ' ', last_name) AS full_name, count(*)
FROM table
GROUP BY CONCAT(first_name, ' ', last_name)
Copy the code
Next, execute the statement in GROUP BY and then SELECT, so that’s ok.
In addition, the database engine will certainly perform a series of checks to ensure that the contents of SELECT and GROUP BY match before the query starts running, so it must examine the query as a whole before formulating an execution plan.
2. Queries are not run strictly in this order (optimization)
In fact, the database engine does not run queries through joins, filters, and groupings, because it implements a number of optimizations to speed up queries, such as reordering (as long as it does not affect the final returned results).
Here is a simple example of how query execution order can affect query performance.
SELECT * FROM
owners LEFT JOIN cats ON owners.id = cats.owner
WHERE cats.name = 'mr darcy'
Copy the code
If you only need to find three cats named “Mr Darcy”, it is slow to perform the entire left join and match all the rows in both tables. Instead, it is much faster if the cat named “Mr Darcy” is screened before the connection is made. In this case, filtering first does not change the results of the query!
In fact, there are many other optimizations implemented by the database engine that allow queries to be executed in a different order than the tables.
3. Different query syntax
LINQ (query syntax in C# and VB.NET) is based on FROM… WHERE … SELECT order to execute the query. Here is an example of a LINQ query:
var teenAgerStudent = from s in studentList
where s.Age > 12 && s.Age < 20
select s;
Copy the code
Pandas basically works the same way, although sometimes it is not necessary to write code in the following order, but it is a good habit to do so:
df = thing1.join(thing2) # like a JOIN
df = df[df.created_at > 1000] # like a WHERE
df = df.groupby('something', num_yes = ('yes'.'sum')) # like a GROUP BY
df = df[df.num_yes > 2] # like a HAVING, filtering on the result of a GROUP BY
df = df[['num_yes'.'something1'.'something']] # pick the columns I want to display, like a SELECT
df.sort_values('sometthing', ascending=True)[:30] # ORDER BY and LIMIT
df[:30]
Copy the code
This is not because Pandas mandates that code be written in the JOIN/WHERE/GROUP BY/HAVING order to help understand the underlying logic. (It is worth mentioning that you can perform WHERE before JOIN to improve performance, and this is how most database engines perform in practice)
Dplyr in R also allows you to query different SQL databases with different syntax, such as Postgres, MySQL and SQLite.
The last
I was actually quite surprised when I discovered the order in which SQL queries were executed. By exploring the order in which SQL queries are executed, I clarified the problem I had previously encountered. Hopefully, this article will help more people understand the order of SQL execution and how to write SQL queries correctly.
Follow the HelloGitHub public account to receive the first updates.
There are many more open source projects and treasure trove projects to be discovered.