Problem scenario
SqlAlchemy = model Account;
query = Account.query.filter(Account.id.in_(account_ids)).order_by(Account.date_created.desc())
If the UIDS is empty, the query will be executed with the following warning:
/ usr/local/lib/python2.7 / site - packages/sqlalchemy/SQL/default_comparator py: 35: SAWarning: The IN-predicate on "account.id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance. return o[0](self, self.expr, op, *(other + o[1:]), **kwargs)
The idea here is that using an empty list takes a long time and needs to be optimized to improve performance.
Why is there this hint? Why would an empty list affect performance?
Print Query to get the following SQL statement:
SELECT * FROM account WHERE account.id = '*'; = account.id ORDER BY account.date_created DESC
The filter condition in the generated statement is WHERE account.id! = account.id, use the PostgreSQL Explain Analyze command,
- EXPLAIN: Shows the execution plan generated by the PostgreSQL scheduler for the supplied statements.
- Analyze: Collect statistics about the contents of the tables in the database.
The query cost analysis results are as follows:
postgres=> EXPLAIN ANALYZE SELECT * FROM account WHERE account.id ! = account.id ORDER BY account.date_created DESC; QUERY PLAN -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the Sort (cost = 797159.14.. Rows =4471702 width=29) (Actual time=574.002.. 574.002 rows=0 loops=1) Sort Key: date_created DESC Sort Method: SCAN ON ACCOUNT (COST =0.00.. Rows =4471702 width=29) (Actual time=573.991.. 573.991 ROWS =0 loops=1) FILTER: (id <> id) ROWS =0 loops=1) FILTER: (id <> id) ROWS =0 loops=1) 574.052 ms (8 rows)
Looking at the execution plan generated by the statement provided by PostgreSQL, we can see that even though the return value is empty, the query cost is extremely high. Almost all the time of the execution plan is spent on sorting, but the query plan time is negligible compared to the execution time. Filter: (id <> id) Filter: (id <> id) Filter: (id <> id)
Following this line of thought, there are two query scenarios:
If ACCOUNT_IDS is empty, then the empty list is returned without any action and the query becomes:
if account_ids:
query = Account.query.filter(Account.id.in_(account_ids)).order_by(Account.date_created.desc())
If ACCOUNT_IDS is empty, then filter the query to:
query = Account.query
if account_ids:
query = query.filter(Account.id.in_(account_ids))
else:
query = query.filter(False)
query = query.order_by(Account.date_created.desc())
If ACCOUNT_IDS is empty, the result of the generated SQL statement is:
SELECT *
FROM account
WHERE 0 = 1 ORDER BY account.date_created DESC
The analysis results are as follows:
postgres=> EXPLAIN ANALYZE SELECT * FROM account WHERE 0 = 1 ORDER BY account.date_created DESC; QUERY PLAN --------------------------------------------------------------------------------------------------- Sort (cost = 77987.74.. 77987.75 rows=1 width=29) (Actual time=0.011.. 0.011 rows=0 loops=1) Sort Key: date_created DESC Sort Method: quicksort Memory: 25kB-> Result (cost=0.00.. 77987.73 rows=1 width=29) (Actual Time =0.001) 0.001 rows=0 loops=1) one-time Filter: false -> Seq Scan on account (cost=0.00.. 77987.73 rows=1 width=29) (never Executed) Planning time: 0.197ms Execution time: 0.06ms (8 rows)
As you can see, query planning and execution times have improved significantly.
A test
If you just remove scenario 1 ordering, look at the results of the analysis
Analyze the query cost using the PostgreSQL Explain Analyze command as follows:
postgres=> EXPLAIN ANALYZE SELECT * FROM account WHERE account.id ! = account.id; The QUERY PLAN -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Seq Scan on the account (cost = 0.00.. Rows =4471702 width=29) (Actual time=550.999.. 550.999 rows=0 loops=1) Filter: (id <> id) ROWS Removed by Filter: 4494173 Planning time: 0.134 ms Execution time: 551.041 ms
As you can see, there’s not much difference between the time and the sort.
How to calculate query costs
Perform an analysis with the following results:
Postgres => explain select * from account where date_created ='2016-04-07 18:51:30.371495+08'; postgres=> explain select * from account where date_created ='2016-04-07 18:51:30.371495+08'; QUERY PLAN -------------------------------------------------------------------------------------- Seq Scan on account (cost = 0.00.. 127716.33 rows=1 width=211) Filter: (date_created = '2016-04-07 18:51:30.371495+08'::timestamp with time zone) (2 rows)
The data referenced by EXPLAIN is:
- 0.00 Expected startup overhead (time elapsed before the output scan starts, e.g., time spent sorting in a sort node).
- 127716.33 Total projected expenses.
- Estimated number of rows output by the program node.
- 211 The estimated average row width (in bytes) of the planned node.
Here the unit of cost (COST) is the number of disk pages accessed, such as 1.0 would represent a sequential disk page read. The overhead of the upper level node will include the overhead of all its children nodes. The number of output rows (ROWS) here is not the number of rows processed/scanned by the planned node, it is usually lower. In general, the expected number of rows at the top level will be closer to the number of rows actually returned by the query. In the case of a single CPU kernel, the estimated cost is 127716.33;
To calculate the cost, PostgreSQL first looks at the size of the table in bytes
The size of the ACCOUNT table is:
postgres=> select pg_relation_size('account');
pg_relation_size
------------------
737673216
(1 row)
Check the size of the block
PostgreSQL adds a cost point for each quick read. Use Show Block_Size to see the size of the block:
postgres=> show block_size;
block_size
------------
8192
(1 row)
Count the number of blocks
You can see that the size of each block is 8KB, so you can calculate the sequential block cost of reading from the table as:
blocks = pg_relation_size/block_size = 90048
90048 is the number of blocks occupied by the ACCOUNT table.
Look at the cost of each block
postgres=> show seq_page_cost;
seq_page_cost
---------------
1
(1 row)
This means that PostgreSQL allocates one cost point per block, which means that the above query needs to be from 90048 cost points.
CPU time required to process each piece of data
- CPU_TUPLE_COST: CPU overhead to process each record (tuple: one row of records in the relationship)
- CPU_OPERATOR_COST: The CPU cost of an operator or function.
postgres=> show cpu_operator_cost; Cpu_operator_cost -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 0.0025 (row 1) postgres = > show cpu_tuple_cost; CPU_TUPLE_COST ---------------- 0.01 (1 row)
To calculate
The calculation formula of COST is:
Cost = the number of disk blocks
Block cost (1) + number of rowsCPU_TUPLE_COST (system parameter value) + number of rows * CPU_OPERATOR_COST
Now use all the values to calculate the values obtained in the EXPLAIN statement:
Number_of_records = 3013466 # count block_size = 8192 # block size in bytes pg_relation_size=737673216 blocks = PG_RELATION_SIZE/BLOCK_SIZE = 90048 SEQ_PAGE_COST = 1 CPU_TUPLE_COST = 0.01 CPU_OPERATOR_COST = 0.0025 COST = BLOCK * seq_page_cost + number_of_records * cpu_tuple_cost + number_of_records * cpu_operator_cost
How to reduce query costs?
Answer directly, using an index.
postgres=> explain select * from account where id=20039; QUERY PLAN ---------------------------------------------------------------------------------------- Index Scan using Account_pkey on account (cost = 0.43.. (id = 20039) (2 rows)
As you can see from this query, the query cost is significantly reduced when using indexed field queries.
The index scan calculation is a little more complex than the sequential scan calculation. It consists of two phases.
PostgreSQL considers random_page_cost and cpu_index_tuple_cost variables and returns a value based on the height of the index tree.
Refer to the link
- sqlalchemy-and-empty-in-clause
- PostgreSQL query performance analysis and optimization
- PostgreSQL Learning Manual (Performance Improvement Tips)
- PostgreSQL queries the cost model
- PostgreSQL query schedule time calculation details