Query based on collection and program methods

Implicit in the reverse model is the fact that there are differences between the collection – and program-based approaches to building queries.

The procedural approach to queries is very similar to the programming approach: you tell the system what needs to be done and how to do it. Such as the example in the previous article, query the database by executing one function and then calling another, or obtain the final query result using a logical approach that includes loops, conditions, and user-defined functions (UDFs). You’ll find that in this way, you’re always requesting subsets of data from layer to layer. This approach is also often referred to as step by step or line by line query.

The other is a collection-based approach that simply specifies the operations that need to be performed. All you need to do with this method is specify the conditions and requirements for the results you want to obtain from the query. When retrieving data, you don’t need to focus on the internal mechanics of implementing the query: the database engine determines the best algorithm and logic to execute the query.

Because SQL is collections based, this approach is more efficient than the procedural approach, which explains why SQL can work faster than code in some cases.

Collection based query methods are also required by the data mining analysis industry you must master skills! Because you need to be adept at switching between the two methods. If you find a program query in your query, you should consider whether you need to rewrite this section.

From query to execution plan

The reverse pattern is not static. Avoiding query reverse models and rewriting queries can be a difficult task as you become a SQL developer. So often you need to use tools to optimize your queries in a more structured way.

Thinking about performance requires not only a more structured approach, but also a more in-depth approach.

However, this structured and in-depth approach is primarily based on query planning. The query plan is first parsed into a “parse tree” and defines exactly what algorithm to use for each operation and how to coordinate the operation process.

Query optimization

When tuning a query, you will most likely need to check the plan generated by the optimizer manually. In this case, you will need to analyze your query again by looking at the query plan.

To master such a query plan, you need to use some of the tools provided by your DATABASE management system. Here are some tools you can use:

Some package functionality tools can generate graphical representations of query plans.

Other tools can provide you with a text description of the query plan.

Note that if you are using PostgreSQL, you can differentiate between different EXPLAIN items by simply getting a description of how planner executes queries without running plans. EXPLAIN ANALYZE also executes the query and returns an analysis report that evaluates the query plan against the actual query plan. In general, the actual execution plan will actually execute the plan, and the evaluation execution plan can solve this problem without executing the query. Logically, the actual execution plan is more useful because it contains additional details and statistics about what actually happens when the query is executed.

Next you’ll learn more about XPLAIN and ANALYZE, and how to use them to further understand your query plan and query performance. To do this, you need to start using two tables: one_million and half_million to do some examples.

You can retrieve the current information for the one_million table by using EXPLAIN: make sure you put it in the first place to run the query and return it to the query plan when the query is complete:

  


EXPLAINSELECT *FROM one_million; QUERY PLAN_________________________________________________Seq Scan on one_million (cost = 0.00.. 18584.82 rows = 1025082 width = 36) (1 row)

  


In the above example, we see that the Cost of the query is 0.00.. 18584.82, the number of rows is 1025082, the column width is 36.

You can also use ANALYZE to update statistics.

  


ANALYZE one_million; EXPLAINSELECT *FROM one_million; QUERY PLAN

_________________________________________________Seq Scan on one_million (cost = 0.00.. 18334.00 rows = 1000000 width = 37) (1 row)

  


In addition to EXPLAIN and ANALYZE, you can also use EXPLAIN ANALYZE to retrieve the actual execution time:

  


EXPLAIN ANALYZESELECT *FROM one_million; QUERY PLAN___________________________________________________Seq Scan on one_million (cost = 0.00.. 18334.00 rows = 1000000 width = 37) (actual time = 0.015. Rows =100 loops=1)Total Runtime: 100 loops (2 rows)

  


The disadvantage of using EXPLAIN ANALYZE is that you need to actually execute the query, which is worth noting!

All the algorithms we’ve seen so far have been sequential or full table scans: a method of scanning a database in which each row of a scanned table is read in sequential (serial) order, and each column is checked for eligibility. In terms of performance, sequential scanning is not the best execution plan because you need to scan the entire table. But if you use a slow disk, sequential reads are also fast.

There are some examples of other algorithms:

  


EXPLAIN ANALYZESELECT *FROM one_million JOIN half_millionON (one_million.counter=half_million.counter); QUERY PLAN_____________________________________________________________Hash Join (cost = 15417.00.. 68831.00 rows = 500000 width = 42) (actual time = 1241.471. 5912.553 rows = 500000 loops = 1) Hash Cond: (one_million. Counter = half_million. Counter) -seq Scan on one_million (cost=0.00.. Rows =1000000 width= 100) (actual time=0.007.. 1254.027 rows=100000 loops=1) – Hash (cost= 10013.00.. Rows =500000 width=5) (actual time=1241.251.. 1241.100 Rows =500000 Loops =1) Buckets: 4096 Fandom: 16 Memory Usage: 300KB – Seq Scan on HALf_million (cost=0.00.. 7213.00 rows = 500000 width = 5) (actual time = 0.008. 601.128 Rows =500000 loops=1)Total Runtime: 601.128 ms

  


We can see that the query optimizer has selected the Hash Join. Keep this in mind because we need to use it to evaluate the time complexity of the query. We notice that there is no half_million. Counter index in the above example, we can add an index in the following example:

  


CREATE INDEX ON half_million(counter); EXPLAIN ANALYZESELECT *FROM one_million JOIN half_millionON (one_million.counter=half_million.counter); QUERY PLAN______________________________________________________________Merge Join (cost = 4.12.. 37650.65 rows = 500000 width = 42) (actual time = 0.033. 3272.940 rows = 500000 loops = 1) Merge Cond: (one_million_counter = half_million.counter) -index Scan using one_million_counter_idx on one_million (cost=0.00.. Rows =1000000 width=37) (actual time=0.011.. 694.466 ROWS =500001 loops=1) – Index Scan using half_million_counter_IDx on half_million (cost=0.00.. 14120.29 rows = 500000 width = 5) (actual time = 0.010. Rows =500000 loops=1)Total Runtime: 500000 ms(5 rows)

  


By creating indexes, the query optimizer has determined how to look for Merge Joins during index scans.

Note the difference between an index scan and a full table scan (sequential scan) : the latter (also known as a “table scan”) scans all data or indexes all pages to find suitable results, while the former scans only each row in the table.