“This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”
Access Methods
The method by which a DBMS accesses data in a table (scan operator). There are three methods: sequential scan, index scan, and multi-index scan (also called bitmap scan).
Sequential Scan
Each tuple in each page of the table is traversed sequentially. Determine whether the tuple is needed based on the predicate. When each page is accessed, the buffer pool is first retrieved to see if there is a cache. A DBMS needs to maintain a cursor to mark the page/slot currently accessed.
Almost the least efficient way to read.
Some optimization methods:
- Prefetching: Preloads subsequent Pages to prevent I/O blocking
- Parallelization: Perform scan operations in parallel
- Buffer pool bypass: Stores the page in the local memory instead of the buffer pool to avoid sequential flooding
- Zone map: Computes the aggregated results of each page to determine coarse-grained whether the page is likely to contain the required data:
- Late Materialization: Each operator passes only the smallest piece of information to the next (such as the record ID). Valid only in column storage systems.
- Heap clustering:
Index Scan
Select an index to find the data. Index selection based on:
- The attributes contained in the index
- Query Specifies the required attributes
- The range of the property
- Predicates combination
- Whether the index contains non-unique keys
multi-index scan (bitmap scan)
A query with multiple indexes can be used in two steps:
- Find the ID set from each matched index
- Union or INTERSECT are evaluated for these sets according to predicate requirements
(Called bitmap Scan in Postgres)
To calculate the intersection, you can use bitmap, hash table, Bloom filter, etc.
Obviously, we want to hit the index as much as possible to retrieve the data through the index.
index scan page sorting
Since the pages distributed by tuples retrieved from the non-clustered index (only the location where the tuple is stored, not the data directly) is chaotic, we can first sort these tuples according to the pages they belong to:
conclusion
This section gave me a scare at the beginning, after reading it, I found that it is actually talking about the role of indexes, maybe this is usually familiar with a little, so many think it is not worth pointing out. Indexing is a good idea. We expect the query to hit the index as closely as possible, so the primary key must be indexed. Other attributes are up to the administrator. Of course, more indexes are not always better, because more indexes mean higher maintenance costs when the data in the table changes. Trade-off is a concept that exists all the time, “balance exists between all things”.
Expression Evaluation
As mentioned above, SQL statements are parsed into a tree: operators are intermediate nodes and operands are leaf nodes.
Predicate evaluation after the expression tree is built is slow because the DBMS must traverse the nodes in the tree and consider what each node does.
So sometimes it’s better to just judge the expression itself.
JIT
Just-in time: after the program is started, the code is compiled on the fly step by step as the program runs. In contrast, traditional compilers translate all the code into machine language before running the program.