If someone asks you how databases work, ask them to read this article

In this paper, by
Bole online –
PanblackTranslation,
Li-min huangProofread. Without permission, prohibit reprinting!

英文名称 :
Christophe Kalenzaga. Welcome to join
Translation group.

When it comes to relational databases, I can’t help but think that something is being overlooked. Relational databases are everywhere and come in many varieties, from the small and useful SQLite to the powerful Teradata. But few articles explain how databases work. You can Google/Baidu relational database Principles for yourself and see how few results there are. “And the ones I found were short. Now if you look for the latest trendy technologies (big data, NoSQL, or JavaScript), you’ll find more articles that delve into how they work.

Are relational databases so old and boring that no one wants to talk about them except in college textbooks, research literature and books?

Review images

As a developer, I don’t like to use things I don’t understand. Besides, the database is 40 years old, so there must be a reason. Over the years, I’ve spent hundreds of hours really grasping these weird black boxes THAT I use every day. Relational databases are interesting because they are based on practical and reusable concepts. If you’re interested in learning about a database, but have never had the time or inclination to dig deep into this wide-ranging subject, you should enjoy this article.

Although the title of this article is clear, my goal is not to show you how to use a database. Therefore, you should already know how to write a simple Join Query and CRUD operation (create read update delete), otherwise you may not understand this article. That’s the only thing you need to know. I’ll explain the rest.

I’ll start with a little bit of computer science, like time complexity. I know some people hate this concept, but you can’t understand the subtleties inside a database without it. Since this is a big topic, I’ll focus on what I think is necessary: the way databases handle SQL queries. I’m just going to cover the basic concepts behind the database so that by the end of this article you have a good idea of what’s going on underneath.

In computer science, the time complexity of an algorithm is a function that quantitatively describes the running time of the algorithm. If you are not familiar with this concept, it is recommended to read wikipedia or Baidu Encyclopedia first. It is helpful to understand the following articles.

Since this is a long technical article that involves a lot of algorithms and data structures, you can take your time. Some concepts are hard to understand, so you can skip them without affecting your overall understanding.

This article is roughly divided into three parts:

Overview of the underlying and upper-layer database components
Query the overview of the optimization process
Overview of transaction and buffer pool management

Return to base

Long, long ago (in a galaxy far, far away…) Developers must know exactly how many operations their code will take. They keep algorithms and data structures in mind because their computers are slow and can’t afford to waste CPU and memory.

In this section, I will remind you of some of these concepts, because they are essential to understanding databases. I will also introduce the concept of database indexes.

O(1) vs O(n^2)

Many developers today don’t care about time complexity… They are right.

But when you’re dealing with large amounts of data (and I’m not just talking about tens of thousands) or when you’re competing for milliseconds, it’s critical to understand this concept. And guess what, the database handles both! I will not keep you long, if you can only understand that. This concept will help us understand what cost-based optimization is later on.

concept

Time complexity tests how long it takes an algorithm to process a certain amount of data. To describe this complexity, computer scientists use the mathematical term “big O notation in a concisely explained algorithm.” This representation uses a function to describe how many operations an algorithm needs to process a given piece of data.

For example, when I say “this algorithm is O(some function ())”, WHAT I mean is that for some data, the algorithm takes some function (amount of data) operations to complete.

What matters is not the amount of data, but how the operations increase as the amount of data increases. Time complexity doesn’t give you the exact number of operations, but it gives you an idea.

Review images

The diagram shows the evolution of different types of complexity, and I used a log scale to build this diagram. Specifically, the amount of data has grown from one to a billion at a very rapid rate. We can draw the following conclusions:

Green: O(1) or constant order complexity, kept constant (otherwise they would not be called constant order complexity).
Red: O(log(n)) logarithmic complexity, low even at billions of data.
Powder: The worst complexity is O(n^2), square complexity, and operand inflation.
Black and blue: The other two types of complexity are growing rapidly.

example

For low data, the difference between O(1) and O(n^2) is negligible. Let’s say you have an algorithm that handles 2,000 elements.

The O(1) algorithm takes one operation
Order log of n takes 7 operations
The O(n) algorithm will take 2,000 operations
Order n log n takes 14,000 operations
The O(n^2) algorithm will consume 4,000,000 operations

The difference between O(1) and O(n^2) seems huge (4 million), but you lose at most 2 milliseconds, just the blink of an eye. Indeed, today’s processors can process hundreds of millions of operations per second. This is why performance and optimization are not an issue in many IT projects.

As I said, it’s still important to understand this concept when you’re dealing with massive amounts of data. If this time the algorithm needs to process 1,000,000 elements (which is not too big for a database).

The O(1) algorithm takes one operation
Order log of n takes 14 operations
The O(n) algorithm will consume 1,000,000 operations
The O(n*log(n)) algorithm will consume 14,000,000 operations
The O(n^2) algorithm will consume 1,000,000,000,000 operations

I haven’t done the math, but I would say that with the O(n^2) algorithm you have time for a cup of coffee (or even a refill!). . If you add a zero after the data amount, you can go to sleep.

further

Just so you can understand

Searching for a good hash table yields O(1) complexity
- Searching a balanced tree gives you order log n
- Searching an array yields O(n) complexity
- The best sorting algorithm has O(n*log(n)) complexity
- Bad sorting algorithms have O(n^2) complexity

Note: We will examine these algorithms and data structures in the following sections.

There are many types of time complexity

General scenario
Best case scenario
Worst-case scenario

Time complexity is often in the worst case scenario.

I’m only talking about time complexity here, but complexity also includes:

Memory consumption of the algorithm
Algorithm disk I/O consumption

There are worse things than n^2, such as:

N ^4: Lame! Some of the algorithms I’m going to mention have this complexity.
3^n: Worse! One of the algorithms examined in the middle of this article has this complexity (and is actually used in many databases).
Factorial n: You never get a result, even with a small amount of data.
N ^n: If you get to this level of complexity, you should ask yourself if IT is your thing.

Note: I did not give the actual definition of “big O notation”, but just used the concept. Check out this article on Wikipedia.

Merge sort

What do you do when you want to sort a set? What? Call sort()… Well, you’re right… But for databases, you need to understand how the sort() function works.

There are several excellent sorting algorithms, but I’ll focus on the most important one: merge sort. You may not know what data sorting is for yet, but you will after reading the query optimization section. Furthermore, merge sort will help us understand a common database join operation, namely merge join.

merge

Like many useful algorithms, merge sort is based on the trick that it takes only N operations to merge two sorted sequences of size N/2 into a sorted sequence of N elements. This method is called merging.

Let’s use a simple example to see what this means:

Review images

As you can see from this figure, you only need to iterate once in the two 4-element sequences to build the final 8-element sorted sequence because the two 4-element sequences are already sorted:

1) In two sequences, compare current elements (current = first occurrence)
2) Then take the smallest element and put it into the 8-element sequence
3) Find the next element in the sequence and extract the smallest
Repeat steps 1, 2, and 3 until you reach the last element in one of the sequences
Then take the rest of the other sequence and put it into the 8-element sequence.

This works because both 4-element sequences are already sorted and you don’t need to “go back” to the sequence to look for comparisons.

[Translator’s note: Merge sort details, one of the giFs (the original is longer, I have deleted) clearly demonstrates the above merge sort process, but the original description is not so clear.]

Review images

Now that we understand the trick, here’s my merge sort pseudocode.

array mergeSort(array a)
   if(length(a)==1)
      return a[0];
   end if

   //recursive calls
   [left_array right_array] := split_into_2_equally_sized_arrays(a);
   array new_left_array := mergeSort(left_array);
   array new_right_array := mergeSort(right_array);

   //merging the 2 small ordered arrays into a big one
   array result := merge(new_left_array,new_right_array);
   return result;Copy the code

Merge sort is to break the problem into smaller ones and solve the original problem by solving the smaller ones. If you don’t understand, don’t worry, I didn’t either when I first encountered it. If it helps, I think this algorithm is a two-step algorithm:

The split phase, in which the sequence is divided into smaller sequences
In the sorting phase, small sequences are joined together (using merge algorithms) to form larger sequences

Split phase

Review images

During the split stage, three steps are used to divide the sequence into unary sequences. The value of the number of steps is log(N) (because N=8, log(N)=3). [Translator’s note: The base number is 2.]

How do I know that?

I’m a genius! In a word: Math. The idea is that at each step you divide the length of the original sequence by 2, and the number of steps is the number of times you can divide the length of the original sequence by 2. That’s exactly the definition of a logarithm at base 2.

Sorting phase

Review images

In the sorting phase, you start with unary sequences. In each step, you apply multiple merge operations at a total cost of N=8 operations.

The first step is 4 merges, each costing 2 operations.
Step two, two merges, each costing four operations.
Step 3, 1 merge, cost 8 operations.

Because there are log N steps, the total cost is N log N operations.

[Translator’s note: This full GIF shows the whole process of splitting and sorting.]

Review images

The power of merge sort

Why is this algorithm so powerful?

Because:

You can change the algorithm to save memory space by modifying the input sequence rather than creating a new sequence.

Note: This algorithm is called an in-place algorithm.

You can change the algorithm so that you can use both disk space and a small amount of memory and avoid the huge amount of disk I/O. The method is to load only the portion of the current process into memory. This is an important technique when sorting a table of several GIGABytes in a memory buffer of only 100MB.

Note: This is called “external sorting”.

You can change the algorithm to run on multiple processors/threads/servers.

For example, distributed merge sorting is one of the key components of Hadoop, the famous big data framework.

This algorithm turns everything into gold (it does!).

This sorting algorithm is used in most (if not all) databases, but it is not the only one. If you want to learn more, you can check out this paper, which explores the strengths and weaknesses of sorting algorithms commonly used in databases.

Arrays, trees, and hash tables

Now that we know the idea behind time complexity and sorting, I must introduce you to three data structures. This is important because they are the backbone of modern databases. I will also introduce the concept of database indexes.

array

A two-dimensional array is the simplest data structure. A table can be thought of as an array, for example:

Review images

This two-dimensional array is a table with rows and columns:

Each row represents a principal
Columns are used to describe the characteristics of the subject
Each column holds a certain type of pair of data (integers, strings, dates…)

While this method is great for saving and visualizing data, it’s terrible when you’re looking for specific values. For example, if you wanted to find all the people working in the UK, you would have to look at each line to see if it belonged to the UK. This costs N operations (N equals rows), which isn’t bad, but is there a faster way? This is where the tree comes into play.

Tree and database indexes

A binary lookup tree is a binary tree with special attributes, and the key for each node must be:

Larger than any key stored in the left subtree
Smaller than any key stored in the right subtree

The binary search tree is a binary Sort tree. The binary search tree is a binary search tree.

concept

Review images

This tree has N equals 15 elements. Let’s say I’m looking for 208:

I start at the root of 136, because 136 is less than 208, and I find the right subtree of node 136.
398>208, so I’m going to find the left subtree of node 398
250>208, so I’m going to find the left subtree of node 250
200 is less than 208, so I’m going to find the right subtree of node 200. But 200 has no right subtree, so the value doesn’t exist (because if it did, it would be in the right subtree of 200).

Now let’s say I’m looking for 40

I start with the root of 136, because 136 is greater than 40, so I find the left subtree of node 136.
80 is greater than 40, so I’m going to find the left subtree of node 80
40=40. The node exists. I extract the ids of the internal rows of the node (not drawn in the figure) and then look up the corresponding ROW IDS in the table.
Knowing the ROW ID, I know exactly where the data is in the table, and I can retrieve the data immediately.

Finally, the cost of the two queries is the number of layers inside the tree. If you read the merge sort section carefully, you should know that there are log(N) layers. So the cost of this query is log(N), not bad!

Back to our question

That was very abstract, so let’s go back to our problem. Instead of silly numbers this time, imagine the string representing someone’s country in the previous table. Suppose you have a tree containing the column “country” in the table:

If you want to know who works in THE UK
You look for nodes in the tree that represent UK
In the “UK node” you will find the location of the UK staff line

This search only takes log(N) operations, as opposed to N operations if you used the array directly. What you just imagined was a database index.

B + tree index

Finding a particular value is a nice tree to use, but when you need to find multiple elements between two values, you get into big trouble. Your cost would be O(N) because you would have to look up every node in the tree to see if it was in between those two values (for example, using middle order traversal on the tree). And this operation is not disk I/O advantageous because you have to read the entire tree. We need to find efficient range queries. To solve this problem, modern databases use a modified version of a tree called a B+ tree. In a B+ tree:

Only the lowest level node (leaf node) holds information (row position of correlation table)
The other nodes are only used to guide the search to the correct node.

Traverse Wikipedia with B+ trees and binary trees

Review images

As you can see, there are twice as many nodes. Indeed, you have extra nodes, which are “decision nodes” that help you find the right node (the right node holds the position of the row in the relevant table). But the search complexity is still order log(N) (just one more layer). One important difference is that the lowest node is connected to subsequent nodes.

Using this B+ tree, suppose you want to find values between 40 and 100:

You just need to find 40 (or the closest value after 40 if it doesn’t exist), just like you did in the last tree.
Then use those connections to collect subsequent nodes for 40 until 100 is found.

Let’s say you find M subsequent nodes, and there are N nodes in the tree. The cost of searching a given node is log(N), the same as the previous tree. But when you find this node, you have to connect the subsequent nodes to get M subsequent nodes, which takes M operations. So this search only takes M+log N operations, as opposed to N operations for the last tree. Also, you don’t need to read the entire tree (only M+log(N) nodes), which means less disk access. If M is small (say 200 rows) and N is large (1,000,000), the results are vastly different.

But there are new problems (again!). . If you add or delete a row in the database (and thus in the relevant B+ tree index) :

You have to keep the order between the nodes in the B+ tree, otherwise the nodes will get messy and you won’t be able to find the nodes you want.
You have to keep the number of layers in the B+ tree as low as possible, otherwise order log of N becomes order N.

In other words, B+ trees need to self-organize and self-balance. Thank goodness we have smart delete and insert. But there is a cost: in B+ trees, insert and delete operations are order log(N) complexity. So some people have heard that using too many indexes is a bad idea. Yes, you slow down the quick insert/update/delete of a row in the table because the database needs to update the index of the table with costly O(log(N)) operations per index. Furthermore, adding indexes means adding more workload to the transaction manager (which we’ll explore at the end of this article).

For more details, check out this Wikipedia article on B+ trees. If you want an example of implementing B+ trees in a database, check out this article and this article by the MySQL core developers. Both articles focus on how innoDB(the MySQL engine) handles indexes.

Hash table

Our last important data structure is the hash table. Hash tables are very useful when you want to find values quickly. Furthermore, understanding hash tables will help us understand a common database join operation called “hash join.” This data structure is also used by the database to hold internal things (such as lock tables or buffer pools, which we will examine later).

A hash table is a data structure that uses keywords to quickly find an element. To build a hash table, you need to define:

Elements of theThe keyword
- Keyword hash function. The computed hash value of the keyword gives the location of the element (called the hash bucket).
- Keyword comparison function. Once you find the correct hash bucket, you must use the comparison function to find the element you want in the bucket.

A simple example

Let’s look at a visual example:

Review images

This hash table has 10 hash buckets. I’m only going to give you five buckets because I’m lazy, but I know you’re smart, so I’m going to ask you to imagine the other five buckets. The hash function I use is modulo 10 of the keyword, that is, I keep only the last bit of the element’s keyword to find its hash bucket:

If the last digit of the element is 0, it goes into hash bucket 0,
If the last digit of the element is 1, it goes into hash bucket 1,
If the last digit of the element is 2, it goes into hash bucket 2,
… The comparison function I’m using is just to see if two integers are equal.

[Translator’s note]

Let’s say you’re looking for element 78:

The hash table computes the hash code of 78, which is equal to 8.
Look for hash bucket 8 and the first element you find is 78.
Return element 78.
The query took only two operations (one to compute the hash value and one to look up elements in the hash bucket).

Now, let’s say you’re looking for element 59:

The hash table computes the hash code of 59, which is equal to 9.
Look for hash bucket 9, and the first element found is 99. Because 99 does not equal 59, then 99 is not the correct element.
Using the same logic, find the second element (9), the third element (79)… , the last one (29).
The element does not exist.
The search took seven operations.

A good hash function

As you can see, the cost is not the same depending on the value you’re looking for.

If I change the hash function to keyword modulo 1,000,000 (that is, take the last 6 digits), the second search only takes one operation, because there are no elements in hash bucket 00059. The real challenge is to find a good hash function that contains very few elements in the hash bucket.

In my example, it’s easy to find a good hash function, but this is a simple example. Good hash functions are harder to find when the keyword is of the following form:

1 string (such as a person’s last name)
2 strings (such as a person’s first and last name)
2 strings and a date (such as a person’s first and last name, and date of birth)
…

If you have a good hash function, the search time in the hash table is order 1.

Array vs. hash table

Why not use an array?

Well, that’s a good question.

Half of a hash table can be loaded into memory, and the rest of the hash bucket can be left on hard disk.
With arrays, you need a contiguous memory space. If you load a large table, it is difficult to allocate enough contiguous memory space.
With hash tables, you can select the keywords you want (for example, a person’s country and last name).

For more detailed information, you can read my article on efficient hash table implementations on Java HashMap. You don’t need to know Java to understand the concepts in this article.

A global overview

Now that we’ve looked at the basic components inside the database, we need to get back to the big picture.

A database is a collection of information that is easy to access and modify. But a simple stack of files can do the trick. In fact, the simplest database like SQLite is just a bunch of files, but SQLite is a well-designed bunch of files because it allows you to:

Use transactions to ensure data security and consistency
Fast processing of over a million pieces of data

Databases can generally be understood as follows:

Review images

Before writing this section, I had read many books/papers describing databases in their own way. So, I won’t focus on how to organize databases or how to name various processes, because I’ve chosen to describe these concepts in my own way to fit into this article. The difference is different components, the general idea is: the database is composed of a variety of components that interact with each other.

Core components:

Process Manager: Many databases have a pool of processes/threads that need to be managed properly. Furthermore, to achieve nanosecond operations, some modern databases use their own threads rather than operating system threads.
Network Manager: Network I/O is a big problem, especially for distributed databases. So some databases have their own network manager.
File System Manager: Disk I/O is the primary bottleneck for databases. It is very important to have a file system manager that can perfectly handle and even replace the OS file system.
Memory Manager: To avoid the performance penalty of disk I/O, a large amount of memory is required. But if you’re dealing with a lot of memory you need an efficient memory manager, especially if you have a lot of queries using memory at the same time.
Security Manager: Used for user authentication and authorization.
Client Manager: Used to manage Client connections.
…

Tools:

Backup Manager: Used to save and restore data.
Recovery Manager: Used to restart the database to a consistent state after a crash.
Monitor Manager: A tool for logging database activity information and providing monitoring of databases.
AdministrationManager (Administration manager): is used to store metadata (such as the name and structure of a table) and provides tools for managing databases, schemas, and table Spaces.[translator’s note: Ok, I really don’t know what the translation of Administration manager should be, if you know, please inform me, thank you very much…]
…

Query Manager:

Query Parser: Checks whether a Query is valid
Query rewriter: Used to pre-optimize queries
Query Optimizer: Used to optimize queries
Query Executor: Used to compile and execute queries

Data manager:

Transaction Manager: Used to process transactions
Cache Manager: Stores data in memory before it is used or written to disk
Data Access Manager: Accesses Data on disks

For the rest of this article, I’ll focus on how databases manage SQL queries through the following processes:

Client manager
Query manager
Data manager (including recovery manager)

Client manager

Review images

The client manager handles client communication. The client can be a server or an end user or end application. Client manager through a series of well-known apis (JDBC, ODBC, OLe-DB…) Provides different ways to access the database.

The client manager also provides a proprietary database access API.

When you connect to a database:

The manager first checks your authentication information (username and password) and then checks if you are authorized to access the database. These permissions are assigned by the DBA.
The manager then checks to see if there are any free processes (or threads) to process your query.
The manager also checks to see if the database is heavily loaded.
The manager may wait a while to get the required resources. If the wait time reaches the timeout, it closes the connection with a readable error message.
The manager then sends your query to the query manager for processing.
Because the query process is not “all or nothing,” once it gets the data from the query manager, it saves some of the results into a buffer and starts sending them to you.
If it encounters a problem, the manager closes the connection, sends you a readable explanation, and then frees the resource.

Query manager

Review images

This is where the power of the database lies, where a poorly written query can be turned into a fast-executing code whose results are sent to the client manager. The multi-step process is as follows:

The query is first parsed and judged to be valid
It is then rewritten to remove useless operations and add pre-optimizations
It is then optimized to improve performance and converted into executable code and data access plans.
The plan is then compiled
Finally, be executed

I won’t talk too much about the last two steps here, because they’re not very important.

After reading this section, if you need more in-depth knowledge, I suggest you read:

A preliminary research paper on cost optimization (1979) : Access path selection for relational database systems. This article is only 12 pages long and can be understood at computer level.
Very good, very in-depth introduction to how to optimize queries in DB2 9.x
A very good introduction to PostgreSQL how to optimize queries. This is one of the most accessible documents because it says “Let’s see what query plan PostgreSQL provides in this case” rather than “Let’s see what algorithm PostgreSQL uses”.
Official SQLite optimization documentation. “Easy” to read because SQLite uses simple rules. Again, this is the only official document that really explains how SQLite works.
SQL Server 2005 how to optimize query
Oracle 12C Optimization Whitepaper
Two query optimization tutorials, one and two. This tutorial is from the author of Database System Concepts, a good read that focuses on disk I/O, but requires a good computer science level.
Another principle tutorial, this one I find easier to understand, focuses only on join operators and disk I/O.

Query resolver

Each SQL statement is sent to the parser to check the syntax, and if your query has an error, the parser will reject it. For example, if you write “SLECT…” Instead of “SELECT…” “Then there will be no follow-up.

But that’s not all. The parser also checks to see if the keywords are in the correct order, such as WHERE being rejected before SELECT.

The parser then parses the tables and fields in the query, using database metadata to check:

Table exists
Whether a table field exists
Is it possible to perform operations on certain types of fields (e.g., you can’t compare integers to strings, you can’t use the substring() function on an integer)

Next, the parser checks whether you have permission to read (or write) the table in the query. Again: these permissions are assigned by the DBA.

During parsing, THE SQL query is transformed into an internal representation (usually a tree).

If all is well, the internal representation is sent to the query rewrite.

Query rewriter

At this step, we have an internal representation of the query, and the goal of the rewrite is:

Pre-optimized query
Avoid unnecessary calculations
Help optimizer find reasonable best solution

The rewriter checks the query according to a set of known rules. If the query matches a pattern rule, the query is rewritten to follow that rule. Here is a non-exhaustive list of (optional) rules:

View merge: If you use a view in a query, the view is converted to its SQL code.
Subquery flattening: Subqueries are difficult to optimize, so the rewrite tries to remove subqueries

Such as:

MySQL

SELECT PERSON.*
FROM PERSON
WHERE PERSON.person_key IN
(SELECT MAILS.person_key
FROM MAILS
WHERE MAILS.mail LIKE 'christophe%');Copy the code

Remove unnecessary operators: For example, if you use DISTINCT and you have a UNIQUE constraint (which in itself prevents data duplication), the DISTINCT keyword is removed.
Eliminate redundant joins: If the same JOIN condition appears twice, such as a hidden JOIN condition in the view, or useless JOIN due to transitivity, it will be eliminated.
Constant evaluation assignment: If your query needs to be evaluated, the evaluation will be performed once during rewriting. For example, WHERE AGE > 10+2 is converted to WHERE AGE > 12, and TODATE(” date string “) is converted to a datetime value.
(Advanced) Partition Pruning: If you are using a Partition table, the rewriter can find the Partition to use.
(Advanced) Materialized view rewrite: If you have a Materialized view that matches a subset of query predicates, the rewriter checks if the view is up to date and modifies the query to use the Materialized view instead of the original table.
(Advanced) Custom rules: If you have custom rules to modify a query (like Oracle Policy), the rewrite will enforce those rules.
(advanced) OLAP transformations: parse/windowing functions, star joins, ROLLUP functions… Both transformations take place (but I’m not sure if this is done by a rewrite or optimizer, because the two processes are so closely related that it depends on the database).

The process by which an evaluation of a predicate, predicate, conditional expression returns true or false.

The rewritten query is then sent to the optimizer, where the fun begins.

statistical

Before we look at how databases optimize queries, we need to talk about statistics, because a database without statistics is stupid. The database does not analyze its own data unless you explicitly instruct it to. No analysis can cause the database to make (very) bad assumptions.

But what kind of information does a database need?

I must talk (briefly) about how databases and operating systems hold data. The smallest units used by both are called pages or blocks (4 or 8 KB by default). This means that if you only need 1KB, it will take up a page. If the page size is 8KB, you’re wasting 7KB.

Back to statistics! When you ask the database to collect statistics, the database calculates the following values:

The number of rows and pages in the table
In each column of a table: Unique value Data length (min, Max, average) Data range (min, Max, average)

Table index information

These statistics help the optimizer estimate the disk I/O, CPU, and memory usage required by the query

The statistics for each column are very important. For example, if a table PERSON wants to join two columns: LAST_NAME, FIRST_NAME. Based on the statistics, the database knows that FIRST_NAME has only 1,000 different values and LAST_NAME has 1,000,000 different values. Therefore, the database joins according to LAST_NAME, FIRST_NAME. Since LAST_NAME is unlikely to be repeated, most cases compare the first 2 or 3 characters of the LAST_NAME, which greatly reduces the number of comparisons.

But these are just basic statistics. You can ask the database to do an advanced statistic called a histogram. Histograms are statistics about the distribution of column values. Such as:

The value that occurs most frequently
quantile“” http://baike.baidu.com/view/1323572.htm”
…

These additional statistics help the database find better query plans, especially for equality predicates (e.g. WHERE AGE = 18) or range predicates (e.g. WHERE AGE > 10 and AGE < 40), because the database can better understand the numeric type rows associated with these predicates (note: the technical name for this concept is selectivity).

Statistics are stored in database metadata, such as the location of statistics for (non-partitioned) tables:

Oracle: USER/ALL/DBA_TABLES and USER/ALL/DBA_TAB_COLUMNS
DB2: syscat.tables and syscat.columns

Statistics must be up to date. There is nothing worse if a table has 1,000,000 rows and the database thinks it has only 500 rows. The only downside to statistics is that they take time to calculate, which is why most databases do not automatically calculate statistics by default. When it becomes difficult to do statistics in the millions, you can choose to do only basic statistics or perform statistics on a database sample.

For example, I was involved in a project that needed to process a library of hundreds of millions of pieces of data per table. I chose to count only 10%, which resulted in a huge time drain. This example proves to be a bad decision, because sometimes Oracle 10G’s selection of 10% from a particular column in a particular table is very different from the full 100% (this is very rare for a table with 100 million rows). It was a nightmare trying to find the root cause of a 30-second query that took eight hours to execute. This example shows the importance of statistics.

Note: Of course, each database also has its own specific higher level statistics. If you want to learn more, read the documentation for the database. Having said that, I’ve done my best to understand how statistics are used, and the best official documentation I’ve found is from PostgreSQL.

Query optimizer

Review images

All modern databases use cost-based optimization, or CBO, to optimize queries. The idea is to set a cost for each operation and find the best way to reduce query costs by applying the cheapest series of operations.

To understand how the cost optimizer works, I thought it best to “get a feel” for the complexity behind the task with an example. Here I’ll show you three ways to join two tables, and we’ll quickly see that even a simple join query is a nightmare for the optimizer. Later, we’ll see how the real optimizer works.

For these join operations, I will focus on their time complexity, but the database optimizer calculates their CPU cost, disk I/O cost, and memory requirements. The difference between time complexity and CPU cost is that time cost is an approximation (for lazy guys like me). In CPU cost, I include all operations, such as addition, conditional judgment, multiplication, iteration… There are:

Each high-level code operation requires a specific number of low-level CPU operations.
Intel Core I7, Intel Pentium 4, AMD Opteron… (in terms of CPU cycles) the cost of CPU computation is different, that is, it depends on the CPU architecture.

It was much easier (at least for me) to use time complexity, which allowed me to understand the concept of CBO. Because disk I/O is an important concept, I mention it occasionally. Keep in mind that most of the time the bottleneck is disk I/O, not CPU usage.

The index

When we talked about indexes in B+ trees, remember that indexes are already sorted.

Fyi: There are other types of indexes, such as bitmap indexes, that do not cost the same as B+ tree indexes in terms of CPU, disk I/O, and memory.

In addition, many modern databases can dynamically generate temporary indexes only for current queries in order to improve the cost of execution planning.

Access path

Before you can apply the Join operators, you first need to get the data. Here’s how to get the data.

Note: Since the real problem with all access paths is disk I/O, I won’t talk too much about time complexity.

Four types of Oracle index scanning

Full scan

If you’ve read an execution plan, you’ve probably seen the word “full scan” (or just “scan”). Simply put, a full scan is a complete database read a table or index. In terms of disk I/O, it is clear that a full table scan is more expensive than an index full scan.

Range scan

Other types of scans have index range scans, such as when you use the predicate “WHERE AGE > 20 AND AGE < 40”.

Of course, you need an index on the AGE field to use an index range scan.

As we learned in the first part, the time cost of a range query is approximately log(N)+M, where N is the amount of data indexed and M is the number of rows estimated within the range. It is thanks to statistics that we know the values of N AND M. In addition, with a range scan, you don’t need to read the entire index, so it’s not as expensive in disk I/O as a full scan.

The only scan

If you only need one value from the index you can use unique scan.

Access by ROW ID

In most cases, if the database uses an index, it must look up the rows associated with the index, which uses access by ROW ID.

For example, suppose you run:

If the age column of the Person table has an index, the optimizer will use the index to find all the people aged 28, and then it will go to the table and read the relevant rows, because the only information in the index is age and you want the first and last name.

But suppose you did it differently:

The index of the PERSON table is used to join the TYPE_PERSON table, but the PERSON table is not accessed by row ID because you do not request information from the table.

While this method works well for small amounts of access, the real problem with this calculation is disk I/O. If you need a lot of access by row ID, the database may choose full scan.

The other path

I haven’t listed all the access paths, but you can read the Oracle documentation if you are interested. Other databases may be called different but the idea behind it is the same.

Join operator

So, we know how to get the data, so now join them together!

I want to show three common join operators: Merge join, Hash join and Nested Loop join. But before we do that, I need to introduce a new vocabulary: inner relation and outer relation. OUTER JOIN, INNER JOIN, OUTER JOIN, INNER JOIN, OUTER JOIN Relational databases refer to “each table (sometimes referred to as a relationship)…”. A relationship can be:

A table
An index
The intermediate result of a previous operation (such as the result of a previous join operation)

When you join two relationships, the join algorithm treats the two relationships differently. For the rest of this article, I will assume:

The outer relation is the left-hand data set
The inner relation is the right-hand data set

For example, A JOIN B is A JOIN between A and B, where A is outer relation and B is inner relation.

In most cases, the cost of A JOIN B is different from that of B JOIN A.

In this section, I will also assume that the outer relation has N elements and the inner relation has M elements. Keep in mind that real optimizers know the values of N and M by statistics.

Note: N and M are the cardinality of the relationship. 【原文】

Nested loop joins

Nested loop joins are the simplest.

Review images

Here’s why:

For each row of the outer relation
View all rows in the inner relationship to find matching rows

Here is the pseudocode:

Since this is a double iteration, the time is order N times M.

In the case of disk I/O, the inner loop needs to read M rows from the inner relationship for each row of N lines outside the relationship. This algorithm requires reading N+ N*M rows from disk. But if the inner relation is small enough that you can read it into memory, then you’re only left with M + N reads. After this modification, the inner relationship must be minimal because it has a better chance of loading into memory.

There is no difference in CPU cost, but in disk I/O, best of all, each relationship is read only once.

Of course, inner relationships can be replaced by indexes, which is better for disk I/O.

Because this algorithm is so simple, the following version is better for disk I/O when the internals are too large to fit into memory. Here’s why:

To avoid reading both relationships line by line,
You can read in clusters, store two clusters of rows in memory,
Compare the two clusters of data, keep the matched ones,
A new data cluster is then loaded from disk to continue the comparison
Until all the data is loaded.

Possible algorithms are as follows:

With this version, time complexity is the same, but disk access is reduced:

With the previous version, the algorithm required N + N*M accesses (one row per access).
With the new version, disk access becomesNumber of data clusters in outer relation + number of data clusters in outer relation * Number of data clusters in inner relation.
Increasing the size of the data cluster reduces disk access.

Hash join

Hash joins are more complex, but in many cases cost less than nested loop joins.

Review images

The logic of hash joins is as follows:

1) Read all elements of the inner relationship
2) Create a hash table in memory
3) Read all elements of the external relation one by one
4) Calculate the hash value of each element (using the hash function of the hash table) to find the related hash bucket in the inner relation
5) Whether it matches the elements of the external relationship.

I need to make some assumptions about time complexity to simplify things:

The inner relationship is divided into X hash buckets
The hash function distributes the hash values of the data within each relation almost uniformly, meaning that the hash bucket size is the same.
The element of the outer relation matches all elements in the hashed bucket at the cost of the number of elements in the hashed bucket.

The time is M/X times N/X plus the cost of creating the hash table M plus the cost of the hash function times N. If the hash function creates a sufficiently small hash bucket, the complexity is O(M+N).

There is also a version of hash join, which is good for memory but not good enough for disk I/O. This time it went like this:

1) Compute hash tables for both inner and outer relations
2) Save the hash table to disk
3) Then compare hash buckets one by one (one reads into memory, the other reads line by line).

Merge join

Merge join is the only join algorithm that produces sort.

Note: This simplified merge join does not distinguish between inner tables or surfaces; Both tables play the same role. But the actual implementation is different, for example when dealing with duplicate values.

1. (Optional) Sort join operation: Both input sources are sorted by join key.

2. Merge join operation: merge sorted input sources together.

The sorting

We’ve already talked about merge sort, which is a good algorithm here (but not the best, hash joins are better if you have enough memory).

However, sometimes the dataset is already sorted, for example:

If the table is internally organized, such as an index-organized table in a join condition.
If the relation is an index in the join condition
If the join is applied to an already sorted intermediate result in a query

Merge join

Review images

This part is very similar to the merge operation in the merge sort we studied. But this time, instead of picking all the elements from the two relationships, we pick only the same elements. Here’s why:

1) In two relationships, compare the current element (current = first occurrence)
2) If they are the same, place both elements in the result and compare the next element in the relationship
3) If not, go to the next element in the relationship with the smallest element (because the next element may match)
4) Repeat steps 1, 2, and 3 up to the last element of one of the relationships.

This works because both relationships are sorted and you don’t have to “go back”.

The algorithm is a simplified version because it does not deal with multiple occurrences of the same data in two sequences (that is, multiple matching). The real version is more complex “just” for this example, which is why I chose the simplified version.

If both relations are sorted, the time is order N+M.

If two relations need sorting, the time complexity is the cost of sorting the two relations: O(N*Log(N) + M*Log(M))

For computer geeks, I offer the following possible algorithm to handle multiple matching (note: I’m not 100% sure about this algorithm) :

mergeJoin(relation a, relation b) relation output integer a_key:=0; integer b_key:=0; while (a[a_key]! =null and b[b_key]! =null) if (a[a_key] < b[b_key]) a_key++; else if (a[a_key] > b[b_key]) b_key++; else //Join predicate satisfied write_result_in_output(a[a_key],b[b_key]) //We need to be careful when we increase the pointers if (a[a_key+1] ! = b[b_key]) b_key++; end if if (b[b_key+1] ! = a[a_key]) a_key++; end if if (b[b_key+1] == a[a_key] b[b_key] == a[a_key+1]) b_key++; a_key++; end if end if end whileCopy the code

Which algorithm is best?

If you have the best, there’s no need to have so many types. This question is difficult because there are many factors to consider, such as:

Free memory: If you don’t have enough memory, say goodbye to powerful hash joins (at least fully in-memory hash joins).
The size of two data sets. For example, if a large table joins a small table, a nested loop join is faster than a hash join because of the high cost of creating a hash. If both tables are very large, then the CPU cost of nested loop joins is very high.
Index or not: With two B+ tree indexes, the smart choice seems to be to merge the join.
Whether the results need to be sorted: Even if you are working with unsorted data sets, you may want to use a more expensive merge join (sorted) because when you finally get sorted results, You can concatenate it with another merge join (or perhaps because the query implicitly or explicitly requires a sorting result with operators like ORDER BY/GROUP BY/DISTINCT).
Whether the relationship is sorted: A merge join is the best candidate.
Tablea.col1 = tableb.col2 Inner connection? Outer join? Cartesian product? Or self-join? Some connections do not work in certain environments.
Data distribution: If the data for join conditions is skewed (such as joining people by their last names, but many people have the same last name), hash joins can be a disaster because the hash function will produce hash buckets that are very unevenly distributed.
If you want the join operation to use multithreading or multi-process.

For more detailed information, read the documentation for DB2, ORACLE, or SQL Server.

Simplified example

We have examined three types of join operations.

Now, let’s say we want to join five tables to get all the information about a person. A person can have:

Multiple mobile numbers
A number of E-mails
Multiple addresses (ADRESSES)
Multiple bank accounts (BANK_ACCOUNTS)

In other words, we need to get the answer quickly with the following query:

As a query optimizer, I have to find the best way to handle the data. But there are two problems:

What type does each join use? I have three options (hashing, merging, nesting) and may use 0, 1, or 2 indexes (not to mention multiple types of indexes).

In what order is the join performed? For example, the following figure shows the possible execution plan for four tables with only three joins:

Review images

So here’s what I might do:

1) Use database statistics in a rough way to calculate the cost of every possible execution plan and retain the best one. However, there are many possibilities. For a given sequence of join operations, each join has three possibilities: hashing, merging, and nesting, so there are 3^4 possibilities in total. Determining the order of the join is a binary tree arrangement problem, with (2*4)! / (4 + 1)! The possible order. I’m going to end up with 3 to the fourth times 2 times 4 factorial for this rather simplified problem. / (4 + 1)! Possible. Terminology aside, that’s 27,216 possibilities. If you add 0,1, or 2 B+ tree indexes to the merge join, the possibilities become 210,000. Did I tell you that this query is actually quite simple?

2) It would be tempting for me to yell and quit the job, but then you won’t get the results, and I need the money to pay my bills.

3) I try only a few execution plans and pick the one with the lowest cost. Not being Superman, I can’t figure out the cost of all the plans. Instead, I can arbitrarily choose a subset of all possible plans, calculate their costs and give you the best plan.

4) I use smart rules to reduce the number of possibilities There are two kinds of rules: I can use the “logical” rule, which removes useless possibilities but does not filter out a large number of possibilities. For example: “The inner relation of a nested join must be the smallest data set.” I accept the fact that instead of looking for the best solution, I dramatically reduce the number of possibilities with more aggressive rules. For example: “If a relationship is small, use nested loop joins, never merge or hash joins.”

In this simple example, I ended up with a lot of possibilities. Real world queries also have relational operators like OUTER JOIN, CROSS JOIN, GROUP BY, ORDER BY, PROJECTION, UNION, INTERSECT, DISTINCT… That means more possibilities.

So how does the database work?

Dynamic programming, greedy algorithms and heuristic algorithms

Relational databases try many of the approaches I just mentioned, but the optimizer’s real job is to find a good solution in limited time.

Most of the time, the optimizer doesn’t find the best solution, it finds a “good” one

For small queries, it is possible to take a rough approach. But there is a way to avoid unnecessary computation in order to allow even moderately sized queries to take a rough approach: dynamic programming.

Dynamic programming

The idea behind these words is that many execution plans are very similar. Take a look at these plans below:

Review images

They all have the same subtree (A JOIN B), so instead of calculating the cost of the subtree in each plan, calculate it once, save the result, and reuse it when it encounters the subtree again. In more formal terms, we have an overlap problem. To avoid double-counting partial results, we use mnemonics.

For computer geeks, here’s an algorithm I found in the tutorial I gave you earlier. I don’t offer explanations, so only read it if you already know dynamic programming or are proficient with algorithms (I warned you) :

For large queries, you can also use a dynamic programming approach, but add additional rules (or heuristics) to reduce the likelihood.

If we analyze only one specific type of plan (for example, left-deep tree, see), we get n*2^n instead of 3^n.

Review images

If we add logical rules to avoid schema schemes (like “If a table has an index for a given predicate, don’t try to join the table, do the index”), we reduce the number of possibilities without doing too much harm to the best scenario.As =has, to=too
If we add rules to the process (like “join operations precede all other relational operations”), we can also reduce a lot of possibilities.
…

Greedy algorithm

However, when the optimizer is faced with a very large query, or in order to find the answer as quickly as possible (when the query is not fast enough), it applies another algorithm called the greedy algorithm.

The principle is to plan queries in a progressive manner according to a rule (or heuristic). Under this rule, the greedy algorithm gradually searches for the best algorithm by first processing a JOIN and then adding a new JOIN according to the same rule at each step.

Let’s look at a simple example. For example, for A query with four joins on five tables (A,B,C,D,E), we use nested joins as the possible JOIN method for simplicity, following the “use the least cost JOIN” rule.

Start directly from one of the 5 tables (e.g. A)
Computes each join with A (A as inner or outer relation)
It is found that A JOIN B has the lowest cost
Calculate the cost of each result JOIN with “A JOIN B” (” A JOIN B “as inner or outer relation)
It is found that “(A JOIN B) JOIN C” has the lowest cost
Calculate the cost of each JOIN with the result of “(A JOIN B) JOIN C”…
“(((A JOIN B) JOIN C) JOIN D) JOIN E)”

Since we start arbitrarily with table A, we can apply the same algorithm to B, then C, then D, then E. Finally, retain the lowest cost execution plan.

This algorithm, by the way, has a name: the nearest neighbor algorithm.

All details aside, with a good model and an N*log(N) sort, the problem is easily solved. The complexity of this algorithm is order N log(N), compared to order 3^N, which is completely dynamic programming. If you have a large query with 20 joins, that means 26 vs 3,486,784,401, a big difference!

The problem with this algorithm is that we make the assumption that we can find the best way to join the two tables, keep the join result, and join the next table to get the lowest cost. But:

Even between A, B and C, A joins B at the lowest cost
(A JOIN C) JOIN B may be better than (A JOIN B) JOIN C.

To improve this situation, you can use greedy algorithms based on different rules multiple times and keep the best execution plan.

Other algorithms

[If you’ve had enough of algorithms, skip to the next section. It’s not important for the rest of the article.] [Translator’s note: I want to skip this part too.]

Many computer science researchers are interested in finding the best execution plan. They often seek better solutions to specific problems or patterns, such as:

If the query is a star join (a multi-join query), some databases use a specific algorithm.
Some databases use a specific algorithm if the query is parallel. …

Other algorithms are also being studied to replace dynamic programming algorithms in large queries. Greedy algorithms belong to a large family of algorithms called heuristics, which, based on a rule (or heuristic), save the method found in the previous step and “append” it to the current step to further search for a solution. Some algorithms apply the rules step by step based on specific rules but do not always retain the best method found in the previous step. They are collectively called heuristic algorithms.

Genetic algorithms, for example, are a way to:

A method represents a possible complete query plan
Each step preserves P methods (i.e., plans) instead of one.
0) P plans are randomly created
1) The lowest-cost plan is retained
2) These best plans are mixed together to produce P new plans
3) Some new plans were randomly rewritten
4) Repeat steps 1, 2, and 3 T times
5) Then, in the last loop, get the best plan from P plans.

The more loops, the better the plan.

Is this magic? No, it’s the law of nature: survival of the fittest!

PostgreSQL implements genetic algorithms, but I can’t find out if it uses them by default.

Other heuristic algorithms are used in the database, such as “Simulated Annealing”, “Iterative Improvement”, “two-phase Optimization”… . However, I do not know if these algorithms are currently used in enterprise databases or only in research databases.

If you want to learn more, this research paper introduces two more possible algorithms “A survey of algorithms for join sorting problems in database query optimization”, you can read it.

Real optimizer

[This paragraph is not important and can be skipped]

However, all of the above wordy stuff is very theoretical, I’m a developer not a researcher, and I like concrete examples.

Let’s take a look at how the SQLite optimizer works. This is a lightweight database that uses a simple optimizer, based on greedy algorithms with additional rules, to limit the number of possibilities.

SQLite never reorders tables with the CROSS JOIN operator
Use nested joins
External joins are always evaluated sequentially
…
Versions prior to 3.8.0 used the “nearest neighbor” greedy algorithm to find the best query plan and so on… We’ve seen this algorithm before! What a coincidence!
Starting with version 3.8.0 (released in 2015), SQLite uses the “N nearest neighbor” greedy algorithm to search for the best query plan

Let’s look at another optimizer in action. IBM DB2 is similar to all enterprise databases, and I discuss it because it was the last database I really used before switching to big data.

After reviewing the official documentation, we learned that the DB2 optimizer allows you to use seven levels of optimization:

Use greedy algorithms for joins
0 — Minimal optimization, using index scans and nested loop joins, avoiding some query rewriting
- 1 – Low level optimization
- 2 — Fully optimized
Use dynamic programming algorithms for joins
3 — Medium optimization and rough approximation
- 5 — Fully optimized, using all techniques with heuristics
- 7 – Fully optimized, similar to level 5, but without heuristics
- 9 — Maximum optimization, regardless of overhead, considering all possible join orders, including cartesian products

You can see that DB2 uses greedy algorithms and dynamic programming algorithms. Of course, they don’t share their heuristic algorithms, because the query optimizer is the database’s bread and butter.

DB2’s default level is 5, and the optimizer uses the following features:

Use all available statistics, including frequent-value line trees and quantile statistics.
Use all query rewrite rules (including materialized query table routing, Materialized Query table routing), except for computationally intensive rules that apply in rare cases.
Simulate joins using dynamic programming
Limited use of composite inner relation
Limited use of cartesian products for star schemas involving lookup tables
Consider a wide range of access methods, including list prefetch, index ANDing, and materialized query table routing.

By default, DB2 uses a dynamic programming algorithm constrained by heuristics for join permutations.

(GROUP BY, DISTINCT…) Handled by simple rules.

Query plan cache

Because creating a query plan is time consuming, most databases keep the plan in the query plan cache to avoid double-counting. This is a big topic because the database needs to know when to update outdated plans. The idea is to set an upper limit, and if a table’s statistical change exceeds the upper limit, the query plan for that table is removed from the cache.

Query executor

At this stage, we have an optimized execution plan, which is then compiled into executable code. Then, if there are enough resources (memory, CPU), the query executor executes it. Operators in the plan (JOIN, SORT BY…) It can be executed sequentially or in parallel, depending on the executor. To get and write data, the query executor interacts with the data manager, which is discussed in the next section of this article.

Data manager

Review images

In this step, the query manager performs the query, needs data from tables and indexes, and makes a request to the data manager. But there are two problems:

Relational databases use a transaction model, so when someone else is using or modifying data at the same time, you can’t get at that part of the data.
Data extraction is the slowest operation in a database, so the data manager needs to be smart enough to get the data and store it in memory buffers.

In this section, I haven’t looked at how relational databases handle these two issues. I won’t go into how the data manager gets the data, because that’s not what matters (and this article is already long enough!). .

Cache manager

As I said, the main bottleneck for databases is disk I/O. To improve performance, modern databases use a cache manager.

Review images

The query executor does not take data directly from the file system, but asks for it from the cache manager. The cache manager has an in-memory cache, called the buffer pool, from which data can be read to significantly improve database performance. It’s hard to put an order of magnitude on this, because it depends on what kind of operation you want:

Sequential access (e.g., full scan) vs. random access (e.g., access by row ID)
To read or write

And the disk type used by the database:

7.2K / 10K / 15K RPM disk
SSD
RAID 1/5 /…

I’d say memory is a hundred to a hundred thousand times faster than disk.

However, this leads to another problem (databases always do…). The cache manager needs to get the data before the query executor uses it, otherwise the query manager has to wait for the data to be read from a slow disk.

The proofs

This problem is called prereading. The query executor knows what data it will need because it knows the entire query flow and also knows the data on disk through statistics. Here’s how it works:

When the query executor processes its first batch of data
The cache manager is told to preload the second batch of data
When you start processing the second batch of data
Tell the cache manager to preload the third batch of data, and tell the cache manager that the first batch can be purged from the cache.
…

The cache manager stores all of this data in the buffer pool. To determine whether a piece of data is useful, the cache manager adds additional information (called latches) to the cached data.

Sometimes the query executor does not know what data it needs, and some databases do not provide this capability. Instead, they use either a speculative prefetch (for example, if the query executor wants data 1, 3, 5, it is likely to ask for 7, 9, 11 soon) or a sequential prefetch (where the cache manager simply reads a batch of data and then loads the next batch of continuous data from disk).

To monitor prefetch performance, modern databases introduce a measure called buffer/cache hit ratio, which shows how often requested data is found in the cache rather than read from disk.

Note: A poor cache hit ratio does not always mean that the cache is working badly. Read the Oracle documentation for more information.

A buffer is just a limited amount of memory, so it needs to remove some data in order to load new data. Loading and clearing the cache requires some disk and network I/O costs. If you have a query that is executed frequently, it is inefficient to load and erase the query results each time. Modern databases use a buffer replacement strategy to solve this problem.

Buffer replacement strategy

Most modern databases (at least SQL Server, MySQL, Oracle, and DB2) use the LRU algorithm.

LRU

LRU stands for Least Recently Used algorithm. The principle behind LRU is that the data kept in the cache is Recently Used, so it is more likely to be Used again.

Illustration:

Review images

For better understanding, I’ll assume that the data in the buffer is not locked by the latch (that is, can be removed). In this simple example, the buffer can hold three elements:

1: The cache manager (CM for short) takes data 1 and puts it into an empty buffer
2: CM takes data 4 and puts it into a half-load buffer
3: CM takes data 3 and puts it into a half-load buffer
4: CM uses data 9, the buffer is full, so data 1 is cleared because it is the last recently used data 9 is added to the buffer
5: CM uses data 4, which is already in the buffer, so it becomes the first recently used one again.
6: CM uses data 1, the buffer is full, so data 9 is cleared because it is the last recently used data 1 is added to the buffer
…

This algorithm works well, but there are some limitations. What if a full table scan is performed on a large table? In other words, what happens when the table/index size exceeds the buffer? Using this algorithm will clear all data previously in the cache, and fully scanned data is likely to be used only once.

To improve the

To prevent this, some databases add special rules, such as those described in the Oracle documentation:

“For very large tables, a database usually use direct path to read, namely direct load block […], to avoid fill the buffer. For medium size table, the database can be used to directly read or cache read. If you choose to read cache, database, to the end of the block in the LRU prevent to empty the current buffer.”

There are also possibilities, such as using an advanced version of LRU called LRU-K. For example, SQL Server uses LRU-2.

The idea is to take more history into account. Simple LRU (that is, LRU-1), which only considers the data that was last used. LRU -k? :

Consider the last KTH use of the data
The number of times the data is used adds weight
A batch of new data is loaded into the cache, and old but frequently used data is not cleared (because of the higher weight).
But this algorithm does not retain data that is no longer in use in the cache
So if the data is no longer used, the weight value decreases over time

Calculating weights costs money, so SQL Server just uses K=2, which is a good value for performance and an acceptable overhead.

For more in-depth knowledge of LRU-K, read an earlier research paper (1993) : LRU-K page Replacement Algorithm for database disk buffering

Other algorithms

There are other algorithms for managing caches, such as:

2Q (LRU-K like algorithm)
CLOCK (LRU-K like algorithm)
MRU (the latest algorithm used, using LRU with the same logic but different rules)
LRFU (Least Recently and Frequently Used, Recently and Frequently Used)
…

Write a buffer

I only talked about read caches — preloading data before use. It is used to store data and to brush to disk in batches rather than write to disk one by one, resulting in many single disk accesses.

Keep in mind that buffers hold pages (the smallest unit of data) and not rows (logically/the way humans are used to viewing data). A page in the buffer pool that has been modified but not written to disk is a dirty page. There are many algorithms to determine the best time to write dirty pages, but this problem is highly relevant to the concept of transactions, which we’ll talk about next.

Transaction manager

Last but not least, the transaction manager, we’ll see how this process ensures that each query executes within its own transaction. But before we begin, we need to understand the concept of ACID transactions.

“I ‘m on the acid”

An ACID transaction is a unit of work that guarantees four properties:

Atomicity: A transaction is “either all done or all cancelled”, even if it runs for 10 hours. If the transaction crashes, the state goes back to before the transaction (transaction rollback).
Isolation: If two transactions A and B are running at the same time, the final result of transaction A and B is the same regardless of whether A ends before/after/during B.
Durability: Once a transaction commits (i.e., executes successfully), data is saved in the database regardless of what happens (crashes or errors).
Consistency: Only valid data (according to relational and function constraints) can be written to the database. Consistency is related to atomicity and isolation.

Review images

You can run multiple SQL queries to read, create, update, and delete data within the same transaction. Trouble arises when two transactions use the same data. The classic example is A remittance from account A to account B. Suppose there are two transactions:

Transaction 1 (T1) takes $100 from account A and gives it to account B
Transaction 2 (T2) takes $50 from account A and gives it to account B

Let’s go back to the ACID property:

Atomicity ensures that no matter what happens during T1 (server crash, network outage…) , you can’t have $100 withdrawn from account A but not given to account B (this is A data inconsistency state).
Isolation ensures that if T1 and T2 occur at the same time, eventually A will lose $150 and B will get $150, as opposed to other outcomes such as A losing $150 and B only getting $50 because T2 partially erased T1 (which is also inconsistent state).
Persistence ensures that T1 does not disappear into thin air if a database crash occurs as soon as IT commits.
Consistency ensures that money is not generated or lost in the system.

[The following parts are not important and can be skipped]

Modern databases do not use pure isolation as the default mode because it incurs a significant performance cost. SQL generally defines four isolation levels:

Serializable (Serializable, SQLite default mode) : Highest level of isolation. Two simultaneous transactions are 100% isolated, and each transaction has its own “world.”
Repeatable Read (MySQL default mode) : Each transaction has its own “world”, except for one case. If a transaction executes successfully and new data is added, this data is visible to other ongoing transactions. However, if a transaction successfully modifies a piece of data, the result is not visible to the running transaction. Therefore, the isolation between transactions is broken only for new data and remains isolated for existing data. For example, if transaction A runs “SELECT count(1) from TABLE_X” and then transaction B adds A new item to TABLE_X and commits it, the result will not be the same when transaction A runs count(1) again. This is called phantom read.
Read Committed (Oracle, PostgreSQL, SQL Server default mode) : Repeatable Read + new isolation break. If transaction A reads data D, and data D is then modified (or deleted) and committed by transaction B, the change (or deletion) to the data is visible when transaction A reads data D again. This is called non-repeatable read.
Read Uncommitted: The lowest level of isolation, Read Committed + new isolation breaks. If transaction A reads data D and then data D is modified by transaction B (but not committed and transaction B is still running), the modification is visible when transaction A reads data D again. If transaction B rolls back, then the second read of data D by transaction A is meaningless because it is A modification made by transaction B that never happened (rollback already happened). This is called dirty read.

Most databases add custom isolation levels (such as PostgreSQL, Oracle, SQL Server snapshot isolation) and do not implement all the levels in the SQL specification (especially the read uncommitted level).

The default isolation level can be overridden by the user/developer when establishing a connection (with a simple addition of one line of code).

Concurrency control

The real problem with ensuring isolation, consistency, and atomicity is writing to the same data (add, delete, delete) :

If all transactions just read data, they can work simultaneously without changing the behavior of the other transaction.
If (at least) one transaction is modifying data read by other transactions, the database needs to find a way to hide the changes from the other transactions. Furthermore, it needs to ensure that this modification operation is not erased by another transaction that cannot see these data modifications.

This problem is called concurrency control.

The simplest solution would be to execute each transaction sequentially (that is, sequentially), but this would not scale at all and would be inefficient with only one core working on a multi-processor/multi-core server.

Ideally, each time a transaction is created or cancelled:

Monitor all operations for all transactions
Check if parts of two (or more) transactions conflict because they read/modify the same data
Rearrange operations in conflicting transactions to reduce conflicting portions
Execute the conflicting parts in a certain order (while non-conflicting transactions are still running concurrently)
Consider the possibility that the transaction will be cancelled

In more formal terms, this is a scheduling problem for conflicts. More specifically, this is a very difficult and CPU expensive optimization problem. Enterprise databases cannot afford to wait hours to find the best schedule for each new transaction activity, so they use less than ideal methods to avoid wasting more time on conflict resolution.

Lock manager

To solve this problem, most databases use locking and/or data versioning. This is a big topic, and I’m going to focus on locking, and a little bit of data versioning.

Pessimistic locking

The principle is:

If a transaction requires a piece of data
It locks the data
If another transaction also needs this data
It has to wait for the first transaction to release that data and that lock is called an exclusive lock.

But using an exclusive lock on a transaction that only reads data is expensive because it forces other transactions that only need to read the same data to wait. Hence another type of lock, the shared lock.

A shared lock looks like this:

If A transaction only needs to read data A
It assigns A “shared lock” to data A and reads it
If the second transaction also needs to only read data A
It assigns A “shared lock” to data A and reads it
If the third transaction needs to modify data A
It places an “exclusive lock” on data A, but must wait for the other two transactions to release their shared lock.

Similarly, if an exclusive lock is placed on a block of data, a transaction that only needs to read the data must wait for the exclusive lock to be released before a shared lock can be placed on the data.

Review images

The lock manager is a process that adds and releases locks, internally stores lock information in a hash table (the key word is locked data), and knows what each piece of data is:

The lock that was added by the transaction
Which transaction is waiting for data to unlock

A deadlock

But using locks can lead to a situation where two transactions are forever waiting for a piece of data:

Review images

In this figure:

Transaction A places an exclusive lock on data 1 and waits to acquire data 2
Transaction B places an exclusive lock on data 2 and waits for data 1 to be acquired

This is called a deadlock.

When a deadlock occurs, the lock manager chooses to cancel (roll back) a transaction in order to eliminate the deadlock. It was a tough decision:

Killing transactions with the least amount of data modification (thus reducing rollback costs)?
Kill the shortest duration transaction because users of other transactions wait longer?
Killing transactions that can end in less time (avoiding a possible resource famine)?
Once a rollback occurs, how many transactions are affected by the rollback?

Before making a selection, the lock manager needs to check for deadlocks.

A hash table can be thought of as a graph (see figure above) in which a loop indicates a deadlock. Because the checking loop is expensive (the graph of all the locks is huge), it is often solved in a simple way: by using timeout Settings. If a lock is not added within the timeout period, the transaction enters a deadlock state.

The lock manager can also check to see if a lock will become deadlocked before attaching it, but this is expensive to do perfectly. So these prechecks often set ground rules.

Two pieces of lock

The simplest way to achieve pure isolation is to acquire the lock at the beginning of a transaction and release it at the end. That is, you must wait to make sure you have all the locks before the transaction starts and release the locks you hold when the transaction ends. This works, but a lot of time is wasted waiting for all the locks.

A faster approach is the Two-Phase Locking Protocol (used by DB2 and SQL Server), where transactions are divided into Two phases:

Growth phase: transactions can acquire locks, but cannot release them.
Shrink phase: Transactions can release locks (for data that has already been processed and will not be processed again), but cannot acquire new locks.

Review images

The principle behind these two simple rules is:

Release locks that are no longer in use to reduce wait time for other transactions
Prevent the occurrence of inconsistencies when the data originally acquired by a transaction is modified after the transaction has started.

This rule works fine, with one exception: if the transaction is cancelled (rolled back) after modifying a piece of data and releasing the associated lock, and another transaction reads the modified value, the value is rolled back. To avoid this problem, all exclusive locks must be released at the end of the transaction.

Say a few words more

Of course, real databases use more complex systems, involving more types of locks (such as intention locks, Intention locks) and more granularity (row-level locks, page-level locks, partition locks, table locks, tablespace locks), but the principles are the same.

I will only explore purely lock-based approaches; data versioning is another approach to this problem.

Version control looks like this:

Each transaction can modify the same data at the same time
Each transaction has its own copy (or version) of the data
If two transactions modify the same data and only one modification is accepted, the other will be rejected and the related transaction will be rolled back (or re-run)

This will improve performance because:

Read transactions do not block write transactions
Write transactions do not block reads
No extra overhead of a “bloated and slow” lock manager

Data versioning performs better than locking in every respect except when two transactions write the same data. However, you will soon find that disk space is a huge drain.

Data versioning and locking mechanisms are two different perspectives: optimistic locking and pessimistic locking. There are pros and cons to both, and it all depends on the usage scenario (read more or write more). For data versioning, I recommend this excellent article on how PostgreSQL implements concurrency control for multiple versions.

Some databases, such as DB2 (up to version 9.7) and SQL Server (without snapshot isolation) use locking only. Others like PostgreSQL, MySQL and Oracle use a hybrid locking and mouse version control mechanism. I don’t know if there is a version-only database (let me know if you do).

A reader told me:

Firebird and Interbase use unlocked version control.

It’s interesting how versioning affects indexes: sometimes unique indexes duplicate, index entries exceed table rows, and so on.

If you have read the section on different levels of isolation, you know that increasing the isolation level increases the number of locks and the time a transaction waits to be locked. This is why most databases do not use the highest level of isolation (serialization) by default.

Of course, you can always look it up in the documentation of a major database (like MySQL, PostgreSQL or Oracle).

Log manager

We already know that to improve performance, databases store data in memory buffers. But if the server crashes when the transaction commits, the data that is still in memory at the time of the crash is lost, which breaks the persistence of the transaction.

You can write all the data to disk, but if the server crashes, only part of the data may end up being written to disk, breaking the atomicity of the transaction.

Any changes made to the transaction must either be undone or completed.

There are two ways to solve this problem:

Shadow copies/pages: Each transaction creates its own copy of the database (or copies of parts of the database) and works based on this copy. If an error occurs, the copy is removed; Once successful, the database immediately uses a trick of the file system to replace the copy with the data and then discard the “old” data.
Transaction log: A Transaction log is a storage space where the database writes information to the Transaction log before each write so that if a Transaction crashes or rolls back, the database knows how to remove or complete pending transactions.

WAL (write-ahead logging)

Shadow copies/pages create a lot of disk overhead when running large databases with many transactions, so modern databases use transaction logging. Transaction logs must be kept on stable storage, and I won’t dig into storage techniques, but at least RAID disks are a must in case of disk failure.

Most databases (at least Oracle, SQL Server, DB2, PostgreSQL, MySQL, and SQLite) use the Write-Ahead Logging Protocol (WAL) for transaction Logging. The WAL protocol has three rules:

1) Each change to the database generates a log record, which must be written to the transaction log before data can be written to disk.
2) Log records must be written in sequence; If record A occurs before record B, A must precede B.
3) When a transaction commits, the commit order must be written to the transaction log before the transaction succeeds.

Review images

This is done by the log manager. In simple terms, the log manager sits between the cache Manager and the Data Access Manager (which writes data to disk), Each UPDATE/DELETE/CREATE/commit/ROLLBACK operation is written to the transaction log before writing to disk. Easy, right?

Wrong answer! We’ve covered so much that by now you should know that everything about databases is cursed with the “database effect.” Okay, seriously, the question is how to find a way to log while maintaining good performance. If the transaction log is written too slowly, the whole thing slows down.

ARIES

In 1992, IBM researchers “invented” an enhanced version of WAL called ARIES. ARIES is more or less used in modern databases, not necessarily the same logic, but the concept behind AIRES is everywhere. I put the invention in quotes because, according to the MIT course, the IBM researchers “simply wrote best practices for transaction recovery.” AIRES paper was published when I was 5 years old, I don’t care about those sour scientists of the old gossip. In fact, I mention this allusion as a way to lighten your mood before moving on to the last technical point. I have read a great deal of this ARIES paper and find it very interesting. I’m only going to talk briefly about ARIES in this section, but I strongly recommend that if you want real knowledge, you read that paper.

ARIES stands for Algorithms for Recovery and Isolation Exploiting Semantics.

The technology has a twofold goal:

1) Log while maintaining good performance
2) Fast and reliable data recovery

There are several reasons why a database has to roll back a transaction:

Because the user cancels
Server or network failure
Because transactions violate database integrity (such as a column with a unique constraint and a transaction adding duplicate values)
Because of a deadlock

In some cases (such as a network failure), the database can resume transactions.

How is that possible? To answer this question, we need to understand the information stored in logs.

The log

Each transaction operation (add/delete/change) generates a log, which consists of the following contents:

LSN: a unique Log Sequence Number. The LSN is assigned in chronological order *, which means that if operation A precedes operation B, the LSN of log A is smaller than the LSN of log B.
TransID: ID of the transaction that generates the operation.
PageID: indicates the position of the modified data on the disk. The smallest unit of disk data is a page, so the location of the data is the location of the page on which it is located.
PrevLSN: Link to the last log record generated by the same transaction.
UNDO: cancels the operation. For example, if the operation is an update, UNDO will either save the value/state of the element before the update (physical UNDO) or reverse the operation to the original state (logical UNDO) **.
REDO: The method of repeating this operation. Again, there are two ways: either save the element value/state after the operation, or save the operation itself for repetition.
… For your information, a ARIES log also has two fields: UndoNxtLSN and Type.

Further, each page on disk (the one that holds data, not the one that holds logs) records the last LSN to modify that data.

*LSN allocation is actually more complicated because it relates to how logs are stored. But the idea is the same.

** ARIES only uses logical UNDO because dealing with physical UNDO is too messy.

Note: As far as I know, PostgreSQL is the only one that doesn’t use UNDO and instead uses a garbage collection service to delete older versions of data. This has to do with PostgreSQL’s implementation of data versioning.

To better illustrate this, here is a simple logging illustration, which is created by querying “UPDATE FROM PERSON SET AGE = 18;” We assume that the query was executed by transaction 18. [translator’s note:]

Review images

Each log has a unique LSN, and logs linked together belong to the same transaction. Logs are linked in chronological order (the last log in the linked list is generated by the last operation).

Log buffer

To prevent logging from becoming a major bottleneck, the database uses log buffers.

Review images

When the query executor asks for a change:

1) The cache manager stores the changes in its own buffer;
2) The log manager stores the related logs in its own buffer;
3) At this point, the query executor considers the operation complete (so it can request another change);
4) Then (soon) the log manager writes the log to the transaction log, and an algorithm decides when.
5) Then (soon) the cache manager writes the changes to disk, and an algorithm decides when.

When a transaction commits, it means that steps 1, 2, 3, 4, 5 for each operation of the transaction have been completed. Writing a transaction log is fast because it simply “adds a log somewhere in the transaction log”; Writing data to a disk is even more complicated because you need to “write data in a way that can be read quickly.”

STEAL and FORCE policies

For performance reasons, it is possible to complete Step 5 after commit, since it is possible to restore transactions with REDO logs in the event of a crash. This is called the no-force policy.

The database can choose a FORCE policy (such as step 5 must be completed before commit) to reduce the load on recovery.

Another problem is choosing whether the data is written step by step (STEAL policy) or whether the buffer manager needs to wait for the commit command to write all at once (no-Steal policy). The choice between STEAL or no-steal depends on what you want: fast writes but slow recovery from UNDO logs, or fast recovery.

To summarize the impact of these strategies on recovery:

STEAL/ no-force requires UNDO and REDO: high performance, but more complex logging and recovery processes (like ARIES). Most databases choose this strategy. Note: I have seen this in several academic papers and tutorials, but have not seen it explicitly stated in official documentation.
STEAL/ FORCE only requires UNDO.
No-steal/no-force only needs REDO.
No-steal /FORCE requires nothing: worst performance and a huge amount of memory.

About the recovery

Ok, we have nice logs, let’s use them!

Let’s say the new intern crashes the database (first rule: It’s always the intern’s fault). You restart the database and the recovery process begins.

ARIES has three phases for recovering from a crash:

1) Analysis phase: The recovery process reads the entire transaction log to reconstruct the timeline of what happened during the crash, determine which transaction to roll back (all uncommitted transactions are rolled back) and which data to write to disk during the crash.
2) Redo stage: This level starts with one of the log records selected from the analysis and uses Redo to restore the database to its pre-crash state.

In the REDO phase, REDO logs are processed in chronological order (using LSN).

For each log, the recovery process needs to read the disk page LSN that contains the data.

If LSN (disk pages) >= LSN (logging), the data was written to disk before the crash (but the values have been overwritten by some operation after the log, before the crash), so nothing needs to be done.

If LSN (disk pages) < LSN (logging), the pages on disk will be updated.

REDO is done even for transactions that are being rolled back, because it simplifies the recovery process (although I’m sure modern databases don’t do this).

3) Undo phase: This phase rolls back all transactions that were not completed at the time of the crash. Rollback starts with the last log for each transaction, and UNDO logs are processed in reverse chronological order (using PrevLSN for logging).

During the recovery process, the transaction log must pay attention to the operation of the recovery process so that the data written to disk is consistent with the transaction log. One solution is to remove logging generated by cancelled transactions, but this is too difficult. Instead, ARIES logs compensation logs in the transaction log to logically remove logging for cancelled transactions.

When a transaction is cancelled “manually”, either by the lock manager (to eliminate deadlocks), or simply because of a network failure, the analysis phase is not needed. There are two memory tables for which REDO is required and which UNDO is required:

Transaction table (holds the state of all current transactions)
Dirty page table (holds what data needs to be written to disk)

These two tables are updated by the cache manager and transaction manager when new transactions occur. Because they are in memory, they are destroyed when the database crashes.

The task of the analysis phase is to rebuild the above two tables with the information from the transaction log after the crash. To speed up the analysis phase, ARIES introduces the concept of a check point, which is to write the contents of the transaction table and dirty page table, along with the last LSN, to disk from time to time. In the analysis phase, only the logs after the LSN need to be analyzed.

conclusion

Before writing this article, I knew how big the topic was and how time consuming writing such an in-depth article would be. I turned out to be overly optimistic, and it actually took twice as long as I expected, but I learned a lot.

If you want a good understanding of databases, I recommend this research paper, Database Systems Architecture, which has a good introduction to databases (110 pages) and can be read by non-computer professionals. This paper has helped me to make a writing plan for this paper. It does not focus on data structure and algorithm like this paper, but more on the concept of architecture.

If you read this article carefully, you should now know how powerful a database can be. Given the length of the article, let me remind you of what we learned:

B+ tree index Overview
A global overview of the database
An overview of cost-based optimization with a special focus on join operations
Overview of buffer pool management
Overview of Transaction Management

But databases contain more clever tricks. For example, I did not address the following thorny issues:

How do I manage database clusters and global transactions
How do I create a snapshot while the database is running
How to store (and compress) data efficiently
How to Manage Memory

So, when you have to choose between a problematic NoSQL database and a rock-solid relational database, think twice. Don’t get me wrong, some NoSQL databases are great, but they are still young and address specific concerns of a small number of applications.

Finally, if someone asks you how databases work, you don’t have to run away. Now you can answer:

Review images

Or, let him or her read this article.

If someone asks you how databases work, ask them to read this article

Return to base

O(1) vs O(n^2)

concept

example

further

Merge sort

merge

Split phase

Sorting phase

The power of merge sort

Arrays, trees, and hash tables

array

Tree and database indexes

concept

Back to our question

B + tree index

Hash table

A simple example

A good hash function

Array vs. hash table

A global overview

Client manager

Query manager

Query resolver

Query rewriter

statistical

Query optimizer

The index

Access path

Before you can apply the Join operators, you first need to get the data. Here’s how to get the data.

Full scan

Range scan

The only scan

Access by ROW ID

The other path

Join operator

Nested loop joins

Hash join

Merge join

The sorting

Merge join

Which algorithm is best?

Simplified example

Dynamic programming

Greedy algorithm

Other algorithms

Real optimizer

Query plan cache

Query executor

Data manager

Cache manager

The proofs

Buffer replacement strategy

LRU

To improve the

Write a buffer

Transaction manager

“I ‘m on the acid”

Concurrency control

Lock manager

Pessimistic locking

A deadlock

Review images

WAL (write-ahead logging)

The log

Log buffer

STEAL and FORCE policies

About the recovery

conclusion

Related Posts

Easy to understand design patterns: 3. Abstract Factory Pattern (Creative)

JVM garbage collector and memory allocation strategy

Cryptography series: Content Sniffing