This is the third day of my participation in the August More text Challenge. For details, see:August is more challenging

OLAP, OLTP, and HTAP are all different types of database workloads.

How does the DBMS manage its memory and move data back-and-forth from disk?

  • Space: Try to close the physical distance between the data disks that may be used together
  • Time: when to read from memory, when to write to disk; Reduce the number of times you have to read data from disk

1. locks & latches

lock

  • Protect the logical content of the data in the transaction
  • Hold in the course of business
  • You need to be able to roll back

latch

  • Quarantine data, mutex access
  • Hold during operation
  • No rollback is required

Latches are lightweight locks that lock resources, such as an area of memory. Works only in memory; To lock resources quickly for a short period of time; Preventing multiple processes from accessing a shared resource simultaneously; Can only be accessed by the current instance, is instantaneous occupancy. No team, no deadlock.

A lock is a higher-level lock that directly locks data in DB, such as a table, a tuple, and so on. It will not be released until the transaction ends correctly and may need to be rolled back; Need to join the team, there is a deadlock; Can be understood as mutex in OS.

2. allocation policy

global policy

Optimize scheduling for all active TXNS (Transactions)

local policy

If there is a particular TXN that needs to be optimized, only that TXN is considered, and no other TXNS in parallel are considered.

3. Buffer pool

Buffer pool organization

Regions in memory are also divided into fixed-size blocks, the same size as pages in external memory, called frames.

When pages are read into memory, a copy of each page is stored in a stack frame in physical memory. Use the page table for indexing.

Similar to memory management in the OS.

Buffer pool meta-data

Page table to record the current page in memory, using the page table to retrieve the actual location in memory.

The Buffer pool stores metadata:

  • The page table: a page table

    • Note that the previous page directory is used to record the mapping of page ID to page location. Location here refers to the location in the DB file.
  • The page table is used to record the mapping of the page ID to the corresponding page location in the buffer pool.

Each page stores its own metadata:

  • Dirt-flag: indicates the dirty flag bit. Write it back to disk
  • Pin /Reference Counter: Number of threads currently accessing the page.

Buffer pool potimizations

multiple buffer pools

It is common to use multiple buffer pool instances, such as each DB, each page type;

Used to reduce LATCH conflicts: it is possible for multiple threads to access multiple different tables at the same time, so that they can do so together without competing for latches.

And enhance locality.

The problem is that we need to know what data is being managed by each buffer pool. Two ways:

  • Insert a buffer pool identifier into the record ID (object ID, page ID, slotNum).
  • Use a hash table to maintain the mapping of page ids to the Buffer pool.

pre-fetching

The page after the page currently being read is preread and loaded into memory at one time. Queries with good locality, or operations that generally require sequential scan of multiple pages, are friendly.

This can be achieved by relying on the Query Plan.

  • For example, query Plan finds that page0 – Page4 needs to be read, then first page0 is read into memory, when processing the data in Page0, you can pre-read the rest of the part into memory, so that after processing Page0, want to access the data in Page1 do not have to wait for it to load into memory.
  • In fact, the operating system does the job, and mmap preloads the rest. But this is because this is a simple case. The memory required by this query is continuous, so the OS can optimize without knowing the contents of our query.

If the query needs a specific data range, and we know the data range in each index_page, as shown in the figure below, we query page0 according to the tree structure, after page1, we know that then we need page3 and Page5 data, this time we can go to preload Page3 and Page5; But OS does not know the scope of each page, so OS will (naturally) preload Page2 and Page3 after reading Page1, at this time Page2 loading is a useless, wasteful operation.

So this is an example of something that a DBMS can do, but an OS can’t do.

OS is not our friend.

select * from A

where val between 100 and 250
Copy the code

scan sharing

Different queries can share intermediate results of certain scan or operations.

Both queries A and B scan FROM T1. After scan page 5, B starts with PAGE 5, and record the result of iterating over PAGE 5. After A is complete, B returns to head and loads page 0-4.

Buffer pool bypass

Sequential scan, usually each page is only used once, in the short term will not be reused; Therefore, when such a query occurs, a small buffer pool area is allocated for it, and the pages that have been used are not stored. This prevents the entire memory from being filled up by this sequential scan and causes frequent subsequent memory space displacement.

OS page cache

Most disk operations rely on OS apis.

Most DBMSS use Direct IO to avoid passing through the OS Page cache, to avoid storing redundant pages in the OS cache, and to avoid different data reclamation strategies.

4. Buffer replacement policies

Replacement goals: correctness, accuracy, speed, metadata coverage.

  • LRU
  • Clock is similar to LRU. It is set to 1 when accessing the clock. When encountering a 1, it is changed to 0 and when encountering a 0, it is discarded.

These two methods are commonly used in the OS, but are very bad for sequential scan, because usually the page that has just been read is the least needed.

The Sequential flooding problem is the case mentioned earlier in the Buffer pool bypass section.

LRU-K

Retain the last k time stamps (per PAGE)

Localization

(The translation is not accurate, so please post it in slides)

The DBMS chooses which pages to evict on a per txn/query basis. This minimizes the pollution of the buffer pool from each query.

This requires maintaining a list of pages that each Query has accessed.

Priority hints

The DBMS knows the context information for each page and can use it as a reference to determine the priority and importance of each page.

dirty page

  • Non-dirty pages in memory can be discarded without being written back to disk

5. To summarize

This section, through some real-world scenarios, shows why DBMSS have better control over memory and data than operating systems.

Access to memory is much faster than access to disk, but memory space is limited, so we want to make good use of memory. So there are some optimization measures:

  • Multiple buffer pools, so that different tables do not have to lock the same buffer each time concurrency control, reduce concurrent wait time;
  • Preloading, when a page is read from disk, the DBMS uses its “wisdom” to determine which pages are needed later but have not yet been loaded into memory. Why do DBMSS have this wisdom? Because it knows what data is stored in each page, and it knows what data is available through the Query Plan. What can OS do? OS can only stare.
  • Share results. If multiple queries scan the same table, the subsequent query can borrow the scan result of the previous query.
  • Sometimes you get locality, sometimes you don’t. In order to scan a table, it is obvious that each page needs to be used only once in a while, so there is no need to keep the scanned pages in memory. So my buffer only provides a small space for such a scanning task, and the used page is thrown away, worthless. Prevent you from taking up the entire memory and wasting things.
  • Since the OS is useless, the automatic cache when the OS accesses disk data loaded into memory is no longer needed.

The DBMS can manage that sweet, sweet memory better than the OS.

–by leveraging the semantics about the query plan to make better decisions: Evictions, Allocations, Pre-fetching.