Introduction: This article is mainly through the PFS engine memory management source reading, interpretation of PFS memory allocation and release principle, in-depth analysis of some of the existing problems, as well as some ideas for improvement. This article source code analysis is based on mysqL-8.0.24 version.

The author | | sources pivot of ali technology to the public

A quote

MYSQL Performance Schema (PFS) is a powerful Performance monitoring diagnostic tool provided by MYSQL. It provides a special method to check the internal Performance of the server at runtime. PFS collects information by monitoring registered events within the server. An event could theoretically be any execution or resource usage within the server, such as a function call, a system call WAIT, parsing or sorting status in an SQL query, or memory resource usage.

PFS stores the collected performance data in the Performance_SCHEMA storage engine, which is an in-memory table engine, meaning that all collected diagnostic information is stored in memory. The collection and storage of diagnostic information imposes additional overhead, and PFS performance and memory management are also important in order to minimize business impact.

This article is mainly through the PFS engine memory management source reading, interpretation of PFS memory allocation and release principle, in-depth analysis of some of the existing problems, as well as some ideas for improvement. This article source code analysis is based on mysqL-8.0.24 version.

Memory management model

PFS memory management has several key features:

  • Memory allocation is based on Page. Multiple records can be stored in a Page
  • Some pages are pre-allocated when the system starts up and dynamically grow as needed during operation, but pages are only added and not recycled
  • Record requests and releases are unlocked

1 Core data structure

PFS_buffer_scalable_container is the core data structure of PFS memory management. The overall structure is shown as follows:

A Container contains multiple pages. Each page has a fixed number of Records. Each record corresponds to an event object, such as PFS_thread. The number of records in each page is fixed, but the number of pages grows as the load increases.

2 Allocate when Page selects a policy

PFS_buffer_scalable_container is the core data structure for PFS memory management

Key data structures involved in memory allocation are as follows:

PFS_PAGE_SIZE // Size of each page. The default value in global_thread_container is 256 PFS_PAGE_COUNT // Maximum number of pages. The default value for global_thread_container is 256. Class PFS_buffer_scalable_container {PFS_cacheline_atomic_size_t m_monotonic; PFS_cacheline_atomic_size_t m_max_page_index; PFS_cacheline_atomic_size_t m_max_page_index; // The maximum page index size_t m_max_page_count; STD :: Atomic < array_type *> m_pages[PFS_PAGE_COUNT]; // Page array native_mutex_t m_critical_section; // A lock to create a new page}Copy the code

M_pages is an array, each page may have free records, or the whole page may be busy. Mysql adopts a relatively simple strategy, training each page in turn to try whether there is free records, until the allocation is successful. If all pages are still not allocated, new pages will be created to expand until the maximum number of pages is reached.

Instead of looking at the first page every time, the rotation starts looking at the position recorded by the atomic variable M_monotonic, which adds 1 each time an allocation fails in the page.

The core simplified code is as follows:

value_type *allocate(pfs_dirty_state *dirty_state) { current_page_count = m_max_page_index.m_size_t.load(); monotonic = m_monotonic.m_size_t.load(); monotonic_max = monotonic + current_page_count; while (monotonic < monotonic_max) { index = monotonic % current_page_count; array = m_pages[index].load(); pfs = array->allocate(dirty_state); If (PFS) {return PFS; } else {// Allocation failed, try the next page, // because m_monotonic is accumulated concurrently, it is possible that the local monotonic variable is not linearly increasing, it is possible that it goes directly from 1 to 3 or greater, // So the current while loop is not a strict rotation of all the pages, but rather a jump attempt to rotate all the pages together. // There are some problems with this algorithm, which will cause some pages to be skipped and ignored, thus increasing the probability of new page expansion. monotonic = m_monotonic.m_size_t++; While (current_PAGe_count < m_max_page_count) {// While (current_page_count < m_max_page_count) {// While (current_page_count < m_max_page_count) { Native_mutex_lock (&m_critical_section); native_mutex_lock(&m_critical_section); Array = m_pages[current_PAGe_count].load(); array = m_pages[current_PAGe_count].load(); If (array == nullptr) {m_allocator->alloc_array(array); m_pages[current_page_count].store(array); ++m_max_page_index.m_size_t; } native_mutex_unlock(&m_critical_section); // Try allocating PFS again in a new page = array->allocate(dirty_state); If (PFS) {// Allocate success and return PFS; } // Allocation failed, continue trying to create new page up to the upper limit}}Copy the code

Let’s analyze the page strategy in detail, because the accumulation of m_momotonic atomic variables is concurrent, which will cause some pages to be skipped, thus increasing the probability of new page expansion.

Take an extreme example, it is easier to illustrate the problem, suppose that there are currently a total of 4 pages, the first and fourth pages are full of no available record, the second and third pages have available record.

When 4 threads simultaneously Allocate request, m_monotonic=0 is obtained simultaneously.

monotonic = m_monotonic.m_size_t.load();

At this point all threads attempting to allocate a record from the first page fail (because there is no record available for the first page) and then add up to try the next page

monotonic = m_monotonic.m_size_t++;

This is where the problem arises, because the atomic variable ++ returns the most recent value, and the four threads ++ succeed in a sequential order, with the first ++ thread being monotonic 2, the second ++ thread 3, and so on. Thus see threads 3 and 4 to skip the page2 and page3, cause the failure of 3, 4 threads will be training in rotation end to create a new page in the process, but this time is have free record in page2 and page3 can use.

Although the above example is extreme, it should be very easy to skip a portion of a page in a Mysql concurrent access by requesting PFS memory simultaneously.

3 Record selects policy in Page

PFS_buffer_default_array is a management class that maintains a set of records per Page.

Key data structures are as follows:

class PFS_buffer_default_array { PFS_cacheline_atomic_size_t m_monotonic; // Monotone incrementing atomic variable, used to select free record size_t m_max; // The maximum number of record T *m_ptr; // Record corresponds to the PFS object, such as PFS_thread}Copy the code

ALLOCATED Page is an array of fixed length and each record object has 3 states FREE, DIRTY, and ALLOCATED. FREE means FREE record is available and ALLOCATED successfully. DIRTY is an intermediate state indicating that ALLOCATED record is occupied but not ALLOCATED.

The essence of Record selection is the process of searching and preempting the Record in free state.

The core simplified code is as follows:

Value_type *allocate(pfs_dirty_state *dirty_state) {// Monotonic = m_monotonic. M_size_t ++; monotonic_max = monotonic + m_max; while (monotonic < monotonic_max) { index = monotonic % m_max; pfs = m_ptr + index; // m_lock is a pfs_lock structure, Free/Dirty/Allocated state is maintained by this data structure // more on how it can achieve atomic state migration if (PFS ->m_lock.free_to_dirty(dirty_state)) {return PFS; } // The current record is not free, atomic variable ++ try the next monotonic = m_monotonic. }}Copy the code

Choose the main body of the record process and choose page basic similar, the difference is that the number of record page is fixed, so there is no expansion of the logic.

Of course, the same selection strategy, there will be the same problem, here m_monotonic atomic variable ++ is multithreaded concurrency, also if the scene of large concurrency will have a record was skipped, so that the page internal even if there is a free record may not be selected.

So even if the page selection is not skipped, the record page has a chance to be skipped and missed, adding insult to injury, more aggravated the growth of memory.

4 pfs_lock

Each record has a PFS_lock to maintain its allocation status (free/ Dirty/Allocated) in the page as well as version information.

Key data structures:

struct pfs_lock {

std::atomic m_version_state;

}

Pfs_lock uses a 32-bit unsigned integer to hold version+state information in the following format:

State The lower 2 bytes indicate the allocation status.

state PFS_LOCK_FREE = 0x00

state PFS_LOCK_DIRTY = 0x01

state PFS_LOCK_ALLOCATED = 0x11

version

The initial version of the record is 0, each time the allocation success increment 1, version can indicate the number of times the record was allocated successfully.

// The following three macros are mainly used for bitwise operation, State or Version #define VERSION_MASK 0xFFFFFFFC #define STATE_MASK 0x00000003 #define VERSION_INC 4 bool free_to_dirty(pfs_dirty_state *copy_ptr) { uint32 old_val = m_version_state.load(); If ((old_val & STATE_MASK)! = PFS_LOCK_FREE) { return false; } uint32 new_val = (old_val & VERSION_MASK) + PFS_LOCK_DIRTY; Atomic_compare_exchange_strong is an optimistic lock. Multiple threads may modify the atomic variable at the same time, but only one of them succeeds. bool pass = atomic_compare_exchange_strong(&m_version_state, &old_val, new_val); Copy_ptr ->m_version_state = new_val; copy_ptr->m_version_state = new_val; } return pass; } void dirty_to_allocated(const pfs_dirty_state *copy) { /* Make sure the record was DIRTY. */ assert((copy->m_version_state & STATE_MASK) == PFS_LOCK_DIRTY); /* Increment the version, set the ALLOCATED state */ uint32 new_val = (copy->m_version_state & VERSION_MASK) + VERSION_INC + PFS_LOCK_ALLOCATED; m_version_state.store(new_val); }Copy the code

The process of state transfer is easier to understand. The logic of “dirty_to_allocated” and “allocated_to_free” is simpler because of the problem of concurrent write in the transfer of state only when the record state is free. Once the state becomes dirty, The current record is already owned by one thread and no other thread will attempt to manipulate the record.

Version was increased when state became PFS_LOCK_ALLOCATED

5 Release THE PFS memory

PFS memory freeing is relatively simple, because each record records its own container and page. Call the DealLocate interface, and finally set the state to free.

The lowest level goes to pfs_lock to update the status:

struct pfs_lock { void allocated_to_free(void) { /* If this record is not in the ALLOCATED state and the caller is trying to free it, this is a bug: the caller is confused, and potentially damaging data owned by another thread or object. */ uint32 copy = copy_version_state(); /* Make sure the record was ALLOCATED. */ assert(((copy & STATE_MASK) == PFS_LOCK_ALLOCATED)); /* Keep the same version, set the FREE state */ uint32 new_val = (copy & VERSION_MASK) + PFS_LOCK_FREE; m_version_state.store(new_val); }}Copy the code

Optimization of memory allocation

Previously, we analyzed that both page and record have a chance to skip round training. Even if there are free members in the cache, allocation will be unsuccessful, resulting in the creation of more pages and occupying more memory. The main problem is that once allocated, the memory is never freed.

To improve the PFS memory hit ratio and avoid the above problems, some ideas are as follows:

while (monotonic < monotonic_max) { index = monotonic % current_page_count; array = m_pages[index].load(); pfs = array->allocate(dirty_state); If (PFS) {// Index m_monotonic. M_size_t. store(index); return pfs; } else {// Undefined undefined 1600x/s; }}Copy the code

Another point is that each lookup starts at the last successful location, which inevitably leads to concurrent access conflicts because everyone starts at the same location. The initial lookup location should be somewhat random to avoid a lot of conflict retries.

The summary is as follows:

  1. Each Allocate starts at the index where the last allocation was successful, or at a random location
  2. Each Allocate rotates strictly all pages or records

Optimization of memory release

The biggest problem with PFS memory release is that once created memory is not released until shutdown. In the case of a hot service, a lot of page memory is allocated during the peak of the service, but it is still not released during the peak of the service.

It is difficult to implement a lock-free reclamation mechanism to periodically detect reclaimed memory without affecting the efficiency of memory allocation.

There are mainly the following points to consider:

  1. The release must be on a page basis, that is, all records within the released page must be guaranteed to be free, and the page to be free must not be assigned
  2. The memory allocation is random, and the overall memory can be reclaimed, but each page may have some busy, how to better coordinate this situation
  3. How to set the release threshold and avoid frequent allocation + release

For optimization of PFS memory release, PolarDB has developed and provided periodic PFS memory reclamation features, which will be covered later due to space limitations in this article.

V. About Us

PolarDB is a cloud native distributed relational database independently developed by Alibaba. It entered the Leader quadrant of Gartner global database in 2020 and won the first prize of Science and Technology Progress awarded by China Electronics Society in 2020. PolarDB, based on the cloud native distributed database architecture, provides large-scale online transaction processing capability, and has the parallel processing capability for complex queries. As a whole, PolarDB has reached the international leading level in the field of cloud native distributed database, and has been widely recognized by the market. In the best practices within Alibaba Group, PolarDB also fully supported the 2020 Tmall Double 11, and refreshed the peak database processing record, as high as 140 million TPS. We are looking forward to working with you to build a world-class next-generation cloud native distributed relational database.

Reference:

[1] MySQL Performance Schema dev.mysql.com/doc/refman/…

[2] MySQL · Best Practices · Are you parallel today? – insight PolarDB 8.0 of parallel query mysql.taobao.org/monthly/201…

[3] Source code mysql/ mysql-server 8.0.24 github.com/mysql/mysql…

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.