Guide language | native directing a lot of growth in the number of tables, resource consumption also increased substantially, a sharp drop in performance. After the performance analysis, the MongoDB team of Tencent database R&D center used the idea of shared table space to share the massive database tables created by users in one table space of the underlying WT engine. The number of tables maintained by the WT engine does not grow linearly as the user creates tables and indexes, and is always in the single digit range. The optimized architecture improves read and write performance by 1-2 orders of magnitude, reduces memory consumption, and shortens startup time from the original hour level to less than 1 minute in a million-level database table scenario.
I. Background: Find problems, set up projects and tackle tough problems
Tencent database research and development center (hereinafter referred to as CMongo) directing a team found that a lot of business in the process of operation is to create a large number of base table of the demand, and with the growing business library table number, customer feedback for several seconds to tens of seconds slow queries, and accompanied by the node is unavailable, seriously affect the normal business of a customer request.
Through monitoring and observation, it is found that when the number of database tables and indexes of native MongoDB reaches one million, MongoDB instances may suffer from operation delays and performance degradation even when the CPU and disk resources are far from the bottleneck. From our operational observations, there are at least three very serious issues:
- Performance deteriorates and slow queries become more frequent
- Memory consumption increases and OOM appears frequently
- Instance startup times become significantly longer, possibly reaching the hour level
In view of the above problems, the MongoDB team of Tencent Database RESEARCH and development Center based on THE V4.0 version analyzed the performance of the million-level database table scenario in the original scenario, and optimized the architecture in combination with the industry solutions, and finally achieved very good results (the optimized architecture in the million-level database table scenario, Improves read/write performance by 1-2 orders of magnitude, reduces memory consumption, and shortens startup time from the original hour level to less than 1 minute. . This paper will introduce the analysis process and the optimization scheme of MongoDB architecture according to the principle of MongoDB.
Second, the performance analysis of the original multi-table scenario (know yourself and the enemy, to be able to do it with ease)
Starting from version 3.2, the MongoDB kernel adopted a typical plug-in architecture, which can be simply understood as the server layer and the storage engine layer. Through doting and log debugging, we finally found that all problems pointed to the storage engine layer. Therefore, the analysis of WiredTiger engine became our priority in the future.
(1) Introduction to WiredTiger storage engine
MongoDB uses WiredTiger (WT for short) as the default storage engine, and the overall architecture is shown in the figure below:
(Overall architecture of WiredTiger Storage engine)
Each table and index created by the user in the MongoDB layer corresponds to its own independent WT table.
Data reads and writes go through the following three layers:
- WT Cache: Caches uncompressed library table data through B+ trees and uses custom elimination algorithms to ensure that memory usage is within a reasonable range.
- OS Cache: Managed by the operating system, the OS caches compressed database table data.
- Database file: Stores the compressed database table data. Each WT table corresponds to a separate disk file. The disk file is divided into four KB aligned extents (offset+length) and managed by three linked lists: Available List (list of extents that can be allocated), Discard List (discarded extents that cannot be reused immediately because they may still be referenced by other checkpoints), and Allocate List (list of extents that have been allocated).
(2) Memory consumption analysis
Consider: If the user does not access all tables in a short time, there must be a table idle for a long time, then why inactive table data stay in memory for a long time?
Assumption: If the memory used by inactive tables is swapped out in a timely manner, it will effectively increase the maximum number of tables that can be supported in a normal size cluster, thus avoiding frequent OOM usage.
Exploration process: We create a 2-core, 4G replica set on the cloud, constantly creating tables (2 indexes per table), each table is not accessed after the insert data has been created. The current Active DHandle was increasing while the Connection Sweep dhandles closed indicator was flat. In the end, the memory footprint of the instance also keeps rising, triggering instance OOM when fewer than 10,000 tables are created.
Data Handle & Sweep Thread Data Handle (dHandle) can be simply understood as a proprietary handle for wiredTiger resources, similar to the fd of a system.
The global DHandle list is maintained in the global WT_CONNECTION object, and each WT_SESSION object maintains a DHandle cache pointing to the DHandle list. If there is no corresponding DHandle in the global DHandle List and session DHandle cache when the WT table is accessed for the first time, a Dhandle is created for this table and placed in the global DHandle List. The background thread scans the DHandle list in the WT_CONNECTION every 10 seconds, marking the dHandles that are not currently in use. If the number of Btrees enabled exceeds close_HANDLE_minimum (the default value is 250), then check which DHandles remain idle during close_IDLE_time and close the btrees associated with the dHandle. Release some resources (not all resources) and mark the DHandle dead so that the session referencing it can see that the DHandle is no longer accessible. If there is no session reference to the Dead DHandle, it is removed from the global List.
Based on this analysis, there are reasons to wonder why the clean-up is so inefficient. When WiredTiger engine is initialized, close_IDLE_time is set to 100000S (~28h). The result of this setting is that sweep is not timely enough and tables that are no longer accessed still consume memory for a long time. Interested readers can join the discussion at JIRA.
Validation results: To quickly verify our analysis, we set close_IDLE_time from 10,000 seconds to 600 seconds in the code, run the same test program, and find that the node is not OOM, the 10,000 collection is successfully inserted, and the memory usage is maintained at a low value.
(Memory consumption for different close_IDLE_time)
Connection data handles currently active does not go up in a straight line at the beginning, but goes up and down, and finally returns to 250 after the program is completed, as expected. Therefore, for the cluster with more online business tables, frequent OOM, and small instance specifications, you can temporarily adjust the configuration to alleviate. However, this will lead to a very low cache hit ratio of DHandle, resulting in serious performance degradation.
Therefore, how to reduce the number of DHandles to reduce memory consumption, while ensuring the hit ratio of dHandle cache to avoid performance degradation, has become one of our key optimization objectives.
(3) Read and write performance analysis
Exploration process:
Scenario 1: Performance analysis before preheating
Preheating refers to the sequential reading of at least one piece of data from each collection, so that all collection cursors go to the session CURSOR cache.
Throughout the test, each thread randomly selected the collection for testing and found that there were persistent slow queries. As you can see from the monitoring display below, in the Wait time window, latency from read latency soared as the schema lock wait went up. Therefore, it is necessary to analyze the usage logic of schema Locks.
(Randomly query the monitoring chart before preheating)
Why do simple CRUD requests rely directly or indirectly on Schema Locks?
The fire diagram is generated during the test, and the function call stack is as follows:
(Single point random query flame diagram before preheating)
Above the “all” there are many columns of fire like the “Conn10091” thread, indicating that client processing threads are grabbing the Schema lock, which is a global lock belonging to the WT_CONNECTION object. This explains the large Schema Lock wait.
So why are threads executing open_CURSOR?
MongoServer thread needs to select a WT_SESSION and retrieve the WTCURSOR from the WT_SESSION. Then we do findRecord, insertRecord, updateRecord, and so on.
To ensure performance, wTCURSOR will be cached in MongoServer and WT engine. However, if the table is accessed for the first time, the DHandle and WtCursor have not been generated, which requires expensive creation operations, such as opening the file and establishing the Btree structure.
In combination with the code and the flame diagram, open_CURSOR consumes a lot of CPU time during the GET DHandle phase after obtaining the Schema lock. When the system is not warmed up, it is obvious that there will be a lot of open_CURSOR calls, resulting in a long wait for spinlock. It can be seen that the more Dhandles, the greater the competition. If you want to reduce the spinlock competition, you need to reduce the number of Dhandles, which can greatly accelerate the warm-up speed. That’s one of the things we’re going to focus on.
Scenario 2: Performance analysis after preheating
In the continuous read and write scenario, after the “data warm-up” phase, the data Handle of each WT table has completed the open operation, and the schema lock described above is no longer a performance bottleneck. But we found that performance was still relatively poor compared to scenarios with fewer tables.
In order to further analyze the performance bottleneck, we conducted piecewise statistics and analysis of all stages of read and write requests in the whole link, and found that the data Handle cache access phase takes a long time.
WT_CONNECTION and WT_SESSION cache the data handle and use a hash list to speed up lookup. The default number of hash buckets is 512. In the million-level library table scenario, each list becomes very long and the search efficiency decreases dramatically, resulting in slow read and write requests.
By increasing the access performance statistics of data Handle cache in WT code, it is found that the number of slow requests on the user side is related to the number of slow operations on data Handle, and the delay distribution is basically the same. As follows: \
Statistics of slow operations | Number of slow requests from MongoDB users | Number of slow operations for traversing the DHandle Hashed List |
---|---|---|
10-50ms | 13432 | 19684 |
50-100ms | 81840 | 75371 |
>100ms | 12473 | 6905 |
In order to quickly verify our analysis, we can try to increase the number of buckets by 100 times during the pressure test, so that the length of each Hash list is significantly shorter, and we find that the performance can be improved by several times or even orders of magnitude.
From the above analysis, the enlightenment can be obtained: If MongoDB tables can be shared in a small WT table space, the number of data handles can be reduced, the access efficiency of data Handle cache can be improved, and the performance can be improved.
(4) Start speed analysis
In the million-level database table scenario of native MongoDB, it takes tens of minutes or even more than an hour to start a Mongod instance. If there are multiple nodes that can’t be pulled up quickly after OOM, overall service availability will suffer.
Exploration process: In order to observe the process and time distribution of MongoServer and WT engine during startup, we analyzed the logs of MongoDB and WT engine.
According to log analysis, the reconfig phase of the WT table takes the longest time.
Specifically, when Mongod starts, the initialization phase of mongoDbMain initializes the storage engine and performs loadCatalog to initialize WiredTigerRecordStore for all tables. When WiredTigerRecordStore is constructed, setTableLogging is performed to configure whether WAL is enabled on the underlying WT table based on the uri of the table. Finally, the schema_ALTER process of the underlying WT engine is invoked to perform the configuration, which involves multiple IO operations (getting the old configuration, comparing it with the new configuration, and possibly updating the configuration). MongoDbMain will also initialize all tables’ index (WiredTigerIndex) and perform the corresponding setTableLogging operation.
In addition, the above processes are all serial operations, and the overall time will increase linearly as the library table index changes. It is found that more than 99% of the time is spent in this stage in the million-table scenario, which becomes the bottleneck of startup speed.
Why reconfig the WT table at startup?
Mongo will according to the purpose of the table as well as the current configuration, to decide whether to open a table a WAL, specific can consult WiredTigerUtil: : useTableLogging and WiredTigerUtil: : setTableLogging implementation logic. This is done when the Mongod instance is started and when the user creates the table. A closer look at the logic here reveals the following pattern: After the table is created and the corresponding WT URI is determined, WAL is enabled on the table and does not change as the instance runs.
From the above analysis, two Revelations can be obtained: \
- If you share all MongoDB tables with consistent WAL enabled into a small number of WT table Spaces, you can reduce the number of setTableLogging operations from millions to single digits, thus greatly improving initialization speed. From our architectural optimizations and testing, we have also proven that we can optimize the hour-level startup time to less than 1 minute.
- Reducing setTableLogging operations will prevent the WT engine from obtaining a global Schema lock when performing schema_ALTER operations, thus giving MongoServer’s upper level logic space for optimization. This can be further accelerated by optimizing serial initialization to allow multiple threads to concurrently initialize tables and indexes.
(v) Performance analysis and summary
In mega table scenarios, the WT engine generates a large number of data handles and results in performance degradation, including lock contention, low cache hit ratio of key data structures, and time magnification of partial serialization processes under mega table.
(6) Thinking
Combined with the previous analysis, if the MongoServer layer was able to share table Spaces and store only a small number of physical tables on the WT engine, would the performance bottleneck above be avoided?
Three, CMongo million database table architecture optimization: advanced optimization, better than blue
(1) Scheme selection
In MySQL, InnoDB shares table Spaces by assigning space IDS to each table. In MongoRocks, each table is assigned a unique prefix, and each data Key carries the prefix information.
In the native MongoDB code, we also follow the idea of “prefix mapping” to implement some basic code. See the definition of KVPrefix, as well as the code and comments for the GroupCollections option. However, this function is not fully implemented, only the basic data structure is defined, so it cannot be used directly. We communicated with the author via email, and the community has no further plans to implement this feature. MongoDB JIRA information is as follows:
The problem | describe | The current state |
---|---|---|
“Make WT Cursors support groupCollections” -Jira.mongodb.org/browse/SERV… | When groupCollections are enabled, the key_format of the data and index is changed to QQ, QU, and CURSOR, respectively. During iteration, check whether the prefix matches | fixed |
Compatible Sampling Cursor -Jira.mongodb.org/browse/WT-3… | When groupCollections are supported, a new method range() is provided for WT_CUROSR to support random Cursor | open |
Compatible Sampling Cursor -Jira.mongodb.org/browse/SERV… | Related to the second problem, the WT layer needs to support random Cursor to correctly return data that matches the prefix. This is used when setting the cutoff point for oplog. According to the original author, there is more work to be done | closed |
“Official Test million meter” -Jira.mongodb.org/browse/WT-3… | POC program to test the groupCollections feature | closed |
“Official groupCollections Design Document” -Jira.mongodb.org/browse/SERV… | A lot of work is required, including session sharing | – |
“Create only if the underlying table does not exist” -Jira.mongodb.org/browse/SERV… | When groupCollections are started, you first search for compatible tables and create them only if there are no underlying tables; Do not delete the underlying table unconditionally; Oplog is going to be on its own | closed |
Based on our testing and analysis of native multi-table scenarios, we believe that sharing table space can solve the above performance bottleneck and improve performance. Therefore, we decided to optimize and validate the shared table space architecture based on the “prefix mapping” approach.
(2) Architecture design
At the beginning of the scheme design, we made the following comparison on which module to implement the shared table space:
- Modify internal logic of WT storage engine. Multiple logical WT tables are distinguished by prefix and share the same physical WT table space, data Handle, Btree, block Manager and other resources. But this approach involves a huge amount of code change and a long development cycle.
- Change the logic of KVEngine abstraction layer in MongoDB. In the upper layer of the storage engine, multiple MongoDB tables share the same WT table space by means of prefix mapping. This approach mainly involves modifying the logic used by KVEngine’s storage abstraction layer and is compatible with the problems caused by the native WT engine’s incomplete support for prefix operations. This approach involves relatively few code changes and does not involve changes to the internal architecture of the WT engine, making the stability and development cycle more controllable.
Therefore, the optimization work focused on the “KVEngine abstraction layer” : mapping multiple MongoDB user tables into one WT table by means of prefix; MongoServer layer on the specified library table CURD operation, will be through key -> prefix+key conversion, to the WT engine data operation. The above mappings are stored through __mdb_catalog.
The overall architecture is shown as follows:
(Optimized overall structure)
After the mapping is established, no matter how many library tables and indexes the user creates on the upper layer, there will only be 9 WT tables in the WT engine layer:
- Wiredtig. wt: stores Checkpoint metadata
- Wiredtigerlas.wt: Stores the paged LAS data
- _mdb_catalog.wt: Stores the mapping between the upper MongoDB table and the lower WT table
- SizeStorer. Wt: Stores count data
- Oplog-collection. wt: stores Oplog data
- Collection. wt: Stores all MongoDB non-local table data
- Index. wt: Stores all MongoDB non-local index data
- Local-collect. wt: stores MongoDB local table data
- Local-index. wt: stores MongoDB local index data
Why are tables and indexes stored separately?
For example, the schema of the table is QQ -> u, and the schema of the index is qu -> q (q means INT64, u means string). It is also possible to unify the schema for both collection and index in the form qu -> U, but there is some additional type conversion when reading or writing data.
Each table corresponds to a prefix. The data of multiple tables is shared in a file by establishing a ternary relationship (NS, prefix, Ident). The shared data file is as follows:
(Shared data file)
As a result of the architecture changes, some critical paths have changed:
Change path 1: create table and index to mdb_catalog as unique prefix, instead of creating table to WT engine (the shared table of WT engine is checked and created the first time the instance is started)
Path change 2: delete table and index operations, which delete records in mdb_catalog and then delete data, but do not directly delete the WT table file
Path change 3: Data read/write operation, will add prefix to the key header of access, and then go to WT to perform data operation.
By establishing the prefix mapping relationship, no matter how the number of Upper-layer MongoDB library tables grows, the number of lower-layer WT tables will not grow. Therefore, the problems discussed above, such as high memory usage due to too many data handles, lock contention due to data handle open operations, low data Handle cache efficiency, and slow instance startup due to too many WT tables, will not occur. Greatly improved overall performance.
Of course, there are some new restrictions to share the table space with prefix.
- After a user deletes a table, the space is not released, but can be reused
- The semantics of some table operations have changed. For example, the compact and validate operations need to be placed in the global implementation and are currently not supported
- Some statistics have changed. For example, table and index storageSize cannot be counted, only the logical size (size before compression) can be counted. Since only the logical size of the table is recorded in native Sizestorer.wt, you need to implement your own statistics on the logical size of the index. The speed of the show DBS (listDatabases) command was improved from 11s to 0.8s in our tests. This is due to the fact that you don’t need to perform statistics on a large number of index files.
(3) Optimization effect
The test environment we set up on Tencent is as follows:
- (collection * recordNum * fieldLength)
- The total number of load threads was 200
- Replica set configuration: 8 cores, 16 GB memory, 2 TB disk
- Mongo Driver version: go-driver [v1.5.2]
- The connection pool size of the driver is 200
- The test tool is on the same machine as the primary, directly connected to the primary
The test results
The MongoDB team of Tencent database R&D center selected two scenarios of 50W table and 100W table for testing according to the business usage. During the actual test, it was found that the cluster before the transformation could not write data to the 100W table. Finally, the test data of 50W table is given. The general steps of the test are as follows:
- By default, 10 libraries are built;
- Create 5W empty tables for each library;
- Starts writing a fixed amount of random data to the table;
- Perform CRUD operations.
The following is a graph of percentile delay comparison results (due to the large differences in data, log10() processing was performed on all data in order to display the comparison). As can be seen from figure P99, the performance of CRUD operation after modification is better than that before modification. The following figure shows the QPS comparison results:
(QPS comparison before and after modification)
QPS after modification is also much better than that before modification. The reason is that all the data can be considered in the same table after modification, and the performance will not change greatly with the number of tables. However, the intensity of lock grabbing before modification and the access time of data Handle will increase with the increase of the number of tables, resulting in low performance.
The following figure shows the QPS comparison of query operations of multi-table and native single table before and after the modification. The maximum QPS decrease is within 7% compared with that of native single table after the modification, while the maximum QPS decrease is more than 90% compared with that of native single table before the modification.
(QPS change map)
Effect of online
With the large increase of the number of tables in native MongoDB, the resource consumption will increase significantly, and the performance will decrease sharply. After the performance analysis, the MongoDB team of Tencent database R&D center used the idea of shared table space to share the massive database tables created by users in one table space of the underlying WT engine. The number of tables maintained by the WT engine does not grow linearly as the user creates tables and indexes, and is always in the single digit range.
The optimized architecture improves read and write performance by 1-2 orders of magnitude, reduces memory consumption, and shortens startup time from the original hour level to less than 1 minute in a million-level database table scenario. At present, the architecture optimization has passed various functional tests and performance pressure tests and successfully launched. The instant messaging IM service on the cloud, after replacing the million library table version, effectively solves the problem of slow CRUD operation caused by many customer tables, and the system memory consumption is also significantly reduced. Later this kernel feature will be fully open to the cloud, welcome to experience.
Four, non-stop, has been on the road
Tencent Cloud MongoDB (TencentDB for MongoDB) is a high-performance NoSQL database built by Tencent based on the globally popular document database MongoDB, which is 100% compatible with MongoDB protocol. Compared with the original version, tencent database research and development center directing a team of native custom made a lot of optimization and kernel depth, including millions of TPS, physical backup, dense, nondestructive and free nodes, such as rocksdb engine optimization, but also new developed flow control, audit, encryption, such as enterprise features, internal and external customers for the company’s core business provides strong support.
At present, Tencent MongoDB has made significant breakthroughs in stability, security, performance and enterprise-level features. In the future, we will continue to refine and improve the performance, new features, cost and other aspects, and strive to provide better cloud MongoDB services for users inside and outside the company.
Product | TEG mongo database research and development center/tencent r&d center team planning & visual | quality service management group