This article is from OPPO’s Internet Technology team. At the same time, you are welcome to follow OPPO Internet Technology team’s official account: OPPO_tech, and share OPPO’s cutting-edge Internet technology and activities with you.

1. The background

The peak TPS of an online cluster exceeds 1 million/s (mainly write traffic, but very low read traffic). The peak TPS has almost reached the upper limit of the cluster, and the average latency exceeds 100ms. With the further increase of read and write traffic, the delay jitter seriously affects the service availability. This cluster adopts the natural sharding mode architecture of mongodb, and the data is evenly distributed in each shard. After adding the shard key and enabling the sharding function, perfect load balancing is achieved. The following figure shows the traffic monitoring of each node in the cluster:

As can be seen from the figure above, the cluster traffic is relatively large, with the peak value exceeding 1.2 million/SEC. The delete expired traffic is not included in the total traffic (delete is triggered by the master, but will not be displayed on the master, only when the slave node pulls oplog). If you include delete traffic from the primary node, the total TPS is over 1.5 million/SEC.

2. Software optimization

In the case of no increase in server resources, we first do the following software optimization, and achieved several times the desired performance improvement:

  1. Business level optimization

  2. Optimization of Mongodb configuration

  3. Storage engine optimization

2.1 Optimization at the service level

The cluster contains nearly 10 billion documents. Each document is stored for three days by default, and services randomly hash the data to expire at any point three days later. Due to the large number of documents, the peak monitoring of white balance can find that there are often a large number of DELETE operations on the secondary node, and even the number of delete deletion operations at some time points has exceeded the read and write traffic of the business side. Therefore, the delete expired operation is considered to be performed at night. The method of adding expired indexes is as follows:

Db.collection.createIndex( { "expireAt": 1 }, { expireAfterSeconds: 0 } )
Copy the code

ExpireAfterSeconds =0; expireAfterSeconds=0; expireAfterSeconds=0;

Db.collection.insert ({// indicates that the document will expire at 1:00 a.m. at night"expireAt": new Date('July 22, 2019 01:00:00'),  
  "logEvent": 2."logMessage": "Success!"
})
Copy the code

By randomly hashing expireAt at any time in the morning three days later, you can avoid triggering a large number of cluster DELEts introduced by expired indexes during peak hours during the day, thus reducing cluster load during peak hours, and ultimately reducing service average time delay and jitter.

Delete Expired Tips1: expireAfterSeconds Meaning

  1. ExpireAt expires at the absolute point in time specified by expireAt, which is 2:01 am on December 22
Db.collection.createIndex( { "expireAt": 1}, { expireAfterSeconds: 0 })
db.log_events.insert( { "expireAt": new Date(Dec 22, 2019 02:01:00'),"logEvent": 2,"logMessage": "Success!" })Copy the code
  1. ExpireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds expireAfterSeconds
db.log_events.insert( {"createdAt": new Date(),"logEvent": 2."logMessage": "Success!"} )
Db.collection.createIndex( { "expireAt": 1 }, { expireAfterSeconds: 60 } )
Copy the code

Delete expired Tips2: Why mongostat can only monitor Delete operations on the secondary node but not on the primary node?

The reason is that the expired index is triggered only on the master node. After the trigger is triggered, the master node will directly delete the index and call the corresponding Wiredtiger storage engine interface to delete the index. Therefore, the delete statistics cannot be displayed on the master node.

After the delete of the primary node expires, the delete oplog information of the secondary node will survive. The secondary node will pull the primary node oplog and simulate the replay of the client. In this way, the secondary data can be deleted at the same time as the primary data is deleted to ensure the final data consistency. Simulating client playback from the node will go through the normal client link process, so the delete count statistics will be recorded.

The official reference is as follows: docs.mongodb.com/manual/tuto…

2.2 Mongodb configuration Optimization (network I/O overcommitment, network I/O and disk I/O separation)

The TPS of the cluster is high and there are a large number of pushes on the hour, so the concurrency on the hour is higher. The default mode of one thread per request of mongodb will seriously affect the system load. This default configuration is not suitable for read and write application scenarios with high concurrency. Official introduction is as follows:

2.2.1 Implementation principle of Mongodb internal network thread model

Mongodb’s default network model architecture is a client link, and mongodb creates a thread to handle all read/write requests and disk I/O operations on that link.

The default network threading model of Mongodb is not suitable for high concurrent read and write for the following reasons:

  1. In the case of high concurrency, a large number of threads are created in an instant, such as this cluster on the line, and the number of connections increases to about 10,000 in an instant, meaning that the operating system needs to create 10,000 threads in an instant, so the system load is high.

  2. In addition, when the link request is processed and the traffic enters the low peak period, the client connection pool reclaims the link. At this time, the mongodb server needs to destroy the thread, which further aggravates the system load and further increases the jitter of the database, especially in the short link business such as PHP. Frequent creation of threads and destruction of threads resulted in high system debt.

  3. Each thread is linked to a thread, which is not only responsible for sending and receiving network data, but also responsible for writing data to the storage engine. The entire NETWORK I/O processing and disk I/O processing are all responsible for the same thread, which is a flaw in the architecture design itself.

2.2.2 Optimization method of network thread model

To adapt to high-concurrency read/write scenarios, mongodb-3.6 introduces serviceExecutor: Adaptive configuration. This configuration dynamically adjusts the number of network threads based on the number of requests and tries to reuse NETWORK I/O to reduce the system load caused by thread creation consumption.

In addition, with the serviceExecutor: adaptive configuration, the boost:asio network module is used to realize network I/o reuse and separation of network I/o and disk I/o. In such a high concurrency situation, the network link I/O reuse and mongodb lock operations are used to control the number of DISK I/O access threads, ultimately reducing the high system load caused by the creation and consumption of a large number of threads, and ultimately improving the high concurrent read and write performance.

2.2.3 Performance comparison of network thread model before and after optimization

Add serviceExecutor: Adaptive to the high-traffic cluster to realize network I/O overuse and separation of network I/OS from disk I/OS. The cluster latency, system load, and slow logs are greatly reduced. Details are as follows:

2.2.3.1 System Load Comparison before and after Optimization

Verification mode:

The cluster has multiple shards. The load of the optimized primary node in one shard configuration is compared with that of the unoptimized primary node at the same time:

The load configuration is not optimized

Optimized the load configuration

2.2.3.2 Slow Log Comparison before and after optimization

Verification mode:

This cluster has multiple shards, among which the number of slow logs of the optimized primary node in one shard configuration is compared with that of the unoptimized primary node at the same time:

Statistics on the number of slow logs at the same time:

Number of unoptimized slow logs (19621) :

Number of slow logs after optimization (5222):

2.2.3.3 Comparison of average delay before and after optimization

Verification mode:

The average latency of all nodes in the cluster after I/O overcommitment is configured is compared with the default configuration.

As can be seen from the figure above, the latency of network I/O multiplexing is reduced by 1-2 times.

2.3 WiredTiger Storage engine optimization

From the previous section, we can see that the average latency has decreased from 200ms to about 80ms on average. Obviously, the average latency is still very high. How can we further improve the performance and reduce the latency? Continue to analyze the cluster, we found that the disk IO is 0 for a while, and then 100% for a while, and there is a 0 drop phenomenon, the phenomenon is as follows:

As can be seen from the figure, I/O is written once to 2G, and I/O will continue to block in the next few seconds. Read and write I/O completely drops 0, avgqu-sz and AWIT are huge, and utiL orderability is 100%. In the process of I/O falling 0, TPS reacted by the business side also drops 0.

In addition, util continues to be at 0% for a long time after a large number of IO writes.

The overall IO load curve is as follows:

When I/O util reaches 100%, I/O logs are full again. Mongostat monitors the traffic and finds the following:

Mongostat =100% I/O util=100% I/O util=100% I/O util=100% I/O util=100%

With the above phenomenon, we can confirm that the problem is caused by IO failing to keep up with the writing speed of the client. In chapter 2, we have optimized the mongodb service layer, and now we start to optimize the wiredtiger storage engine layer, mainly through the following aspects:

  1. Cachesize adjustment

  2. Adjust the proportion of dirty data elimination

  3. Checkpoint optimization

2.3.1 Tuning cachesize (Why the Larger Cachesize is, the worse the Performance)

As you can see from the previous IO analysis, the timeout point is the same as the time when I/O blocking fell 0, so how to resolve the I/O drop 0 becomes the key to solving the problem.

Find a cluster peak period (total TPS500,000 /s) to check the TPS of the node at that time, found that TPS is not very high, a single fragment is about 30,000-40,000, why there will be a large number of brush disk, instant can reach 10G/ s, resulting in IO util persistent drop 0 (because I/O can not keep up with the write speed). Wiredtiger storage engine is a B+ tree storage engine. Mongodb documents are first converted into KV and written to WiredTiger. In the process of writing, the memory will become larger and larger. Just start brushing. When the checkpoint limit is reached, the disk flushing operation is triggered. Check the status of any mongod node. The memory consumption reaches 110 GB, as shown in the following figure:

Conf configuration file cacheSizeGB: 110 GB. As you can see, the total number of KV in the storage engine has almost reached 110 GB. The larger the cachesSize setting is, the more dirty data will be stored in the storage engine. Causes I/O to drop 0.

In addition, looking at the memory of the machine, it can be seen that the total memory size is 190G, of which 110G has been used, which is almost caused by mongod’s storage. This will lead to the reduction of the kernel-mode page cache. When a large number of writes are written, the insufficient kernel cache will cause the disk page missing interrupt.

Solution: through the analysis of the above problems might be a large number of write scenarios, dirty data is too much easy to cause a one-time large I/O write, so we can consider the store cause cacheSize down to 50 g, to reduce the amount of I/O write the same time, so as to avoid the peak cases one-time a lot written to disk I/O played blocking problem.

Optimization for eliminating dirty data from the storage engine

Adjusting the cachesize size resolves the 5s request timeout problem, and the corresponding alarm disappears. However, the problem persists. The 5S request timeout problem disappears, but the 1s timeout problem still occurs occasionally.

Therefore, how to adjust cacheSize to avoid the problem of heavy I/O write becomes the key to solve the problem, further analysis of the storage engine principle, how to solve the balance between memory and I/O becomes the key to solve the problem. Mongodb default storage due to wiredTiger cache elimination policy related to the following configuration:

Wiredtiger disables related configurations The default value The working principle of
eviction_target 80 When the percentage of memory used exceeds the eviction_target, background EVICT threads begin to phase out
eviction_trigger 95 When the amount of memory used exceeds the eviction_trigger of total memory, the user thread also starts to become obsolete
eviction_dirty_target 5 When the proportion of dirty data in the cache exceeds eviction_dirty_target, background EVICT threads become obsolete
eviction_dirty_trigger 20 When the proportion of dirty data in the cache exceeds eviction_dirty_trigger, user threads become obsolete
evict.threads_min 4 Minimum number of background EVICT threads
evict.threads_max 4 Maximum number of background EVICT threads

If the proportion of dirty data in cacheSize reaches 5% after you change the size from 120 GB to 50 GB, the obsoletization speed cannot keep up with the client write speed, which may cause I/O bottlenecks and congestion.

Solution: How to further reduce persistent I/O writes, that is, how to balance cache memory and disk I/O, becomes the key issue. As can be seen from the above table, if the proportion of dirty data and the total memory usage reaches a certain level, the background thread will start to select pages to write to the disk. If the proportion of dirty data and the total memory usage increases further, the user thread will start to do page cull, which is a very dangerous blocking process, resulting in the user request verification block. To balance cache and I/O, adjust the phase-out policy to allow background threads to phase out data as early as possible to avoid mass disk flushing, and lower the user thread threshold to prevent user threads from page phase-out and blocking. The storage configuration is as follows:

eviction_target: 75%

Eviction_trigger: 97%

eviction_dirty_target: %3

Eviction_dirty_trigger: 25%

Evict. Threads_min: 8

Evict. Threads_min: 12

The general idea is to make background EVICT eliminate dirty page page to disk as early as possible, and adjust the number of EVICT elimination threads to accelerate dirty data elimination, mongostat and client timeout phenomenon will be further alleviated after adjustment.

2.3.3 Checkpoint optimization of the storage engine

The storage engine uses checkpoint to take snapshots and record all the dirty data on the storage engine to disks. By default, a checkpoint is triggered by the following two conditions:

  1. A checkpoint snapshot is created every 60 seconds
  2. The incremental redo log(that is, the journal log) reaches 2G

If the journal log reaches 2G or redo log does not reach 2G and the interval between the last log and redo log reaches 60 seconds, wiredtiger will trigger a checkpoint. As a result, the backlog of dirty data will increase. In other words, there will be more dirty data during checkpoint, resulting in a large number of I/O write operations during checkpoint.

If we shorten the checkpoint period, there will be less dirty data between the two checkpoint periods, and the disk I/O 100% duration will be shortened.

The adjusted value of checkpoint is as follows:

checkpoint=(wait=25,log_size=1GB)
Copy the code
2.3.4 I/O Comparison before and after Storage Engine Optimization

After optimizing the storage engine for the above three aspects, the disk I/O starts averaging at different time points. Iostat monitors the optimized I/O load as follows:

From the IO load diagram above, it can be seen that the previous IO load was 0% at one time and 100% at another time. The phenomenon has been alleviated, as summarized in the following figure:

2.3.5 Latency Comparison before and after storage Engine Optimization

The delay comparison before and after optimization is as follows (Note: Several services in the cluster are used at the same time. The delay comparison before and after optimization is as follows) :

As you can see from the figure above, the storage engine optimized time delay is further reduced and stabilized, from an average of 80ms to an average of around 20ms, but it is still not perfect and has jitter.

3. The SYSTEM disk I/O of the server is rectified

3.1 Background of SERVER I/O Hardware Problems

As described in Section 3, suspicion of a defect in the disk hardware began to mount when Wiredtiger discovered that whenever disk writes exceeded 500M/s per second, utiL persisted at 100% for the next few seconds, with w/s dropping almost 0.

As shown in the preceding figure, the disk is an nvMe SSD. The data shows that the DISK has excellent I/O performance, supporting 2 GBIT /S write operations per second with an IOPS of 2.5W/S. However, the online disk can only write 500 mbit /S.

3.2 Performance Comparison after SERVER I/O Hardware Problems Are Resolved

Therefore, migrate all the primary nodes of the shard cluster to another server. This server also uses SSDS, and the I/O performance reaches 2 GBIT /s. (Note: Only the primary node is migrated, and the secondary node is still a server of 500 MB /s.) After the migration, it is found that the performance is further improved and the latency is reduced to 2-4ms/s. The following figure shows the latency monitoring at three different service layers:

As can be seen from the above figure, the delay is further reduced to an average of 2-4ms after the primary node is migrated to a machine with better IO capabilities.

Although the delay is reduced to an average of 2-4ms, there are still many spikes of dozens of ms. In view of the length, I will share the reasons in the next issue. Finally, all the delays are kept within 5ms and the spikes of dozens of ms are eliminated.

In addition, the CAUSE of the NVMe SSD I/O bottleneck is determined to be a Linux kernel version mismatch after confirmation and analysis with the vendor. If you have the same PROBLEM with nvMe SSDS, upgrade your Linux version to 3.10.0-957.27.2.el7.x86_64. After the upgrade, the I/O capability of nvMe SSDS is greater than 2 GB /s.

4. Summarize and leave problems

After the optimization of mongodb service layer configuration, storage engine optimization, and hardware I/O improvement, the average latency of the high-traffic write cluster is reduced from the previous average of hundreds of ms to an average of 2-4ms, and the overall performance is improved tens of times.

However, it can be seen from the optimized delay in Section 4.2 that the cluster still occasionally has jitter. In view of the length, we will share in the next installment. If the delay jitter in section 4.2 is eliminated, the final full delay time is controlled within 2-4ms without any jitter over 10ms.

In addition, some pits have been mined in the process of cluster optimization, and the next phase will continue to analyze the pit records of large flow cluster mining.

Note: Some of the optimization methods in this article are not necessarily applicable to all mongodb scenarios. Optimize based on actual business scenarios and hardware resource capabilities, rather than step-by-step.

5. At last…

We will continue to share the following topics in the near future, please stay tuned:

  1. Million-level High-concurrency MongoDB Cluster Performance Improvement Principle (2)

  2. Performance optimization of millions of highly concurrent MongoDB clusters

  3. Summarize and analyze the problems of typical cluster jitter and unavailability

  4. MongoDB Document database service use best cases to share

Finally, by the way, I would like to make an advertisement that OPPO Internet operation and maintenance cloud storage team urgently needs several positions:

If you are interested in MongoDB kernel source code, Wiredtiger storage engine, RocksDB storage engine, database room multi-activity, data link synchronization system, middleware, database, etc. Welcome to join the OPPO family and participate in OPPO’s research and development of million-level high-concurrency document database.

Work location: Chengdu/Shenzhen

E-mail: yangyazhou#oppo.com