This article is from OPPO Internet technology team, please note the author. At the same time, welcome to follow the official account of OPPO Internet technology team: OPPO_tech, and share with you OPPO cutting-edge Internet technology and activities.

Previously: OPPO Internet Technology recently launched “OPPO Millions of High Concurrency MongoDB Cluster performance tens of times Improvement optimization Practice (1)”, which was widely praised, now we release the next chapter for readers. Friends who have not read the previous chapter can read the previous chapter first to understand the background of the problem and optimization methods, so that you can better understand and learn the next chapter.

1. The background

The peak TPS of an online cluster is over 1 million/s (mainly write traffic, but the read traffic is very low. The read traffic is separated from the primary and secondary read traffic, and the QPS is hundreds or thousands). The peak TPS has almost reached the upper limit of the cluster, and the average latency is over 100ms. Delay jitter seriously affects service availability.

This cluster adopts the natural sharding mode architecture of MongoDB, and the data is evenly distributed in each sharding. After the sharding function is enabled by adding the slice key, perfect load balancing is achieved. The flow monitoring of each node in the cluster is as follows:

As can be seen from the figure above, the cluster traffic is relatively large, and the peak value exceeds 1.2 million/second. The traffic deleted after delete is expired is not included in the total traffic (Delete is triggered by the master, but is not displayed on the master, and is only displayed when Oplog is pulled from the node). If delete traffic from the primary node is included, the total TPS exceeds 1.5 million/SEC.

In OPPO Multi-million-level High-Concurrent MongoDB Cluster Performance Optimization Practice (1), after business optimization, MongoDB service layer configuration optimization, Wiredtiger storage engine layer optimization and hardware IO optimization, the overall client delay is controlled from several hundred ms to about 2-5ms. The overall performance is greatly improved, but when there is a large traffic write impact, there will be dozens of ms delay jitter, the delay of several different interfaces is shown in the figure below:

2. Hardware problem review and remaining problems

In the above, I/O problems of NVMe SSD hardware were located and analyzed together with vendors. The I/O problems were found to be caused by incorrect operating system versions. Therefore, the server hardware of the primary and secondary MongoDB instances was upgraded and the cluster instances were replaced after the upgrade.

The specific operation process is as follows:

  1. In order to verify the machine after IO upgrade, we replaced a shard slave node with the upgraded server (IO problem was solved, IO capacity was written from 500M/s to nearly 2G/s, we called the server after IO upgrade as high IO server, and the server without upgrade as low IO server). After the replacement, you can run iostat to see that the 100% I/O problem of the slave node is alleviated to a large extent, and the continuous I/O decline does not occur.

  2. The first step in the above server ran after a week, we determine the upgraded high IO server is running stable, in order to be prudent, we determine the high IO server from the node to run there is no problem, but we need to be further in the master node to verify whether the document, so we made a master-slave switch, and the high IO server running change master node, That is, the master node of a fragment in the cluster is a high I/O server, but the slave node is a low I/O server.

  3. When the high IO server runs on the master node of a shard for several weeks, we determine that the high IO server is running properly on the master node, so we have to conclude that the upgraded server is running stable.

  4. After determining that the high IO server was ok, we started replacing MongoDB instances to the server in batches. To be on the safe side, after all, a high IO server only verified in master-slave operation is no problem, so we consider only the master node of the whole cluster for high IO server (because it was that the client is done with the default configuration, data is written to the master node will return OK, although from node IO slow, but still can catch up with the oplog speed, Then the client delay can be well controlled.

In order to be cautious and safe, we only replaced the master nodes of all fragments through the above hardware replacement and upgrade process, and the architecture changed before and after lifting. The original cluster hardware architecture is shown in the figure below:

The architecture diagram of all sharded master nodes after hardware upgrade is as follows:

As can be seen from the figure above, the I/O capabilities of the primary and secondary node servers are quite different in the new cluster architecture. At first we thought that the business side did not set WriteConncern by default, that is, the client will send acknowledgement when writing to Primary by default, so it will not affect the business write.

After all the master nodes in fragments are upgraded to high IO servers, the time access delay of multiple service interfaces drops to an average of 2-4ms, but there are still dozens of ms spikes in the event of heavy traffic impact. I select the delay of one interface as an example, as shown in the figure below:

As can be seen from the figure above, spikes are more obvious, especially at the time point of high flow impact.

3. The hardware of the active node is optimized after upgrade

3.1 readConcern configuration optimization

In the previous section, we replaced all the master nodes of the shard with high IO servers, while the slave nodes were the same unupgraded low IO servers. Since the business side does not set WriteConncern by default, we assume that the client will be returned OK if the client writes to the master successfully, even if the secondary server is performing poorly.

After upgrading the main server, I continued to optimize the storage engine and changed eviction_dirty_trigger: 25% to 30%.

Due to the high concurrent impact of large flow, the TPS will instantly soar from hundreds of thousands to millions in the flat peak period, and the burr will appear almost two or three times a day, which is easy to reappear. Mongostat was deployed in advance to monitor all instances, and Iostat was used to monitor real-time IO conditions on each server. Also write a script for real-time acquisition db. Serverstatus (), the printSlaveReplicationInfo (), the printReplicationInfo () important information such as cluster.

When monitoring burr appeared at a certain point, then we started Mongostat analysis, we found a problem, even in flat peak period, dirty data proportion will continue to increase to the threshold (30%), we know that when the dirty data proportion exceeds eviction_dirty_trigger: At the 30% threshold, the user thread does EVICT flushing, so the user thread blocks until memory is free, so flushing is slow. Mongostat monitoring corresponding to burr time points in flat peak period was analyzed, and the following situation was found:

As can be seen from the figure above, when the cluster TPS is only about 400,000-500,000, dirty data appears in the master node of a fragment, reaching eviction_dirty_trigger: 30% threshold, then the access delay of the whole cluster will increase instantly. The reason is that the user thread of one fragment needs to flush, which leads to the increase of the access delay of this fragment (in fact, the access delay of other fragments is normal), and finally pulls up the overall average delay.

Why ordinary flat peak period also can have jitter? This is clearly not scientific.

Therefore, some monitoring information of the problematic master node is obtained, and the following conclusions are drawn:

  1. I/O is normal. I/O is not a bottleneck.

  2. Analyze the top load of the system during jitter, and the load is normal.

  3. The TPS of this sharding is only around 40,000, which is obviously not at the sharding peak.

  4. Db. PrintSlaveReplicationInfo () see the master-slave delay is higher.

When the client delay monitoring found the time delay spike, we found that all phenomena of the master node were normal, and the system load, IO and TPS did not reach the bottleneck, but there was a unique exception, which was the continuous increase of the master/slave synchronization delay, as shown in the figure below:

The I/O status of the secondary node of the low-IO server is as follows:

The I/O performance of the slave node is poor, which is also the source of the increase in master/slave latency.

It can be seen from the figure above that the master-slave delay is large at the same time point of the delay spike. So doubt may delay spike and node pull Oplog speed, so the whole Mongostat, iostat, top, the printSlaveReplicationInfo (), the serverstatus () to monitor continuously run for two days, Some core systems and Mongo monitoring indicators were recorded for two days.

Two days later, the corresponding monitoring data were analyzed for the time point of the delay spike on the client, and a common phenomenon was found. The time point at which the spikes appeared was consistent with the time point at which the dirty data EViction_dirty_trigger exceeded the threshold, and the master/slave delay had a great delay at this time point.

At this point, we increasingly suspect that the problem has to do with the oplog speed of pulling from the node. Previously, it was believed that the business side did not set WriteConncern by default, that is, the client will send the acknowledgement after writing to the Primary by default. There may be other processes before answering the client, which may affect the writing. Look at the Production Notes for mongodb-3.6 and find the following information:

Production Notes: Mongodb-3.6 has read Concern “majority” enabled by default.

To avoid Tibetan independence, MongoDB added this function. After enabling this function, MongoDB ensured that the data read by clients with readConcern(“Majority”) was actually synchronized to the data of most instances. MongoDB therefore has to maintain more version information in memory with snapshot and master-slave communication, which increases the memory requirements of the Wiredtiger storage engine.

As slave nodes are low-IO servers, it is easy to cause congestion, so the speed of pulling Oplog will not keep up with the progress, causing the master node to consume a large amount of memory to maintain snapshot information, which will lead to a large amount of memory consumption, and eventually lead to the sudden increase of dirty data, quickly reaching the EViction_dirty_trigger threshold. The service is also shaken.

Say a small episode, because mongo 3.6 default open enableMajorityReadConcern function, we are in the process, appeared a serious fault, cluster business flow for some time, suddenly jumped, caused by time delay continuous reach thousands of ms, phenomenon is as follows:

The root of the problem is caused because enableMajorityReadConcern function, because of the serious lag behind the master node from the node, leading to the master node in order to maintain various snapshot snapshot, consume large amounts of memory, at the same time, from the node and the node oplog delay, leading to the master node maintains a version more memory, The percentage of dirty data continues to grow until the node catches up with Oplog. Because we do not need readConcert business function, so we consider to disable this feature (increased configuration replication configuration file enableMajorityReadConcern = false.

Given length, enableMajorityReadConcern and master-slave hardware failure caused the serious business of IO ability is not enough, we don’t do detailed analysis, will be late to write a special “millions high concurrency mongo clustering performance optimization of mining pit, do share, in addition to ReadConcern mining pit, There are several other core mining points, please pay attention.

In addition, we will write an analysis article on the principle and code implementation of ReadConcern.

3.2 Replacing the Secondary Node Server with the Upgraded High I/O Server

In addition to the replication. EnableMajorityReadConcern = false in the configuration file to disable ReadConcern Majority function, we continue to put all the shard from the node due to the low IO server before replacement for the upgraded high IO server, After the upgrade, the performance of all primary and secondary hardware resources is the same. After the upgrade, the cluster fragment architecture is shown as follows:

After disabling the function and unifying the hardware resources of the primary and secondary servers, view the time delay of an interface with jitter, as shown in the following figure:

As you can see from the figure above, with MajorityReadConcern and a system upgrade for all low-IO servers from the slave nodes, the peak value of service time delay jitter is further reduced from the previous peak value of 80ms in Section 2 to the current peak value of around 40ms.

In addition, the persistent eviction_dirty_trigger threshold for dirty data that caused client latency to spike to several thousand ms was resolved in Section 3.1.

3.3 Continue to optimize and adjust storage cause parameters

Through the previous bar optimization, we found that one business interface still occasionally had 40ms delay spikes. The analysis found that the main reason was that eviction_dirty_trigger reached the threshold we configured, and business threads began to eliminate page cache, which resulted in slow business threads and ultimately resulted in average delay spikes.

In order to further mitigate the delay spikes, we continued to tune the storage engine on the previous basis, and the adjusted configuration is as follows:

Eviction_target: 75% EViction_trigger: 97% EViction_dirty_target: %3 eviction_dirty_trigger: 30% evict.threads_min: 12 evict.threads_max: 18 checkpoint=(wait=20,log_size=1GB)
Copy the code

After this round of storage engine tuning, the latency of the core interface of the service is further improved, and the delay spikes are further improved compared with the previous one. The maximum delay spike time is reduced from the previous 40ms to 30ms, and the frequency of spikes is significantly reduced, as shown in the figure below:

As can be seen from the figure above, at the peak TPS level of millions, the EVICT elimination rate of some nodes can no longer keep pace with the writing speed, so the user thread flushing occurs.

However, it is strange that the system load, IO status and memory status of the corresponding machine are analyzed at this time point, and it is found that the system load is relatively normal, but the IO status of the corresponding server is high, as shown in the following figure:

At the same time, we analyze the slow log of the corresponding server at the corresponding time point, and find that the statistics of the slow log of the time when the spikes appear are as follows:

Analyze the non-delay spike time points, and the corresponding slow log statistics are as follows:

The analysis of the slow logs at two time points shows that the number of slow logs is the same as the time when the time delay spikes appear, that is, when the I/O load is high.

As can be seen from the above analysis, when IO is high (utiL over 50%), both slow log and delay increase, and there is a proportional relationship between them.

4. To summarize

After software (MongoDB service configuration, service optimization, storage engine optimization) and hardware optimization (operating system upgrade), the latency of the core interface of the large traffic cluster is reduced from the initial average of hundreds of ms to the current average of 1-2ms, with considerable performance improvement and the overall latency performance increased by tens of times.

Main interface delay before optimization:

Without adding physical machines, after a series of optimization measures, the delay of the main interface of the business side was finally controlled to several ms, as shown in the figure below:

trailer

In the near future, we will continue to share the following topics, please pay attention:

  • Millions of high concurrency MongoDB cluster performance optimization pit memory

  • Summary and analysis of typical online cluster jitter and unavailability

  • MongoDB document database service use best case sharing

Finally, as an advertisement, OPPO Internet operation and maintenance cloud storage team urgently needs several positions:

If you are interested in MongoDB kernel source code, Wiredtiger storage engine, RocksDB storage engine, database room live, data link synchronization system, middleware, database and so on. Welcome to join the OPPO family and participate in the development of OPPO mega high concurrency document database.

Location: Chengdu/Shenzhen

E-mail: yangyazhou#oppo.com