This article is from OPPO Internet Basic technology team, please note the author. At the same time, welcome to follow our official account: OPPO_tech, share with you OPPO cutting-edge Internet technology and activities.
1. Background
A core JAVA long connection service uses MongoDB as the main storage, and hundreds of clients are connected to the same MongoDB cluster. Performance jitter problems occur for several times in a short period of time. In addition, an “avalanche” fault occurs, and traffic drops to zero instantly, which cannot be automatically recovered. This paper analyzes the root causes of these two failures, including a series of problems such as improper client configuration, improper MongoDB kernel link authentication, and incomplete proxy configuration. Finally, the root causes are determined through multiple efforts.
The cluster has more than ten service interfaces to access, and each interface is deployed on dozens of service servers. The total number of clients accessing the MongoDB machine exceeds hundreds. Some requests pull dozens or even hundreds of rows of data at a time.
This cluster is a two-room multi-active cluster in the same city (election nodes do not consume too many resources, and election nodes are deployed in the third room in different places). The architecture diagram is as follows:
As can be seen from the figure above, in order to realize multi-activity, corresponding agents are deployed in each machine room. The corresponding client of the machine room is linked to the Mongos agent of the corresponding machine room, and there are multiple agents in each machine room. Proxy layer deployment IP address :PORT Address list (note: not real IP address) :
A room agent address list: 1.1.1.1:111,2.2. 2.2:1111,3.3. 3.3:1111
B room agent address list: 4.4.4.4:1111,4.4. 4.4:2222
Three agents are deployed on three physical machines in equipment room A, and two agents are deployed on the same physical machine in equipment room B. In addition, rooms A and B are in the same city, and the cross-room access delay can be ignored.
Cluster storage layer and Config Server adopt the same architecture: A room (1 master node +1 slave node)+ B room (2 slave nodes)+C room (1 election node arbiter), that is, 2(data node)+2(data node)+1(election node) mode.
The multi-activity architecture of the equipment room can ensure that the failure of any equipment room has no impact on the services of the other equipment room. The multi-activity principle of the equipment room is as follows:
-
If machine room A fails, the agent in machine room B will not be affected because the agent is A stateless node.
-
If A computer hang up, at the same time, the master node in A room, at this time in room B and C room election two data node A total of three nodes, can guarantee the new election need more than more than half of the node this condition, so the B room data node in A short period of time will be elected A new master node, so the whole visit is not affected by any storage layer.
This paper focuses on the following six questions:
-
Why does burst traffic jitter?
-
Why does the data node not have any slow logs, but the agent load is 100% short?
-
Why did the Mongos agent cause an “avalanche” that lasted hours and could not be recovered for a long time?
-
Why does agent jitter in an equipment room continue to jitter after services are switched to another equipment room?
-
Why is it that the client frequently establishes and breaks links when packet capture is abnormal, and the interval between establishing and breaking links is very short?
-
Theoretically, the agent is seven-layer forwarding, which consumes less resources and should be faster than mongod storage. Why does mongod storage node have no jitter while Mongos agent does not have jitter?
2. Fault process
2.1 Occasional Traffic peak and Service jitter?
The cluster jitter occurs for several times in A period of time. When the client in machine ROOM A jitter, the agent load in machine room A is high. Therefore, the agent in machine room A accesses the agent in machine room B.
2.1.1 Slow Log Analysis of storage Nodes
Firstly, the monitoring information of system CPU, MEM, IO and load of all mongod storage nodes in the cluster was analyzed and everything was found normal. Then, the slow log of each Mongod node was analyzed (since the cluster was sensitive to delay, the slow log was adjusted to 30ms). The analysis results are as follows:
As shown in the preceding figure, the udSN does not have any slow logs when the service jitter occurs. Therefore, it can be concluded that all the udSn is normal. The service jitter has nothing to do with the Mongod storage node.
2.1.2 Mongos agent analysis
There were no problems with the storage nodes, so we started to troubleshoot the Mongos agent nodes. Due to historical reasons, the cluster is deployed on another platform. The platform does not fully monitor QPS and latency. As a result, the monitoring fails to detect early jitter. After jitter, migrate the platform cluster to the new management and control platform developed by OPPO. The new platform has detailed monitoring information. QPS monitoring curve after migration is as follows:
For each traffic increase time point, the corresponding service monitoring has a timeout or jitter, as follows:
Analyze the corresponding agent Mongos logs and find the following symptoms: Jitter time point Mongos. log has a large number of link establishment and link breaking processes, as shown in the following figure:
As can be seen from the figure above, thousands of links are established within a second, while thousands of links are disconnected. In addition, packet capture found that many links are disconnected within a short period of time, the phenomenon is as follows (link breaking time – link building time =51ms, some links are disconnected over 100 ms) :
Packet capture is as follows:
In addition, the client link peak period on the machine agent is very high, even exceeding the normal QPS value, QPS is about 7000-8000, but conn link is up to 13000, Mongostat obtained the monitoring information as follows:
2.1.3 Agent machine load analysis
During each burst of traffic, the agent has a high load. Sampling is performed periodically through the deployment script. The monitoring diagram for jitter time points is as follows:
As can be seen from the figure above, the CPU Load is very high during each traffic peak, and it is sy% Load, us% Load is very low, and the Load is even up to several hundred, and occasionally even more than 1000.
2.1.4 Jitter analysis summary
According to the above analysis, the system is heavily loaded due to sudden traffic at certain time points. Is the root cause really due to burst traffic? In fact, this is not true. If you look at the subsequent analysis, this is actually a false conclusion. A few days later, the same cluster avalanches.
Then the service combs out the interface corresponding to burst traffic and discards the interface. The QPS monitoring curve is as follows:
To reduce service jitter, the burst traffic interface is disabled. After a few hours, the service jitter stops. After dropping the burst traffic interface, we also did the following things:
-
Since the true reason of 100% monGOS load was not found, the monGS agents were expanded in each machine room and 4 agents were kept in each machine room. Meanwhile, all agents were kept in different servers, and the agent load was reduced as far as possible by branching.
-
Notification of A room and B room all eight agent business configuration, each room is no longer only configuration corresponding room agent (because of the business for the first time jitter, we analysis the mongo Java SDK, determine the SDK equilibrium strategies will automatically weed out request ShiYanGao agent, next time if an agency problems and also automatically kicked out).
-
Notification service increases all client timeouts to 500ms.
However, there are always a lot of doubts and suspense, mainly in the following points:
-
There are 4 storage nodes and 5 proxy nodes. The storage nodes do not have any jitter, but the agent load of Layer 7 forwarding is high.
-
Why are many new connections found in packet capture and then disconnected after tens or more than 100 ms? Frequent chain building and chain breaking?
-
Why is the agent QPS only in the tens of thousands, when the agent CPU consumption is very high, and all the sy% system load? In my years of middleware proxy development experience, proxies consume very few resources, and CPU consumption is only US %, not SY %.
2.2 The same business “avalanches” a few days later
The good times did not last long. A few days later, a more serious fault appeared. The service traffic of machine room B directly dropped 0 at a certain moment, which was not a simple jitter problem, but the direct service traffic dropped 0, and the system SY % load was 100%.
2.2.1 Analysis of machine system monitoring
Machine CPU and system load monitoring are as follows:
As can be seen from the figure above, the system load caused by burst traffic is almost the same as the previous phenomenon. The service CPU SY % load is 100% and the load is high. Log in to the machine and obtain top information. The phenomenon is the same as that monitored.
The corresponding network monitoring at a time is as follows:
Disk I/O monitoring is as follows:
From the above system monitoring analysis, it can be seen that during the period of the problem, the system CPU sy% and load are very high, the network read and write traffic drops almost 0, and all disk I/OS are normal. It can be seen that the whole process is almost the same as the jitter caused by sudden traffic.
2.2.2 How to Restore Services
After the jitter problem caused by the first burst of traffic, we expanded all agents to 8 and informed the business service to configure all agents on all service interfaces. Because many business interface end B room configuration all agents, business does not only configured with the former two are on the same physical machine agent (4.4.4.4:1111,4.4. 4.4:2222), eventually trigger a directing a performance bottleneck (see the back analysis), caused the entire mongo clusters “avalanche”
Finally, the service is restarted and the eight agents in machine room B are configured at the same time. The problem is solved.
2.2.3 Mongos agent instance monitoring analysis
By analyzing the proxy logs during this period, we can see the same phenomenon as 2.1: a large number of new key connections, and the new connections are closed after dozens of ms and more than 100 ms. The whole phenomenon is the same as the previous analysis. The corresponding logs are not analyzed statistically.
In addition, by analyzing the proxy QPS monitoring at that time, the QPS access curve of normal Query read requests is as follows, and the QPS during the failure period almost falls to zero avalanche:
Command The statistics monitoring curve is as follows:
As can be seen from the above statistics, when the agent node has a spike at the time of traffic failure, and the command statistics at this time point suddenly increases to 22000(actually may be higher, because we monitor the sampling period of 30 seconds, this is just the average), that is, 22,000 connections come in instantly. Ismaster () indicates the Command statistics. After the client connects to the server, the first packet is an ISmaster packet. The server replies to the client by executing db.ismaster(), and the client starts sasL authentication.
The normal client access process is as follows:
-
The client initiates a link with Mongos
-
After the Mongos server accepts the link, the link is successfully established
-
The client sends the db.ismaster () command to the server
-
The server replies the isMaster to the client
-
Client initiates SASL authentication with Mongos agent (multiple interactions with Mongos)
-
The client initiates the normal find() process
After the SDK link is successfully established, the client sends db.ismaster () to the server for the purpose of load balancing policy and determining the node type, ensuring that the client can quickly perceive the agent with long access delay, so as to quickly eliminate the node with long access delay and determine the node type.
In addition, the script deployed in advance automatically captures packets when the system load is high. The captured packet analysis results are as follows:
The temporal analysis of the above figure is as follows:
-
11:21:59.506174 The link was established successfully
-
11:21:59.506254 The client sends db.ismaster () to the server
-
11:21:59.656479 The client sends a FIN disconnection request
-
11:21:59.674717 The server sends db.ismaster () to the client
-
11:21:59.675480 The client directly RST
The difference between the third packet and the first packet is about 150ms. Finally, the service determines the timeout period corresponding to the client IP address, which is 150ms. In addition, other captured packets have timeout configurations such as 40ms and 100ms. After confirming the client and service, you can determine the timeout duration of the service interface on the client. Therefore, combined with the packet capture and client configuration, it can be determined that when the agent does not return db.ismaster () to the client after the specified timeout, the client immediately times out and immediately initiates a reconnection request.
Conclusion: Through packet capture and Mongos log analysis, it can be determined that the reason for the rapid disconnection after the establishment of the link is: the client access agent’s first request db.ismaster () timed out, thus causing the client to reconnect. After reconnection, db.ismaster () requests are fetched again. Due to the high CPU load of 100%, requests after each reconnection are timed out. For clients with a timeout period of 500ms, db.ismaster () does not time out. Therefore, sasl authentication is performed later.
Therefore, it can be seen that the high system load is related to the repeated chain building and chain breaking. At one point, the client establishes a large number of links (2.2W), resulting in a high load. Moreover, because the client timeout time is configured differently, the client will enter the SASL process and obtain random numbers from the kernel state, resulting in a high sy% load. The high sy% load causes the client to time out, and the whole access process becomes an “endless loop”, resulting in an avalanche of Mongos agents.
2.3 Offline simulation of faults
At this point, we have roughly identified the cause of the problem, but why would 20,000 requests cause sy% load of 100% at the point of failure? Theoretically tens of thousands of links per second would not cause such a serious problem, after all, our machine has 40 cpus. Therefore, the analysis of why repeated chain building and chain breaking caused the system SY % load 100% has become the key point of this fault.
2.3.1 Simulating the fault process
The steps of simulating frequent chain building and chain breaking faults are as follows:
-
Modify mongos kernel code, all requests all delay 600ms
-
The same machine has two identical Mongos, distinguished by port
-
6000 concurrent links are enabled on the client and the timeout duration is 500ms
Through the above operations, all requests can be guaranteed to time out. After the timeout, the client will immediately start to re-establish the chain, and the access to MongoDB will also time out after the re-establishment of the chain, so as to simulate the process of repeatedly establishing and breaking the chain. In addition, to ensure consistency with the avalanche failure environment, two Mongos agents were deployed on the same physical machine.
2.3.2 Fault simulation test results
In order to ensure the same hardware environment as the failed Mongos agent, we selected the same type of server with the same operating system version (2.6.32-642.el6.x86_64), and the problem immediately appeared when the programs were running:
The operating system version of the faulty server, Linux-2.6, is too early. Therefore, it is suspected that the faulty server may have a problem with the operating system version. Therefore, upgrade a physical machine of the same type to Linux-3.10.
As can be seen from the figure above, client 6000 is reconnected repeatedly and the server pressure is normal. All CPU consumption is in US % and sy% consumption is very low. The user CPU consumes 3 cpus and the kernel CPU consumes almost zero, which is the normal result we expect, so we think the problem may be due to the operating system version.
In order to verify whether the Linux 3.10 kernel version has the same sy% kernel CPU consumption problem as the 2.6 kernel version, the concurrency is increased from 6000 to 30000. The verification results are as follows:
Test results: By modifying the MongoDB kernel version, the client was deliberately timed out to repeatedly establish and break the link. In linux-2.6, the problem of the system CPU SY % 100% with more than 1500 concurrent and repeatedly establish and break the link could emerge. However, in Linux-3.10, sy% load increases gradually after 10000 concurrency, and the higher the concurrency, the higher the sy% load.
Summary: In Linux-2.6, as long as there are thousands of repeats of MongoDB links per second, the system SY % load will be close to 100%. Linux-3.10, the concurrency 20000 repeatedly build chain break, SY % load can reach 30%, with the increase of client concurrency, SY % load also increases accordingly. Linux-3.10 has a significant performance improvement over 2.6 for the scenario of repeated chain building and chain breaking, but it does not solve the fundamental problem.
2.4 The client repeatedly establishes and breaks the chain causing SY % 100% root cause
In order to analyze the reason for the high load of % SY system, install perf to obtain the system top information, and find that all CPU consumption is in the following interface:
From the perf analysis, we can see that the CPU consumption in _spin_lock_irqsave function, continue to analyze the kernel state call stack, obtain the following stack information:
-89.81% 89.81% [kernel] [k] _spin_lock_irqsave ▒
– _spin_lock_irqsave ▒
– mix_pool_bytes_extract ▒
– extract_buf ▒
Extract_entropy_user ▒
Urandom_read ▒
Vfs_read ▒
Sys_read ▒
System_call_fastpath ▒
0xe82d
The stack above shows that MongoDB is reading /dev/urandom and consuming a handful of spinlocks due to multiple threads reading the file at the same time.
The root case is not caused by sys being too high with tens of thousands of connections per second. The root cause is that each new link on the Mongo client causes the MongoDB backend to create a new thread, which at some point calls urandom_read to read the random number /dev/urandom, and because multiple threads read at the same time, the kernel consumes a spinlock lock. The CPU is high.
2.5 MongoDB Kernel random Number Optimization
2.5.1 MongoDB kernel source location analysis
/dev/urandom random number is read by multiple threads in the MongoDB kernel.
/dev/urandom = /dev/urandom = /dev/urandom = /dev/urandom = /dev/urandom
Continuing to walk through the code, I found that the main places are as follows:
// The server receives the first sasL authentication packet from the client, which generates a random number // If mongos, Sasl_scramsha1_server_conversation::_firstStep(...) {... . unique_ptr<SecureRandom> sr(SecureRandom::create()); binaryNonce[0] = sr->nextInt64(); binaryNonce[1] = sr->nextInt64(); binaryNonce[2] = sr->nextInt64(); . . } / / than mongos mongod storage node is the client, mongos as the client needs to generate a random number SaslSCRAMSHA1ClientConversation: : _firstStep (...). {... . unique_ptr<SecureRandom> sr(SecureRandom::create()); binaryNonce[0] = sr->nextInt64(); binaryNonce[1] = sr->nextInt64(); binaryNonce[2] = sr->nextInt64(); . . }Copy the code
2.5.2 MongoDB kernel source code random number optimization
As can be seen from 2.5.1 analysis, mongos will generate random number through “/dev/urandom” in sasL authentication process of new client connections, which causes the system SY % CPU to be too high. How to optimize the random number algorithm is the key to solve this problem.
Continue to analyze the MongoDB kernel source code, it is found that there are many places where random numbers are used, some of which are generated by user-mode algorithm, so we can use the same method to generate random numbers in user-mode. The core algorithm of user-mode random number generation is as follows:
class PseudoRandom { ... . uint32_t _x; uint32_t _y; uint32_t _z; uint32_t _w; }Copy the code
This algorithm can ensure the random distribution of the generated data. See the principle of this algorithm for details:
En.wikipedia.org/wiki/Xorshi…
You can also view the following git address acquisition algorithm: MongoDB random number generation algorithm notes
Summary: By optimizing the random number generation algorithm for SASL authentication to user mode algorithm, the CPU SY % 100% problem is solved, and the agent performance is improved several times/tens times in short link scenarios.
3. Question summary and answer
From the above analysis, it can be seen that the fault is triggered by a series of factors, including improper use of client configuration, extreme abnormal defects of MongoDB server kernel, incomplete monitoring, etc. The summary is as follows:
-
Client configurations are not uniform, and multiple service interfaces in a cluster are configured in various ways. The timeout and link configurations are different, which makes it difficult to capture packets and troubleshoot faults. If the timeout period is set too small, repeated reconnection may occur.
-
The client needs to be equipped with all mongos agents, so that when an agent fails, the client SDK will remove the failed agent node by default, thus ensuring minimal business impact and no single point of problem.
-
Multiple service interfaces in a cluster must be configured in the same configuration center to avoid inconsistent configurations.
-
The new connection random algorithm of MongoDB kernel has serious defects, which can cause serious performance jitter and even service “avalanche” in extreme cases.
At this point, we can answer the six questions from Chapter 1 as follows:
Why does burst traffic jitter?
A: Since the business is Java business, mongos agent is linked by link pool. When there is a burst of traffic, the link pool will increase the number of links to improve the performance of accessing MongoDB. At this time, the client will add links. The first step in sasL authentication is to generate a random number, which requires access to the operating system “/dev/urandom” file. And because mongos proxy model is the default link one thread, it will cause instant multiple threads to access the file, resulting in the kernel sy% load is too high.
Why mongos agent caused “avalanche”, why the flow of zero unusable?
A: The reason is that the client may have a sudden increase in traffic at a certain moment, and the number of links in the link pool is not enough, so it increases the link with Mongos agent. Because it is an old cluster, the agent still default one link one thread model, so there will be a large number of links in a moment, and sasL authentication will start after each link is established successfully. The mongos server generates a random number by reading “/dev/urandom”. The mongos server generates a random number by reading “/dev/urandom”. The mongos server generates a random number by reading “/dev/urandom”. Because sy% system load is too high, because the client timeout setting is too small, further cause client access timeout, timeout after reconnection, reconnection after sasL authentication, and aggravated read “/dev/urandom” file, so repeated cycle continues.
In addition, after the first service jitter, 8 Mongos agents were added to the server, but no modification was made to the client. As a result, the two agents in the business configuration of machine room B were on the same server, and the policy of automatically eliminating nodes with high load by Mongo Java SDK could not be used, so the “avalanche” was finally caused.
Why does the data node not have any slow logs, but the agent load is CPU SY % 100%?
A: Because Java programs on the client side directly access the Mongos agent, a large number of links only occur between the client and Mongos. At the same time, because the client timeout time is set too short (some interfaces are set to dozens of ms, some interfaces are set to more than 100 ms, some interfaces are set to 500ms), It causes a chain reaction at the time of peak flow (sudden flow system load is high, causing the client to fast timeout, after timeout, fast reconnection, further cause timeout, infinite infinite loop). There is also a link pool model between Mongos and Mongod, but Mongos as the client to access the Mongod storage node timeout is very long, the default is second level, so it will not cause repeated timeout build chain break.
Why is the agent jitter in room A still jitter after services are switched to room B?
A: When the service in machine room A is jitter and the service is switched to machine room B, the client needs to re-establish link authentication with the server, which triggers A large number of repeated chain establishment and chain breaking processes and random number reading “/dev/urandom”, so the machine room multi-live fails eventually.
Why is it that the client frequently establishes and breaks links when packet capture is abnormal, and the interval between establishing and breaking links is very short?
Answer: The root cause of frequent chain establishment and chain disconnection is the high system sy% load, and the reason why the client establishes the link and then port in a very short time is that the client configuration timeout time is too short.
Theoretically, the agent is seven-layer forwarding, consuming fewer resources and should be faster than mongod storage. Why does mongos agent have serious jitter while Mongod storage node does not have any jitter?
A: Because of the sharding architecture, there is a layer of Mongos agent in front of all Mongod storage nodes. Mongos agent acts as the client of Mongod storage nodes. The default timeout time is second level.
If the MongoDB cluster adopts the common replication set mode, will frequent chain building and chain breaking by clients cause the same “avalanche” of Mongod storage nodes?
A: yes. If there are too many clients, the operating system kernel version is too low, and the timeout interval is too high, direct access to the mongod storage node of the replication set will be triggered because the client and storage node authentication process is the same as the mongos agent authentication process, so frequent reads of the “/dev/urandom” file will still be triggered. The CPU sy% load is too high, causing avalanches in extreme cases.
4. “Avalanche” solutions
From the above analysis, the problem is that the client configuration is not reasonable, and the MongoDB kernel authentication process is defective in reading random numbers in extreme cases, resulting in an avalanche. If you do not have MongoDB kernel development capabilities, you can avoid this problem by standardizing the client configuration. Of course, the problem can be solved completely if the client configuration is normalized and the MongoDB kernel level handles random number reading in extreme cases.
4.1 JAVA SDK Client Configuration Standardization
In a business scenario where there are many service interfaces and many client machines, the client configuration must be as follows:
-
The timeout period is set to the level of seconds to avoid repeated chain establishment and chain breakage caused by setting the timeout period too far.
-
The client needs to configure all mongos proxy addresses, and can not configure a single point, otherwise the traffic to a Mongos can easily cause a transient traffic peak chain authentication.
-
Increase the number of Mongos agents, so as to ensure that the new key links of each agent are as few as possible at the same time. When the client is configured with multiple agents, the traffic is distributed evenly by default. If the load of a certain agent is high, the client will automatically remove it.
If there is no MongoDB kernel source research and development ability, you can refer to the client configuration method, while eliminating linux-2.6 kernel, linux-3.10 or higher kernel, basically can avoid stepping on the same type of pit.
4.2 MongoDB kernel source code optimization (discard kernel mode to obtain random number, select user random number algorithm)
See section 2.5.2 for details.
4.3 PHP short link business, how to avoid the pit
Because PHP service is a short link service, if the traffic is high, it is inevitable that the chain will be built frequently, which will go through the SASL authentication process. Finally, multiple threads frequently read “/dev/urandom” file, which is easy to cause the previous problem. In this case, you can use a specification similar to 4.1 Java client. At the same time, you can avoid this problem by using a kernel version higher than 3.x rather than a lower version of Linux kernel.
5. MongoDB kernel source code design and implementation analysis
This article related MongoDB thread model and random number algorithm implementation related source code analysis is as follows:
MongoDB dynamic thread model source code design and implementation analysis:
Github.com/y123456yz/r…
MongoDB a link a thread model source code design and implementation analysis:
Github.com/y123456yz/r…
MongoDB kernel-state and user-state random number algorithm implementation analysis:
Github.com/y123456yz/r…