In our daily use of database, monitoring system, as an important auxiliary system for troubleshooting and alarm faults, plays an important role in problem diagnosis, troubleshooting and analysis for DBA, operation and maintenance, and business development students. In addition, the quality of a monitoring system also greatly affects whether the fault can be located accurately, and whether the problem can be repaired correctly to avoid the next failure. Monitoring granularity, monitoring index integrity and real-time monitoring are three important factors to evaluate a monitoring.
In terms of monitoring granularity, many current systems can only achieve minute-level monitoring, or half-minute level monitoring. Such a monitoring granularity has become increasingly inadequate in the current high-speed software environment. There is nothing to be done about the sudden mass of anomalies. However, increasing the monitoring granularity will bring about a doubling increase in the amount of large data and a doubling reduction in the acquisition frequency, which will be a great test for resource consumption.
In terms of the integrity of monitoring indicators, most of the current systems adopt the collection method of predefined indicators. This way has a great disadvantage, that is, if you do not realize the importance of a certain indicator at the beginning and miss, but it is the key indicator of a fault, this time the fault is very likely to become a “headless injustice”.
And in the real-time nature of surveillance — “no one cares about the past, they care about the present.”
The above three capabilities, as long as a good, can be called a good monitoring system. The second-level monitoring system Inspector developed by Ali Cloud can achieve the true second-level granularity of 1 point per second, collect all indicators without any omissions, and even automatically collect and display real-time data for indicators that have never appeared before. 1 second 1 point monitoring granularity, so that any jitter database is no hiding; Full index collection, giving dba enough comprehensive and complete information; And real-time data display, can be the first time to know the occurrence of the fault, also can be the first time to know the recovery of the fault.
When you encounter a db access timeout, use the inspector to check whether the mongodb database is running properly.
case 1
There was an online service using mongodb replicas and read/write separation on the business side. Suddenly one day, a large number of online read traffic times out, and it is obvious from the inspector that the latency from the library is extremely high at that time
The high latency of the slave library means that the speed of the slave oplog replay thread cannot catch up with the writing speed of the master library. If the response speed of the slave library is not as fast as that of the master library under the condition that the master and slave configurations are consistent, it can only mean that the slave library is carrying out some high-consumption operations besides normal business operations at that time. After checking, we found that the cache of db at that time had a surge:
As can be seen from the monitoring, cache usage rapidly increased from around 80% to 95% of the EVict trigger line. Meanwhile, the dirty cache also increased to reach the evict trigger line. For wiredTiger, when cache usage reaches the trigger line, WT considers that the EVICT thread is too late for evICT page, so the user thread will join the EVICT operation, which will cause a large timeout. This idea can also be verified through the Application EVict Time index:
From the figure above, we can clearly see that the user thread spent a lot of time to do EVICT, which resulted in a large number of timeout of normal access requests and then went through the investigation of the business end. Because there were a large number of data migration jobs at that time, the cache was filled. Therefore, after limiting the flow of migration jobs and enlarging the cache, The whole DB operation also began to become smooth.
case 2
One day, a business using Sharding cluster on the line suddenly reported another access timeout error, and then recovered quickly after a short time. Judging from experience, there were probably some lock operations, resulting in access timeout. The inspector shows that a shard lock queue is high at the time of the failure:
So basically confirmed our previous conjecture that lock causes access timeout. So what exactly is causing the lock queue to spike? Soon, through a check of the current command, we found that the authentication command on the shard suddenly increased:
By checking the code, we found that although Mongos and Mongod used keyfile for authentication, they were actually authenticated by SCRAM protocol of SASL command. During authentication, there would be a global lock, so a large amount of authentication at that time led to a spike in the global lock queue, and then the access timeout
Therefore, we finally changed the number of client connections to reduce the global lock timeout caused by the sudden surge of authentication.
Through the above two cases, we can see that small enough monitoring granularity and comprehensive enough monitoring indicators are very important for troubleshooting faults. Real-time performance also plays an obvious role in the monitoring wall scenario.
Finally, second-level monitoring has been opened on the mongodb console of Ali Cloud. Cloud mongodb users can independently enable monitoring and experience the high-definition experience brought by second-level monitoring.