background

GOPS Global Operations Conference (Beijing station) heard a lot of dry goods. What is particularly beneficial is the “From 0 to 1 to N, The Full Perspective of Tencent Monitoring System” shared by Nie Xin from Tencent SNG Business Division.

In his theme sharing, he summarized the development course of Tencent’s monitoring system in these years as point monitoring -> surface monitoring -> depth monitoring:

When I saw this slide of his, it kind of clicked. When I heard him share, our system just completed the architecture microservitization, and we launched the call chain: distributed tracking system to solve the troubleshooting and tracking of specific problems in the microservice distributed system. However, our monitoring system has not made corresponding evolution for the architecture microservitization. For example, most monitoring systems stay at the point monitoring level, and a small number of surface monitoring related to multiple services are relatively elementary, requiring human analysis and intervention.

Point monitoring is easy to understand. It refers to assigning monitoring points to the system and triggering alarms based on thresholds.

Surface monitoring correlates alarm information in time and space, effectively eliminating burr alarms and making alarms more accurate. Because the alarm itself has timeliness, timeliness comes from the alarm delay, continuity may be interference, so only time correlation is not enough. Link correlation (spatial correlation) and time correlation determine accuracy.

In fact, deep monitoring is a bit of hot spot of deep learning. From the perspective of sharing, it actually means further improvement of the link correlation of the opposite monitoring, and simple classification based on the collected system by using machine learning.

After attending the meeting, we made clear the evolution direction of our monitoring system, made some choices according to the characteristics of our own system, and determined the three-dimensional monitoring scheme.

Objective of stereo monitoring scheme

The so-called “three-dimensional monitoring” refers to the current system in the case of point monitoring, as far as possible reuse of the current monitoring probe, time and space (between service links) dimension expansion, to achieve the monitoring of the whole system in time and space.

Stereo monitoring should eliminate monitoring burrs caused by point monitoring. For example, if services depend on each other, alarms on the last level should be generated only after cascading alarms are analyzed and merged by stereo monitoring.

Three-dimensional monitoring can quickly locate system faults. The granularity of fault locating can be determined by the monitoring type, such as microservice level, interface level, database instance level, and cache instance level.

Design and implementation of stereo monitoring scheme

For microservices, we send the information to be monitored to Kafka through data Bus for collection. This approach is also used in call chain distributed tracking systems. It has to be said that Data Bus is a very important and practical basic component after architecture microservitization.

In order to minimize the invasiveness of each microservice integration monitoring component, we integrate by modifying the base library. For example, if the microservice uses the MySQL database, we will modify the database driver used by the microservice to monitor the database. Once any error or warning information occurs, the message will be written to the data bus. A micro service needs to call an interface of Tencent, so we will modify our HTTP client base library and write error and timeout messages into data bus to realize monitoring of this interface.

Therefore, our micro-service monitoring integration almost does not require the intervention of r&d personnel, but only requires the operation and maintenance personnel to update the service dependency library, and then re-release online. As a matter of fact, we did not arrange an independent release time for micro-service monitoring integration. They were all launched when feature or Hotfix was developed and launched. Nearly 200 micro-services were integrated within 2 weeks. It is important to note that changes to the base library must be made by relatively senior developers on the team, and that testing must be done in a small way, or it will cause large-scale problems.

For the monitoring of k8S cluster, database, Redis and other infrastructure, the previous point monitoring data is used, but the events are uniformly reported to the Event Collector Event collection service. In addition to collecting Event data, the Event Collector also filters data that does not conform to common rules.

Event Analyzer The Event analysis service analyzes the monitoring results in time and space based on the events collected by the Event Collector, and sends alarm notifications.

In terms of event analysis, the time dimension is very simple. The event information within a time window can be selected (1min is currently set according to our experience). In terms of spatial dimension, it is relatively troublesome. We did not use the full-link analysis algorithm used by Tencent. This is mainly because the algorithm needs to pre-generate the link topology. However, in microservice architecture, the increase, decrease and change of each microservice are very frequent, and the pre-generated topology is somewhat anti-pattern, which also causes certain costs. I suspect that the reason Tencent feels that pre-generated topologies are not a problem has to do with the fact that its current architecture is not fully microservitized and has strict internal management processes.

Because you do not want to pre-generate the link topology, the link area calculation formula given in the share cannot be used. On the other hand, because the full-link analysis algorithm provided by Tencent has not been proved theoretically, we lack confidence in whether we can achieve the same effect with this algorithm when our data volume is not as large as Tencent’s.

But ideas can be borrowed. Through discussion, we agreed that the so-called link analysis is actually the association analysis, and the association analysis should not hesitate to use Google Page Rank algorithm:

As shown in the figure above, we point to the alarm events between services (1, 2, 3, 4) as a hyperlink and calculate the PR value. Then, the service with the highest PR value is the one we think is most likely to have problems. Theoretical proof is skipped here, we can throw the pot to Google.

It is important to note that when a complex system fails, it is often not one component or service that fails, but more than one service that fails simultaneously. So when you calculate PR, you might be faced with multiple independent digraphs of PR. At this time, the relationship between independent graphs can be processed by machine learning according to the historical data to further give the cause of the fault, and then quickly deal with the fault according to the plan to restore the service.

Event Analyzer also provides a page to view historical alarm information links:

You can also monitor and analyze link traffic (the thicker the edge, the greater the traffic) :

conclusion

After the 3d monitoring goes online, o&M can only troubleshoot problems from points before, but directly contact the developers of corresponding services to solve problems based on the Event Analyzer’s aggregated alarm information. At the same time, many potential problems have been identified that experience previously flagged as system jitter and ignored. At this point, our microservices architecture has completed another evolution with the support of call chain and stereoscopic monitoring.