It is easy to nip the nip in the bud, but difficult to get rid of the nip in the bud.
As a programmer, have you noticed that every holiday in the major scenic spots when there are programmers open the computer to deal with the emergency online problems? If there are a lot of alarms online, how can we judge whether it is a service problem or a service problem? When a major problem occurs on the midnight line, how can the relevant personnel be awakened to respond quickly? I believe that these problems are not unfamiliar to many students, the importance of monitoring is self-evident, then how to build a perfect monitoring system to help programmers find and efficiently locate problems? This article will introduce the monitoring practice of Baidu game microservice. Based on baidu’s perfect monitoring foundation, we have built a relatively perfect monitoring system. Now we will introduce our practice course to you.
The full text contains 4583 words and takes an estimated reading time of 9 minutes.
The background,
With the rapid development of business, the game server development students maintain an average of 2 ~ 3 services, increase subsequent business scenarios may introduce more services, how efficient has learned that the micro service system operation state, when a business exception problem of how to quickly find and remove the fault, the game server development classmate did a lot of work in monitoring practice attempt.
Early monitoring based on the company’s Argus monitoring (log server related monitoring), Monitor monitoring platform (business monitoring), Sia monitoring (such as visual surveillance) covers some basic monitoring, but because of the lack of system, the lack of and business combination, the overall effect is not ideal, many problems still is customer service and product feedback, At the same time, one of the most troublesome points in the process of tracking problems is that it often takes a long time to locate the problems, which has a certain negative impact on the business. In this case, we systematically sorted out the problems we were facing, systematically designed and optimized the monitoring system, and focused on the in-depth combination of problem positioning and business, which greatly improved the efficiency of problem positioning.
The following will be an overall introduction to the construction process of our monitoring system, hoping to help readers.
Second, micro service monitoring
In the early stage of monitoring construction, we mainly added various monitoring based on Baidu’s monitoring infrastructure, but the effect was not ideal due to the lack of system. Although our monitoring ability was not perfect and weak in the initial stage, these scattered monitoring measures also helped the r&d students find many system problems, which laid a foundation for the systematic and multi-dimensional combined monitoring in the future.
2.1. Log and server Monitoring
Using The Baidu Argus monitoring platform to monitor machine status and business logs, the game microservice uses machine and log monitoring capabilities to monitor online services.
Our initial application of Argus monitoring was one-dimensional, and the depth of the business scenario was not enough. For example, the monitoring threshold and multidimensional alarm capability of some instances of a certain problem were not considered in the initial design. The following is the introduction of baidu Argus monitoring capability and process:
Argus’s overall data flow is as follows, which supports phone calls, SMS, SMS and Baidu Stream alerts
There is a familiar ELK Stack scheme (Elasticseach + Logstash + Kibana) in the industry. Beats (optional) is installed on each server as the log client collector. Logstash collects, parses, and filters logs in a unified way. Then the data is sent to Elasticsearch for storage and analysis. Finally, Kibana is used to display the data.
2.2 Service polling monitoring
Use baidu monitor monitoring platform, as the core interface using regular polling detection mechanism to assist in monitoring online service quality, the monitor platform support visual configuration, but you need to do custom configuration for each scene, as the business rapid iteration, the monitoring efficiency and ease of use of add cannot meet the needs of the business.
2.3 Visual monitoring of services
Using SIA intelligent monitoring system of the company, the monitoring visualization of service flow, availability, performance and other indicators is realized, which can assist business research and development to visually observe online service status and alarm based on online abnormal status. However, services do not fully use the SIA intelligent monitoring capability. As a result, the auxiliary role of visualization is limited and the intelligent capability is not reflected.
Figure 3. Monitoring visualization
For the industry’s visual monitoring tools, such as Kibana, Grafana, etc., the relevant capabilities have been very complete, can basically meet the business of a variety of presentation needs, you can refer to understand.
3. Evolution of microservice monitoring
As described above, although monitoring measures in the initial stage of monitoring can assist R&D to find and locate some problems, there are still many problems, mainly in the following four aspects:
-
Risk exposure lag, most of the alarm occurred when the impact;
-
Lack of unified monitoring planning, related monitoring items chaos and incomplete coverage;
-
Weak monitoring capability, unable to provide effective exception information;
-
Alarm confusion, research and development is bombarded by alarm information;
In terms of the costs and benefits of the overall monitoring system, we will not completely overturn the past monitoring, but based on the existing basic monitoring capabilities to improve. First, we make a comprehensive design for the monitoring system from a systematic perspective, and then strengthen the ability of each part of the monitoring system based on the design.
3.1 Systematic design of monitoring
Objective: effective prevention, timely detection, quick stop loss;
Landing: Based on the systematic design goal, the following landing ideas are disassembled.
From the aspects of risk control, intelligent monitoring, intelligent alarm and efficient positioning, the systematic monitoring work of the micro-service system is designed. The overall process is as follows:
The following four aspects of risk control, intelligent monitoring, intelligent alarm and efficient positioning are introduced one by one.
3.1.1 Risk control design
The earlier the online problems can be found, the better. Due to the objective differences between the students in the RESEARCH and development and the inability of cooder Review to effectively avoid the occurrence of online problems, the game business RESEARCH and development has done a lot of work in the automatic case and release to reduce the occurrence of problems. The following are the main risk control items of the research and development. Through the implementation of these risk control items, more than 95% of the problems in the launch can be reduced.
3.1.2 Intelligent monitoring design
The initial monitoring of the game business is the addition of decentralized monitoring: Log monitoring using Argus, the visual monitoring experiment SIA intelligent monitoring platform, the coverage of monitoring and the synergies between monitoring systems are not considered globally, which exposes some problems, such as:
Question 1:
Monitoring based on monitoring objects effectively covers the system in a single dimension. However, how to detect global fluctuations of the system?
Question 2:
How to obtain information efficiently when an instance pvlost surges due to network or machine disk failure?
Question 3:
System availability fluctuations, is it a problem in a specific machine room, a problem in a specific interface, or an exception in the downstream access?
(1) Intelligent anomaly detection
The intelligent anomaly detection algorithm of the SIA system is used to incorporate indicators such as time consumption, traffic, SLA, and revenue into the monitoring system to efficiently detect periodic and non-periodic fluctuations of the system. The following describes the main algorithms.
By combining these metrics with metrics such as game business traffic, time, revenue, etc., even slow downturns can be detected by these cyclical tools as the system fluctuates periodically or not
Effective detection, greatly improve the coverage of abnormal detection.
(2) Monitoring coverage of all scenarios
We cover the monitoring from four quadrants, so that there is no dead Angle of problem exposure. At the same time, for monitoring such as service dimension, we also refine the multidimensional screening ability, and strive to facilitate the discovery of problems from the macro perspective, but also to assist the efficient positioning of problems in the micro world.
To return to data monitoring, we have specialized scenarios for the game business and detailed the data and scenarios that need to be monitored to ensure complete coverage. Here are some monitoring items related to data.
(3) Multi-dimensional monitoring visualization assistance
Multidimensional filtering capability: service, interface, error code, machine room, machine instance;
Abnormal multi-dimensional visualization: such as PVlost based on the distribution of interface, machine and machine room;
Visualization of error distribution: divided interface, divided error code;
FIG. 6 Visualization of multidimensional monitoring
3.1.3 Intelligent alarm design
The alarm has been designed in a hierarchical way, and different alarm ranges and alarm modes are set based on different scenes to reduce the overflow of non-important alarm information. Meanwhile, the overall design of alarm application is as follows:
(1) Intelligent merge filtering and automatic upgrade
Intelligent filtering: reduce excessive alarm information, do certain information screening;
Intelligent alarm combination: through the combination of information, improve the alarm information profile, further reduce the overflow of alarm information;
Alarm automatic upgrade: to solve the problem that the alarm can not reach the person on duty, by setting different thresholds to expand to different ranges, and upgrade the alarm form from email -> such as flow -> SMS -> telephone, and alarm phone can be set to continuously dial until someone responds, to solve the problem of contact;
(2) Style content customization
For ordinary instance alarm or service alarm, the corresponding alarm information is output according to the fixed format;
In the core logic part, the alarm content definition based on rich text is added, the alarm information and alarm problem are displayed completely, and the context semantics of the problem is provided, which greatly improves the information amount and provides sufficient and effective information for locating the problem.
Figure 8 alarm content style customization
3.1.4 Support for efficient positioning
High efficiency of alarm exposure information: For the key core logic, Trace link + robot mode is adopted to realize the efficient contact and customized output of alarm, and realize the efficient transmission of information;
High efficiency of alarm information confirmation: Some attention should be paid to the real-time trace system, which solves this problem efficiently in order to confirm the relevant complete log data online and request the fast data retrieval of the current data condition after abnormal information alarm.
(1) Trace link information of the core logical robot
Alarm exposure information in the core logic has basically reached the minute level problem alarm + problem automatic positioning, research and development based on the alarm information can see the corresponding problem code lines and error causes, greatly improve the problem location efficiency.
, of course, the way of alarm is also realize the high cost of problems, such as in the business of prepaid phone after the completion of the game to the user if there is a hair prop process, we will expose the request parameters, the error function and the specific reason for the error, the research based on the data can be intuitive to specific problems, but the need to achieve more customized, There is an access cost.
(2) Real-time trace system access
With the capabilities of Baidu Trace, services can be collected in a non-intrusive manner with very low access costs. In terms of timeliness, Baidu DataHub message queue is adopted, and Dstream is used to build indexes in real time, so that the retrieval timeliness based on key information from the data source to the fault location platform can be within 5 minutes, greatly improving the research and development location efficiency.
4. Panorama of microservice monitoring
4.1 User Access
Through multi-dimensional visual monitoring, the research and development can quickly analyze the general cause of the problem based on the visual interface; Based on intelligent alarm and business report, it can meet the comprehensive detection of timeliness and business detail health, so that the r & D students can fully perceive the state of the system;
4.2 Monitoring Tool
Based on the Argus monitoring, Sia intelligent monitoring and robot monitoring AIDS provided by the company, the system can be fully covered. Customized monitoring is provided for service data with a long period, such as daily application activity, download success rate, and white screen rate.
4.3 Monitoring Indicators
The monitoring indicators are roughly classified as above, and the effective coverage of monitoring is achieved based on these classifications.
4.4 Monitoring Objects
Monitoring objects include servers, service logs, service status, service data, service core logic, and core scenarios. By combing monitoring objects, you can fully control the monitoring.
Fifth, summarize the outlook
Through systematic monitoring capacity building, both in the efficiency of timeliness, positioning and coverage achieves the ideal state, such as research and development for the major online problem can be the first time perception, and has the perfect auxiliary positioning information to assist in effective positioning problem, summarizes the practice of the whole monitoring process, mainly has the following several aspects.
(1) Systematic design implementation
Monitoring system must first clear what is the problem, what is the goal, after understanding the problems and objectives and implementation on how to fully solve the problem and to think, to achieve your goals based on the systematic analysis of the process, we from the risk control, intelligent monitoring, intelligent alarm, effective positioning of several parts to implement our monitoring system, To achieve the desired goal.
(2) Hierarchical thinking mode is applied in monitoring and alarm, and the core logic concentrates the firepower
Whether it is monitoring or alarm, we aim to focus on the important features and core logic, and if the existing tools cannot meet the goals, consider multiple tool combinations to meet the monitoring objectives. For common logical functions, emphasis is placed on the extent of coverage, complete coverage with existing tools.
(3) Easy to implement and landing
SIA Intelligent Monitoring, Argus monitoring, and the ability to provide aggregation for homogeneous content monitoring in one step. For heterogeneous or differentiated services, the existing form of the business side can support access with non-intrusive capability, which greatly improves the efficiency of monitoring.
(4) Fully integrate the company’s existing capabilities, innovate combined applications, and improve efficiency
When using monitoring infrastructure, different monitoring tools have their own advantages and disadvantages, make full use of the advantages of different monitoring tools, the optimal overall monitoring effect, at the same time for monitoring such as some core logic, the use of innovative robot alarm + trace content customization capabilities, implement effective feedback and positioning to the core logic problem.
Although the practice in the monitoring system has achieved a relatively ideal effect, but in the system fault handling, disaster recovery and other capabilities of the automatic mechanism to further improve the construction, and the use of system resources has not been intelligent use, the current increase and decrease of resources still depends on manual intervention. The subsequent optimization goal is to achieve full automation in automatic fault handling and intelligent resource expansion and scaling to provide the overall maintainability and availability of the system.
Recommended Reading:
How to optimize user experience like Baidu Live (start)
Baidu search stability analysis of the story (second)
Baidu search stability analysis of the story (on)
———- END ———-
Baidu said Geek
Baidu official technology public number online!
Technical dry goods · Industry information · online salon · Industry conference
Recruitment information · Internal promotion information · Technical books · Around Baidu
Welcome your attention