Princess yi
Taobao, as an aircraft-carrier application, is used by hundreds of millions of users every day. Ensuring client stability is our primary goal. To that end, we also came up with a 5-15-60 goal. That is, it takes 5 minutes to respond to a fault alarm, 15 minutes to locate the fault, and 60 minutes to recover the fault. However, the existing screening system cannot achieve this goal well. The main reasons are as follows:
Monitor phase
- Aggregation statistics based on Crash stack and exception information are not precise, accurate and sensitive enough.
- After an exception is detected, the end – side behavior is monotonous. Only the exception information is reported, but more useful data cannot be provided.
- Most of the problems with mobile shopping are related to online change, but there is a lack of change quality monitoring.
Phase of the February
- If the abnormal information reported by the monitor is insufficient, rely on logs to rectify faults.
- In case of a Crash or an exception, logs are not actively uploaded. You need to manually retrieve logs. Users cannot obtain logs when they are offline.
- After getting the log:
- Lack of classification, lack of standards, cluttered logs, unable to understand the logs of other modules;
- Lack of scene information, the operation scenario of the user when the exception cannot be completely reproduced.
- Lack of event information related to the whole life cycle, unable to master the operation of app;
- The upstream and downstream log information of each module cannot be associated to form a complete link.
- Existing log visualization tools are weak and cannot improve troubleshooting efficiency.
- Problem investigation by manual analysis, low efficiency, the same problem lack of precipitation;
- There is no correlation between the data of various systems, so it is necessary to obtain data from multiple platforms.
Upgrade of diagnostic system
In view of the above existing problems, we redesigned the architecture of the whole wireless operation and maintenance inspection and diagnosis system. In the new architecture, we introduced the concept of scenarios. In the past, exceptions occurred on the server are independent events, and there is no way to do more detailed processing and data collection for different exceptions. After the concept of scenario is introduced, a scenario may be a combination of an exception and multiple conditions. Configurations can be made for different scenarios to collect more abundant and accurate exception information.
At the same time, we redefined the end – side exception data, including standard Log data, Trace full link data that records the call link, runtime Metric data and field snapshot data when an exception occurs. The platform side can use abnormal data to monitor and alarm. These data can also be visually parsed. For business differences, the platform provides plug-in capabilities to parse data. With this semantically generated information, the platform can perform initial problem diagnosis.
So the next goal to achieve is:
- Implement end-to-end monitoring operation and maintenance;
- Upgrade the LOG system to integrate LOG, TRACE and METRIC data to provide richer and more accurate investigation information;
- Complete data integration of high availability system, provide standardized interface and platform for troubleshooting;
- Plug-in support platform enabling, business customized diagnosis plug-in, complete the diagnosis system platform docking;
- The platform provides diagnosis results based on diagnostic information, which can be automated and intelligent.
- According to the diagnosis results, solutions or rectification requirements are proposed, forming a closed loop from requirements -> RESEARCH and development -> release -> monitoring -> investigation -> diagnosis -> repair -> requirements.
Log System Upgrade
At present, analysis of run logs is the main means of end-to-end troubleshooting. As mentioned above, there are some problems in our logs. So our first step was to update the logging system. (Before this, we improved the basic capabilities of the log itself, such as improving write performance, improving log compression rate, improving upload success rate, and establishing the data market of the log, etc.)
In order to improve the efficiency of log troubleshooting, we reformulate the end-to-end standard log protocol from the log content. The standardized log protocol can help us to build log visualization and automate log analysis on the platform side. From the perspective of Log, Trace and Metric, we divide the existing logs into the following categories according to the actual situation of mobile shopping:
- CodeLog: Compatible with original logs, which are relatively messy;
- PageLog: Logs can be divided by page dimensions during troubleshooting.
- EventLog: Records various events on both ends, such as switching between front and back ends, network status, configuration changes, exceptions, and click events.
- MetricLog: Records various indicator data, such as memory, CPU, and service indicator data, during the end-to-end running.
- SpanLog: indicates the log data of the entire link. Concatenate individual points to define unified standards for measuring performance and detecting anomalies. The solution based on OpenTrace is connected to the server, forming an end-to-end full-link investigation mechanism.
With all this data. On the platform side, a log visualization platform can be used to quickly replay end-to-end behaviors. Thanks to the standardization of logs, logs can be analyzed from the platform side to quickly display abnormal nodes.
End-to-end diagnosis upgrade
The client is the source of the whole diagnostic system. All abnormal data and operation information are collected and reported by various tools on the client. At present, the main end-side tools include:
- APM: Collects running and performance information from both ends and reports the data to the server.
- TLOG: collects logs about the running of the end and saves the logs locally. The server sends commands to retrieve logs when necessary.
- UT: an end – side buried point tool. Many service exceptions are reported through UT.
- Abnormality monitoring: Represented by Crash SDK, this one mainly collects Crash information on both ends, as well as SDK related to abnormality and user public opinion.
- Troubleshooting tools: memory detection, lag detection, and blank screen detection tools. These are not directly classified in APM because these tools are run by sampling, and some are even turned off by default.
It can be seen that there are many mature tools in use at the end, but problems such as missing data are often found when troubleshooting problems. The main reason is that on the one hand, the data is scattered on the platform side, and there is no unified interface for query. On the other hand, the client does not integrate the data of these tools, and there is little interaction between these tools when exceptions occur, resulting in data loss.
To address these issues, we introduced a new diagnostic SDK and staining SDK on the side. Their main functions are:
- Interconnect with existing tools to collect end-to-end running data. Write this data to the TLOG log according to the standard logging protocol;
- Monitor the change information of the end side and generate the dye mark of the change;
- If an exception occurs on the listener, snapshot information (including running information and change information) is generated and reported to the server.
- Scenario diagnosis data is collected and reported in specific scenarios based on the rules configured on the server.
- Directional diagnosis is supported. According to the configuration delivered by the server, the corresponding troubleshooting tool is invoked to collect data.
- Supports real-time log uploading and online debugging for specific users and devices.
Abnormal snapshot
End – side exceptions include Crash, service exceptions, and public opinions. When an exception occurs, the reported data format, content, channel, and platform are different. To add some data, both the end – side and the corresponding platform will need to be modified. Therefore, we monitored the SDK related to end-to-end anomaly monitoring and provided an exception notification interface for business. When the diagnostic SDK receives an exception, it generates a snapshot data using the currently collected running information. Each snapshot data has a unique snapshotID. We just need to pass this ID to the corresponding SDK so that changes to the existing SDK are minimal.
Snapshot data is enriched with enhanced end-to-end capabilities. The collected snapshot information is uploaded to the diagnostic platform. Platforms can use snapshotID to associate data. The diagnostic platform can analyze faults based on snapshot information and log information to obtain preliminary diagnosis results.
Change monitoring
Most of the problems in hand shopping are due to online changes. The current monitoring and troubleshooting system does not specifically monitor changes, but mainly generates alarms based on the number and trend of exceptions. This has a certain lag, resulting in difficult to quickly find problems in the volume stage. At the same time, there is no unified control standard for the release of changes. Therefore, we introduced dyeing SDK on the side to collect change data, and monitored change release with the change diagnosis platform of wireless operation and maintenance, so as to achieve gray scale, observability and rollback.
Current end-to-end changes include common configuration changes (Orange Platform), AB trial changes and business customization changes (Touchstone, Security, Neoultron, etc.). The dyeing SDK connects with these SDKS on the end side, and after collecting the change data on the end side, the corresponding dyeing mark will be generated and reported. In conjunction with TLOG and the diagnostic SDK, these changes are logged and exception information is marked when exceptions occur. The platform side will also connect with various release platforms and high availability data platforms, and make release decisions according to the data reported by the client.
Dyeing logo
Change is actually a process in which the server sends data to the client for use by the client.
Therefore, for the change, we define:
- Change type: Used to distinguish the types of changes, such as configuration change (Orange), test change (ABTest), service A change, service B change, etc.
- Configuration type: A change type may have multiple configurations, for example, orange change. Each configuration has a namespace to distinguish configuration types.
- Version information: Represents a specific release of a configuration. Not all configuration changes have specific version information; you can use the unique identity of a publication or configuration as version information. For example, each Orange configuration publication has a version and each ABTest publication has a publishID.
With this information we can generate a unique staining identifier for each change. By reporting dyeing marks, we can count the effective number of changes; By carrying dyeing labels in snapshot information, crash rate and number of public opinions with change labels can be calculated. Monitor the quality of changes by comparing them with high availability market data. A service can also add a color mark to a network request to collect statistics on whether an exception exists on the interface.
Gray definition
For the dyeing SDK, we hope to observe it during the gray scale, find problems in advance, and not bring the problems caused by changes to the full online environment, so we define the release process as three stages:
- Preparation period: prepare change data, pre-send validation, submit approval, and publish online. This phase determines the type, configuration, and version of the release change;
- Grayscale phase: Delivers the change configuration to some users. Our staining monitoring is also mainly in this phase, mainly for reporting staining markers and carrying staining markers in abnormal snapshots. At this stage, the platform side processes the data to generate grayscale related data.
- Full period: when the gray level standard, into the full period. This is when the configuration is pushed to all users that meet the criteria.
Data reporting control
Because the number of mobile users is too large, if the effective data continues to be reported after the full number of changes, the pressure on the server will be great. At the same time, if the abnormal information continues to mark the significance is not much. Therefore, we must have a mechanism to control the reporting of end-to-end staining data.
For general changes such as Orange and ABTest, we have adapted them separately. They can be controlled by experiment number and namespace configuration for black-and-white list control, sampling control or publication status. But for custom changes, the conditions that can be controlled vary so much that understanding this particular change is essential for fine-grained control. So we defined some general conditions: gray scale identification, sampling rate, expiration time to control.
The information is sent to the end in the form of a configuration file. The end – side does not need to pay attention to the logic of setting these conditions, but sets them on the platform side. The platform side connects to the publishing platform and ha platform, and makes decisions based on the reported data. At present, the report is mainly determined by the gray mark + timeout time in the configuration.
Release entrance guard
The server can monitor changes based on data such as valid counts, abnormal staining, and so on, reported from the end to the side. According to the number of relevant crashes, the number of public opinions, gray time, etc., to determine whether the current change has reached the standard of full release.
Meanwhile, crash information and public opinion information related to this change will also be listed. Assist to determine whether there is any risk in this change release.
At present, Orange configuration change, AB test change, details, order and other services have been accessed. The result is still good, has successfully avoided 4 online failures.
Scenario-based reporting
Scenario data reporting is an important capability in diagnostic system. In the past, when an alarm was generated, we manually retrieved the relevant data from the end for troubleshooting. In addition, different anomalies required different data, which often required multiple operations across multiple platforms. This leads to a lag in data acquisition and an uncontrolled process. Therefore, after the new log standard, abnormal snapshot collection, abnormal change dyeing and other basic capabilities are available on the end side, scenario-based reporting is introduced.
As an example, according to the existing way of screening, when online abnormal alarm occurs, we generally through first reported abnormal information for troubleshooting, but restricted by the existing information is not complete, often need to pass pull TLOG to do further explored, shale TLOG and rely on online users, however, this causes the entire screen positioning time is very long.
After the concept of scenario is introduced, when the platform detects that the number of exceptions is about to reach the threshold, it can automatically generate a scenario rule configuration and select a group of users and deliver it to the espace desktop. If the same exception occurs on the server, the scenario engine collects and reports the required data, so that the platform has enough data to analyze and locate the alarm when the alarm reaches the threshold.
Scene: the rules
The scenario engine is used to execute the scenario rules delivered by the server. A scenario rule consists of three parts:
Trigger (Trigger)
It can be an action or an event on the end. Compared with the previous data reporting only in Crash and business anomalies, we have expanded the exception trigger time.
- Collapse of abnormal
- User screenshot feedback
- Network exception (MTOP error, network error, etc.)
- Page exception (blank screen, abnormal display)
- System faults (too much memory, too much CPU, too fast power consumption, high temperature, or slow down)
- Service exception (business error code, logic error, etc.)
- Priming (generally used for directional diagnostics)
Condition (Condition)
Condition determination is the core of the whole scene upload. After adding scene conditions, we can classify exception types more accurately from multiple dimensions. Conditions can be roughly divided into:
- Basic conditions: From the device information, user information, version information and other dimensions to match and filter;
- Status conditions: mainly include network status, current page, memory water level and other runtime information;
- Specific conditions: Different scenarios need to determine the conditions are different. For example, when a Carsh occurs, it can be matched based on the Exception type, stack, and so on. Service exceptions may be matched based on error codes and network errors.
Behavior (Action)
When a rule is fired on the end and all conditions are met, the specified behavior is triggered, so that different data can be collected for different scenarios. At present, end-to-end behaviors are as follows:
- Upload TLOG logs
- Uploading Snapshot Information
- Uploading Memory Information
- Cooperate with other troubleshooting tools to collect and report data based on the delivered parameters
Scene issued
A new scene management platform is built on the platform side, which can easily configure various scenarios and conditions. There are also standard release, review, gray process. By using PUSH and PULL, the client can obtain scenario rules in a timely manner.
The platform and the end – side also support directional configuration delivery. By specifying some basic conditions, you can deliver configurations for different devices and users.
Data Traffic Control
After matching the scenario rules, abnormal data will be reported. However, if the amount of data is too large, the server storage will be stressed. Therefore, we carried out traffic control for the data reported by the scene.
From a troubleshooting point of view, we may only need several copies of logs and data for the same problem. So we specify a threshold for data reporting when we create a scenario. When the platform has collected enough data, the scenario stops and the client is notified of the rule going offline.
At the same time, to prevent users from frequently uploading diagnostic data from affecting normal operation, the end also has its own traffic limiting policy. For example, the report can be reported only in wifi environment, the number of rules executed every day is limited, the amount of data uploaded every day is limited, and the interval for data uploading is set, etc.
Custom Scenarios
At present, our triggers are some common scenarios, obtaining data from high availability tools and reporting. However, a business may have its own system for monitoring exceptions. Therefore, we also provide corresponding interfaces for businesses to call, and use our capabilities of scene delivery, rule expression execution, and obtaining running data to help businesses diagnose problems. Services can define their own trigger timing and trigger conditions. In the future, we can also add the ability to customize behaviors so that businesses can report corresponding data according to scenarios.
Directional diagnosis
At present, in addition to TLOG, there are memory tools, performance tools, caton tools and other troubleshooting tools. These tools can be useful for troubleshooting specific problems. However, it is enabled or disabled by default by configuration sampling. However, the current configuration cannot be delivered to devices and users. This can lead to incomplete and valid information during troubleshooting. Therefore, we also use the ability of scenariooriented delivery to simply transform the existing tools and cooperate with these tools to collect and report abnormal data.
future
At present, the end-to-end diagnostic capability is still under continuous construction. We also iterate on the pain points existing in the investigation process, such as real-time log, remote debugging, improvement of full-link data, abnormal data and so on. In addition, we also need to see that end-to-end diagnosis is no longer a simple stacking tool to collect data, and we need to collect and process data in a more targeted and refined way in the future.
With the improvement of diagnostic data, the next challenge will be how the platform can use the data to locate the root cause of the problem, analyze the impact surface of the problem, and precipitate the diagnosis results. For the end – side, it not only focuses on data collection, but also relies on the precipitation of diagnostic knowledge of the platform and the ability of end – side intelligence to realize problem diagnosis. At the same time, it can realize self-healing after abnormal occurrence with the capability of end-side degradation and dynamic repair.
Follow us every week for 3 mobile dry goods & Practices to give you thinking!