background
On April 13, 2018, the white paper “Enterprise-level AIOps Practice Suggestions”, in which BATJ, 360, Huawei, Cloud Intelligence and many other Internet companies participated in the standard formulation, mentioned: AIOps refers to intelligent operation and maintenance. Its goal is to further solve the problems that cannot be solved by automatic operation and maintenance based on the existing operation and maintenance data (logs, monitoring information, application information, etc.) by means of machine learning, improve the predictive ability and stability of the system, reduce IT costs, and improve the product competitiveness of enterprises. Gartner proposed the concept of AIOps in 2016 and predicted that the adoption rate of AIOps would reach 50% of the entire operation and maintenance industry by 2020.
Why use AIOps? What is its value? Let’s take a look at how operation and maintenance students usually deal with a fault.
- 18:00 began to appear hidden trouble;
- 18:30 Services are normal, and the impact range of potential risks is within the tolerance range of the system.
- 19:00 The hidden danger exceeds the normal range that the system can bear and begins to turn into a fault.
- 19:30 Part of the business is abnormal, and some customers are affected;
- 19:40 The monitoring system generates an alarm when the abnormal indicator exceeds the threshold.
- 19:50 O&M personnel log in to the system and start troubleshooting.
- 20:20 The operation and maintenance personnel find the cause of the fault and start to handle it. At this time, the fault may have affected a large number of users.
- 20:30 Fault recovery. Services are restored.
This is the processing mode of traditional operation and maintenance. There are several obvious problems in the whole process of fault generation and recovery:
- Monitoring data acquisition problems
- Multifarious monitoring tools and invalid data abound
- A lot of manual configuration, lack of first level identification of problems
- Hidden problems discovered
- Static threshold: Potential risks cannot be discovered in a timely manner and major risks are delayed
- Information retrieval and screening are difficult: massive information is difficult to quickly locate the cause
- Alarm problems
- Low event processing efficiency: There is no historical alarm model
- Lack of event compression and escalation: a side effect of message flooding during a major failure
- Problem dealt with
- Lack of data: heavy reliance on human brain and experience
- After disposal can not be evaluated: there is only fault is or is not, once fluctuations helpless
Practice common scenarios for AIOps
Beyond these four obvious categories, there are certainly more. For example, how to use massive and valuable monitoring data to ensure efficient and secure operation of services in complex software and hardware environments? In this process, operation and maintenance rules are also flexible. How can they be managed and maintained effectively? For these traditional operation and maintenance pain points, it is not difficult to abstract the following typical scenarios:
Intelligent anomaly detection
The anomaly detection mentioned here refers to the discovery of abnormal problems in time series data indicators from the massive operation and maintenance monitoring data indicators. To put it simply, it is to find outliers in the historical data that are different from most objects, which is different from the index evaluation based on human judgment and can more effectively improve the accuracy and timeliness of problem discovery.
Intelligent alarm
Use the dynamic threshold learned from historical data instead of the static threshold to discover major risks or faults in a timely manner.
Intelligent alarm message correlation analysis and convergence, resolve the side effects caused by alarm storms when faults occur. By analyzing the correlation between alarm messages, you can identify alarm patterns and combine multiple related alarms or convert them into one alarm with more information, which helps diagnose faults more quickly and accurately.
Intelligent fault root cause analysis
In the three stages of detection, location and identification of fault management, fault identification and diagnosis is particularly important. Root cause analysis, also known as fault location, fault isolation, or alarm/event correlation, is the process of inferring a set of failures that produce a given set of symptoms. Root cause analysis requires that this reasoning process be performed using a model that explains the relationship between failures and symptoms.
Intelligent time series prediction
Based on the massive historical data acquisition model, it can predict the change of the future trend, and continuously compensate and modify the model in the production process, at the same time, it can realize more accurate early warning before the occurrence of faults or accidents.
In this typical set of scenarios, the expected outputs are scope of influence, probability of cause and probability of influence, and specific object entities of a certain type. And the required input data can meet the following aspects:
- Enough data and enough data increments
- Only enough data can be available for model training
- Only sufficient data increments can conditionally compensate for modifying the trained model
- Data dimension coverage (time dimension, region dimension, system level dimension, application level dimension, etc.) should be sufficient
- The improvement of the time dimension is helpful to discover the periodicity of the index and the accuracy of prediction, while the lack of sufficient time span will greatly reduce the data value due to the loss of periodicity
- Regional dimension
- From the perspective of business, the failure of large-scale complex systems at the network level often has regional impact. Such as the failure of CDN, bone network quality and regional network quality. By means of distributed active network monitoring and real user experience monitoring, seemingly unpredictable failures can be caused, and the direct impact range of hidden dangers can be timely perceived at the beginning of affecting users, thus reducing the impact time of failures.
- System-level dimension
- Nowadays, the scale of software and hardware of business system is increasingly large, and the combination and call relationship of software and hardware are also more complex. Once the failure of large-scale complex system, it often has a small and far-reaching impact. At the same time, with the increase of the change frequency of sub-services, the performance and stability of related upstream and downstream business systems will be severely challenged. In complex systems, because of the sudden changes of one or some subsystems, domino chain effects are often formed. If the collected data has a perfect system-level dimension, it is bound to bring convenience to the discovery, positioning and prediction of hidden dangers and faults.
- Application-level dimension
- In addition to physical and network factors, faults on the Server are mostly caused by program bugs or improper SQL usage. In the data required to analyze the problem, the slow or abnormal running of the program, the code execution stack when the error occurs, the running slice, SQL call details, process or thread lock and other information will have a direct impact on the speed of fault location and prediction results.
- Attribution and association markers between data
- The data obtained by unsupervised algorithms sometimes need manual supervision or semi-supervision to modify, and at this time, the requirements on data analysis engineers or business personnel will be very high. This is only feasible in small data scale or simple business scenarios, but it is not desirable when facing large or complex business scenarios. So if there is some natural connection when the data is generated, it will certainly bring twice the result with half the effort for model training, calibration and correction.
Why can AIOps be combined with APM
Having listed the issues and requirements to be addressed in this article, the following attempts to address the need for AIOps to be combined with APM.
APM is the full name of Application Performance Management, which is a Management model abstracted by Gartner. APM management model requires five levels: terminal real user experience, runtime application architecture mapping, application transaction analysis, deep application diagnosis and analysis report. To support these five levels of management, a complete APM system needs to collect at least the following types of data:
Client experience data
It includes indicators such as the first screen time, DNS resolution time, first packet time, JS errors, IP operator data, crash analysis, lag analysis, HTTP and Socket connection and request time and error rate, resource loading and other data in various browsers and APP (native or H5 or mixed mode) clients. More important is the user behavior Path data based on the Session Path, which makes all the data collected from the client naturally obtain the attributes of the user behavior — what a great news for the massive data generated by the massive client at the same time!
Application performance Data
It includes request parameters, execution time, error details and error rates, exception details and exception rates, time and details of calling external resources (such as SQL, MQ, API, RPC, etc.), execution time and running stack of classes and methods, and state data of virtual machines (GC, Heap, thread pool, etc.). What is more important is the relationship between data constructed based on Trace model, which will discover the relationship between application and application, and between application and resource. Obviously, this relationship can be used to quickly get the logical topology of the application runtime, and also provide strong data basis for the upcoming data mining and prediction. Common Trace models include Google Dapper and Twitter Zipkin scheme, and Google Dapper is the most commonly used.
State data for servers and services
This is different from the data collected by Zabbix and other monitoring products, but is associated with the application and business. Besides the indirect correlation between the time dimension and the application business, the status data also becomes directly related to the application and business due to the Trace model.
End-to-end data connection
The most useful weapon for using APM data in AIOps is the Trace model. The data obtained by using Trace model has natural data association. The Trace model is also easy to extend, adding browsers and APP clients to the model. End-to-end refers to the data from the client application to the Server application. A unique Trace ID is generated on the side closest to the user when the user initiates a Request, and transmitted to the Server application through Request Header or other Request attributes. Thus, the user experience data, application performance data, Server and service status data. These three categories of data have natural markers of relationship.
How does AIOps combine with APM
Through the brief introduction of APM and APM data collection, it is not difficult to see the data requirements required for the practice of AIOps, as well as the data of various dimensions provided by APM system. In this supply and demand relationship, APM system provides sufficient data storage and increment, sufficient coverage of data dimensions, and perfect attribution and association markers between data.
Let’s go back and see whether the data provided by APM system can deal well with several typical scenarios abstracted from the pain points of traditional operation and maintenance:
- Intelligent anomaly detection
Critical transactions are an important requirement scenario in APM systems. The key services with high frequency access or critical importance that are specified by users or learned by the system are called critical transactions. Due to the sequential nature of data generation, in the anomaly detection scenario, not only anomaly detection can be performed well, but also fault range prediction can be made based on the relationship of call chain and user behavior.
- Intelligent alarm and intelligent time series prediction
The two typical AIOps scene data provided for the APM system applies, and because the system level and application level of the relationship between data and pattern recognition becomes more simple and efficient, the relational model can be directly applied to the alarm model training, the successful escape the scenario or half supervision in the most headache problem of human intervention. By applying intelligent alarm convergence, AIOps system can provide various alarm compression rules, such as flash interruption, high frequency and negative interruption. Based on the algorithm, AIOps system can reduce the worthless messages, shorten the time of problem discovery and eliminate the interference of message flood.
- Intelligent fault root cause analysis
The various Trace models have been described in some detail and the natural relationships between the data that result from the Trace model have been demonstrated. According to several Gartner analysts, the most useful weapon for APM systems to implement AIOps is the Trace model, which provides the main thread for analyzing problems. What if you don’t use APM? Tracing code is often embedded in applications based on human experience or specific business scenarios, known as the “dotting” approach, which is highly limited and difficult to operate due to business changes, making it almost impossible or difficult to standardize and productize.
Practice AIOps with the data provided by APM systems to look at failures from an external perspective of application health, user experience, or business performance. For example, it is found that a specific critical transaction is slow and users in a certain region are severely affected. The association diagnoses the code snippet or SQL statement, a node Load or IO condition of the application server or middleware that is most likely to affect performance.
summary
By reading this article, you can see that data from APM systems is faster and more effective than traditional operations data when implementing AIOps. APM system not only provides rich enough data for AIOps practice process to make AIOps platform adapt to enterprise application scenarios more quickly, but also provides the key technical basis of collection, processing and storage for AIOps practice process, and can verify and evaluate AIOps practice effect.
Author’s brief introduction
Chi Tao Gao (Neeke), Director of Cloud Intelligence, is a member of the PHP development team and the author of PECL/SeasLog. In his early years, he engaged in large-scale enterprise informatization research and development architecture. He once worked in BITcar Group and a large microblog marketing platform successively. In 2009, he set his hand in the field of Internet digital marketing and conducted in-depth research on architecture and performance optimization. In 2014, he joined Cloud Intelligence, committed to the architecture and research and development of APM products, and has unique insights on business operation and intelligent operation and maintenance, advocating agility, efficiency and GettingReal.