Odd technical guidelines

On September 22nd, the 18th session of 360 Internet Technology Training camp — AIOps landing practice exploration was held as scheduled in 360 Building in Beijing

This article is the summary share of 360 senior operation and maintenance development engineer Wang Baoping

Introduction to the

There were four topics in this meeting, which were shared by two internal lecturers and two external lecturers respectively. The first one shared the implementation practice of AIOps in 360 — you can also quickly land AIOps introduced 360’s intelligent operation and maintenance framework and suggested replacement of each component. It aims to enable some small companies that have not yet fallen or are preparing to fall to the ground to carry out AIOPS. Another internal lecturer shared the topic of 360 fault self-healing practice of AI operation and maintenance platform based on StackStorm, which mainly started from some commonly used specific scenes and added prediction, anomaly detection, association analysis and other models. Then make some judgments and self-healing according to the test results.

Next, let’s take a look at AIOps shared by two external lecturers from Creditease and Logyi

Appropriate letter

The share of the second issue is a lecturer from the appropriate letter XiaoYunPeng sharing based on the knowledge map to build next generation intelligent CMDB, knowledge map also has a lot of research, is in the field of AIOps algorithm, because of the nature of the algorithm or by big data analysis and generate the operational rules, and some of the rules, in certain scenarios, It could have been generated directly by human experience, or by existing CMD call relationships. For example, A will definitely cause B to happen, so there is no need to use big data to find the relationship between A and B. If YOU compare AIOps to operation brain, knowledge map is some fixed knowledge that can be directly told to your brain based on your operation experience. It can directly give you a three-year-old IQ, without requiring you to start from 0 and learn everything. Some operation and maintenance experience directly cannot generate rules, can be handed over to big data AI analysis, slowly for you to learn to find rules.

Although 360 has mature systems such as CMDB, HULK and Odin, and accumulated experience in daily front-line operation and maintenance, it has not generated experience base in the form of knowledge graph for the time being, which is also what 360 needs to strengthen in the future.

The log is easy

The share of the third issue is from the log, have rich experience in operations of Du Weipu share big based on log data intelligence operations and security practice, as a tripartite Tob company, if you want to access a company’s operational data analysis of the intelligence, general agent to integration, users considering the server-side security, will not cooperate, Therefore, agent self-collection and agent packet capture are difficult. Generally, they start from logs. Therefore, the entry point for easy logging is large-scale three-party logs, from the initial ELK to the present self-research. Then the AI part should be reflected in the log of some keywords NLP analysis, and generated timing anomaly detection, etc.

Where there is a need to learn in the log as the breakthrough point, we have our own company the full amount of permissions to the machine, can get the full amount of the machine system log, and the nginx log and log on each web service, web service against doing basic analysis of the elk before, but no large-scale investment, so this is also a swimmer.

Here is the introduction of AIOps landing practice in 360.

360

After knowing some entry points of AIOps from other companies, we will look at 360’s own AIOps implementation plan. We have started to implement and explore intelligent operation and maintenance since March 2018, and we have also concluded a set of implementation ideas that are in line with the current situation of 360.

On the whole, with the continuous development of the company, many businesses have entered the plateau stage. At the same time, many common problems of operation and maintenance have emerged, such as: resources are idle for a long time, but they are not recycled in time; More and more alarms, but full of false alarms; There are many alarms, but there is no correlation between the alarms, etc., which is a great challenge and torture for operation and maintenance personnel.

In order to better liberate operation and maintenance, 360 specifically selected three general high-frequency operation and maintenance scenarios, and then analyzed and predicted in these scenarios through AI analysis. For example, in the resource recovery scenario, classification and timing prediction models are used, and timing anomaly detection is used in alarm false alarms and missed alarms. A more accurate dynamic threshold is used to replace the one-size-fits-all threshold. In the aspect of alarm correlation, alarm convergence, correlation analysis, root cause analysis and other models are used, and the possible causes of topN are given to assist operation and maintenance to make decisions.

So just a quick rundown. We mentioned our o&M big data -> AI center -> alarm self-healing -> O&M big screen. In fact, this idea has a anthropomorphic correspondence. The operation and maintenance of big data means that our monitoring system is analogous to our eyes, the AI center is analogous to our brain, the self-healing is analogous to our hands and feet, and the operation and maintenance of large screens is analogous to our faces. This is a three-dimensional architecture. We have to see where the problem is, find the problem, then go to the AI to analyze the problem, and finally to self-heal to perform the action, solve the problem, the whole thing is definitely connected. Of course, no matter how good the work behind you is, you must have a good-looking face so that others can see your beauty. Therefore, it is very necessary to operate and maintain the large screen as a display for us. Now let’s look at the face first.

Operational time

This figure is actually a big aggregation, showing the cost data of resource recovery in the left column, the efficiency improvement data on the right, the big data of operation and maintenance at the bottom, and the data of backbone network link in the middle. The link data of backbone network can be generated in real time with the drag of the scroll bar. And real-time push on top of the channel, if there is some big alarm, after a judgment and analysis is to determine the AI data, will be real-time displayed on the screen of the scope, of course this display a screen or data dimension in the majority, imaginative, later whether can add interactions, such as mount cameras, who see the screen, see a few times, How long you watch it, you can count, and then add gestures, you can cut the screen, local zoom, and even some more cool interactive display. In a word, display is a very high playability field, even if we are engaged in operation and maintenance, but also hope that we do not limit our thinking, dare to think and dare to play, can have new tricks.

Of course, such a complex display is due to the strong support of the company’s Strange dance Group. In general, some open source 3D visualization frameworks can also do very well, such as Echarts, you can explore by yourself

The architecture

Next, let’s look at the architecture behind visualization. As a whole, our system is agent collection. Agents here are self-developed and reformed, and they can collect any data we want according to demand. Such as the hardware, the log data, such as quality, process, the network and then according to the need to save to the corresponding scheme, and then unified centralized management, then in view of the time-series data of fixed, fixed the scene to do AI analysis, finally if the scenario need healing, the self-healing action can trigger the command system, finally, the aggregation of data make a focus on screen display

Then let’s look at the overall framework of data acquisition

Here, our overall data collection includes agent self-collection and tripartite data. For example, syslog is written to ES through Kafka, reported by hardware interfaces, written to Mongo, collected by processes, reported by interfaces, and finally written to HA infludB. The overall number of our resources is twenty eight nuclear 16 g machine, our overall structure is simple, is to collect, report interface gateway, the gateway is a heavy component, the inside of the data processing and alarm sent didn’t off alone, to do so in fact from the operations point of view, as a whole or individual applications and services of quality difference, I don’t want to make comparisons here, but if you are interested, you can refer to my previous article on microservices splitting.

We strive for sustainability, easy maintenance, and low code complexity. Don’t microservice for microservice’s sake. For example, we do not use RPC to interact with the agent and gateway, but we use short links, because RPC has a lot of attention in connection pool maintenance and connection state management, but essentially solve the problem is to reduce bandwidth transmission, reduce the time to build connections, etc., but these are not very concerned in our architecture. Blocking a layer of nginx in front of the gateway is also customary for log analysis. This eliminates the need to load frameworks and a lot of logging code into the business code. The whole thing is a stateless Web service that is easy to scale up and maintain.

A single point of application

Here we are composed of three models: timing prediction, timing anomaly detection and alarm correlation analysis. Timing prediction is used in the resource recovery scene, timing anomaly detection is used in the VIP external network quality monitoring scene, and association analysis is used in the IO alarm association and other scenes. In fact, the accuracy of the algorithm itself is one aspect of the current AI landing AIOPS, and the scene selection of the algorithm is also very important. The good or bad of an algorithm landing is directly related to the choice of scene and the tidiness of data.

Here, take timing anomaly detection as an example, 500 VIPs, 200 CDN IPS, 100,000 link ping, 100,000 alarm threshold

Now, in order to detect the abnormal points of the 100,000 links, we have two methods: fixed threshold, fixed threshold 500ms, so there will certainly be some missing and false positive, for example, abnormal fluctuation within 500ms, this fixed threshold can not be detected, resulting in missing. At the same time, for example, the link from Xinjiang to Beijing may not be good all the time, 510ms all the time, but this value is normal, so the whole link is a false positive.

Therefore, it is better for us to generate an appropriate alarm threshold for each link of 100,000 links. It is impossible for us to set a threshold for each link manually. Therefore, we need to use the anomaly detection model to help us detect the abnormal point of each link, which is equivalent to each link has its own dynamic threshold.

For 100,000 links, we first divide them into about 200 classes through clustering, and each class has a model. If the clustering algorithm is not well tuned, we can classify them through the source CDN machine room through manual experience.

Our anomaly detection model includes several statistical models in addition to the isolated forest and curve fitting EWMA+3σ. In this way, different models are given weight, and finally the vote of each model is used to determine whether this data point is abnormal. Because most of the time, the isolated forest should be the most weighted, the most accurate, the other models are just to confirm that the isolated forest is not abnormal, and go to confirm it several times. Two heads are better than one.

The overall process here is that when an external network mass data point comes, the AI center will go to the Redis queue and sub the similar data of the current point within 2 hours, 100,000 data per minute, and each data is judged in series by 7 models. The statistical models in the 7 models need to preload data and load models in real time. Then the model itself also needs time to process data, so it needs to process 100,000 pieces of data within one minute, and each data must be returned within one minute, otherwise the real-time performance of the alarm will be affected

Within a single machine, the loading time of each statistical data should be shortened (about 10s), which is achieved by making a sliding window in the alarm data processing layer. The AI center only receives data without data processing.

The real-time loading of the model itself took about 10s. We used the online non-real-time model by hot loading into the memory in advance, and the AI center replaced the online model every 6 hours through off-line training at the back end. The first two steps saved 20s in total, and then the processing speed of the model itself could not be reduced by about 100ms. The single Goroutine can process 200 pieces of data per minute. Then we expanded the AI center and made it into 10 high-performance model processing clusters. The 100,000 requests are then broken up and each machine can handle more than 10,000 models. In order to control bandwidth, our requests and outlier returns from the AI center are executed in batches, so we have a Map-reduce mechanism in the AI center. Finally, the abnormal state of 100,000 data will be returned in batches within 1 minute.

Finally, the abnormal detection data will be manually annotated through the annotation platform to feed back to our algorithm model, so as to calibrate the model manually.

Timing model prediction is used in the scene of resource recovery, and the flow chart is as follows. The core lies in the classification of labeling and accuracy of prediction model. I’m not going to do that here

The association analysis model is used in the association analysis of IO alarm. Through the association analysis model, we first judge whether there is correlation between multiple time series data, and then simply judge whether there is a root cause by the order of abnormal points found by anomaly detection. I’m not going to expand it here

Tagging platform

We know that the data comes out of the model, but the result needs constant feedback of accuracy to continue to evolve, and the annotation platform here plays this role. We in view of the three model relative to the scene, the left column shows the model output data, such as an abnormal model output points, on the right side of the output the current value of the index, operations or business model, such as consumers, can choose Y or N, and then feedback will push to model database, according to the feedback, the offline model to continuously update and training.

Self-healing platform

At present, our self-healing is more based on some high-frequency scenarios, such as machine down, fault reporting, and process restart. The bottom layer is currently based on stackStorm, which abstracts the daily self-healing scene from various action atomic operations. Finally, these action atomic operations form a workflow. In this way, when a new scene is added, the business can select general actions on the interface and arrange them by itself. Finally, compose your own workflow. Stackstorm also has some disadvantages, such as heavy framework, reliance on yamL and other texts, and relatively high access costs, which can be replaced and optimized over time.

conclusion

The above operation and maintenance big data -> AI center -> annotation platform -> alarm self-healing -> operation and maintenance big screen is an overall AIOps framework for us. Of course, while polishing internally, we also hope to generalize each component and decoupled it from the company. Can Tob help more teams to implement AIOps in the future? If you have any AIOps landing experience is also welcome to the background comment discussion

Want to learn more about 360 Tech Boot Camp episode 18

And the fourth lecturer

StackStorm based ChatOPS solution – monitoring alarm self-healing

360 Internet Technology Training Camp 18th Session — AIOps landing practice exploration

Pay attention to our

World of you when not

Just be your shoulders

There is no


360 official technical official account

Technology of dry goods | | hand information activities

Empty,