On October 20, Tencent Cloud Metis intelligent operation and maintenance software platform announced on OSCAR Open Source Pioneer Day that it was officially open source. Metis is the first open source product in AIOps (Algorithmic IT Operations), or intelligent Operations. Intelligent operation and maintenance advocates learning and exploring rules from massive operation and maintenance data through algorithms, gradually reducing the dependence on human-specified rules, and thus reducing human errors.
OSCAR Open Source Pioneer Day was hosted by China Academy of Information and Communication Technology. He Baohong, director of China Ict Cloud Institute, Li Wei, Deputy Director of Cloud Computing Department of China ICT Cloud Institute, Zhao Jianchun, Vice President of Tencent Cloud, and Xiao Shiguang, General Manager of Tencent Cloud Operation Department jointly participated in the opening ceremony of Metis.
(Tencent Cloud “Metis” open source project officially released)
Zhao Jianchun, vice president of Tencent Cloud, said: “The combination of artificial intelligence and operations has been AIOps new concept, how to seek to a new breakthrough in the field of intelligence operations, from the traditional API to ops learning, the contribution of tencent mass quantity trained model open source community and industry, I think this is woven cloud Metis to learn the meaning of the piece of open source intelligence operations, and we build AI operational scenarios, “To integrate AI and operations.”
(Zhao Jianchun, Vice president of Tencent Cloud)
In the name of “Tencent Cloud Metis Intelligent Operation and maintenance learning software platform”, the concept of “learning software” was put forward by Professor Zhou Zhihua from Nanjing University. Learning components = model + specification, with reusable, evolvable, understandable characteristics. On this basis, Mr. Zhao Jianchun, vice president of Tencent Cloud, further proposed the concept of “operation and maintenance learning component”, also known as AI operation and maintenance component, emphasizing that it has the ability to memorize intelligent solutions for operation and maintenance scenarios.
“Zhiyun” refers to Tencent’s intelligent integrated operation and maintenance platform, while “Metis” is named after Metis, goddess of wisdom in Greek mythology. With the rapid expansion of Internet services and the diversified development of service types, the deficiencies of artificial rules are gradually highlighted, which promotes the rapid development of intelligent operation and maintenance in the past two years. Wecloud Metis is a collection of application practices focusing on intelligent operation and maintenance, aiming to analyze and make decisions on operation and maintenance data through a series of machine learning-based algorithms, so as to achieve a higher stage of automatic operation and maintenance.
Due to the rich variety and large scale of social services, Tencent has built sufficient IT infrastructure. In order to realize multi-dimensional and deep operation and maintenance of the massive operation and maintenance data generated in the development and interaction of various businesses of the company, Metis came into being.
At present, Metis has implemented many intelligent operation and maintenance practices in the three aspects of operation and maintenance quality, efficiency and cost, and gradually built mature intelligent operation and maintenance scenarios, which are embodied in six aspects: quality assurance, efficiency improvement, cost management, intelligent detection, and general model and rule learning.
Quality assurance: Machine learning technology can be used to detect anomalies, locate faults, analyze bottlenecks, etc., to ensure the stable operation of services intelligently without human intervention. Such as no threshold intelligent monitoring, DLP life and death index monitoring, multidimensional root cause analysis.
Efficiency improvement: Based on natural language processing and machine learning technology, intelligent question and answer, intelligent change and intelligent decision can significantly improve operation and maintenance efficiency. For example, Metis intelligent consulting robot, public opinion monitoring, cluster intelligent load balancing, database parameter tuning, capacity prediction.
Cost management: Based on big data intelligent analysis technology, manage resources (equipment, bandwidth, storage), quickly analyze the details of resource usage, and identify optimization points through horizontal big data comparison. For example, hard disk life cycle prediction. This time, Metis is the first open source non-threshold intelligent monitoring software to solve the problem of intelligent detection of time series data from the perspective of unsupervised + supervised learning.
Intelligent detection: The operation and maintenance personnel do not need to set the monitoring threshold any more. The model can make intelligent decisions on abnormal situations and intuitively tell whether the detection result is normal or abnormal. Generally speaking, the monitoring of threshold value includes the setting of maximum value, year-on-year, sequential and other dimensions. This detection scheme has a good effect in the initial stage of detection, but with the development and expansion of the business, it needs to pay high human cost to maintain an appropriate threshold range, which outweighs the loss for large-scale development business. The scheme of intelligent detection is based on statistical decision, unsupervised and supervised learning to jointly detect time series data. The first-level decision is made through statistical decision and unsupervised algorithm to output suspected anomalies, followed by supervised model decision to obtain the final detection result. This process eliminates the problems of the threshold approach.
General model: The intelligent detection model is trained from the diversified and massive business samples of Tencent Cloud, which is suitable for reuse in the time series detection of the Internet industry. The supervised detection effect depends on the accuracy and variety of labeled samples. Through the sample database management function, a large number of positive and negative samples are accumulated, which are divided into test sets and training sets. The general model is trained from the sample data of a large number of training sets, covering a relatively comprehensive sample classification. It can help some users to avoid the difficulties caused by the lack of training data, and users can directly load the general model for detection.
Rule learning: In the process of practice, there will be personalized business scenarios, and different users will have different criteria for abnormal judgment. Therefore, annotation feedback function is supported. Users can train according to the annotation information, generate new detection model, and master new business rules.
Metis threshold free intelligent monitoring software has carried over 2.4 million abnormal detection of business indicators in Tencent. It has a wide range of applications in the field of abnormal detection and operation and maintenance monitoring after the development of massive monitoring data. It can replace the traditional threshold detection method to achieve intelligent detection of abnormal timing data. It can also push alarms for abnormal data based on service policies.
Adhering to the open source concept of Tencent, Metis will create an open learning software platform, and open source time series indicator prediction, host abnormal intelligence analysis, MySQL abnormal intelligence analysis, disk life cycle prediction and other intelligent operation and maintenance learning software, collecting the construction experience and practice of users in the field of intelligent operation and maintenance. Enrich and improve AI software in terms of quality, efficiency and cost, build a complete operation and maintenance scenario, and will be compatible with other Open source products in the monitoring field in the future, such as Zabbix, Nagios, open-Falcon, etc.
In recent years, Tencent has become more active in the open source community. Since 2010, Tencent has adopted the r&d mode of “open, sharing and joint development” internally. To achieve independent open source, and actively participate in community work, has joined Hyperledger, LF Networking and Open Network Foundation, became the chief founding member of LF Deep Learning Foundation and platinum member of Linux Foundation. This open source Metis is another practice of Tencent’s open strategy in the field of technology. In the industry, it will fill the open source gap in the field of intelligent operation and maintenance, and gather forces to promote the breakthrough and development of operation and maintenance technology.