Yu Tao, head of network transmission quality of Agora SD-RTN, delivered a speech at RTE2021. In his speech, he made a comprehensive analysis of the pain points of traditional OPS, the advantages of AI OPS and the difficulties of AI OPS engineering, and shared the valuable experience of sNET r&d team in the implementation of AI OPS.

▲ Figure: Yu Tao, head of transmission quality of Agora SD-RTN network

01 Why do WE need AI OPS

Demand often starts from the pain point in production, and every new technology is born from the incubation process of demand. At present, traditional OPS mainly has three major pain points: 7 D × 24 H uninterrupted operation and maintenance guarantee, quality of operation and maintenance execution and execution efficiency.

1. 7 D × 24 H

The high availability of services depends on the uninterrupted operation and maintenance of 7 D × 24 H, and it is difficult and costly to set up a 7 D × 24 H team. And with the growth of business scale and the globalization of the industry, the difficulty of the above questions will increase linearly.

2. Quality

Operation and maintenance is a relatively “experience” job, which is why traditional operation and maintenance recruits often have a long orientation and familiarity with the business process. Differences in experience will lead to inconsistent quality of engine operation and maintenance processing, or lead to exceptions not caught, or lead to inadequate processing, or too aggressive processing, which will directly affect the customer experience. This point can be better compensated by AI OPS. Well-trained algorithms can approach the execution quality to the group decision results of experienced operation and maintenance personnel, while maintaining high consistency of execution results.

The efficiency of 3.

Operation and maintenance efficiency is also critical to the user experience. If the service quality is abnormal, the user experience will be damaged for one hour. If the processing time is only 5 minutes, it only affects 5 minutes. Network quality to room back, experienced staff from the alarm to the average time and processing is completed in about 15 minutes, plus online service scale, the fault is relatively quickly, the team monitoring the scale of service of the limited number of clusters at the same time, and ensure that all exceptions to efficiently handle exceptions, traditional operations have been compromised. But automated AI OPS have inherent advantages in terms of execution efficiency.

In retrospect, these pain points of traditional human OPS are exactly where AI OPS ‘greatest strengths lie:

  • 7 D × 24 hours: The machine does not need to sleep.
  • Quality problem: the quality of execution of the trained model is stable and reliable.
  • Efficiency issues: Automated AI OPS systems are far more efficient than humans.

02 Difficulties in AI OPS engineering

In the process of concrete practice, there are still many difficulties in the engineering implementation of AI OPS.

Standardization: From the perspective of THE AI OPS industry, the whole industry is in a period of expectation expansion. While technologies and tools are constantly innovating, the industry lacks standards, mature and stable platforms and tool chains. From the perspective of company development, AI OPS is in the exploratory stage, and engineering without standardization is very costly and risky. For example, it is like that we want to container the existing services on the cloud, but there is no now very mature Docker, K8S tools, and cloud manufacturers do not make various kinds of adaptation and compatibility for the container, which will increase a lot of architectural design and development work. This is the first thorny issue that AI OPS will face in the process of landing.

Inconsistent expectations: In the enterprise, business, operations, algorithms, and even big data teams often interpret AI OPS differently. Over the past few years intelligent driving accident news everyone still have some impression, should be one of the reasons is that people expect of intelligent driving is “real autopilot”, but from the current phase of AI is hard to go completely independent operation, when the ability of driving conditions beyond the AI boundary driver didn’t take over, a tragedy occurs. AI is a very fashionable word at present, but people of different majors have different expectations on it. Inconsistent expectations lead to poor information, which not only affects the progress of collaboration, but also may cause online failures.

High infrastructure requirements: one can’t make bricks without straw. In addition to good algorithms, good AI relies on high-quality data, and AI OPS is no exception. Taking global cloud service providers as an example, the basis for AI OPS is to do well in big data. Besides real-time high-throughput data centers, streaming computing interfaces should also be provided.

Agora’s AI OPS implementation best practices

How did our R&D team solve the problems in the specific practice of AI OPS landing?

  • Short and long term goal setting

Align long-term goals, the formulation of long-term goals can help the team to clarify the direction of development, is conducive to sorting out the focus of early work.

The short-term goal is to disassemble the long-term goal into phased goals, which mainly plays a role in putting the AI algorithm of our project into practice as soon as possible. It also lays the foundation for the long-term goal, and some common components should be established.

  • Teams align expectations and complement each other

Since AI OPS is a multi-team collaboration, it is the product of close collaboration between SRE, business, algorithm and big data teams. In this process, alignment expectations and good teamwork are crucial.

The first is the understanding of capabilities between teams. Our business team needs to take the initiative to understand the ability of the algorithm team. In addition to real-time anomaly capture ability and time series anomaly recognition of single index, the algorithm team can also provide consumption prediction ability. According to the data quality and consumption law, the algorithm prediction can even provide weekly or monthly prediction. For SRE, this can be of great value in terms of flow planning and cost control.

The second is understanding boundaries, which is very important. A vague understanding of the boundaries of other teams’ capabilities can lead to online failures or tragedy. The quality of machine learning depends on the quality of manually annotated data, and the capability boundary of the algorithm can be approximated to that of annotated people (groups). If the business team puts too much faith in the algorithm, an online failure may be inevitable. If the business team were to apply the problematic algorithm to some very core service completely out of human control, it would be like completely out of human autopilot. According to Murphy’s law, something would go wrong.

Once capabilities and boundaries are understood, teams need to complement each other to achieve overall excellence. AI does its best to reduce labor costs with high quality and efficiency, while traditional operations deal with unexpected situations to ensure overall availability.

  • Decouple business, operation and peacekeeping algorithms

In the landing process, a crucial step is to make the operation and maintenance platform as a service and API.

What is operation and maintenance? What is landing? Operation and maintenance refers to the operations at the service layer/business, so the landing of AI OPS is the effect of algorithm results on business. However, if the business operation and maintenance interface is directly called in the algorithm layer or the database is directly modified, the first risk is very large, and the second is very serious coupling. On the contrary, the iterative efficiency of business and operation and maintenance development is very low for the algorithm. This is undoubtedly very difficult for the whole project team, which is in urgent need of rapid iteration in the real exploration stage.

To solve the above problem, we split AI OPS into three layers:

  • Layer 1: AI layer
  • Second layer: decision layer
  • Third layer: executive layer

Once these three steps are done, AI and operations can be decoupled with three benefits:

I. Algorithm, decision and execution layer can be independently developed to improve the efficiency of module research and development. Each module, especially the algorithm, can be updated rapidly iteratively, and at the same time, the situation of “pulling the whole body” can be avoided, and the cost of iteration is lower. For AI OPS in the exploration stage, decoupling is an indispensable step on the road to landing.

Two: the system is more robust. The decision layer can make more anti-stay and security policies to improve the system robustness. In the face of a single algorithm crash, or result output exception, processing space is larger and more elegant.

Three: strong scalability. Standardized input interfaces provide a more convenient access method for subsequent algorithms and even other automatic scripts. Essentially standardizing AI OPS within the company.

  • Transform multiple indexes into single comprehensive indexes — simplify complex indexes

Anomaly detection of multiple indicators is more complex than that of single indicator, which requires higher data quality and annotation quality and is more difficult to implement. At this time, the business side can give full play to the complementary ability and try to convert multiple indicators into a comprehensive indicator. For example, delay, packet loss, and jitter all affect the network transmission experience, which is translated into a comprehensive indicator of quality transmission rate. This index can not only accurately reflect the network quality, but also is relatively simple for the algorithm model of the single index training algorithm team, which accelerates the landing of AI OPS.

  • A “powerful” algorithm is not necessarily a suitable algorithm

Algorithms are at the heart of AI OPS. A powerful algorithm is not necessarily the most appropriate algorithm, but also a matter of maturity and robustness.

For example, the engine is the power core of a car, and a powerful engine often needs a higher grade of gasoline. Similarly, more powerful algorithms, such as deep learning, require higher-quality data. At this stage, higher quality data is required and involves a long link. Define appropriate metrics at the business level; During transmission, data may be lost or dirty. In algorithm training, a lot of manual annotation is needed. Considering these three points, the challenges in the early stage of AI OPS are considerable, and selecting algorithms suitable for the business at the current stage can promote the landing more effectively.

There is also business to consider. The algorithm has two key indicators: accuracy rate and recall rate. If the business itself is fault-tolerant, it can give up some accuracy rate and expand some recall rate, which is also good for user experience.

04 outlook

With long-term goals in mind, we hope to include more exploration in the future. On the one hand, they want to take cost and efficiency into account, and on the other hand, they want the platform to be able to access more services.

Review past

RTE2021 Review | PRACTICE and exploration of HDR technology products

RTE2021 Review | Point cloud video coding and Reconstruction Technology based on V-PCC framework

RTE2021 review | Research and application of real-time background segmentation algorithm for acoustic network

RTE2021 Review! Smart Sense Super Clear: Give you the best visual feast!

RTE2021 Review | First open Source experience of Flat online classroom

RTE2021 Review | New generation API exploration and practice for RTE scenarios

RTE2021 Review | Quality monitoring behind real-time voice activity

RTE2021 Review | One plus two Minus, application of power-assisted deep learning in real-time inference scenarios

RTE2021 Review | New generation API exploration and practice for RTE scenarios

RTE2021 Review | Quality monitoring behind real-time voice activity

RTE2021 Review | One plus two Minus, application of power-assisted deep learning in real-time inference scenarios

RTE2021 Review | WebRTC’s long growth path, where will we go in the next decade?

RTE2021 Review | Deep learning based Audio codec implementation and Landing Challenge