Practice of netease Cloud communication service monitoring platform

Article | Dai Qiang netease senior development engineer letter cloud data platform

Introduction: Data is crucial in many businesses. For netease yunxin, we use data to improve services and promote continuous business growth. With the ability of service monitoring platform, we can intuitively feel the running status of online services. This paper will analyze in detail how to realize the service monitoring platform of netease Yunxin.

The introduction

Often human fears are based on the unknown reality of the world.

There is a lot of uncertainty in real life, and fear is because our current knowledge doesn’t make sense of it. With this sudden outbreak, for example, the fear of death is spreading. There are many uncertainties in the world. What is uncertainty? Suppose we had to decide whether the stock index would go up or down tomorrow. Without any data to back it up, we’d flip a coin with a 50/50 chance. In this scenario, all of our judgments are unreliable and we feel uncomfortable with our decisions.

If people knew all the factors involved in an impending event, they would be able to predict it accurately. Or conversely, if an event happens, it can be assumed that its occurrence is inevitable, which is Laplace’s doctrine.

As the above theory implies, data can help us to guide the direction and verify whether we are on the right track. Similarly, data is also crucial for the development of netease yunxin. We need data to improve our services and promote the continuous growth of our business.

Netease Yunit is a PaaS service product integrating netease’s 20 years of IM and audio and video technology. We have been committed to providing stable and reliable communication services, but how to ensure stability and reliability?

Service monitoring platform is one of the important link, it is equivalent to the dashboard on the bugatti veyron, how much is the car’s speed, the amount of oil is enough, how much is the current speed, the be clear at a glance on the dashboard, can help us to judge the: whether can step on the accelerator, is the brake the car when necessary. The goal and value of service monitoring platform is here, it is equivalent to netease cloud also believe the bugatti veyron’s dashboard, can tell us how about the current quality of service, whether need to add some more “oil”, if need to step on the “accelerator” or “brake”, and customers to provide more information to us, Help us to provide the best quality, the most reliable, the most stable service.

This paper will analyze in detail how to realize the service monitoring platform of netease Yunxin. Starting from the overall architecture, it will briefly introduce the framework of netease Yunxin service monitoring platform, and then carefully analyze the realization of four modules including data collection, data preprocessing, monitoring and alarm, and data application.

System architecture

At present, the audio and video data of netease yunxing basically come from the client and server logs, so the whole data collection link is a very important link, which determines the validity and timeliness of data.

First of all, let’s take a look at the overall architecture of netease Cloud information collection and monitoring platform, as shown below:

The overall architecture of the collection and monitoring platform is mainly divided into four parts: data collection, data processing, data application and monitoring and alarm. The whole processing process is as follows:

Data collection:
- Our main data sources are business SDK and application server, and these data can be accessed to collection service through HTTP Api and Kafka.
- The collection service simply verifies and splits the data and transmits it to the data cleaning service via Kafka.
Data processing: Data processing services are mainly responsible for processing the received data and sending it to downstream services for use. We provide JOSN and other simple data formatting capabilities, as well as script processing modules to support more flexible and powerful data processing capabilities, which also make the data processing capabilities of our monitoring platform more diverse.
Monitoring the alarms: Monitoring the alarms module is the most important part of the service monitoring capability we mentioned at the beginning. The collected data are aggregated statistically and analyzed in different dimensions, and rich aggregation algorithms and flexible rule engines are used to achieve the purpose of service early warning and problem location.
Data application: The cleaned data can be directly written into the sequential database for use by the troubleshooting platform, or connected to ES, HDFS, and stream processing platform through Kafka for use by the application layer. For example: quality service platform, general query service, troubleshooting platform, etc.

Next, we will analyze the above four modules in detail.

The data collection

Data acquisition is the entrance of the service monitoring platform and the first step of the whole process. The following figure is the architecture diagram of the data acquisition module.

As mentioned above, in order to facilitate user access, we provide HTTP API and Kafka channels to the business side.

The HTTP API is used for real-time data reporting in server-side or server scenarios and supports data access at the second level.
Kafka is mainly used in scenarios with high throughput and low data real-time requirements.
Data filtering pre-processing module, filtering some illegal data in advance, and pre-data separation and other processing.

Finally through Kafka transmission to the data processing service, the next is the introduction of the data processing stage.

The data processing

After completing the data collection stage, the data processing stage is entered, and the specific process is as follows:

Task scheduling, mainly responsible for data processing thread life cycle management, from start to close.
Consumers, after acquiring data, decouple the data using internal queues, thus achieving the ability to scale horizontally to improve the parallelism of data processing threads.
Processing unit, parallelism can be set as required:
- Data processing capabilities fall into two categories, common rules and custom scripts. General rules are simple JSON conversion and field extraction, which can basically meet 80% of the requirements. However, in order to support complex services such as multi-field association calculation, regular expression and multi-stream association processing, we also provide the ability to customize scripts for data processing.
- Dimension table is mainly used for multi-data flow association processing scenarios. In order to solve the problem of high data volume and concurrency, local + third-party cache is used.
- Output of sequential database: NTSDB is used for sequential database. NTSDB is a clustering scheme made by netease Cloud based on InfluxDB, which has the characteristics of high availability, high compression ratio and high concurrency.

After the data is processed, the next important stage is monitoring alarms.

Monitoring alarm

The following diagram shows the process of monitoring alarms:

Alarm monitoring consists of indicator aggregation module and alarm module.

The indicator aggregation module supports specified field group statistics, flexible aggregation window time, data filtering, fine-grained operator level data filtering, and maximum data delay time. Most importantly, we support very rich aggregation operators: Cumulative, min/Max, firstValue/lastValue, average, record count, decount, TP series (TP90/TP95/TP99), sequential, standard poor, etc., while supporting the ability to perform composite calculation after the first index aggregation (composite index). These rich operators provide a guarantee for us to implement more flexible monitoring rules.

In addition, we changed the original one-stage aggregation to two-stage aggregation. Why? Because in the process of data processing, we often encounter a problem: skew caused by data hotspot. Therefore, we added a pre-processing stage, in which random numbers are used to disperse data to ensure the balance of data, and then the pre-aggregated data are aggregated in the second stage.

The alarm module and indicator aggregation module are split into two. The indicator module focuses on data aggregation rather than coupling with the alarm module as part of the alarm module. And warning as an additional function, only need according to the received data, do some alarm rules calibration, frequency control, alarm information encapsulation, and docking message platform to send alarm messages, and support the internal IM platform such as telephone, SMS, message channel, a variety of news channel is to the first perceived problem arises.

Data applications

Existing data application platforms: data visualization, quality service platform, ELK log platform, online and offline analysis, etc. Below, we give a brief overview of each platform.

Data visualization

Like most companies, we use Grafana for data visualization. Data that needs to be visualized can be synchronized to the NTSDB first, and then used as data for charts and large plates. In addition, for diagrams that are not supported, we redeveloped Grafana to support more visualization requirements.

The following are some dashboards for audio and video troubleshooting scenarios:

Quality Service Platform

The platform aims to provide customers with intuitive, efficient, comprehensive and real-time problem location and troubleshooting tools. When customers receive feedback on problems, they can find and locate problems in the first time, and finally feedback to users and optimize.

ELK log platform

ELK technology stack consists of Logstash, ES and Kibana, which is a complete set of log collection, storage, query and visualization solutions. Currently in our system, more queries are used for detailed logs.

Online and Offline analysis

Here we use Kafka as the data pipeline and use Flink platform to slice and archive the log data. After this part of data is synchronized to offline data warehouse, subsequent data mining analysis can be carried out, and the discussion will not be extended here.

conclusion

The above is the whole introduction of this paper. It analyzes the design and practice of netease cloud communication service monitoring platform, mainly introduces the system architecture of the whole service monitoring platform, and expounds the four points of data collection, data processing, monitoring and alarm, and data application.

Since its launch in early 2020, the entire data collection and monitoring system of netease yunit has grown from a dozen original collection tasks to more than 300, with 100+ key user behaviors and system events, 300+ core audio and video indicators, and processing millions of lines of data and T-level data per day. The concurrency and throughput of the whole platform are constantly rising, which benefits from the continuous growth of cloud information services, but also makes us put forward higher requirements for the stability and scalability of the platform. In the future, we will use the capabilities of the platform to provide customers with better services.

The authors introduce

Dai Qiang, senior development engineer of netease Yunxin Data platform, has been engaged in work related to data platform. He has built real-time and offline data warehouse system of netease Yunxin from 0 to 1, and is also responsible for the design and development of service monitoring platform, data application platform and quality service platform.

More technical dry goods, welcome to pay attention to [netease Wisdom enterprise technology +] wechat public number, there is a little surprise oh