System monitoring has always been an element of project integrity, and the principle of “don’t bring an unmonitored system online” is gradually gaining acceptance. If a system is not monitored, we can not know the operating status of the system, as well as the situation of all aspects of the business, and even the system downtime or major failure is not known, so as to cause significant losses.
Iqiyi Music Channel is a service built by the iQiyi Content Team in the whole process of production, release and operation of video, audio, subtitles, pictures and other content. With the gradual deepening of business development and microservitization, there are more and more system projects. At present, there are 100+ microservices, more and more contents to be maintained, and links are getting longer and longer. The existing monitoring means can no longer meet the needs of business development, and more and more urgent monitoring needs emerge, such as:
1. Found and positioned abnormal problems in the system efficiently and timely
When the system is abnormal, you need to immediately detect the fault before the service feedback. You need to be more efficient and visualized to locate the fault. In this way, it takes a long time to locate the fault, and even if there are too many logs, the fault cannot be accurately located.
2, the on-line process can be monitored, the first time there is a problem back
Every system launch is a bug birth window, a considerable proportion of online problems are caused or introduced by the launch, because a stable service, you do not update the system code, under normal circumstances will not spontaneously produce problems. We need a monitorable on-line process to fix problems in a timely manner.
3, comprehensive and accurate measurement of system service, interface, database performance
We need a more comprehensive and convenient monitoring of the whole system provides the service interface and system depend on the performance of an external service interface, QPS, slow query and other performance indexes, performance if there is a potential problem may be at a critical time, such as the flow excursion, cause database connection pool played, long transaction, dismantle the gateway and other major fault.
4. More timely and stable monitoring of service health
Although currently deployed services are double or multiple backup deployment, availability is guaranteed, but there are still some services down, we need to be aware of such problems in the first time, and have continuous and stable monitoring.
5, timely monitoring of machine (container) performance changes
When the system encounters sudden traffic or attacks, it is likely to cause basic problems such as connection pool full, OOM full, database full, CPU, memory, disk full, etc. All these will lead to serious accidents, which cannot be monitored in time and can not be handled in time, resulting in serious online faults.
6. More perfect and clear business monitoring
The urgent demand of the above describes the system monitoring, if there is no monitoring in terms of business, such as production volume, the success rate, aging, etc., so the overall business operation for maintainer is a black box, and it would get into a passive situation, the development of business is the goal of technical support, don’t know the development target, can let a lot of work.
In business is less, the system is less, less micro service, we demand for monitoring system is less, with the development of business, system explosive will maintain more and more demanding, we need to build a complete monitoring system to swallow the short board, for each service, business, some for business development.
Monitoring content and technology selection
After sorting out, at present our service system needs to monitor the specific content: \
1. Machine monitoring
Basic monitoring monitors the running status of VMS or containers that host each system, including cpus, memory, disks, and networks. Specific indicators are as follows: CPU. Busy, CPU, idle, the CPU load. 5 min. Per. Core, mem. Memfree, net.. If, in bytes, df. Statistics. 2. Percent, etc.
2. System monitoring
Monitor all interfaces and services of each system, such as service status, QPS, interface performance, success rate, error log, etc.
For example, service health status, service API QPS, service API performance, success rate, QPS of each machine in the cluster, access volume of each machine in the cluster, QPS pressure of the entire cluster, and exception log trend of the entire system.
3. Service monitoring
According to the specific business carried by each service, the business data market and corresponding monitoring and alarm strategy are established to facilitate the observation of business development and system operation, and provide reference and help for business development decisions.
Under the based on the principle of not repeat rolling, first of all, we research the function of internal monitoring system of the basis of existing, access, invasive, advantages and disadvantages, etc., because the company internal control system function more scattered, single, can meet the comprehensive monitoring requirements, we need so we tend to build, and research the common open source monitoring system:
1, OpenFalcon
OpenFalcon is xiaomi’s open source monitoring system, which can provide rich basic monitoring indicators.
Advantages: Easy access, almost no intrusion, and can report customized data
Disadvantages: Main monitoring machine indicators, insufficient system monitoring
2, Prometheus
Prometheus is an open source monitoring system released by former Google employees in 2015. It differs from OpenFalcon in that data collection is based on Pull mode rather than Push mode and the architecture is very simple.
Advantages: Powerful query engine, PromQL support, can do all kinds of real-time calculations on data.
Disadvantages: not easy to use, high learning cost, not perfect function, basically need to use Grafana to create dashboards and view indicators.
3. Open source monitoring system CAT
CAT is an open source real-time application monitoring platform of Meituan-Dianping, which provides comprehensive real-time alarm monitoring services.
Advantages: Powerful monitoring functions can cover all monitoring scenarios
Disadvantages: High access cost and heavy intrusion on service code
After comparison in many aspects and in combination with the monitoring requirements of our system business, the most important thing is the comprehensive, stable and mature UI, reports and alarms of the monitoring function, so we finally choose CAT as the monitoring system of our microservices.
Ledao-cat Landing of the monitoring system
1. Deployment and iteration \
CAT’s deployment in making open source code and related documents to minimize deployment and trial, first used in the small amount of business system, the virtual machine through the experiment discovered trial effect is very good, meet our demand for monitoring system at first, then we carried out on the CAT several times of upgrading, to gradually optimize to apply China business and the environment.
CAT monitoring system deployed at present the whole about China respectively in the service of overseas, mainland China, each region has deployed a cluster in the service of the corresponding business systems, as shown in figure 1, the cluster service which is in mainland China, most domestic micro cluster service to 100 +, 1 w + TPS, daily processing about 1.5 terabytes of data, The interaction between the business system and CAT monitoring system is shown in Figure 2????
Figure 1 LeDAo-CAT deployment
Figure 2 Interaction flow
Among them, CAT currently supports service applications including physical machines, virtual machines, QAE containers and other system operating environments. Services can be connected to the monitoring system by introducing the LeDAo-cat-client package and performing simple environment configuration and monitoring configuration to monitor related services in all aspects.
The main process is that cat-client-proxy collects monitoring configurations, establishes connections with the server, reads the configurations of items to be monitored, and reports the data to be monitored. After uploading the data to the CAT server, the cat server processes, aggregates, and alarms the original data.
2. Upgrade and renovation
(1) UPGRADING of CAT access mode and monitoring buried point
Native CAT when system access way is in each application in the virtual machine configuration client fixed path. The XML, which content is primarily a server-side connection address and port, so you need to every machine in the service deployment to the configuration file, operational cost is extremely high, under the condition of no unified operations of the company, access and maintenance costs are high, This approach does not apply to QAE container deployments. In order to solve the above problems, the CAT access configuration module is modified to support three access configuration modes: traditional XML file, QAE environment variables, and system configuration file Properties (XML), so that its configuration is decoued from the host computer and only depends on the application itself.
In addition, CAT monitoring is an intrusive monitoring system that requires data reporting at the monitoring site for monitoring. The original buried point method is basically the sectional method. In order to simplify and refine the monitoring buried point work, we developed the Proxy package to facilitate user access, and the main extended buried point method is: Traditional facets, declarative annotations (Service, Method, Controller, DAO), and bulk configuration file methods (Properties), as shown in Figure 3????
Figure 3 Access mode
(2) Add the CAT health check module
CAT all the monitoring function is very comprehensive, but there was no application of health monitoring, the monitoring way is to report the data by the client, when the service is down, the business is unavailable often will stop data report, but the CAT is unable to detect this kind of client downtime is unusual, so we has carried on the upgrade to a native CAT, Add the cat-Health health check module to the CAT system. After configuring the cat health check and receiving the application information reported by the client, the health check module periodically selects applications from different equipment rooms to check the health status of applications. If an application is abnormal repeatedly, the health check module sends an alarm notification according to certain rules. The new health check module fills the gap in native CAT monitoring. The details are shown in Figure 4????
Figure 4 Health check
(3) Upgrade in alarm mode
In addition, on the basis of existing open source CAT alarm way, combining iQIYI own alarm system, we will both butt fusion, through the integration of complete the whole monitoring system of the alarm system, support to iQIYI mail, hot chat, SMS way of exception, health, business and so on various aspects of alarms to send, greatly improving the alarm information contact rate, Make exceptions visible in real time, handle and recover in time.
Figure 5 Alarm upgrade
Practice results
Through the deployment of a native CAT and related upgrade, we build a set of complete about middle service monitoring system, the machine abnormal indicators monitoring, service, health monitoring, system monitoring, system performance monitoring, slow query monitoring until the related business monitoring, combined with the CAT powerful alarm configuration and iQIYI alarm way of diversity, As well as the optimized CAT fast access and buried point mode, a whole monitoring link is formed from service fast access — monitoring fast buried point — alarm configuration — alarm contact — alarm processing, which completely fills the gap in the monitoring system.
Specifically,
- CAT access methods are diverse, and the connection cost is low. A new service can be connected within 5 minutes.
- Combined with cat-client-proxy dependency packages, embedded configuration can be almost invastion-free, and the whole configuration is fast and efficient.
- Microservice system can be monitored comprehensively from hardware indicators, health, abnormal conditions, performance, business and so on, covering almost all aspects.
- Powerful alarm configuration can push system exception information in real time, so that problems can be quickly detected and handled in a timely manner.
Figure 6 Diagram of each monitoring indicator
Figure 7 Hot chat alarm
Figure 8 Configuring access
Upgrade from minimize deployment experiment, to function, to promote team and colleagues recognition, LEDAO – has developed rapidly, the CAT is about China to the whole team uniform LEDAO – CAT as a monitoring tool, currently deployed respectively two cluster, mainland and overseas security services on stable and efficient operation, the escort for the rapid development of business.
Conclusion outlook
Monitoring system has been the indispensable important part in the process of business development, to provide a vital service and stable operation, for different business systems, of course, also have different monitoring system and its adaptation, at home and abroad at present, different companies have provided some open source monitoring service, the CAT as a good open source monitoring system, Provides a very comprehensive and powerful monitoring function, basically can meet all of our current monitoring needs.
This paper briefly introduces a landing practice of CAT in IQiyi Ledao, mainly introduces our monitoring requirements, solutions, optimization and transformation as well as stage achievements. Although the current functions probably meet our needs, the following deficiencies are still being studied. Such as distributed Transaction, business monitoring, deep integration with NACOS, etc., these are the direction we will do in the future, so as to make the whole monitoring system more perfect and support the rapid development of business faster and better.
Maybe you’d like to see more
Mobile APM network monitoring and Optimization practice \
Exploration and Practice of iQiyi micro service monitoring \