The introduction

For distributed system, because it is far more complex than traditional software, the difficulty of system operation and maintenance is greatly increased, and will increase exponentially with the increase of distributed nodes. When a system fails, it is a huge challenge for developers to find the problem in hundreds of application nodes. Especially when the indexes of many nodes are abnormal, what is cause and what is fruit are often difficult to distinguish.

On the other hand, the enterprise-oriented nature of SaaS applications requires a high level of service reliability. First, when a fault occurs, we need to quickly locate and rectify the fault and restore services. More importantly, it is able to clearly understand the running state of the system at all times, and timely discover and eliminate faults before the system deteriorates.

In order to do this, the most basic and important thing is that the system is very observable.

What is observability

When we drive a car on the road, the dashboard of the code meter, tachometer, fuel gauge and other instruments indicate the car’s current basic operating state. When there is a yellow indicator light on, it means that the corresponding part of the vehicle has hidden trouble, need to check, but not affect the basic operation and safety functions of the vehicle. And when there is a red indicator light, it is a serious warning, it is best to immediately send the car to check and repair, otherwise the core components of the vehicle may be damaged, or there are serious safety risks, easy to cause accidents.

The dashboard of a car brand

Automobile instrument panel and its connected sensor and signal transmission system are the most basic and intuitive example of the observability of automobile system. If you need to have a deeper understanding of the state of various parts of the vehicle, you can also connect the OBD interface of the vehicle to obtain more current state and historical state data of the vehicle. The visibility of the vehicle system is also greatly enhanced through the OBD interface. Based on OBD data, a lot of mobile phone management software is derived, which can more easily observe the vehicle state, record more historical data, do driving habits trend analysis, etc., and expand the observability of the car.

In IT, observability is more important than ever. The definition of Cloud Native by CNCF (Cloud Native Computing Foundation) includes the important feature of Observable. From the example of automobile, we can simply summarize the definition of observability, that is, we can understand everything that happens in the system in the shortest time by collecting, analyzing and processing the internal state of the system and reasonably aggregating, summarizing and displaying index data.

With the observability of the system, operation and maintenance and research and development personnel can intuitively observe whether the overall operating state of the system is healthy, and at the same time can easily go into the operation of each detail corner. During normal operation, the observation system can evaluate system load and provide suggestions for operation and maintenance operations. Helps you quickly locate and rectify faults when they occur. In the general trend of operation and maintenance automation and intelligence, the observability of the system is the most fundamental link.

Build a perfect observation system

The capability of system observability is mainly provided by a complete monitoring system. There are many open source distributed monitoring systems, such as Prometheus, Zabbix, Nagios, CAT, etc. Prometheus has become a de facto interface standard. Each large company will have its own monitoring system for flexibility, customization and strong operation and maintenance development capabilities. The components of a monitoring system are essentially the same regardless of type, and here’s a diagram of the architecture of Prometheus:

Image from Prometheus website

As can be seen from the figure, for a system to be observable, it must have the following components:

1. Service status awareness component

In each node of the system, collect service status information and provide original data in a comprehensive dimension. Due to the huge amount of data collected, the component is directly deployed on the running node of the system. Therefore, various methods are required to avoid adverse impact on the normal running of the system. The awareness component collects state data in a variety of ways and then outputs structured data according to a standard interface. Common collection methods include:

  • Independent monitoring tools, such as SAR, TOP, DSTAT, etc. for monitoring system operating status.
  • Bytecode injection
  • Structured log
  • Behavioral event burial point
2. Collect and store status data

This component is the core of the entire observation system, reporting the collected data and storing it efficiently. For different data and different analysis methods, appropriate data formats and storage media should be used. The most common storage for monitoring data are timing databases, such as Prometheus, InfluxDB, and others. For data collectors, various measurement types are provided for structured reporting of data, including

  • Evacuate gauge for simple counting scenes such as memory and thread count.
  • Counters: used in the number of requests and errors.
  • Histograms: Histograms for average response time, RT 95 values, etc. need to calculate mean, variance, quantile scenarios.
  • Meters: TPS counters for rate statistics, as well as 1-minute, 5-minute mean statistics.
  • Timers: collects statistics on request delay, such as request delay and disk read delay.
3. Visual presentation

The visual interface is the most important determinant of whether the observation system can produce value. The visual system must support flexible configuration, flexible combination, easy to use, and intuitive information display. The most widely used open source visual monitoring tool is Grafana.

4. Report to the police

The alarm function is one of the core values of the whole monitoring system. When the system has been abnormal or may be abnormal, the alarm system can timely notify the relevant parties through email, IM, SMS, telephone and other channels, so that the relevant operation and peacekeeping research and development intervene.

Example of IM channel alarm

There are two main types of alarm configuration, status event alarm and trend alarm.

A status event is an exception event that has occurred, for example:

  • The CPU usage exceeds 95%
  • The disk space usage exceeds 90%. Procedure
  • JVM fullGC occurs continuously
  • The interface invocation failed more than N times in a period of time, and the ratio exceeded X %
  • The number of Tomcat threads exceeds 120
  • The service generates ERROR logs. Procedure
Trend alarm refers to the analysis of the trend of index changes, and then alarm abnormal changes, such as:

  • Message volume is down 30% from a week ago
  • The number of requests to a URL interface increased by 30% compared with one minute ago
  • Memory usage increases by more than 5% for ten consecutive minutes

Covering the full spectrum of observational dimensions

(1) Resource monitoring

Resources mainly refer to system computing resources and network bandwidth resources, such as CPU, memory, disk ID, and network adapter traffic. These indicators are used to measure the system load in terms of numbers and percentages. As the most basic monitoring project, various types of open source monitoring systems and cloud computing platforms are provided.

Example of resource monitoring

For each type of resource, here are some common concerns that should be clearly presented in our monitoring system.

  1. CPU: For computing applications, CPU is the core resource. The load level directly indicates the current system load. For non-computing applications, the CPU is usually under load. A sudden spike in CPU load at some point in time usually indicates that the application is buggy, such as an infinite loop, or that the virtual machine is under load due to frequent fullGC. CPU I/O processing and context switching may be overloaded due to high network traffic.
  2. Process survival: Check whether a process with the specified process name exists.
  3. Memory: indicates the memory usage and remaining memory capacity.
  4. Disk: disk space, inode number, and DISK I/O status.
  5. Network adapter: incoming and outgoing network traffic, incoming and outgoing network PPS, packet loss rate, etc
(2) Performance monitoring

Resource monitoring focuses on the operating system level, while performance monitoring focuses on the application level, also known as application Performance Management (APM).

For monitoring at the application process level, monitoring indicator data is usually obtained by implicit injection at the virtual machine layer or bytecode execution layer, such as JVM, PHP Zend Engine, etc. Taking the Java application as an example, we can obtain the following monitoring indicators in this way:

  • JVM memory state
  • The JVM GC conditions
  • Java method call statistics
  • Tomcat Thread Status
  • Custom thread pool working state
At the interface level, for traditional applications, it can still be accessed through bytecode injection. For cloud native applications, you can monitor them in sidecAR mode. Through monitoring, we can obtain the QPS of the interface, concurrent call quantity, response time distribution, error times and other indicators.

Example of monitoring HTTP interface calls

The types of interfaces that can be monitored in this way are:

  • Statistics of HTTP interface invocation/invoked
  • Statistics about RPC interface calls and called cases
  • SQL execution statistics
  • Redis access statistics
The interface performance of a node can only reflect the situation of a node. But when the whole platform of all services of the interface call together, you can have an intuitive display of the overall operation of the system, this is the current hot call link tracking system. The theoretical basis of the current mainstream call link tracking system is basically Google Dapper, popular open source components are: Zipkin, Pinpoint, SkyWalking and so on. By calling link tracing, we can easily:

  • Analyze the performance of each service node
  • Quick Fault Location
  • Request to invoke link analysis
  • Service dependency analysis and governance
Seven fish full link monitoring market

(3) Business monitoring

Whenever business performance is the most immediate concern, the most immediate monitoring is business monitoring. Common business monitoring is all kinds of business market, for example, taobao Double 11 transaction data big screen, such big screen data is usually used for external display, or for leaders, the appearance of the design is cool, the display of data type will also be screened.

For r&d teams, business monitoring is the most important indicator of the health of the business, so the types of data presented are different, and generally more detailed and comprehensive in dimension.

First, the overall service status is monitored. The overall monitoring focuses on the health of core services to ensure that core services can be notified of any exceptions.

Seven fish part of the business monitoring panel

For example, for Seven Fish, its core business process is the communication between visitors and customer service, as well as the improvement of customer service efficiency, so the overall business monitoring indicators will include:

  • Number of concurrent sessions
  • Concurrent traffic
  • AI solution quantity
  • AI resolution rate
  • Number of online seats
  • Message sending and receiving rate
  • Rate of work order creation
However, the overall business trend is not significantly abnormal, but in a particular dimension of segmentation, the business may have become completely unavailable. So underneath the overall business, monitoring continues to drill down and monitor from different dimensions. Common subdivision dimensions are:

  • Geography: Geography is mainly concerned with the network situation, especially for the network sensitive business like video. The biggest regional variation is in the quality of CDN coverage, followed by the restrictions imposed by operators in each region on access to the network, and the frequently reported incidents of optical fiber being cut somewhere.
  • Users: Different functions may be provided for different types of users. Common ways to distinguish users include VIP, tag and so on.
(4) Tenant status tracking

For SaaS businesses, services are provided at a tenant granularity. In order to provide personalized services, tenants have a great deal of freedom to customize functions. The same feature that works in one tenant may be completely unavailable in another. Therefore, it is also necessary to monitor business principal functions in a tenant dimension. Customer status tracking is divided into two parts, one is SaaS platform function, the other is customer interface monitoring.

SaaS platform functions are easy to understand, which refers to the functional services provided by the SaaS platform to each customer. It is of great significance to continuously track the function usage of customers, especially big customers and new customers. First of all, enterprise customers have high requirements on the stability of service. Once any abnormality, even a tiny one, is perceived by customers, it is likely to affect customer relations, even trigger complaints, and affect subsequent renewal and additional purchase. Second, the function of usage change can reflect customer for platform dependent, when a customer a feature usage reduced suddenly, or continue to fall, should focus on the customer’s business have contraction, or whether the customer a part of the business has been moved to competing goods, at this time to timely understand reason, maintenance hospitality love relationship. If a feature is suddenly heavily used by customers, you can begin to prepare for their upcoming purchases.

Key Enterprise message interface monitoring panel

The other part is the customer’s own interface. Many of the functions provided by the SaaS platform will involve the customer’s own business data, and need to get through to the customer’s own system, so SaaS usually provides a lot of interface standards, implemented by the customer, and then the SaaS platform in the business process to call. These external interfaces are not controlled by the platform, provide uneven quality of service, and are prone to anomalies. While these interface failures do not affect the overall service of the SaaS platform, they can be catastrophic to specific customers, and the customer’s first reaction to the failure is usually the SaaS platform failure, requiring you to quickly figure it out, which can waste a considerable amount of team time. Customers may not have such perfect monitoring measures. Therefore, we must monitor the invocation of these interfaces according to the tenant dimension, and timely notify customers in case of abnormalities, which is not only responsible for customers, but also to reduce our own work pressure.

(v) Business logs

Logs are the primary tool for post-mortem analysis. Complete log information can yield the following important values:

  • If the service result does not meet the expectation, the service invokes the link information for complete recovery analysis to find the cause of the problem.
  • Monitor and alarm logs of the ERROR level and specific keywords.
  • Through structured logs, statistical analysis of the amount of call, execution results and other information to assist the statistics of operation data.
For these values, there are some best practices to follow when using logging systems.

First, good journaling content should record just enough information. The problem of logging too little information is obvious, but too much can also be problematic, diluting the truly useful information, taxing the entire logging system, and even affecting the performance of the core business.

Second, the log information should be structured. A complete log information should contain complete input and output parameters, call and called information, critical path execution results, time, etc. Nginx logs are a typical example. To facilitate logging, you can record logs at the architecture level and provide convenient SDKS for access.

Furthermore, in a distributed architecture, a complete request passes through many nodes, and the call logs on these nodes need to be concatenated when analyzing the problem. This requires a complete set of log collection and query tools, the most commonly used is ELK. Since the logs are distributed on different nodes, to concatenate them, the entire invocation link needs to be marked and logged.

Efficient and healthy platters

Different monitoring dimensions, suitable monitoring methods and tools are different. Switching between different tools for daily use and troubleshooting can be very inefficient. The usability of the observation system can be greatly improved by integrating data from various systems and displaying and operating on a single platform through a unified application health tray.

A concise example of a large service monitor

As an aggregation display platform, the health tray is only used to display the real-time health status of the system. Each monitoring system reports data to the platform through a unified interface, including:

  • Service information: service node information, service name, node ID, node IP address, etc.
  • Service domain information: The service domain to which the service belongs. Multi-level service domains can be supported if necessary.
  • Data dimensions: Monitoring dimensions such as resources, performance, high availability, and service indicators.
  • Dimension priority: Different priorities have different display weights. Exceptions with higher priorities are displayed first.
  • Health level: indicates the health of the service, such as health when there is no abnormal, sub-health when there is a small number of warnings, unhealthy when there is a large number of alarms, downtime when the service is unavailable.
  • Detail link: Used for detail drill-down.
Health plate supports the following functions:

  • Monitoring dimension aggregation: All monitoring dimension data on the same node can be aggregated and displayed after weighted calculation based on priority weight and health level.
  • Application status aggregation: Data on unified service nodes can be aggregated and services in the same service domain can be aggregated and displayed.
  • Priority display: The display can be sorted according to the aggregated health priority in real time. The display with low health priority is displayed.
  • Drill-down analysis: Supports drill-down analysis of node data by monitoring dimension and service dimension.

conclusion

In netease Smart Enterprise, the level of services has reached thousands, and the number of instances has reached 10,000. The perfect observation system helps us shield the complexity of system architecture and make the operation status of the whole system clearly visible, which plays a huge role in fault warning and troubleshooting. By observing the system, we can clearly see the blood flow and pulse of the system, and guard the health of the system.

More technical dry goods, welcome to pay attention to the VX public account “netease smart Enterprise technology +”. Series of courses in advance, free gifts, but also direct conversations with CTO.

Listen to netease CTO talk about cutting-edge observation, see the most valuable technology dry goods, learn the latest practical experience of netease. Netease Smart Enterprise technology + will accompany you to grow from a thinker to a technical expert.