The introduction
On the other hand, the enterprise-oriented nature of SaaS applications requires a high level of service reliability. First, when a fault occurs, we need to quickly locate and rectify the fault and restore services. More importantly, it is able to clearly understand the running state of the system at all times, and timely discover and eliminate faults before the system deteriorates.
What is observability
Automobile instrument panel and its connected sensor and signal transmission system are the most basic and intuitive example of the observability of automobile system. If you need to have a deeper understanding of the state of various parts of the vehicle, you can also connect the OBD interface of the vehicle to obtain more current state and historical state data of the vehicle. The visibility of the vehicle system is also greatly enhanced through the OBD interface. Based on OBD data, a lot of mobile phone management software is derived, which can more easily observe the vehicle state, record more historical data, do driving habits trend analysis, etc., and expand the observability of the car.
With the observability of the system, operation and maintenance and research and development personnel can intuitively observe whether the overall operating state of the system is healthy, and at the same time can easily go into the operation of each detail corner. During normal operation, the observation system can evaluate system load and provide suggestions for operation and maintenance operations. Helps you quickly locate and rectify faults when they occur. In the general trend of operation and maintenance automation and intelligence, the observability of the system is the most fundamental link.
Build a perfect observation system
Image from Prometheus website
1. Service status awareness component
-
Independent monitoring tools, such as SAR, TOP, DSTAT, etc. for monitoring system operating status.
-
Bytecode injection
-
Structured log
-
Behavioral event burial point
This component is the core of the entire observation system, reporting the collected data and storing it efficiently. For different data and different analysis methods, appropriate data formats and storage media should be used. The most common storage for monitoring data are timing databases, such as Prometheus, InfluxDB, and others. For data collectors, various measurement types are provided for structured reporting of data, including
-
Evacuate gauge for simple counting scenes such as memory and thread count.
-
Counters: used in the number of requests and errors.
-
Histograms: Histograms for average response time, RT 95 values, etc. need to calculate mean, variance, quantile scenarios.
-
Meters: TPS counters for rate statistics, as well as 1-minute, 5-minute mean statistics.
-
Timers: collects statistics on request delay, such as request delay and disk read delay.
The visual interface is the most important determinant of whether the observation system can produce value. The visual system must support flexible configuration, flexible combination, easy to use, and intuitive information display. The most widely used open source visual monitoring tool is Grafana.
The alarm function is one of the core values of the whole monitoring system. When the system has been abnormal or may be abnormal, the alarm system can timely notify the relevant parties through email, IM, SMS, telephone and other channels, so that the relevant operation and peacekeeping research and development intervene.
There are two main types of alarm configuration, status event alarm and trend alarm.
-
The CPU usage exceeds 95%
-
The disk space usage exceeds 90%. Procedure
-
JVM fullGC occurs continuously
-
The interface invocation failed more than N times in a period of time, and the ratio exceeded X %
-
The number of Tomcat threads exceeds 120
-
The service generates ERROR logs. Procedure
-
Message volume is down 30% from a week ago
-
The number of requests to a URL interface increased by 30% compared with one minute ago
-
Memory usage increases by more than 5% for ten consecutive minutes
Covering the full spectrum of observational dimensions
(1) Resource monitoring
For each type of resource, here are some common concerns that should be clearly presented in our monitoring system.
-
CPU: For computing applications, CPU is the core resource. The load level directly indicates the current system load. For non-computing applications, the CPU is usually under load. A sudden spike in CPU load at some point in time usually indicates that the application is buggy, such as an infinite loop, or that the virtual machine is under load due to frequent fullGC. CPU I/O processing and context switching may be overloaded due to high network traffic.
-
Process survival: Check whether a process with the specified process name exists.
-
Memory: indicates the memory usage and remaining memory capacity.
-
Disk: disk space, inode number, and DISK I/O status.
-
Network adapter: incoming and outgoing network traffic, incoming and outgoing network PPS, packet loss rate, etc
Resource monitoring focuses on the operating system level, while performance monitoring focuses on the application level, also known as application Performance Management (APM).
-
JVM memory state
-
The JVM GC conditions
-
Java method call statistics
-
Tomcat Thread Status
-
Custom thread pool working state
The types of interfaces that can be monitored in this way are:
-
Statistics of HTTP interface invocation/invoked
-
Statistics about RPC interface calls and called cases
-
SQL execution statistics
-
Redis access statistics
-
Analyze the performance of each service node
-
Quick Fault Location
-
Request to invoke link analysis
-
Service dependency analysis and governance
(3) Business monitoring
For r&d teams, business monitoring is the most important indicator of the health of the business, so the types of data presented are different, and generally more detailed and comprehensive in dimension.
For example, for Seven Fish, its core business process is the communication between visitors and customer service, as well as the improvement of customer service efficiency, so the overall business monitoring indicators will include:
-
Number of concurrent sessions
-
Concurrent traffic
-
AI solution quantity
-
AI resolution rate
-
Number of online seats
-
Message sending and receiving rate
-
Rate of work order creation
-
Geography: Geography is mainly concerned with the network situation, especially for the network sensitive business like video. The biggest regional variation is in the quality of CDN coverage, followed by the restrictions imposed by operators in each region on access to the network, and the frequently reported incidents of optical fiber being cut somewhere.
-
Users: Different functions may be provided for different types of users. Common ways to distinguish users include VIP, tag and so on.
For SaaS businesses, services are provided at a tenant granularity. In order to provide personalized services, tenants have a great deal of freedom to customize functions. The same feature that works in one tenant may be completely unavailable in another. Therefore, it is also necessary to monitor business principal functions in a tenant dimension. Customer status tracking is divided into two parts, one is SaaS platform function, the other is customer interface monitoring.
The other part is the customer’s own interface. Many of the functions provided by the SaaS platform will involve the customer’s own business data, and need to get through to the customer’s own system, so SaaS usually provides a lot of interface standards, implemented by the customer, and then the SaaS platform in the business process to call. These external interfaces are not controlled by the platform, provide uneven quality of service, and are prone to anomalies. While these interface failures do not affect the overall service of the SaaS platform, they can be catastrophic to specific customers, and the customer’s first reaction to the failure is usually the SaaS platform failure, requiring you to quickly figure it out, which can waste a considerable amount of team time. Customers may not have such perfect monitoring measures. Therefore, we must monitor the invocation of these interfaces according to the tenant dimension, and timely notify customers in case of abnormalities, which is not only responsible for customers, but also to reduce our own work pressure.
(v) Business logs
-
If the service result does not meet the expectation, the service invokes the link information for complete recovery analysis to find the cause of the problem.
-
Monitor and alarm logs of the ERROR level and specific keywords.
-
Through structured logs, statistical analysis of the amount of call, execution results and other information to assist the statistics of operation data.
First, good journaling content should record just enough information. The problem of logging too little information is obvious, but too much can also be problematic, diluting the truly useful information, taxing the entire logging system, and even affecting the performance of the core business.
Furthermore, in a distributed architecture, a complete request passes through many nodes, and the call logs on these nodes need to be concatenated when analyzing the problem. This requires a complete set of log collection and query tools, the most commonly used is ELK. Since the logs are distributed on different nodes, to concatenate them, the entire invocation link needs to be marked and logged.
Efficient and healthy platters
As an aggregation display platform, the health tray is only used to display the real-time health status of the system. Each monitoring system reports data to the platform through a unified interface, including:
-
Service information: service node information, service name, node ID, node IP address, etc.
-
Service domain information: The service domain to which the service belongs. Multi-level service domains can be supported if necessary.
-
Data dimensions: Monitoring dimensions such as resources, performance, high availability, and service indicators.
-
Dimension priority: Different priorities have different display weights. Exceptions with higher priorities are displayed first.
-
Health level: indicates the health of the service, such as health when there is no abnormal, sub-health when there is a small number of warnings, unhealthy when there is a large number of alarms, downtime when the service is unavailable.
-
Detail link: Used for detail drill-down.
-
Monitoring dimension aggregation: All monitoring dimension data on the same node can be aggregated and displayed after weighted calculation based on priority weight and health level.
-
Application status aggregation: Data on unified service nodes can be aggregated and services in the same service domain can be aggregated and displayed.
-
Priority display: The display can be sorted according to the aggregated health priority in real time. The display with low health priority is displayed.
-
Drill-down analysis: Supports drill-down analysis of node data by monitoring dimension and service dimension.
conclusion
More technical dry goods, welcome to pay attention to the VX public account “netease smart Enterprise technology +”. Series of courses in advance, free gifts, but also direct conversations with CTO.