background
With the rapid development of dewu community and livestreaming business, the number of users is getting bigger and bigger, and the stability of services is becoming increasingly higher. How to quickly attribute the monitoring alarm and quickly solve the problem? I think everyone has their own means of locating and troubleshooting. For students with less experience, we may have experienced the same stages, confused alarm information do not know where to start, easy to go into the wrong way of thinking, do not know how to screen the cause of the problem. This paper focuses on the accumulation of this knowledge. By learning from each other, drawing on the wisdom of the team and summarizing and investigating cases, we hope that we can eventually benefit from this knowledge, quickly locate and timely stop losses.
I. Practice of attribution of live monitoring alarms
This article does not deal with specific business problem attribution, but how to attribute alarm information to a certain aspect. For code problems at the business level, it needs perfect log output, whole link tracing information, and qualified problem context to judge, and the thinking is also the same.
At present, the acquisition community and the live broadcasting business use GO and are in the K8S environment. The monitoring indicators are displayed by Grafana. Current alarm rules include RT, QPS, Goroutine, Panic, HTTP, and service exceptions. Recently, the live broadcast business encountered RT jitter of a certain service. Although the jitter was solved by capacity expansion, some detachment was also made in the attribution positioning process. The following is a demonstration of the whole screening process:
Service Monitoring Performance
Alarm feedback: The service RT and goroutine are abnormal. The service index was checked through Grafana, and it was found that there was flow spike at this time point, and QPS increased significantly
Check HTTP, GRPC indicators, the average RT, 99 line increased significantly
In the Mysql indicator, RT increases obviously. It is speculated that a large number of Mysql queries or slow queries cause RT fluctuation, which leads to an alarm
RT in Reids index rises obviously. It is speculated that the jitter of Redis causes RT to rise, resulting in timeout and then traffic is sent to mysql
\
Identify possible causes
Monitoring indicators | |||
---|---|---|---|
External HTTP | The RTS of all interfaces increased | QPS has flow spikes | |
Redis | All requests RT up | QPS fluctuates with traffic | |
Mysql | All requests RT up | QPS fluctuates with traffic | |
Goroutine | All pods rose significantly | ||
Three parties rely on | All requests RT up |
Combining the above phenomena, we can first determine the scope of influence as the service level. Secondly, the error log of Redis timeout is found in the service log, and the error log of timeout occurs when the three-party service is invoked.
First, check whether the system resources are sufficient. You can check the CPU and memory indicators to see that the system resources at the alarm time do not cause a bottleneck. Then we can rule out the service jitter caused by these two reasons.
Secondly, the service is in the K8S environment and is provided by multiple PODS. Each pod is scheduled on a different node, that is, on a different ECS. At the same time, the service is in the overall jitter state, which can eliminate the cause of pod single machine failure.
The entire service is affected and the rest of the services in the K8S cluster are normal, which can rule out the cause of the network failure. All traffic inbound and outbound interfaces are affected, and the cause of dependent service failure is eliminated. Then the next thing to consider is that the storage layer of the service (mysql, Redis, etc.) is faulty, or a node in the service traffic path is faulty.
Location problem
Through the RDS of Aliyun, the performance trend of mysql and Redis is normal, and there is no slow query log. The machine resources are sufficient, the network bandwidth is normal, and there is no high availability switchover. So preliminary judgment, the storage layer should be no problem.
So how do you determine the non-storage tier problem? If another service uses the same storage layer and the alarm period is normal, you can troubleshoot the storage layer fault. In the live broadcast microservice system, another service uses the same storage layer, and the service is normal during the alarm period. Therefore, you can determine to exclude this cause.
Then there is only one reason for traffic path node failure. Reviewing the whole link, k8S is used for service operation and maintenance deployment and ISTIO is introduced for service mesh. Is this component the cause of the problem? The monitoring panel also provides isTIO monitoring, as follows:
From the monitoring, there seems to be no problem, but the alarm time fluctuates. When the problem is investigated here, it seems to be no problem. So what is the cause? Reviewing the previous analysis, we found that other causes could be definitively ruled out, and RT jitter returned to normal after service expansion.
Therefore, we still turn our attention to the problem of traffic path nodes and seek the support of students in operation and maintenance side, hoping to check the status of ISTIO in real time. At this time, I found that the ISTIO load reported by my classmates on the operation and maintenance side was inconsistent with that in the monitoring panel. This was also the place where the team members made a detour. The reason was skipped because the isTIO load index was wrong. After multiple checks, data collected on the monitoring panel is abnormal. After monitoring data is restored, the actual load of ISTIO is shown as follows:
The actual isTIO load clearly indicates the cause of the alarm. Finally, it is confirmed that sidecar is currently using 2C1G. At the beginning of the service, the POD configuration used is 1C512M. With the development of the business, the number of PODS is increasing, so the resource configuration of the upgraded POD is 4C2G. After the service is split, the traffic of the alarm service is forwarded mostly. As a result, the CPU load of isTIO is too high, which leads to the jitter. At present, sidecar resources are fixed at 2c1G. Therefore, you can reduce the service configuration to 1C2G to increase the number of PODS. The number of PODS is 1:1 with that of Sidecars.
Impact level, possible causes, and reference ideas
We need to quickly locate the problem, so first determine the scope of the impact and what the possible causes are. Through the attribution practice of live broadcast service jitter, we can find that it is a series of screening process, and finally get the answer. For businesses that are not community, live streaming stack, I think I can summarize a set of regulations applicable to their own services. For the live broadcast service technology stack, there are scope levels, and the possible reasons are as follows:
Abnormal performance | CPU reason | memory | Storage layer | Flow path | Three parties rely on | Network fault | |
---|---|---|---|---|---|---|---|
Interface level | An interface in the service is abnormal, but other interfaces are normal. Procedure | There are | There are | ||||
Pod level | If a POD is abnormal, the same as other pods in the service is normal | There are | There are | There are | There are | There are | |
The service level | All the PODS of a service are abnormal, and other services in the cluster are normal | There are | There are | ||||
Cluster level | The entire cluster where the service resides is affected, for example, the ingress test environment was affected earlier | There are | There are | ||||
IDC level | Overall impact of services within the IDC | There are | There are |
- CPU: temporary traffic, code issues, service downsizing, timing scripts
- Memory: temporary traffic, code issues, service downsizing, (k8S needs to distinguish BETWEEN RSS and cache)
- Mysql, Redis: temporary traffic, slow query, mass query, insufficient resources, high availability automatic switchover, manual switchover
- Traffic path node: Caused by ingress in north-south traffic and isTIO in east-west traffic
Reference:
When receiving an alarm, you need to determine the affected scope, consider possible causes, and remove certain causes based on the existing conditions and status quo. The troubleshooting process is like a funnel, and the bottom layer of the funnel is the root cause of the problem, as shown below:
Here are some quick cases to rule out the cause:
-
If other services on the same storage tier are normal, you can troubleshoot the storage tier fault
-
If the number of service PODS is greater than 1, network faults of services can be basically ruled out, because pods are distributed on different ECS
-
Not all traffic inlets are faulty, so you can eliminate the traffic path node problem
3. Traffic paths and storage tiers
In addition to the code level, we should also have an understanding of the entire traffic path of the service and the infrastructure architecture used, so that when troubleshooting problems, we can quickly determine the key problems. The following describes the traffic paths of communities, live streaming services, and infrastructure high availability architecture.
Flow path
North-south flow
In the north-south traffic path, ingress is the core path. If a problem occurs, the entire K8S cluster becomes unavailable.
East-west flow
In the east-west traffic path, the Envoy Proxy takes over all traffic, and a problem with the proxy can affect service Pods
Storage layer
Mysql high availability architecture
Currently, community and live broadcast services use mysql multi-availability zone deployment. Service jitter may occur during automatic ha switchover.
Redis high availability architecture
At present, Redis implements cluster mode through proxy. Proxy and Redis instance are 1:N. Each Redis instance is a master/slave structure.
Four,
Review the whole process, as the silk peeling cocoons, step by step uncover lushan true colors. Rapid alarm attribution requires not only correct troubleshooting ideas, but also an understanding of the entire system architecture in addition to the code level.
The text/Tim
Pay attention to the technology, do the most fashionable technology!