Abstract
Generally, the three methods to deal with online problems are restart, rollback, and expansion, which can solve the problem quickly and effectively. However, according to my years of online experience, these three operations are a little simple and rough, and the probability of solving the problem is very random, which is not always effective. Here’s a summary of the solutions I usually encounter when dealing with failures in my applications.
The principle of
Principles to be followed during troubleshooting
-
Identify faults in advance to prevent fault proliferation. The following figure shows links where faults occur
Problems may occur at each layer. Problems at the lower layer are more influential. Therefore, each layer must have a problem monitoring mechanism. In this way, the sooner a problem is discovered, the sooner the problem can be solved. For example, a database host on which the service depends has a problem, and by the time the user reports it, the service may have been down for several minutes. Wait till you analyze the problem, fix it, switch to master/standby or whatever, and a few minutes have passed. The impact access is relatively large. If you have received an alert when a problem occurs in the database, it may be resolved quickly before the user reports the problem.
-
Broadcast quickly When you receive a P0 alert and determine that there is a problem with the application, broadcast immediately within the group. All personnel shall enter the level 1 combat state. If they find that they may be related to other dependent service/middleware/o&M/cloud vendors, they shall immediately notify relevant responsible persons and require them to enter into cooperative combat.
-
It is important to quickly restore the retained site to help discover the root cause. However, when a fault occurs, there is no time to lose. You can’t waste a few minutes to dump memory, jStack thread state, to preserve the scene. The service must be restored in the first place and then go to root cause based on the monitoring data at that time
-
To resolve the problem, you may need to perform restart/rollback /mock/ limit operations online. Be sure to check to see if the desired effect is achieved. It needs to be observed for a period of time to see if the service is really normal. Sometimes it may just be a short time, and it will come back.
Processing means
The solution can be restart, capacity expansion, rollback, traffic limiting, degradation, or hotfix
Here’s how I normally deal with online problems
Step1: Is there any change
Just as most car accidents are caused by change, online failures are often caused by change. External changes are hard to perceive, but changes in the service itself are easy to perceive when there are service releases, configuration changes, and so on. So first determine whether it can be rolled back, can be rolled back immediately.
Step2: Check whether to run a single machine
Cluster deployments are now common and services are highly available. If only one machine has a problem, remove it as soon as the service can be removed. If it cannot be removed immediately, expand the capacity and then remove it.
Step3: determine whether to cluster
When the entire service cluster has a problem, the problem is relatively complex, requiring a single API and multiple API errors.
- Whether a single API error affects other apis, modules, and downstream storage in the application. If have an effect, can degrade degrade in time. Traffic limiting due to increased requests. Other modules are not affected. Troubleshoot the problem again. Hotfix.
- Multiple API errors this is usually step4 error, you can go directly to step4 to check. If step4 is not incorrect or the traffic exceeds the expected value, perform traffic limiting and capacity expansion. If not, look for code problems and hotfix to live
Step4: there is a problem with the dependent service/storage
Find the team immediately and look at the problem together. If the fault is caused by abnormal service requests, rectify the fault again. If the fault is caused by normal operations, you need to expand the capacity and upgrade the configuration.
How to prevent
The preceding operations show that a lot of judgments need to be made when a fault occurs. If you are not experienced and handle the fault improperly, it is easy to cause fault escalation and asset loss. So prevention is needed.
Know your service
Understand your service like a philosopher dissects himself. Generally, it contains the following contents
Draw application system architecture diagram
You need to include the following modules,
- Who should be notified if there is a problem with the service
- Which modules are included which functional modules are applied. When a user reports a problem, he or she knows roughly which service has a problem
- How does the system flow between modules
- What middleware the dependent middleware depends on and who is responsible for it
- Dependent stores, which stores message queues depend on, and who is in charge of storage operation and maintenance
- Dependent services depend on which services, which functions depend on which services, and who to call if there is a problem. Is weakly dependent and can be degraded.
Draw an application system deployment diagram
How and in what environment the system is deployed. How to log in, expand and upgrade.
Sort out the system fault level
Which modules are core and which are less important and can be downgraded.
Pressure measuring drill
The current system can support the number of stand-alone QPS, what may exist performance bottlenecks, need to be determined through pressure measurement.
What is the API read/write ratio of the current application? What is the ratio of the application to each storage layer? When application QPS rises, which dependency dies first. Redis /mysql is still a dependent service, or the application itself.
Periodic inventory.
Either user feedback failures or monitoring alerts are usually too late because a certain amount of error calls have already been accumulated. So you need to go one step ahead and check your app regularly. The metrics typically revolve around usage, saturation, throughput, and response time and include all dependencies.
-
Application layer disk CPU, memory, load, JVM GC
-
System level QPS
-
Depends on storage disks, CPU, IOPS, and QPS.
-
Whether the message queue consumption rate is normal
In addition, system logs are the primary source of fault information. The application owner needs to periodically query error logs to effectively prevent potential problems.
Monitoring alarm
Monitoring alarms can help detect faults early, so make sure that monitoring items are complete and alarms can be reported effectively. The following are some common monitoring items
type | Monitoring items |
---|---|
The host state | Disk usage >85 |
The host state | Load in 5 minutes > Number of cores x 1.5 |
The host state | Memory usage > 80 in 5 minutes |
The host state | CPU > 50 in 5 minutes |
API | 5-minute API error rate >0.1 |
SQL | Slow query time >100ms |
The log | The number of errors per minute is greater than 10 |
The log | Number of errors >50 within 5 minutes |
Follow the public account [Abbot’s Temple], receive the update of the article in the first time, and start the road of technical practice with Abbot