problems
At about 10:50 this morning, the system page suddenly failed to be accessed. We are sASS system, and many tenants made the same mistake. The problem affects tenant usage.
Screening process
performance
time
2021-09-15 10:50 am
Problem representation
- Multiple tenant material list pages do not display content
- Sentry emits a large number of error logs
It’s the same project, General
Screening method
Because the business code cannot be located from the error log, correlation check is used.
-
Confirm recent changes
It starts around 10:50 in the morning, and starts to deduce the time before and after
- On-line project
There was a project online last night, but no impact has been found so far
- Modified configuration
No direct effect
- An error log
Unable to locate the problem
- Whether the amount of data has changed significantly
The new tenant T247 has released 80W new materials, which may cause pressure on the system.
The above results cannot be obtained directly or indirectly
-
Determine code & project issues
The general module reported an error, but from the error log, we cannot confirm that the error was caused by a code logic problem.
At first, I wanted to ask o&M to restart the General project, but O&M told me that the project had been automatically restarted for 6 times, and the restart could not solve the problem.
- View container node
As said by O&M, the General project is constantly restarted. The restart is due to the health check mechanism of the node. If no feedback is given to the health check request within five consecutive 3s, the container restarts automatically. However, the maximum request time after the restart is still high. Later operations shut down the health inspection mechanism. Strangely, after the health check mechanism was turned off, the system somehow worked. But we cannot relax our guard until we find the right answer. The time is around 11:25am.
- Focus on grafana
Finding that the maximum request time can reach a minute and a half, this is definitely not normal. No wonder the container keeps rebooting, it must be because requests are blocked, so the health check interface can’t get results.
-
Locating the cause of slow requests (slow SQL)
At that time, the project CPU is normal, memory is normal, and garbage collection is normal. Then summed up the problem to the database, contact Yunvilla to get slow log. Slow log down.
That’s a lot of slow logs
Among the logs, there was a familiar number, T247, that appeared more than once. This SQL executed for 49 seconds.
Of course, a 49s slow SQL won’t cripple the system.
-
Ask someone about the recent launch
In the first step, we find the owner who recently went live, because they went live with 80W more data, and slow SQL is also very much related to the amount of data. On closer inspection, we found a killer problem: there was a page in the front and some interface was requesting general every 10 seconds, and the interface was very slow. So I open grafana, and I replay the request again, and I look at the request speed again. As expected, it was slow. This logical fix was then scheduled.
Why is it good to have health checks turned off?
Subsequent tests showed that slow SQL was the cause of the general project. If the Genenral project can recover, the reason is that slow SQL tasks have completed.
While looking at the slow SQL log, I noticed that the slow SQL for T247 tenants dropped dramatically when the health check was turned off (11:25am). This basically identifies the slow SQL pot.
conclusion
First encounter more difficult problems do not panic, actively respond to. Many people, many forces, analysis together.
You can follow the above investigation method step by step.
- View indicators
- Sentry project detection
- K8s Number of restarts
- Grafana performance indicators
- Database CPU& memory metrics
- Slow SQL testing
-
Confirm recent changes
-
Code and project status checks
-
Analyze code or projects
-
Pull together in the first step of the personnel analysis
These steps are basically enough to locate the problem, and the remaining 20% of the work is to fix the bug