problems

At about 10:50 this morning, the system page suddenly failed to be accessed. We are sASS system, and many tenants made the same mistake. The problem affects tenant usage.

Screening process

performance

time

2021-09-15 10:50 am

Problem representation

  1. Multiple tenant material list pages do not display content

  1. Sentry emits a large number of error logs

It’s the same project, General

Screening method

Because the business code cannot be located from the error log, correlation check is used.

  1. Confirm recent changes

It starts around 10:50 in the morning, and starts to deduce the time before and after

  1. On-line project

There was a project online last night, but no impact has been found so far

  1. Modified configuration

No direct effect

  1. An error log

Unable to locate the problem

  1. Whether the amount of data has changed significantly

The new tenant T247 has released 80W new materials, which may cause pressure on the system.

The above results cannot be obtained directly or indirectly

  1. Determine code & project issues

The general module reported an error, but from the error log, we cannot confirm that the error was caused by a code logic problem.

At first, I wanted to ask o&M to restart the General project, but O&M told me that the project had been automatically restarted for 6 times, and the restart could not solve the problem.

  1. View container node

As said by O&M, the General project is constantly restarted. The restart is due to the health check mechanism of the node. If no feedback is given to the health check request within five consecutive 3s, the container restarts automatically. However, the maximum request time after the restart is still high. Later operations shut down the health inspection mechanism. Strangely, after the health check mechanism was turned off, the system somehow worked. But we cannot relax our guard until we find the right answer. The time is around 11:25am.

  1. Focus on grafana

Finding that the maximum request time can reach a minute and a half, this is definitely not normal. No wonder the container keeps rebooting, it must be because requests are blocked, so the health check interface can’t get results.

  1. Locating the cause of slow requests (slow SQL)

At that time, the project CPU is normal, memory is normal, and garbage collection is normal. Then summed up the problem to the database, contact Yunvilla to get slow log. Slow log down.

That’s a lot of slow logs

Among the logs, there was a familiar number, T247, that appeared more than once. This SQL executed for 49 seconds.

Of course, a 49s slow SQL won’t cripple the system.

  1. Ask someone about the recent launch

In the first step, we find the owner who recently went live, because they went live with 80W more data, and slow SQL is also very much related to the amount of data. On closer inspection, we found a killer problem: there was a page in the front and some interface was requesting general every 10 seconds, and the interface was very slow. So I open grafana, and I replay the request again, and I look at the request speed again. As expected, it was slow. This logical fix was then scheduled.

Why is it good to have health checks turned off?

Subsequent tests showed that slow SQL was the cause of the general project. If the Genenral project can recover, the reason is that slow SQL tasks have completed.

While looking at the slow SQL log, I noticed that the slow SQL for T247 tenants dropped dramatically when the health check was turned off (11:25am). This basically identifies the slow SQL pot.

conclusion

First encounter more difficult problems do not panic, actively respond to. Many people, many forces, analysis together.

You can follow the above investigation method step by step.

  1. View indicators
  • Sentry project detection
  • K8s Number of restarts
  • Grafana performance indicators
  • Database CPU& memory metrics
  • Slow SQL testing
  1. Confirm recent changes

  2. Code and project status checks

  3. Analyze code or projects

  4. Pull together in the first step of the personnel analysis

These steps are basically enough to locate the problem, and the remaining 20% of the work is to fix the bug