background

A new project of the marketing center was put online. After the project was put online, the monitoring platform showed that the water level of RT rose and fell regularly



The first screen

When I first looked at the monitoring diagram, I thought it was caused by the simultaneous failure of redis keys in batches, because the interval between the peaks was exactly 15 minutes, and the failure time of Redis keys was also exactly set at this time. At the same time, the company operation and maintenance feedback to me at that time, the table SQL request volume is large, called 36,530 times in 15 minutes, accounting for 80% of the library performance.

It is found from link monitoring that some mysql RTS are very high.



Initial Problem location

Combined with db response time, the initial locating problem is as follows: After cache penetration, a large number of SQL requests cause RT to rise.

But it doesn’t really explain the regular rise.

As a result, increase cache breakdown protection, release online, found RT unexpectedly down! Consider the problem solved.



Problems arise again

After a Dragon Boat Festival, I will look at RT again today and restore the situation of the first picture



Troubleshoot problems

The perception problem is not what it was thought to be. So I checked the server and found that the CPU usage of the server was also very strange.

Therefore, jStack was used to investigate the use of multithreading in the project, and no abnormality was found.

Use top-HP PID to view the threads that are most frequently used by the CPU



Printf “%x\n” 19838 gets the hexadecimal value 4d7e

D7e jstack 19832 | grep “4” to check the thread



It turns out that the gc thread is the one that consumes the most CPU

Jstat -gc 19832 1000 Check gc status



I found a big bug. The e old era only configured 64M, online has been in FULLGC, Dragon Boat Festival three days has fullGC more than 190,000 times. All right, you can go have tea with O.V

conclusion

The configuration in the old times is too small, resulting in the system has been blocking user threads during FULLGC. Generally, the blocking time is about 100ms, resulting in RT soaring. Fullgc is restored, RT is restored, and fullGC continues again.

thinking

1. The monitoring platform lacks monitoring of JVM

2. Evaluate the cache breakdown risk for interfaces with heavy requests

3. Troubleshooting should be considered in combination with CPU, memory, IO and JVM