1. Event process
After the service goes online, the system breaks down during peak hours and all services break down due to memory overflow. The increase in the number of service clusters and the service memory size does not improve.
2. The impact
Marketing services are down, resulting in the failure of all marketing activities, users failed to place orders, resulting in a large number of access errors and problem sheets.
3. Process analysis
3.1. Reproduce the fault site
Spring-aop records the access times of controller interface, roughly counts the high-frequency access interfaces, and simulates concurrent access to high-frequency interfaces through Jmeter to reproduce the accident scene.
3.2. Memory leak analysis
- Add service startup parameters, use MAT software to analyze dump file and locate the cause of memory leak.
Specific steps: 1 >. Increase the launch parameters – XX: + HeapDumpOnOutOfMemoryError – XX: HeapDumpPath = / opt/logs/heap hprof, out of memory, save the dump file (hprof files); -xx :+PrintGCDateStamps -xx :+PrintGCDetails -xloggc :/opt/logs/gc.log 3>. Use MemoryAnalyzer software (MAT) to analyze dump files recorded in OOM.
- GC log: The analysis found that after the pressure test, the full GC could not recover the old objects, resulting in the young GC could not recover the new generation objects, confirming the memory leak.
- MAT analysis: large object instance analysis, found that each request interface occupies 100m memory, including empty logic interface (interface abandoned, no actual processing logic) also occupies 100m memory, thus suspected, is due to the framework or configuration problems caused by memory leakage.
Server. maxHttpHeadSize=99 999 999; server.maxHttpHeadSize=99 999;
3.3. Solution
1> Concurrent access to an empty logic test interface, memory leakage occurs; 2> Delete the server. MaxHttpHeadSize =99 999 999 configuration, and gradually increase the concurrency, observe the memory usage. Through the test environment simulation, after the concurrency is more than 1000, although the QPS is low, there is no memory overflow problem, so it is determined that the problem is solved.
Reference article:
- Meituan technical team: Talk about GC optimization of Java applications from practical cases
- Lot – JavaGuide/docs/Java/JVM/JDK monitoring and troubleshooting tool summary. The md