1. At about 17:20, I received the o&M feedback that the system login was abnormal. After logging in to the system, the error message was found.

2. Check the alarm information about the nails. If no alarm information is received, the server load is normal.

3. Check the APM monitoring. A large number of slow interfaces are found

4. Check the background logs and find some redis connection timeout error messages as follows:

Log in to aliyun Redis console, no slow log is found:

5. Check whether server 100 is normal and the CPU load of server 101 is high. Run the top command to find the processes with high CPU load.Using top — Hbp 18883 to look at the problem threads, it was found that the GC threads were causing the CPU to be overloaded. The application uses too much memory and too often full GC.

6. Use jmap-dump and jstack to generate memory snapshots and thread snapshots. Restart the service immediately.

1. Analyze the memory snapshot and thread snapshot, and find the following suspicious information:

2. Continue to analyze the suspicious memory and find out the corresponding thread:

3. Search for logs and data at the time point of the problem. The basic location is that a service center chose a two-year account period to generate payment notice (the amount of project data is large, so about 20W of account order data are counted).

PS: Due to various security restrictions, it is very troublesome to download data in online environment (especially large files)

Product: According to business requirements, the creation of the collection task is adjusted as follows: 1. Limit the amount of data: the task will fail if the amount of data exceeds 50,000 after the creation of the task. 2

Development: 1. Modify the code to avoid loading all data into memory at one time. Use paging or SQL stream query to avoid too much data resulting in OOM