Online troubleshooting, mainly including CPU, disk, memory and network, basically the problem is df, free, top three connections, and then jSTACK, JMAP
1.jstack
Jstack is typically used to find business logic loops, frequent GC, and excessive context switches
(1) The CPU usage is too high
- 1. By
top
Command to view the usage of each process. By default, the process is sorted by CPU usage
- 2. Use
top -H -p 2634
Find the thread with the highest CPU usage under this process
- Use 3.
printf '%x\n' pid
Convert the thread number with the highest CPU usage to hexadecimal
- 4. Use jStack to troubleshoot the fault
And directly find corresponding stack information in jstack jstack pid | grep ‘nid’ – C5 – color, or the process of thread dump file
- 5. Analyze dump files
The thread dump information generated by the jstack command contains all living threads in the JVM. To analyze a given thread, you must find out the call stack for the corresponding thread
In the top command, the pid of the thread with high CPU usage has been obtained and converted to a hexadecimal value. In thread dump, each thread has an NID. Find the corresponding NID. Calculate (nID =0x246c); calculate (nid= 0x246C); calculate (JstackCase); You can check the corresponding code to see if there is a problem.
Thread dump analyzes the status of threads
In dump, threads generally have the following states:
1, RUNNABLE, the thread is in execution
2. BLOCKED, the thread is BLOCKED
WAITING, the thread is WAITING
In general, we will pay more attention to the part of WAITING and TIMED_WAITING. BLOCKED state must have problems. If there are too many WAITING and TIMED_WAITING states, it will also be abnormal
- Check the status
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c
(2) Frequent GC
To be continued
2. Rectify network request delay
The troubleshooting of network request problems can be divided into three parts
1. System delay of the service interface. Determine the request timeout problem by shortening the response time of the request in the service
2. Network delay and packet loss occur during network request transmission.
3. Client caller multi-threaded code execution delay.
2.1 System delay of the Service interface Service
The CPU load is too high and the processing speed is slow. You can check whether the CPU usage is too high
2.2 Network request transmission Causes network delay and packet loss
- Run the ping command to check whether the network delay is related to the carrier
- If you use Nginx for load balancing, you can configure the log format of nginx to check whether the log is an NGINx fault
log_format main
'$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"
"$upstream_response_time request_time $request_time"';
Copy the code
(upstream_response_time: from the time Nginx establishes a connection to the time Nginx receives data and closes the connection) (Request_time: from the first byte that receives a user request to the time Nginx closes the connection The response data is sent
If I were to add up the whole process, it would be:
[1 user request] [2 Nginx connection] [3 Send response] [4 Receive response] [5 close Nginx connection]
So upstream_response_time is 2+3+4+5
However, it can be considered that the time of [5 closing Nginx connection] is close to 0
So upstream_response_time is actually 2+3+4 and request_time is 1+2+3+4
The difference between the two is the [1 user request] time, if the client network condition is poor or the transfer of data itself is large
Consider that Nginx caches the request body first when using POST
This time adds up to [1 user request]
This explains why request_time might be larger than upstream_response_time