Check the server high load and low CPU usage

Phenomenon of the problem

At some point the monitor reported a CPU load increase at 10.0.0.1, but no significant CPU utilization processes were found.

Troubleshoot problems

Viewing CPU Usage

$top-c top-10:37:06 up 622 days, 18:13, 1 User, Load Average: 55.14, 55.25, 55.94 Tasks: $top-C top-10:37:06 up 622 days, 18:13, 1 User, Load Average: 55.14, 55.25, 55.94 Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.3US, 0.5sy, 0.0Ni, 99.1id, 0.0wa, 0.0hi, 0.0Si, 0.0st KiB Mem: 32948128 total, 2140896 free, 10739188 used, 20068044 buff/cache KiB Swap: 16777212 total, 16598708 free, 178504 Used.20294412 Avail Mem PID USER PR NI VIRT RES SHR S %CPU % Mem TIME+ COMMAND 17549 root 20 0 13.868g 3.532g 13812 S 1.3 11.2 4386:22 /usr/java/jdk1.8.0_121/bin/java -Djava.util.logging.config.file=/data/ifengsite/java/tomcat/conf/logging.properties -Djava.util.logging.manage+ 1261 Root 20 0 560244 5436 4924 S 0.3 0.0 138:35.97 /usr/bin/python -es /usr/sbin/tuned -l -p 17443 root 20 0 0 0 0 S 0.3 0.0 0:12.28 [kworker/1:1] 22211 root 20 0 146324 2292 1504 R 0.3 0.0 0:01.83 top -c 1 root 20 0 41536 3252 2076 S 0.0 0.0 65:07.05 /usr/lib/systemd/systemd --switched-root --system --deserialize 21 2 root 20 00 0 S 0.0 0.0 0:22.00 [kthreadd] 3 root 20 00 0 S 0.0 0.0 30:23.26 [ksoftirqd/0] 5 root 0-20 00 S 0.0 0.0 0:00.00 [kworker/ 0:00.00] 7 root Rt 0 0 0 S 0.0 0.0 0:56.78 [migration/0]Copy the code

As you can see, the server is heavily loaded, but you don’t see significantly high CPU utilization processes.

Check the CPU usage of child processes of a process

$top-hp < PID > top-10:24:24 up 622 days, 18:01, 1 User, Load Average: 55.05, 55.65, 56.88 Threads: 176 total, 0 running, 176 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5us, 0.7sy, 0.0Ni, 98.8id, 0.0wa, 0.0hi, 0.0Si, 0.0st KiB Mem: 32948128 total, 2124500 free, 10737712 used, 20085916 buff/cache KiB Swap: 16777212 total, 16598708 free, 178504 used. 20295740 avail Mem PID to signal/kill [default pid = 17549] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17549 root 20 0 13.868g 3.531g 13812s 0.0 11.2 05:00.00 Java 17550 root 20 0 13.868g 3.531g 13812s 0.0 11.2 0:01.05 Java 17551 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:20.15 Java 17552 root 20 0 13.868g 3.531g 13812s 0.0 11.32:20.45 Java 17553 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:22.92 Java 17554 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:21.64 Java 17555 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:20.46 Java 17556 root 20 0 13.868g 3.531g 13812 S 0.0 11.2 32:19.77 Java 17558 root 20 0 13.868g 3.531g 13812 S 0.0 11.2 32:19.77 Java 17558 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:20.22 Java 17559 root 20 0 13.868g 3.531g 13812s 0.0 11.2 75:17.56 Java 17560 root 20 0 13.868g 3.531g 13812s 0.011.2 0:58.44 Java 17561 root 20 0 13.868g 3.531g 13812s 0.011.2 1:17.08 Java 17562 root 20 0 13.868g 3.531g 13812s 0.011.2 1:17.08 0 13.868g 3.531g 13812s 0.0 11.2 3:16.71 Java 17563 root 20 0 13.868g 3.531g 13812s 0.0 11.2 3:16.71 Java 17564 root 20 0 13.868g 3.531g 13812s 0.0 11.2 3:27.37 Java 17565 root 20 0 13.868g 3.531g 13812s 0.0 11.2 3:40.50 Java 17566 Root 20 0 13.868g 3.531g 13812s 0.0 11.2 1:52.72 Java 17567 root 20 0 13.868g 3.531g 13812s 0.0 11.2 05:00.00 Java 17567 root 20 0 13.868g 3.531g 13812s 0.0 11.2 05:00.00 17568 root 20 0 13.868g 3.531g 13812s 0.0 11.2 50:56.66 JavaCopy the code

Check the disk I/O write statusiostat

$ iostat
Linux 3.10.0-327.el7.x86_64 (cmpp_tomcatweb_pmop195v137_taiji) 	07/05/2021 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.92    0.00    0.83    0.04    0.00   95.22

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
fd0               0.00         0.00         0.00      59780          0
sda               5.70         0.17      1153.12    8907374 62047106018
Copy the code

Disk I/O is not abnormal, so far stuck, no ideas

Internet search “high load low CPU usage” related issues, got some ideas

CPU load = Total number of processes in running + Interruptible state
That is, if the load exceeds the number of CPU cores, one of two things happens:
- Too many processes are waiting to allocate CPU resources
- Too many processes are waiting for disk I/O to complete
The following causes may cause too many processes waiting for disk I/O to complete
1. There are too many read/write requests to the disk, resulting in a large number of I/O waits.
2. MySQL has no index statement or deadlock: there is no MySQL on the server, and from the previous view, I/O requests are low.
3. An external hard disk failure is common, for example, when NFS is mounted. However, an NFS server failure causes a large number of requests to obtain resources, resulting in an increase in interruptible processes and load

Check whether NFS is mounted

$df-h hang...Copy the code

What does this tell us about disk mounting

throughmountCommand to see

$ mount ... torage.staff.dev.com:/media on /mnt/source2 type nfs (rw, relatime, vers = 3, rsize = 65536, wsize = 65536, namlen = 255, hard, proto = TCP, timeo = 600, retrans = 2, the SEC = null, mountaddr = 10.0.0.202, Mountvers =3,mountport=635, mountProto =udp,local_lock=none,addr=10.0.0.202) 10.0.0.161:/data1/media on/MNT /source3 type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountadd R = 10.0.0.161, mountvers = 3, mountport = 33724, mountproto = udp, local_lock = none, addr = 10.0.0.161) 10.0.0.153: / data1 / media on /mnt/source4 type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountadd R = 10.80.81.153, mountvers = 3, mountport = 59983, mountproto = udp, local_lock = none, addr = 10.0.0.153) 10.0.0.146: / data/media on /mnt/source type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountadd R = 10.90.84.146, mountvers = 3, mountport = 36556, mountproto = udp, local_lock = none, addr = 10.0.0.146)...Copy the code

It is found that four NFS disks are mounted. Check the server status.

Check the status of each NFS disk

$showmount -e 10.0.0.202 Export list for 10.0.0.202: /media * $showmount -e 10.0.0.161 Export list for 10.0.0.161: /data1/media * $showmount -e 10.0.0.153 Export list for 10.0.0.153: /data/media * $showmount -e 10.0.0.146 ^CCopy the code

It can be seen that NFS 10.0.0.146 is faulty and the port is disconnected. I tried to uninstall it and let the load drop first.

Unmount the NFS disk

$ umount -f /mnt/source
...
umount.nfs: /mnt/source: device is busy
Copy the code

Umount forcibly unmounting the NFS file system fails.

throughfuserCommand to kill the process that accesses NFS

# check to see which processes you want to access. $fuser -m -v/MNT /source ^C $fuser -k -v/MNT /source # $fuser -k -v/MNT /source # $ umount -f /mnt/sourceCopy the code

View disk mount and CPU load

$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda6 10G 71M 10G 1% / devtmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm TMPFS 16G 1.6g 15G 10% /run TMPFS 16G 0 16G 0% /sys/fs/cgroup /dev/sda2 20G 3.2g 17G 16% /usr/dev /sda8 10G  33M 10G 1% /tmp /dev/sda7 10G 33M 10G 1% /home /dev/sda9 938G 411G 527G 44% /data /dev/sda1 497M 108M 390M 22% /boot / dev/sda3 20 g m 3% 20 g/var storage.staff.dev.com: 537 / media 45 t t t 33 13 29% / MNT/source2 10.0.0.161:23 t/data1 / media 1013G 22T 5% / MNT /source3 TMPFS 3.2g 0 3.2g 0% /run/user/1004 TMPFS 3.2g 0 3.2g 0% /run/user/1005 TMPFS 3.2g 0 3.2g 0% /run/user/10005 10.0.0.153:/data/media 11T 1007G 9.5t 10% / MNT /source4 $uptime 1:07:11 up 622 days, 18:44, 2 users, Load average: 5.54, 33.51, 47.06Copy the code

The df -h command was restored, and the CPU load was reduced to the normal range.

The problem summary

The NFS fault was caused by a colleague’s adjustment of the Server IP address. The client was missed before the adjustment.
The server is not properly monitored and no NFS alarm is configured.
I am not familiar with the basic knowledge of Linux. I did not understand the true meaning of CPU load value.

Check the server high load and low CPU usage

Phenomenon of the problem

Troubleshoot problems

The problem summary

Related Posts

ParallelStream Stream trap

It’s time for Docker: 1 Docker tutorial

ActiveMQ Artemis installation under Windows10 environment