Phenomenon of the problem

At some point the monitor reported a CPU load increase at 10.0.0.1, but no significant CPU utilization processes were found.

Troubleshoot problems

  1. Viewing CPU Usage
$top-c top-10:37:06 up 622 days, 18:13, 1 User, Load Average: 55.14, 55.25, 55.94 Tasks: $top-C top-10:37:06 up 622 days, 18:13, 1 User, Load Average: 55.14, 55.25, 55.94 Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.3US, 0.5sy, 0.0Ni, 99.1id, 0.0wa, 0.0hi, 0.0Si, 0.0st KiB Mem: 32948128 total, 2140896 free, 10739188 used, 20068044 buff/cache KiB Swap: 16777212 total, 16598708 free, 178504 Used.20294412 Avail Mem PID USER PR NI VIRT RES SHR S %CPU % Mem TIME+ COMMAND 17549 root 20 0 13.868g 3.532g 13812 S 1.3 11.2 4386:22 /usr/java/jdk1.8.0_121/bin/java -Djava.util.logging.config.file=/data/ifengsite/java/tomcat/conf/logging.properties -Djava.util.logging.manage+ 1261 Root 20 0 560244 5436 4924 S 0.3 0.0 138:35.97 /usr/bin/python -es /usr/sbin/tuned -l -p 17443 root 20 0 0 0 0 S 0.3 0.0 0:12.28 [kworker/1:1] 22211 root 20 0 146324 2292 1504 R 0.3 0.0 0:01.83 top -c 1 root 20 0 41536 3252 2076 S 0.0 0.0 65:07.05 /usr/lib/systemd/systemd --switched-root --system --deserialize 21 2 root 20 00 0 S 0.0 0.0 0:22.00 [kthreadd] 3 root 20 00 0 S 0.0 0.0 30:23.26 [ksoftirqd/0] 5 root 0-20 00 S 0.0 0.0 0:00.00 [kworker/ 0:00.00] 7 root Rt 0 0 0 S 0.0 0.0 0:56.78 [migration/0]Copy the code

As you can see, the server is heavily loaded, but you don’t see significantly high CPU utilization processes.

  1. Check the CPU usage of child processes of a process
$top-hp < PID > top-10:24:24 up 622 days, 18:01, 1 User, Load Average: 55.05, 55.65, 56.88 Threads: 176 total, 0 running, 176 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5us, 0.7sy, 0.0Ni, 98.8id, 0.0wa, 0.0hi, 0.0Si, 0.0st KiB Mem: 32948128 total, 2124500 free, 10737712 used, 20085916 buff/cache KiB Swap: 16777212 total, 16598708 free, 178504 used. 20295740 avail Mem PID to signal/kill [default pid = 17549] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17549 root 20 0 13.868g 3.531g 13812s 0.0 11.2 05:00.00 Java 17550 root 20 0 13.868g 3.531g 13812s 0.0 11.2 0:01.05 Java 17551 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:20.15 Java 17552 root 20 0 13.868g 3.531g 13812s 0.0 11.32:20.45 Java 17553 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:22.92 Java 17554 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:21.64 Java 17555 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:20.46 Java 17556 root 20 0 13.868g 3.531g 13812 S 0.0 11.2 32:19.77 Java 17558 root 20 0 13.868g 3.531g 13812 S 0.0 11.2 32:19.77 Java 17558 root 20 0 13.868g 3.531g 13812s 0.0 11.2 32:20.22 Java 17559 root 20 0 13.868g 3.531g 13812s 0.0 11.2 75:17.56 Java 17560 root 20 0 13.868g 3.531g 13812s 0.011.2 0:58.44 Java 17561 root 20 0 13.868g 3.531g 13812s 0.011.2 1:17.08 Java 17562 root 20 0 13.868g 3.531g 13812s 0.011.2 1:17.08 0 13.868g 3.531g 13812s 0.0 11.2 3:16.71 Java 17563 root 20 0 13.868g 3.531g 13812s 0.0 11.2 3:16.71 Java 17564 root 20 0 13.868g 3.531g 13812s 0.0 11.2 3:27.37 Java 17565 root 20 0 13.868g 3.531g 13812s 0.0 11.2 3:40.50 Java 17566 Root 20 0 13.868g 3.531g 13812s 0.0 11.2 1:52.72 Java 17567 root 20 0 13.868g 3.531g 13812s 0.0 11.2 05:00.00 Java 17567 root 20 0 13.868g 3.531g 13812s 0.0 11.2 05:00.00 17568 root 20 0 13.868g 3.531g 13812s 0.0 11.2 50:56.66 JavaCopy the code
  1. Check the disk I/O write statusiostat
$ iostat
Linux 3.10.0-327.el7.x86_64 (cmpp_tomcatweb_pmop195v137_taiji) 	07/05/2021 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.92    0.00    0.83    0.04    0.00   95.22

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
fd0               0.00         0.00         0.00      59780          0
sda               5.70         0.17      1153.12    8907374 62047106018
Copy the code

Disk I/O is not abnormal, so far stuck, no ideas

  1. Internet search “high load low CPU usage” related issues, got some ideas
  • CPU load = Total number of processes in running + Interruptible state
  • That is, if the load exceeds the number of CPU cores, one of two things happens:
    • Too many processes are waiting to allocate CPU resources
    • Too many processes are waiting for disk I/O to complete
  • The following causes may cause too many processes waiting for disk I/O to complete
    1. There are too many read/write requests to the disk, resulting in a large number of I/O waits.
    2. MySQL has no index statement or deadlock: there is no MySQL on the server, and from the previous view, I/O requests are low.
    3. An external hard disk failure is common, for example, when NFS is mounted. However, an NFS server failure causes a large number of requests to obtain resources, resulting in an increase in interruptible processes and load
  1. Check whether NFS is mounted
$df-h hang...Copy the code

What does this tell us about disk mounting

  1. throughmountCommand to see
$ mount ... torage.staff.dev.com:/media on /mnt/source2 type nfs (rw, relatime, vers = 3, rsize = 65536, wsize = 65536, namlen = 255, hard, proto = TCP, timeo = 600, retrans = 2, the SEC = null, mountaddr = 10.0.0.202, Mountvers =3,mountport=635, mountProto =udp,local_lock=none,addr=10.0.0.202) 10.0.0.161:/data1/media on/MNT /source3 type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountadd R = 10.0.0.161, mountvers = 3, mountport = 33724, mountproto = udp, local_lock = none, addr = 10.0.0.161) 10.0.0.153: / data1 / media on /mnt/source4 type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountadd R = 10.80.81.153, mountvers = 3, mountport = 59983, mountproto = udp, local_lock = none, addr = 10.0.0.153) 10.0.0.146: / data/media on /mnt/source type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountadd R = 10.90.84.146, mountvers = 3, mountport = 36556, mountproto = udp, local_lock = none, addr = 10.0.0.146)...Copy the code

It is found that four NFS disks are mounted. Check the server status.

  1. Check the status of each NFS disk
$showmount -e 10.0.0.202 Export list for 10.0.0.202: /media * $showmount -e 10.0.0.161 Export list for 10.0.0.161: /data1/media * $showmount -e 10.0.0.153 Export list for 10.0.0.153: /data/media * $showmount -e 10.0.0.146 ^CCopy the code

It can be seen that NFS 10.0.0.146 is faulty and the port is disconnected. I tried to uninstall it and let the load drop first.

  1. Unmount the NFS disk
$ umount -f /mnt/source
...
umount.nfs: /mnt/source: device is busy
Copy the code

Umount forcibly unmounting the NFS file system fails.

  1. throughfuserCommand to kill the process that accesses NFS
# check to see which processes you want to access. $fuser -m -v/MNT /source ^C $fuser -k -v/MNT /source # $fuser -k -v/MNT /source # $ umount -f /mnt/sourceCopy the code
  1. View disk mount and CPU load
$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda6 10G 71M 10G 1% / devtmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm TMPFS 16G 1.6g 15G 10% /run TMPFS 16G 0 16G 0% /sys/fs/cgroup /dev/sda2 20G 3.2g 17G 16% /usr/dev /sda8 10G  33M 10G 1% /tmp /dev/sda7 10G 33M 10G 1% /home /dev/sda9 938G 411G 527G 44% /data /dev/sda1 497M 108M 390M 22% /boot / dev/sda3 20 g m 3% 20 g/var storage.staff.dev.com: 537 / media 45 t t t 33 13 29% / MNT/source2 10.0.0.161:23 t/data1 / media 1013G 22T 5% / MNT /source3 TMPFS 3.2g 0 3.2g 0% /run/user/1004 TMPFS 3.2g 0 3.2g 0% /run/user/1005 TMPFS 3.2g 0 3.2g 0% /run/user/10005 10.0.0.153:/data/media 11T 1007G 9.5t 10% / MNT /source4 $uptime 1:07:11 up 622 days, 18:44, 2 users, Load average: 5.54, 33.51, 47.06Copy the code

The df -h command was restored, and the CPU load was reduced to the normal range.

The problem summary

  1. The NFS fault was caused by a colleague’s adjustment of the Server IP address. The client was missed before the adjustment.
  2. The server is not properly monitored and no NFS alarm is configured.
  3. I am not familiar with the basic knowledge of Linux. I did not understand the true meaning of CPU load value.