Performance-based understanding of Linux system load average and CPU usage

“This is the fifth day of my participation in the August More Text Challenge. For details, see: August More Text Challenge.”

preface

As an engineer, whenever we find a computer slowing down, our standard posture is to execute uptime or top to see how loaded the system is.

For example, if I type uptime on the command line, the system returns a line of information.

Appletekimbp :~ apple$uptime 20:44 up 21 days, 6:41, 2 users, load averages: 2.85 2.33 2.91Copy the code

But my question is, do you know what each of these outputs means?

fell# Current time
up 21 days,  6:41 	# System running time
2 users       		# Number of users logging in

The average load of the system is the average load of the system in 1 minute, 5 minutes and 15 minutes respectivelyLoad Averages: 2.85 2.33 2.91Copy the code

In the second half of this line of information, it says “Load Average”, which means “the average load of the system”, and there are three numbers in it, from which we can determine whether the system load is large or small.

What is the system load average?

I guess some of you will say, isn’t load average CPU usage per unit of time? The above 2.85 means that the CPU usage is 285%. That’s not true.

The CPU load value on Linux systems represents the average number of jobs that are running, in the runnable state (reading a set of program instructions in the machine language corresponding to the process execution thread), or, crucially, hibernated but non-interruptible (non-interleaving hibernated state). That is, to calculate a value for the CPU load, only processes that are running or waiting for CPU time to be allocated are considered. Do not consider normal dormancy process (dormant state), zombie or stopped process.

Simply put, the average load average is the average number of processes per unit of time on a system that are both running and non-interruptible. It is also the average number of active processes, and it is not directly related to CPU usage.

Process status code

R is running or runnable (in a runqueue) D uninterrupted sleep (usually IO) S interruptible sleep (waiting for an event to complete) Z Invalid/zombie, terminated but not stopped by its parent T, stopped by job control signal or because it is tracked […]

So let’s just explain the runnable state and the non-interruptible state.

A Runnable process is a process that is either using the CPU or waiting for the CPU. This is what we often see with the ps command when a process is in the R state (Running or Runnable).

An uninterruptible process is a process that is in a kernel-state critical process that cannot be interrupted, such as waiting for an I/O response from a hardware device. This is the process in the D state (Uninterruptible Sleep, also known as Disk Sleep) that we see in the Ps command. For example, when a process reads or writes data to the disk, it cannot be interrupted by other processes or interrupts until the data is recovered from the disk to ensure consistency. If the process is interrupted, disk data may be inconsistent with process data. Therefore, the uninterruptible state is actually a protection mechanism of the system to the process and hardware devices. Therefore, we can simply think of the average load as the average number of active processes. The average number of active processes is intuitively understood as the number of active processes per unit time. Since the average is the number of active processes, ideally there is just one process running on each CPU, so that each CPU is fully utilized.

Here’s what the different load values mean on a single-core processor computer:

0.00: No jobs are running or waiting for the CPU to execute, that is, the CPU is completely idle. Thus, if a running program (process) needs to perform a task, it requests the operating system to the CPU and immediately allocates CPU time to the process because no other process is competing for it.
0.50: There are no jobs waiting, but the CPU is processing the previous jobs, and it is processing at 50% capacity. In this case, the operating system can also immediately allocate CPU time to other processes without putting them in the hold state.
1.00: There are no jobs in the queue, but the CPU is processing previous jobs at 100% capacity, so if a new process requests CPU time, it must be kept until another job completes or until the current CPU slot time (for example, the CPU tick) expires, and the operating system decides which is the next given process priority.
1.50: THE CPU is working at 100% of its capacity, 5 out of 15 jobs request CPU time, or 33.33%, and must wait in line for others to exhaust their allotted time. Therefore, once the threshold of 1.0 is exceeded, the system can be said to be overloaded because it cannot immediately handle 100% of the requested work.

Obviously, the lower the “load Average “value, such as 0.2 or 0.3, the less work the server is doing and the lower the system load.

An analogy

What if I can read too much? Okay, let’s look at a simple analogy.

In the simplest case, your computer has only one CPU, and all the computations must be done by that CPU.

So, let’s think of this CPU as a bridge with only one lane on the bridge and all vehicles must pass through that lane. (Apparently, the bridge is one-way only.)

The system load is zero, which means there are no cars on the bridge.

The system load is 0.5, meaning half of the bridge has cars.

A system load of 1.0 means that all sections of the bridge have cars, which means the bridge is “full”. It must be noted, however, that up to this point the bridge was still passable.

The system load is 1.7, which means there are too many vehicles, the bridge is full (100%), and the vehicles waiting behind the bridge are 70% of the vehicles on the deck. By analogy, system load 2.0 means that there are as many vehicles waiting on the bridge as on the deck; The system load is 3.0, which means there are twice as many vehicles waiting on the bridge as on the deck. In general, when the system load is greater than 1, the vehicle behind must wait; The greater the system load, the longer it must wait to cross the bridge.

The system load on the CPU is basically the same as the above analogy. The capacity of the bridge is the maximum workload of the CPU; The vehicles on the bridge are just processes waiting for the CPU to process them.

If the CPU is processing a maximum of 100 processes per minute, then the system load is 0.2, meaning that the CPU is processing only 20 processes per minute. The system load was 1.0, meaning that the CPU was processing exactly 100 processes in that minute; The system load is 1.7, which means that in addition to the 100 processes being processed by the CPU, there are 70 processes queuing up for the CPU to process.

For the computer to run smoothly, it is best to keep the system load below 1.0 so that no processes have to wait and all processes are processed first. Obviously, 1.0 is a critical value, beyond which the system is not at its best and you need to intervene.

Multiprocessor and multi-core systems

On a system with multiple processors or cores (multiple logical cpus), the meaning of the CPU load value depends on the number of processors present in the system. Therefore, a computer with four processors will not be used at 100% until it reaches a 4.00 load, so the first thing you must do when interpreting the three load values provided by commands such as top, htop, or uptime is to separate them. The number of logical cpus in the system and draw conclusions from them.

For example, what happens if you have two cpus on your computer? Two cpus means twice the processing power of your computer and twice the number of processes you can process simultaneously. To use the bridge analogy again, two cpus means two lanes of traffic on the bridge, doubling the capacity

So, two cpus means that the system load can reach 2.0, where each CPU is working 100% of the time. By extension, the maximum acceptable system load for a computer with n cpus is n.0.

Chip manufacturers tend to contain multiple CPU cores inside a CPU, which is called a multi-core CPU.

In terms of system load, a multi-core CPU has a similar effect to multiple cpus, so when considering system load, you must consider how many cpus there are on the computer and how many cores each CPU has. The system load is then divided by the total number of cores, and as long as the load of each core does not exceed 1.0, the computer is operating properly. How do you know how many CPU cores a computer has?

Further reading: CPU, physical, and logical core concepts and relationships based on performance

CPU utilization

If we look at the different processes passing through the CPU in a given time interval, the utilization percentage will represent the portion of time relative to that time interval for the CPU to execute the instructions corresponding to each process. But such calculations are considered only for processes that are running, not those that are waiting, whether they are in a queue (runnable) or asleep but not interruptible (such as waiting for the end of an input/output operation).

Thus, this metric gives us an idea of which processes are squeezing the CPU the most, but it does not give a true picture of the system state if the system state is overloaded or underutilized.

In real life, it’s easy to confuse load average with CPU usage, since load average refers to the number of running and non-interruptible processes per unit of time. So, it includes not only processes that are using the CPU, but also processes that are waiting for the CPU and waiting for I/O. CPU usage, as we know from the above explanation, is the amount of busyness per unit of time, and does not necessarily correspond to the load average. Such as:

CPU intensive processes, where using a lot of CPU leads to a higher load average, are consistent.
I/O intensive processes, waiting for I/O can also lead to a higher average load, but not necessarily high CPU usage.
The scheduling of a large number of processes waiting for THE CPU also results in a high load average and therefore high CPU usage.

Note the input/output (I/O) operations

This article has repeatedly emphasized the importance of uninterrupted sleep (D in the first figure), because sometimes you can find very high load values on a computer, while different running processes have relatively low utilization. If you don’t think about this state, you’ll find the situation confusing and you won’t know how to deal with it. When a process is waiting for the release of a resource and its execution cannot be interrupted, such as when it is waiting for an uninterruptible I/O operation, the process completes in this state (not all are uninterruptible). Typically, this occurs due to a disk failure, a network file system (such as AN NFS failure), or heavy use of very slow devices (such as USB 1.0 Pendrive).

In this case, we will have to use alternative tools, such as iostat or IOTop, which will indicate which processes are performing more I/O so that we can kill them or assign them less priority (nice commands) so that more CPU time can be allocated to other, more critical processes.

Some skills

System overloading and exceeding the load value of 1.0 is sometimes not a problem, because even with some delay, the CPU will process the jobs in the queue and the load will drop to the value below 1.0 again. But if the system’s persistent load value is greater than 1, it means that it cannot absorb all the load in execution, so its response time increases and the system becomes slow and unresponsive. High values above 1, especially the last 5 – and 15-minute load averages, are a clear symptom that either we need to improve the computer’s hardware, save less resources by limiting how much users can use the system, or divide the load between multiple similar nodes.

We therefore propose the following:

> = 0.70: No response, but it is necessary to monitor CPU load. If it stays that way for a while, it has to be before things get any worseTo investigate.
> = 1.00:There is a problemYou must find it and fix it, or major spikes in system load will cause your application to be slow or unresponsive.
> = 3.00: Your system becomesVery slow. It was even difficult to manipulate it from the command line to try to figure out the cause of the problem, so fixing the problem took longer than we did before. You run the risk that the system will become more saturated and will definitely crash.
> = 5.00:You may not be able to recover the system. Can you wait for a miracle to automatically reduce the load, or if you know what’s going on and can afford it, can you boot it up in the consolekill -9 <process_name>And pray it runs at some point to lighten the load on the system and regain control of it. Otherwise, you’ll certainly have no choice but to restart your computer.

References:

[1] : www.ruanyifeng.com/blog/2011/0…

Performance-based understanding of Linux system load average and CPU usage

preface

What is the system load average?

An analogy

Multiprocessor and multi-core systems

CPU utilization

Note the input/output (I/O) operations

Some skills

Related Posts

Implement the RPC framework using Python

Small video APP source code with what success out of the circle, shou “jiangshan” how difficult?

JDK dynamic proxy source code analysis