Brief Introduction: During the double Eleven pressure test, one of the common problems is the load surge. Usually, services are affected at this time, for example, the rt surge, the machine cannot be logged in, the command hang is executed on the machine, and so on. In this article, what is load? How is load calculated? Under what circumstances will load surge?

The author | | ali Jiang Chong source technology public number

During the double Eleven pressure test, one of the common problems is the high load, usually at this time, the business will be affected, such as the service RT high, such as the machine can not log in, such as the command hang on the machine and so on. In this article, what is load? How is load calculated? Under what circumstances will load surge?

What is load

Load stands for Linux System Load Averages. Note two keywords: “load,” which measures the amount of CPU, memory, IO, and so on that a Task (used in the Linux kernel to describe a process or thread) demands on the system, and “average,” which measures the average over time, at 1, 5, and 15 minute values, respectively. The system load Average is calculated by the kernel load and recorded in the /proc/loadavg file that is read by user-mode tools such as uptime, TOP, and so on.

We generally believe that:

  • If load is close to zero, it means the system is idle
  • If the 1min average is higher than the 5min or 15min average, the load is increasing
  • If the 1min average is lower than the 5min or 15min average, the load is decreasing
  • If they are higher than the number of cpus on the system, the system is likely experiencing performance problems (depending on the situation)

How to calculate load

1 Core Algorithm

Frankly speaking, the core algorithm is Exponential Weighted Moving Average (EMWA), which is simply expressed as:

a1 = a0

factor + a

(1-factor), where a0 is the value of the last moment, a1 is the value of the current moment, factor is a coefficient with a value range of [0,1], and a is the sampling value of an indicator at the current moment.

Why use exponential moving weighted average? I personally understand

1. Exponential moving weighted average method refers to that the weighted coefficient of each value decreases exponentially with time. The closer it is to the current time, the greater the numerical weighted coefficient will be, which can better reflect the trend of recent changes;

2. It is important for the kernel not to save all the past values.

Let’s take a look at how the kernel calculates load Average.

The exponential moving average formula above, A1 = A0

e + a

A0 is the load at the last moment, A1 is the load at the current moment, E is a constant coefficient, and A is the number of active processes/threads at the current moment.

As mentioned in the previous section, the Linux kernel calculates three loads, one minute, five minutes, and fifteen minutes. Three different constant coefficients e are used to calculate these three loads, as defined below:

#define EXP_1 1884       /* 1/exp(5sec/1min) */
#define EXP_5 2014      /* 1/exp(5sec/5min) */
#define EXP_15 2037   /* 1/exp(5sec/15min) */
Copy the code

Where do these three coefficients come from? The formula is as follows:

  • 1884 = 2048/(power(e,(5/(60

    (1))))

    E = 2.71828 * /

  • 2014 = 2048/(power(e,(5/(60*5))))

  • 2037 = 2048/(power(e,(5/(60*15))))

Where e=2.71828, in fact, is the natural constant E, also known as the Euler number.

So why is it such a formula? Among them, 5 refers to sampling every five seconds, 60 refers to 60 seconds per minute, 1, 5 and 15 minutes respectively. As for why 2048 and natural constant E, fixed-point calculation and other mathematical knowledge are involved here, which are not the focus of our research and will not be discussed for the time being.

Let’s look at the actual code in the kernel:

/*
 * a1 = a0 * e + a * (1 - e)
 */     
static inline unsigned long
calc_load(unsigned long load, unsigned long exp, unsigned long active)
{       
        unsigned long newload;
        // FIXED_1 = 2048
        newload = load * exp + active * (FIXED_1 - exp);
        if (active >= load)
                newload += FIXED_1-1;

        return newload / FIXED_1;
}
Copy the code

It’s a pretty straightforward implementation. In the code above, the first parameter is the load of the previous moment, the second parameter is the constant coefficient, and the third parameter is the number of active processes/threads (including runnable and uninterruptible).

2 Calculation Process

The calculation of load is divided into two steps:

1. Periodically update the active tasks in the RQ on each CPU, including runnable and uninterruptible tasks, to a global variable called calc_load_tasks.

Load is calculated periodically based on calc_load_tasks.

For the first step, each CPU must update calc_load_tasks, but the second step is performed by only one CPU, tick_do_timer_CPU, which performs do_timer() -> calc_global_load() to calculate the system load.

The overall process is shown in the figure below. When each tick arrives (the clock is interrupted), the following logic is executed:

In the figure above, the brown calc_global_load_tick function completes the first step, the green calc_global_load completes the second step, and the blue calc_load is the core algorithm described in the previous section.

Calc_global_load places the calculated load in a global variable avenrun, which is defined as an unsigned long avenrun[3], size 3, and is used for 1/5/15 minutes of loads. When viewing /proc/loadavg, it is from this avenrun array that the data is retrieved.

Common causes of high load

The increase in the number of runnable or uninterruptible tasks causes the load to surge. Complexity is complicated, however, because the number of paths that cause a task to enter an uninterruptible state is very large (roughly 400-500 paths). Personally, some places are abusing this status.

I based on years of Linux kernel development and troubleshooting experience, summed up some experience for readers.

1 Periodic spike

Some businesses have experienced periodic spikes in load, if not periodic spikes in the business, then it is most likely a bug in the kernel calculation of load. This bug is related to the kernel load sampling rate (LOAD_FREQ), which I won’t discuss in detail. This bug has been fixed in ali2016, ali3000, ali4000.

To exclude this cause, you can then check whether disk I/O is the cause.

2 the IO reason

Disk Performance Bottlenecks

Iostat-dx 1 Displays the I/O load of all disks. When the IOPS or BW is high, disks become performance bottlenecks and a large number of threads are in uninterruptible state waiting for I/OS, resulting in a high load. If you use vmstat at this point, you might see a spike in the b column, a spike in CPU iowait, and a spike in procs_blocked in the /proc/stat file.

Cloud disk abnormal

A cloud disk is a virtual disk. The I/O paths are long and complex, which is prone to problems. A common exception is IO UTIL 100%, avgqu-sz is never 0, at least 1. I/O Util 100% does not mean that the disk is busy. It just means that the device’s request queue detects an outstanding I/O request every time it samples. So when I/O is missing for some reason, the cloud disk will get util 100%, and the JBD2 thread in the ECS kernel, Business threads can also be held by D, causing loads to spike.

JBD2 bug

JBD2 is the journalizing system for the ext4 file system. Once the JBD2 kernel thread is jammed by a bug, all disk I/O requests are blocked and a large number of threads enter an uninterruptible state, causing load to spike.

After I/O is ruled out, you can then look at memory.

3 Memory Causes

Memory recovery

Memory reclamation may be triggered when a task applies for memory. If memory reclamation is triggered directly, performance deteriorates. The current task is blocked until memory reclamation is complete, and new requests may cause the number of tasks to increase (such as HSF thread pool expansion), causing the load to spike. Tsar — CPU — MEm — load-i1-L tsar — CPU — MEm — LOAD-I1-L

Memory bandwidth Contention

You’ve probably only heard of IO bandwidth, network bandwidth, not memory bandwidth. In fact, in addition to the capacity dimension, memory has a bottleneck at the bandwidth level, but this indicator is not observed by ordinary tools. The Aprof tool we developed can observe the memory bandwidth competition, which is very powerful in mixed environment during the Double 11 guarantee period.

4 lock

It is usually spin_lock on some path in the kernel that is the bottleneck, especially on the network packet path. You can use perf top -g to view the spin_lock hotspot, and then find the source of the kernel based on the function address. These can be accompanied by sys spikes and softirq spikes.

In addition, if a task holds the lock on a path that uses mutex_lock for concurrency control, other tasks will wait in the TASK_UNINTERRUPTIBLE state, causing loads to spike. But if the lock is not on the critical path, there may be no business impact.

5 user CPU

In some cases load spikes are normal for the business. In this case the user CPU spikes, VMstat sees an increase in the R column, tsar — load-i1-L sees an increase in RUNq, Looking at proc/pid/schedstats you might see that the second number, sched delay, increases very quickly.

Four root cause analysis

1 RUNNABLE Load Surge analysis

As mentioned above, this situation, usually due to increased business volume, is normal, but it can also be caused by business code bugs, such as long loops or even dead loops. In either case, you can usually find the cause through hotspot analysis, or ON CPU analysis. There are many tools on CPU analysis, such as The Ali-Diagnose Perf developed by Ali.

2 UNINTERRUPTIBLE load surge analysis

UNINTERRUPTIBLE means waiting, so if we find out where we’re waiting, we’ll basically know why.

Find UNINTERRUPTIBLE state processes

UNINTERRUPTIBLE, also known as the D state, is described below. There are some simple tools that count the number of d-state processes, and a more complex tool that outputs the call chain of d-state processes, called stack. Such tools typically fetch numbers from the proc file system provided by the kernel.

${pid}/stat /proc/pid/stat /proc/pid/stat /proc/pid/ task/${pid}/stat

The third field is the state of the task. Then look at the /proc/${pid}/stack file to see where the task is waiting. Such as:

However, sometimes the task in the D state is not fixed, which leads to the failure to capture the D state or the inaccurate capture of the stack. That’s where the other big trick comes in: delayed analysis.

Delay analysis

Delay analysis needs to go deep into the kernel and take the number of buried points on the kernel path. So the essence of this kind of tool is kernel probe, including SystemTap, KProbe, EBPF and so on. But probe technology must combine knowledge and experience to make it a useful tool. The Ali-Diagnose diagnose runs various delay analyses, including IRq_delay, sys_delay, sched_delay, io_delay, and load-monitor.

Five summarizes

The Linux kernel is a complex and concurrent system with complex module relationships. In the case of load, however, you can always find the root cause by analyzing both runnable task and uninterruptible Task.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.