This article is a series of articles called “Kernel Trace Tools” produced by Bytedance STE team to introduce the multi-type delay problem tracking tools developed by the team. In practical work, the problem of high network latency caused by long shutdown time of interrupts and soft interrupts is encountered. However, there is no effective solution or objective basis to locate and trace these problems, which requires a lot of time and effort. Interrupts-off or softirqs-off latency tracer The trace-irqoff tool is a self-developed tool born in this background. At present, the trace – irqoff has Open Source, such as interested in can be found in the Open Source Repo: http://github.com/bytedance/trace-irqoff/tree/master.

1. Background

In our work, we often encounter high network latency of business processes. Based on the rich experience of analyzing similar problems, there are many reasons for these problems. We find that the following two reasons often come to our attention.

  • Hardirq was shut down for a long time. Procedure
  • Softirq takes a long time to shut down. Procedure

If hardirQ is shut down for too long, scheduling delays will occur and softirQ on the local CPU will not be executed. We know that softirQ is used for sending and receiving network packets, so hardirq shutting down for too long is bound to cause problems. Similarly, softirq shutdown is the same, and while it does not affect Hardirq, it directly affects softirQ execution.

2. What do we need

We often waste a lot of time trying to determine whether the above causes the problem. Therefore, it is necessary to develop a tool specifically to locate network latency problems caused by this reason. We’re not just looking for a crime scene, we’re looking for a killer. We need to know which process closes interrupts at what point in the code, which helps us solve the problem efficiently.

3. Is there a ready-made solution

Our goal is simple, to track hardirq/softirq shutdown times. How can we do that? The simplest and intuitive method should be to add hook function in the place where the kernel switch is interrupted. Then the difference value can be obtained by counting the time stamp of the switch. The difference value is the closing time. The Linux kernel provides the following apis for turning interrupts on and off:

/* hardirq enable/disable api */
local_irq_disable()
local_irq_enable()
local_irq_save()
local_irq_restore()
Copy the code

/* softirq enable/disable api */ local_bh_disable() local_bh_enable()

Yes, hardirq tracking is already implemented in the Linux kernel, so we just need to configure the following config options.

CONFIG_IRQSOFF_TRACER=y
Copy the code

Okay, it looks like we don’t have to do anything and just use it. It seems so, but there are two problems.

CONFIG_IRQSOFF_TRACER is turned off by default. To do this we need to recompile the kernel, reinstall it, and wait for the problem to reoccur. That’s not what we want. The Interrupt switch in the Linux kernel is frequent. Even if the solution is feasible, overhead will be high. So again, that’s not what we want.

4. Our approach

Let’s think about it another way. We can use hrtimer to determine the time between interrupts to determine whether the interrupt is closed. Hrtimer is a high-precision timer in Linux, and the execution context is hardirq. So we can do it in this way, which is not precise but is good enough for our needs. For example, the hrtimer period is 10ms. The interval between interrupts should then be 10ms. If it is found that the interval between two sampling times is 50ms, it can indirectly indicate that the interruption is about 50ms. In addition, according to the sampling theorem, the time interval between two interrupts must be greater than 2 times the sampling period (20ms) to consider the interrupt closed. So our approach is clear: we start an HRtimer for each CPU and bind each CPU. Calculate the interrupt interval in the HRTimer Handle. This completes the interrupt close detection. What should Softirq do? We can do the same thing. In the Linux kernel, the normal timer execution context is softirq. It fits our needs very well. So we can periodically sample in a similar way, but with a regular timer.

5. Record the culprit’s stack

When we find that the interval between interrupts is greater than the threshold, we need to know what the current CPU is doing to cause Hardirq /softirq to be turned off. We can record the current CPU stack in the interrupt handler. The premise for doing this is that the current process cannot be scheduled when the interrupt handler executes, that is, the culprit must be present. Catch the cat in the act.

5.1 hardirq

For hardirq shutdown. When a process interrupts a timer between closing and opening interrupts, it does not respond to the interrupt, but the hardware sets pending. When a process opens interrupts, it responds to interrupts at the same time. The Hardirq Handle will be called. The current process never had a chance to dispatch, so this must be the scene, and the current process must be the culprit.

5.2 softirq

Does the above premise meet in the softirq shutdown case? The process calls local_bh_enable() to turn on softirq. The function that opens the lower half of the function is as follows:

void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) { /* * Keep preemption disabled until we are done with * softirq processing: */ preempt_count_sub(cnt - 1); if (unlikely(! in_interrupt() && local_softirq_pending())) { do_softirq(); } preempt_count_dec(); }Copy the code

We can see that if there is softirq pending, do_softirq() is responsible for processing softirq in the current process context. And other preemption is turned off. So we don’t schedule it out, and calling local_bh_enable() immediately responds to the normal timer. So this also satisfies those conditions.

6. Softirq is special

Let me remind you of the execution scenario that triggered softirq. There will be three places.

  • irq_exit()

  • local_bh_enable()

  • Ksoftirqd process

Softirq pending is checked when the interrupt returns. This is most cases where softirQ is executed. Due to the special nature of the interrupt, it is valid to record the current CPU stack in the Timer Handle. The local_bh_enable() case, discussed above, can catch the culprit. But in the third case, you can’t. Why is that?

We record stack information in the timer handle. If softirQ execution takes too long when IRQ_exit occurs, it will be arranged to ksoftirQD process. So what’s the stack information that we’re recording? Ksoftirqd is simply the stack of the KsoftirQD process, ksoftirq is not the culprit. Records are useless. So we need special processing for scenarios where KsoftirQD executes. We use the Hrtimer handle. In addition to measuring the two hardirq intervals, hrtimer also measures how long softirQ has not been executed, logging the stack when appropriate. When the hrtimer is executed, it checks how long softirQirq has not updated the timestamp. If the operation threshold is set, we also record the stack information. This is because if softirQ is delayed to ksoftirQd later, The stack logged by Softirq’s timer is worthless.

7. How do I install it

Git Clone code is used to install the trace-irqoff tool.

make -j8
make install
Copy the code

8. How to use it

After the trace-irqoff tool is successfully installed. Creates the following

The/proc/trace_irqoff directory. root@n18-061-206:/proc/trace_irqoff# ls distribute enable sampling_period trace_latencyCopy the code

Four files exist in the /proc/trace_irqoff directory, which are distribute, enable, sampling_period, and trace_latency. After the tool is installed, it is turned off by default, so we need to manually turn on trace.

8.1 open the trace

echo 1 > /proc/trace_irqoff/enable
Copy the code

8.2 shut down the trace

echo 0 > /proc/trace_irqoff/enable
Copy the code

8.3 Setting the Trace Threshold

The trace-irqoff tool only records stack information for closing interrupts or when the soft interrupt time exceeds the threshold. Therefore, we can run the following command to check the current trace threshold:

root@n18-061-206:/proc/trace_irqoff# cat /proc/trace_irqoff/trace_latency
trace_irqoff_latency: 50ms

hardirq:

Copy the code

softirq:

The default threshold is 50ms, as shown in line 2. Line 4 outputs hardirq: indicating that the stacks below are likely to close interrupts above the threshold. Similarly, line 6 is the stack where the soft interrupt closing time exceeds the threshold. To change the threshold to 100ms, run the following command (write value unit: ms) :

echo 100 > /proc/trace_irqoff/trace_latency
Copy the code

8.4 Clearing stack Information

Of course, if you need to clear the stack information in the /proc/trace_irqoff record. You can run the following command (the threshold is not changed to 0) :

echo 0 > /proc/trace_irqoff/trace_latency
Copy the code

8.5 Viewing statistics on the Number of Times an Interrupt is shut down

If we need to know how many times interrupts are closed for a certain amount of time, we can obtain statistics by using the following command:

root@n18-061-206:/proc/trace_irqoff# cat distribute hardirq-off: msecs : count distribution 20 -> 39 : 1 | * * * * * * * * * * | 40-79: > 0 | | 80 - > 159:4 | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 160 - > 319: 2 |******************** | 320 -> 639 : 1 |********** | softirq-off: msecs : count distribution 20 -> 39 : 40 - > 79:0 | | 0 | | 80 - > 159:0 | | 160 - > 319:1 | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * |Copy the code

In this example, we see that Hardirq is turned off for x ∈ [80, 159] ms 4 times. Softirq shutdown time x ∈ [160, 319] ms, times 1 if there is no information output, it means that there is no place where the shutdown interrupt time exceeds 20ms.

8.6 Changing the Sampling Period

As can be seen from the above section, the minimum granularity of interrupt shutdown time distribution diagram is 20ms. This is because the sampling period is 10ms. According to the sampling theorem, the real situation can be reflected only when the sampling period time is greater than or equal to 2 times. To improve the statistics granularity, change the sampling period. For example, to change the sampling period to 1ms, run the following command (the operation is valid only when tracer is disabled) :

The minimum sampling period can be set to 1ms. echo 1 > /proc/trace_irqoff/sampling_periodCopy the code

9. Case study

9.1 hardirq closed

We use the following schematic test procedure to turn off the interrupt for 100ms. View the contents of the trace_irqoff file.

static void disable_hardirq(unsigned long latency)
{
        local_irq_disable();
        mdelay(latency);
        local_irq_enanle();
}
Copy the code

Test the above code through the module, and then look at the stack information.

root@n18-061-206:/proc/trace_irqoff# cat trace_latency
trace_irqoff_latency: 50ms
 hardirq:
 cpu: 17
     COMMAND: bash PID: 22840 LATENCY: 107ms
     trace_irqoff_hrtimer_handler+0x39/0x99 [trace_irqoff]
     __hrtimer_run_queues+0xfa/0x270
     hrtimer_interrupt+0x101/0x240
     smp_apic_timer_interrupt+0x5e/0x120
     apic_timer_interrupt+0xf/0x20
     disable_hardirq+0x5b/0x70
     proc_reg_write+0x36/0x60
     __vfs_write+0x33/0x190
     vfs_write+0xb0/0x190
     ksys_write+0x52/0xc0
     do_syscall_64+0x4f/0xe0
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
Copy the code

softirq:

We can see that the hardirq column records that CPU17 executed the bash command to close the interrupt 107ms (within 10ms). The stack information corresponds to the disable_hardirq() function. The softirq column on line 20 contains no information indicating that softirq was not logged as a closed stack.

9.2 softirq closed

We turned off softirq 100ms using the following schematic test procedure. View the contents of the trace_irqoff file.

static void disable_softirq(unsigned long latency)
{
        local_bh_disable();
        mdelay(latency);
        local_bh_enanle();
}
Copy the code

Test the above code through the module, and then look at the stack information.

root@n18-061-206:/proc/trace_irqoff# cat trace_latency
trace_irqoff_latency: 50ms

hardirq:

Copy the code

softirq: cpu: 17 COMMAND: bash PID: 22840 LATENCY: 51+ms trace_irqoff_hrtimer_handler+0x97/0x99 [trace_irqoff] __hrtimer_run_queues+0xfa/0x270 hrtimer_interrupt+0x101/0x240 smp_apic_timer_interrupt+0x5e/0x120 apic_timer_interrupt+0xf/0x20 delay_tsc+0x3c/0x50 disable_softirq+0x4b/0x80 proc_reg_write+0x36/0x60 __vfs_write+0x33/0x190 vfs_write+0xb0/0x190 ksys_write+0x52/0xc0 do_syscall_64+0x4f/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 COMMAND: bash PID: 22840 LATENCY: 106ms trace_irqoff_timer_handler+0x3a/0x60 [trace_irqoff] call_timer_fn+0x29/0x120 run_timer_softirq+0x16c/0x400 __do_softirq+0x108/0x2b8 do_softirq_own_stack+0x2a/0x40 do_softirq.part.21+0x56/0x60 __local_bh_enable_ip+0x60/0x70 disable_softirq+0x62/0x80 proc_reg_write+0x36/0x60 __vfs_write+0x33/0x190 vfs_write+0xb0/0x190 ksys_write+0x52/0xc0 do_syscall_64+0x4f/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9

For softirQ shutdown, there are two stacks. Notice that the function names on line 9 are different from those on line 24. The stack on line 9 is where the hardware interrupt handler catches the soft interrupt closing, and on line 24 is where the soft interrupt handler catches the soft interrupt closing. Normally, we would target stacks starting at 24 lines. When the stack at line 24 is invalid, look at the stack at line 9. Note here: The Lantency hint on line 9 that 51+ms is the threshold information. Not actual latency (so I add a ‘+’ character after it to indicate latency is greater than 51ms). The actual latency is 106ms as shown on line 24. Let’s see why 2 stacks are necessary.

9.3 ksoftirqd delay

Let’s look at a real problem that was dealt with:

root@n115-081-045:/proc/trace_irqoff# cat trace_latency
trace_irqoff_latency: 300ms

hardirq:

Copy the code

softirq: cpu: 4 COMMAND: lxcfs PID: 4058797 LATENCY: 303+ms trace_irqoff_record+0x12b/0x1b0 [trace_irqoff] trace_irqoff_hrtimer_handler+0x97/0x99 [trace_irqoff] __hrtimer_run_queues+0xdc/0x220 hrtimer_interrupt+0xa6/0x1f0 smp_apic_timer_interrupt+0x62/0x120 apic_timer_interrupt+0x7d/0x90 memcg_sum_events.isra.26+0x3f/0x60 memcg_stat_show+0x323/0x460 seq_read+0x11f/0x3f0 __vfs_read+0x33/0x160 vfs_read+0x91/0x130 SyS_read+0x52/0xc0 do_syscall_64+0x68/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 COMMAND: ksoftirqd/4 PID: 34 LATENCY: 409ms trace_irqoff_record+0x12b/0x1b0 [trace_irqoff] trace_irqoff_timer_handler+0x3a/0x60 [trace_irqoff] call_timer_fn+0x2e/0x130 run_timer_softirq+0x1d4/0x420 __do_softirq+0x108/0x2a9 run_ksoftirqd+0x1e/0x40 smpboot_thread_fn+0xfe/0x150 kthread+0xfc/0x130 ret_from_fork+0x1f/0x30

We see the following stack of process ksoftirqd/4 with a delay time of 409ms. The ksoftirqd process is the kernel process that processes SOFtirQ. So this stack is meaningless to us, because the culprit has already been missed. So at this point, we can use the stack information above to see that when softirq is delayed by 303ms, the current CPU is executing LXCFS. And the stack is memory Cgroup dependent. Therefore, we can basically judge that LXCFS process execution time is too long, because the kernel state does not support preemption, so the Ksoftirqd process did not have a chance to run.

10. Summary

According to the internal practice of byte, trace-IRqoff is easy to install and flexible to use, and can shorten the time of problem location to minutes, with considerable benefits. In addition, many problems have been located through this tool and work efficiency has been improved.

More share

Improvement of Bytedance on RocksDB storage engine

Bytedance’s own quadrillion graph database & graph computing practice

Deep understanding of the Linux kernel – TLB shootdown and optimization caused by Jemalloc

Bytedance STE team

Bytes to beat system STE team has been committed to the operating system kernel and virtualization, system software and library construction and performance optimization, the stability and reliability of the large scale data center construction, new collaborative design of hardware and software system based on technology research and development and engineering to the ground, with the basis of the comprehensive ability of software engineering, For byte upper business escort. At the same time, the team actively follows the community’s technology trends and embraces open source and standards. Welcome more ambitious people to join, if interested can send resume to: [email protected]

Welcome to the Bytedance technical team