This article is one of the “Kernel Trace Tools” series produced by bytedance STE team to introduce the multi-type delay problem tracking tools developed by the team. In practice, the problem of high network latency or lag caused by the non-scheduling of kernel state is encountered. However, there is no effective scheme or objective basis for locating and tracking such problems, which requires a lot of time and energy for troubleshooting. Trace-noschedule tool is a self-developed tool born under this background. Trace-noschedule is now Open Source, see The Open Source Repo if you are interested

1. Background

In actual project practice, we often encounter problems caused by high latency. Since the kernel of our server is configured to not support preemption by default, this is one of the possible causes of the problem. For example, process A takes too long to run in kernel mode, which inevitably affects other processes that want to run on the kernel. This will result in scheduling delays. In this case, we developed a tool to track processes that are stuck in kernel mode and not scheduled for a long time. This can give us a certain direction for troubleshooting problems. In addition, if the delay is caused by such causes, you can quickly locate the cause of the problem.

2. What do we want to do

When a process falls into kernel mode, we should record its time stamp. When leaving the kernel, the timestamp is recorded again. Then compute the difference to get the kernel running time of the process. Clear thinking.

3. How to achieve it

We know that the main way applications fall into kernel state is through system calls. Can we do this through the tracepoint provided by the system call? When the system call starts, the timestamp is logged, the system call exits, and the timestamp is logged again. To find the difference, that’s how long this system call takes. How reliable is this approach? Obviously not. Because the process in kernel mode is likely to cause an active schedule due to resource inadequacy. In this case, it is not the case that the CPU is used all the time without scheduling. Therefore, this method is not feasible.

Since system calls don’t work this way, let’s take a different approach. We know that the execution lifecycle of a thread starts and ends in Schedule. So we know the time difference between a thread starting to execute and actively or passively giving up the CPU. This part of time is the total time, namely the user+kernel time. Is there any way to get the total time? Fortunately, we have tracepoint of schedule to obtain.

static void __sched notrace __schedule(bool preempt)
{
    /* ... */
    trace_sched_switch(preempt, prev, next);
    /* ... */
}
Copy the code

We simply need enable sched_switch to get the total execution time for each thread. Is it the excited heart, the trembling hands, the immediate desire to use the BCC action? But there is a new problem, how to filter out the user mode execution time? Because preemption is supported in user mode, preemption is not supported in kernel mode. So it makes sense to count the time of the kernel state.

Filtering user – mode execution time can be indirectly implemented through timers. We know periodic timing interrupts, and we can get whether the context being interrupted is user-mode or kernel-mode. So the start timestamp last needs to be updated constantly. If the current interrupt is found to be an interrupted user state, then the last timestamp is updated, otherwise it is not updated. The user thread is now considered to be executing in kernel mode. Let’s look at the sample code.

static enum hrtimer_restart trace_nosched_hrtimer_handler(struct hrtimer *hrtimer) { /* * Skip the idle task and make sure we are not only the * running task on the CPU. If we are interrupted from * user mode, it indicate that we are not executing in * the kernel space, so we should also skip it. */ if (! is_idle_task(current) && regs && ! user_mode(regs) && ! single_task_running()) { /* The kernel mode */ /* * If the execution time exceeds the threshold, record * the call trace. */ } else { /* The user mode */ stack_trace->last_timestamp = now; }}Copy the code

We see that the condition for not updating the timestamp is harsh. The following situations are mainly considered:

  • If the interrupt occurs from user mode, update the last_timestamp timestamp.
  • If there is only one TASK in the CPU running queue, update the last_timestamp timestamp. Because in this case, there’s no problem holding the CPU.
  • If the current process is idle, the last_timestamp timestamp is also updated. The idle process itself can run in kernel mode for a long time without any problems.

Let’s look at the hook function implementation of Tracepoint.

static void probe_sched_switch(void *priv, bool preempt, struct task_struct *prev, struct task_struct *next) { u64 now = local_clock(); u64 last = cpu_stack_trace->last_timestamp; cpu_stack_trace->last_timestamp = now; /* * Skip the idle task and make sure we are not only the * running task on the CPU. */ if (! is_idle_task(prev) && ! single_task_running()) hist_update(cpu_stack_trace, now - last); }Copy the code

Record the time difference, which can almost be considered as the time of kernel-mode execution. When the operation sets the threshold, we update the histogram statistics.

4. How do I install it

Installing the trace-noschedule tool is easy. Git Clone code is used to install the trace-noschedule tool.

make -j8
make install
Copy the code

5. How to use it

After the trace-noschedule tool is successfully installed. The following /proc/trace_noschedule directory is created.

root@n18-061-206:/proc/trace_noschedule# ls
distribution  enable  stack_trace  threshold
Copy the code

The /proc/trace_noschedule directory contains four files: distribution, enable, stack_trace, and threshold. After the tool is installed, it is disabled by default.

5.1 open the tracer

Run the following command to enable tracer.

echo 1 > /proc/trace_noschedule/enable
Copy the code

5.2 close the tracer

Run the following command to disable tracer.


echo 0 > /proc/trace_noschedule/enable
Copy the code

Note: Disable tracer after debugging the problem. Overhead cannot be ignored because the internal implementation of the module is based on Sched Tracepoint.

5.3 Setting thresholds

Trace_noschedule logs the Stack Trace only for unscheduled processes whose kernel-mode execution time exceeds the threshold. In order to operate efficiently, it is necessary to set a reasonable threshold. For example, to set the threshold of 60ms (unit: ns) :

echo 60000000 > /proc/trace_noschedule/threshold
Copy the code

5.4 Viewing the Execution Time Distribution of Processes that are not scheduled for a Long time in the kernel mode.

root@n18-061-206:/proc/trace_noschedule# cat distribution Trace noschedule thread: msecs : count distribution 20 -> 39 : 1 |********** | 40 -> 79 : 0 | | 80 -> 159 : 4 | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 160 - > 319:2 | * * * * * * * * * * * * * * * * * * * * |Copy the code

There were four times in kernel mode with no scheduling in the [80, 159]ms range.

5.5 Who occupies the CPU without scheduling

Stack_trace records stacks that do not schedule processes when the CPU usage exceeds a threshold.

root@n18-061-206:/proc/trace_noschedule# cat stack_trace
 cpu: 0
     COMM: sh PID: 1270013 DURATION: 100ms
     delay_tsc+0x21/0x50
     nosched_test_write+0x53/0x90 [trace_noschedule]
     proc_reg_write+0x36/0x60
     __vfs_write+0x33/0x190
     vfs_write+0xb0/0x190
     ksys_write+0x52/0xc0
     do_syscall_64+0x4f/0xe0
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
Copy the code

This is a kernel mode test case, in the kernel mode to execute mdelay(100) occupy CPU 100ms not scheduled. The stack of records is shown above. DURATION” records the execution DURATION.

5.6 Clearing Stack Trace

If we need to clear the stack Trace buffer (the stack trace buffer is limited in size, we need to clear it if necessary).

echo 0 > /proc/trace_noschedule/stack_trace
Copy the code

6. Case presentation

Here’s a practical question to take you through the tool. The problem is that the background is that the business found that there seems to be a problem with the Docker. Logging in to the machine will lag, as will executing many commands.

6.1 Take a look at the CPU usage.


There are several cpus that are 100% and are still in SYS mode. What is the system doing?

6.2 Using the PERF Tool

Use perf top to look at the CPU and see that about 60% of the time is on spinlock. But I don’t see which path of the kernel spinlock. Use the trace-noschedule tool to see if anything is found.

6.3 trace – noschedule tracking

root@n131-103-013:/proc/trace_noschedule# cat distribution Trace noschedule thread: msecs : count distribution 20 -> 39 : 760 |****************************************| 40 -> 79 : 488 | * * * * * * * * * * * * * * * * * * * * * * * * * | 80 - > 159:73 | | * * *Copy the code

Kernel states are unscheduled for a long time, and for a long time. Take a look at the stack information recorded by the tool, captured as follows:

COMM: runc PID: 499045 DURATION: 75ms
     native_queued_spin_lock_slowpath+0x112/0x190
     _raw_spin_lock+0x1d/0x20
     prepend_path+0x266/0x2d0
     __d_path+0x4b/0x80
     seq_path_root+0x53/0xb0
     show_mountinfo+0xb8/0x290
     seq_read+0x321/0x3f0
     __vfs_read+0x33/0x160
     vfs_read+0x91/0x130
     SyS_read+0x52/0xc0
     do_syscall_64+0x68/0x100
     entry_SYSCALL_64_after_hwframe+0x3d/0xa2
COMM: falcon-agent PID: 3002056 DURATION: 114ms
     native_queued_spin_lock_slowpath+0x10f/0x190
     _raw_spin_lock+0x1d/0x20
     prepend_path+0x266/0x2d0
     __d_path+0x4b/0x80
     seq_path_root+0x53/0xb0
     show_vfsmnt+0x7a/0x170
     seq_read+0x321/0x3f0
     __vfs_read+0x33/0x160
     vfs_read+0x91/0x130
     SyS_read+0x52/0xc0
     do_syscall_64+0x68/0x100
     entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Copy the code

It means that the kernel state does spin all the time. Mainly focus on the above two paths of file reading. The mounts file is /proc/$ppid/mounts and /proc/$pid-mountinfo, respectively.

6.4 Viewing mountInfo Information

It may take too long to traverse because of too much mount. And hold the lock for a long time.

root@n131-103-013:/var/log/atop# mount -v | wc -l
65803
Copy the code

That’s a lot. 65,803 plus. Lots of /dev/shm mounts. The following information is only part of the presentation. Now that you know the cause of the problem, the next step is to find out why /dev/shm mounts a lot. This allowed us to quickly identify the cause of the problem.

7. To summarize

According to the internal practice of byte, trace-noschedule is easy to install and flexible to use, which can shorten the time of problem location to minutes, with considerable benefits. Some problems have been located through this tool, and the work efficiency has been improved.

More share

Kernel Trace Tools (1) : Tracing the fault that interrupts and soft interrupts are closed for a long time

Summary of bytedance chaos engineering practice

GDB indicates that the COredUMP file is truncated

STE team, Bytedance Systems Department

Bytes to beat systems STE team has been committed to the operating system kernel and virtualization, system software and library construction and performance optimization, the stability and reliability of the large scale data center construction, new collaborative design of hardware and software based technology in the field of research and development and engineering, software engineering the basis of comprehensive ability, For byte upper business escort. At the same time, the team actively follows the community’s technology trends and embraces open source and standards. Welcome more ambitious people to join, if interested can send resume to [email protected].

Welcome to the Bytedance technical team