Principles & Engineering Practice | Rust Memory error debugging and dynamic analysis tool shared

Author: Wu Feixiang/Post editor: Zhang Handong

GDB/LLDB debug segFault

Compared with static analysis tools such as Clippy/RA, dynamic analysis tools need to run programs to analyze, such as the official bench and test

Why dynamic analysis

The above is a flame map generated by dynamic analysis of a Rust program, through which you can clearly see the performance bottleneck of the program frequently alloc memory allocation

Dynamic analysis can be used not only to find performance bottlenecks, but also to debug memory errors at runtime, and to generate function call trees that allow new team members to quickly read project code

Directory – Dynamic analysis tools

Common debugging tools and memory detection tools:

First, a segFault case is introduced to show how common debugging tools detect memory errors in this case:

coredumpctl
valgrind
gdb
lldb/vscode-lldb/Intellij-Rust

Dynamic analysis tools such as flame map/Function call tree (Profile)

dmesg
cargo-miri
pref
cargo-flamegraph
KCachegrind
gprof
uftrace
ebpf

Finally, the above tools are used to analyze several memory error cases:

SIGABRT/double-free
SIGABRT/free-dylib-mem

Segfault cases and common debugging tools

The following is part of the source code I rewrite ls command (hereinafter referred to as ls application), the complete source code in this code repository

fn main() {
    let dir = unsafe { libc::opendir(input_filename.as_ptr().cast()) };
    loop {
        let dir_entry = unsafe { libc::readdir(dir) };
        if dir_entry.is_null() {
            break;
        }
        // ...}}Copy the code

As with the original ls command, the input argument is a folder to list all the file names in the folder

However, if the ls application parameter is not a folder, segFault will occur:

> cargo r --bin ls -- cargo. Toml Finished dev [unoptimized + debuginfo] target(s) in 0.00s Running 'target/debug/ls Cargo.toml` Segmentation fault (core dumped)Copy the code

coredumpctl

Systemd – coredump configuration

Check whether the Coredump recording function is enabled by viewing the /proc/config.gz system configuration file

> zcat /proc/config.gz | grep CONFIG_COREDUMP
CONFIG_COREDUMP=y
Copy the code

Since /proc/config.gz is a gzip binary format rather than a text format, print it using zcat instead of cat

See modify coredumpctl configuration file/etc/systemd coredump. Conf

Change the default Coredump log size limit to ExternalSizeMax=20 GB

Run the sudo systemctl restart systemd-coredump command to restart the system

View the last Coredump record

The coreDumpctl list is used to find the last Coredump record, which is the segfault record that just happened

Tue 2021-07-06 11:20:43 CST 358976 1000 1001 SIGSEGV present / home/w/repos/my_repos linux_commands_rewritten_in_rust/target/debug/ls 30.6 K

Note that 358976 before user ID 1000 represents the process PID and is used for coreDumpctl info queries

coredumpctl info 358976

PID: 358976 (segfault_opendi) // ... Command Line: ./target/debug/ls // ... Storage: / var/lib/systemd/coredump/core segfault_opendi. 1000. D464328302f146f99ed984edc6503ca0. 358976.1625541643000000 ZST (present) // ...Copy the code

Alternatively, you can use GDB to parse coredump files with Segfault:

Coredumpctl GDB 358976 or CoreDumpctl DEBUG 358976

Reference: Core dump-wiki

Valgrind checks for memory errors

valgrind –leak-check=full ./target/debug/ls

/ /... ==356638== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==356638== Access not within mapped region at address 0x4 ==356638== at 0x497D904: Readdir (in /usr/lib/libc-2.33.so) ==356638== by 0x11B64D: ls::main (ls.rs:15) //...Copy the code

GDB debugging analyzes error causes

§ GDB opens the executable file of ls application:

gdb ./target/debug/ls

§ GDB prints executable code via l or list commands:

(gdb) l

§ GDB runs the ls application and passes the name of the Cargo. Toml file as an input parameter:

(gdb) run Cargo.toml

Starting program: /home/w/repos/my_repos/linux_commands_rewritten_in_rust/target/debug/ls Cargo.toml
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7e5a904 in readdir64 () from /usr/lib/libc.so.6
Copy the code

§ View the stack frame when a Segfault occurs

(gdb) backtrace

#0  0x00007ffff7e5a904 in readdir64 () from /usr/lib/libc.so.6
#1  0x0000555555568952 in ls::main () at src/bin/ls.rs:15
Copy the code

The system call function in question has been identified as readdir64, with the last stack frame on line 15 of ls.rs

$Look around a few lines of the problem code

(gdb) list 15

§ View the local variables of the problem stack frame

info variablesCan print global or static variables
info localsPrints the local variable of the current stack frame
info argsPrints the entry parameter of the current stack frame

(gdb) frame 1 # select frame 1

(gdb) info locals

(gdb) frame 1 #1 0x0000555555569317 in ls::main () at src/bin/ls.rs:20 20 let dir_entry = unsafe { libc::readdir(dir) };  (gdb) info locals dir = 0x0 // ...Copy the code

At this point, it is found that the dir = 0x0 of main stack frame is a null pointer, resulting in the READDIR system call SegFault

Analyzing the cause of the error

let dir = unsafe { libc::opendir(input_filename.as_ptr().cast()) };
loop {
    let dir_entry = unsafe { libc::readdir(dir) };
    // ...
}
Copy the code

The problem is that the openDIR system call was not judged to be successful, and the system call failure either returns NULL or -1

If the OpenDir system call passes in a file type other than Directory, the call will fail

So the solution to the Bug is to check whether the dir variable created by upstream OpenDir is NULL

To solve the segfault

You simply add the code to see if dir is NULL, and if it is, the system call error message is printed

if dir.is_null() {
    unsafe { libc::perror(input_filename.as_ptr().cast()); }
    return;
}
Copy the code

Again test the LS application to read files that are not folders

> cargo r --bin ls -- cargo. Toml Finished dev [unoptimized + debuginfo] target(s) in 0.00s Running 'target/debug/ls Cargo.toml` Cargo.toml: Not a directoryCopy the code

There is no segment error and the error message Cargo. Toml: Not a directory is printed

Code changes about fixing LS application Segfault are made in this commit

See the official gnu.org tutorial at www.gnu.org/software/gc…

LLDB debugging

LLDB debugging is almost the same as GDB, except for the individual commands

(lldb) thread backtrace # gdb is backtrace

error: need to add support for DW_TAG_base_type '()' encoded with DW_ATE = 0x7, bit_size = 0
* thread #1, name = 'ls', stop reason = signal SIGSEGV: invalid address (fault address: 0x4)
  * frame #0: 0x00007ffff7e5a904 libc.so.6`readdir + 52
    frame #1: 0x0000555555568952 ls`ls::main::h5885f3e1b9feb06f at ls.rs:15:34
// ...
Copy the code

(lldb) frame select 1 # gdb is frame 1

frame #1: 0x0000555555569317 ls`ls::main::h5885f3e1b9feb06f at ls.rs:15:34
   12       
   13       let dir = unsafe { libc::opendir(input_filename.as_ptr().cast()) };
   14       loop {
-> 15           let dir_entry = unsafe { libc::readdir(dir) };
   16           if dir_entry.is_null() {
   17               // directory_entries iterator end
   18               break;
Copy the code

§ LLDB variable prints the frame variable equal to GDB’s info args plus info locals

(GDB) info args equals (LLDB) frame variable –no-args

In addition to primitive types, LLDB can print the value of a String variable, but it cannot know the value of a Vec

variable

Vscode – LLDB debugging

Running the program without any breakpoints points to the following code

7FFFF7E5A904: 0F B1 57 04 cmpxchgl %edx, 0x4(%rdi)

Pay attention to the CALL STACK menu in the Debug sidebar on the left side of vscode (aka GDB backtrace).

The Call Stack menu tells Readdir that the last frame of the current assembly code (backtrace’s second stack frame) was line 15 of main

Click on the main stack frame, equivalent to (GDB) frame 1, to jump to the line of source code in question

The value of dir variable passed in by readdir is NULL, resulting in segment error

Intellij – Rust debugging

The Debug command directly jumps to the line where the offending code is located and prompts libc::readdir(dir) that the dir variable is NULL

Dynamic analysis tool

Dmesg Displays segfault records

Sudo Dmesg can view the last dozens of kernel messages and see the following message after a Segfault occurs:

Ls [73815.701427] [165042] : Segfault at 4 IP 00007fafe9BB5904 SP 00007FFd78FF8510 Error 6 in LIBc-2.33.so [7fafe9b14000+14b000]

Cargo – Miri checks the unsafe code

Unfortunately, miri doesn’t seem to support checking FFI calls yet

[w@ww linux_commands_rewritten_in_rust]$ cargo miri run --example sigabrt_free_dylib_data Compiling Linux_commands_rewritten_in_rust V0.1.0 (/home/w/repos/my_repos/linux_commands_rewritten_in_rust) Finished dev [unoptimized + debuginfo] Target (s) in 0.00s Running `/home/w/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin/cargo-miri target/miri/x86_64-unknown-linux-gnu/debug/examples/sigabrt_free_dylib_data` error: unsupported operation: can't call foreign function: sqlite3_libversion --> examples/sigabrt_free_dylib_data.rs:5:19 | 5 | let ptr = sqlite3_libversion() as *mut i8; | ^^^^^^^^^^^^^^^^^^^^ can't call foreign function: sqlite3_libversionCopy the code

Perf function call tree

Check the PERF configuration

First, perf record is used to test whether the perF configuration can read system events. If Error is returned, modify the following configuration file

sudo vim /etc/sysctl.d/sysctl.conf

Add a line to the sysctl.conf configuration file:

kernel.perf_event_paranoid = -1

Then restart the sysctl process to reload the configuration:

sudo systemctl restart systemd-sysctl.service

perf call-graph

Use perf-Record to record the call information of the Rust program:

perf record -a –call-graph dwarf ./target/debug/tree

When the Rust program finishes running, it generates a perf.data file in the current directory

Perf-report will open the perf.data file of the current directory by default. You can also specify the data file by using the -i parameter

Parsing Rust’s function call tree with perf-Report leads to an htop-like command-line UI written in Curses:

perf report –call-graph

You can select the function symbol of Tree :: Main, press Enter, and select Zoom into Tree Thread to display the subfunction call tree of main

The main browsing method is to move the cursor up, down, left and right arrow keys, and then expand or collapse the function call tree on the line where the cursor is located by pressing the **+** key

On the author’s computer, Clion’s default profiler uses PerF

cargo-flamegraph

Cargo – Flamegraph requires the system to have PERF installed to render the PERF data into a flame map

KCachegrind

valgrind –tool=callgrind ./target/debug/tree

887505 is generated by valGrind (887505 is PID) and opened by KCachegrind for visualization

Reference: users.rust-lang.org/t/is-it-pos…

gprof

GCC /clang plus the -pg parameter generates the monitoring data file mon.out at the end of the program

The mon.out file is then analyzed by Gprof, but Rust does not have partial support

uftrace

To support data visualization in the flame chart format, install UTFTrace with the flame chart:

yay -S uftrace-git flamegraph-git

Similar to GRPof /KCachegrind, data is collected, and data visualization is a two-step process

First, Rust compiles with a -pg argument similar to GCC:

rustc -g -Z instrument-mcount main.rs

Or compile with a GCCRS or GCC back end

gccrs -g -pg main.rs

Uftrace then starts recording the data:

uftrace record ./main

There is limited space in this article to cover only the way UFTrace visualizes the flame diagram:

uftrace dump –flame-graph | flamegraph > ~/temp/uftrace_flamegraph.svg && google-chrome-stable ~/temp/uftrace_flamegraph.svg

Uftrace records parameters:

–no-libcall: uftrace can be added with the –no-libcall argument that does not record system calls
— Nest-libcall: for example, the built-in malloc() on the new() function record
–kernel(need sudo): trace kernel function
–no-event: does not record thread scheduling

ebpf

Ebpf analysis of Rust programs should be feasible, the author has not tried

Now that you’re familiar with the above tools, you can look at a few memory errors

SIGABRT/ Double-free case sharing

Here is the code for the depth-first search tree command to traverse the folder (omit some irrelevant code, link to the full source code here)

unsafe fn traverse_dir_dfs(dirp: *mut libc::DIR, indent: usize) {
    loop {
        let dir_entry = libc::readdir(dirp);
        if dir_entry.is_null() {
            let _sigabrt_line = std::env::current_dir().unwrap();
            return;
        }
        // ...
        if is_dir {
            let dirp_inner_dir = libc::opendir(filename_cstr);
            libc::chdir(filename_cstr);
            traverse_dir_dfs(dirp_inner_dir, indent + 4);
            libc::chdir(".. \ 0".as_ptr().cast()); libc::closedir(dirp); }}}Copy the code

This code will run with an error:

malloc(): unsorted double linked list corrupted

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
Copy the code

GDB debugging shows that the STD ::env::current_dir() call is wrong, but the reason for the error is unknown

Experience: SIGABRT possible cause

From the above analysis of segment errors, we know that SIGSEGV may be caused by dereferencing NULL Pointers such as readdir(NULL)

According to the author’s development experience, the possible cause of SIGABRT is double free

Valgrind checks double Free

Libc :: Closedir (DIRP) has an InvalidFree/DoubleFree memory problem

Analyze the causes of double free

Look at the source code again, recursive call before creating a pointer to the subfolder, recursive backtracking but the current folder pointer to close off

This means that once a directory has more than two subfolders, the current folder pointer may be free twice

This reduces the size of the problem to three lines of code:

let dirp = libc::opendir("/home\0".as_ptr().cast());
libc::closedir(dirp);
libc::closedir(dirp);
Copy the code

A generic solution for double Free

C programming habits: after a pointer is free, it must be set to NULL

let mut dirp = libc::opendir("/home\0".as_ptr().cast());
libc::closedir(dirp);
dirp = std::ptr::null_mut();
libc::closedir(dirp);
dirp = std::ptr::null_mut();
Copy the code

In “single-threaded applications,” this solution works,

The dirp pointer is NULL after the first free, and nothing happens if dirp is passed in the second free

The first line of most C/Java functions checks if the input is null if (PTR == null) return

Why is there sometimes no error in double Free

One question puzzled me:

Why does writing several consecutive lines of Closedir advance the SIGABRT process?
Why do multiple Closedir processes exit the loop normally?
Why are there multiple Closedir calls in the loopstd::env::current_dir()When SIGABRT?

The reason is that the double free may not be found in time and may cause the process to abort immediately, or it may not report an error until the next malloc

Because current_dir() uses Vec to apply for heap memory, the memory allocator aborts the process because it finds that memory error has been corrupted

Here are some examples from system programming books explaining this phenomenon:

one reason malloc failed is the memory structures have been corrupted, When this happens, the program may not terminate immediately

Interested readers can read this book: Beginning Linux Programming 4th Edition at 260 pages

SIGABRT/ free-dylib-MEM case sharing

Suppose I want to print the version of SQLite, and the SQLite version information is stored as a static string in /usr/lib/libsqlite3.so

#[link(name = "sqlite3")]
extern "C" {
    pub fn sqlite3_libversion() - > *const libc::c_char;
}

fn main() {
    unsafe {
        let ptr = sqlite3_libversion() as *mut i8;
        let version = String::from_raw_parts(ptr.cast(), "3.23.0\0".len(), "3.23.0\0".len());
        println!("found sqlite3 version={}", version); }}Copy the code

As a result, after the above code was run, a memory error, SIGABRT, was abnormally aborted. GDB debugging revealed that the stack frame before the error was reported was in the destructor process of the unsafe code block

Since only version is a String in the code that needs to call the DROP automatic destructor, we lock the problem down to the String destructor error

When Rust processes want to free memory that does not belong to the Rust process but to the libsqlite3. So dynamic link library, SIGABRT is generated

The solution is to block destructor calls to String with STD ::mem::forget, which is the most common use of the MEm :: Forget API

More memory error debugging cases

Follow the SRC /examples folder of the author’s linux_COMMANds_rewritten_in_rust project

The examples directory is almost full of common examples of various memory bugs that the author has stumbled on

Project link: