Binary analysis is one of the most underrated skills in the computer industry.

Imagine not being able to access the source code of your software, but still being able to understand how it is implemented, find bugs in it, and, best of all, fix bugs. All this was done when only binaries were available. That sounds like a superpower, right?

You can have such superpowers too, and the GNU Binary Utilities (Binutils) are a good place to start. [GNU Binutils] 2 is a set of binary tools that are installed in all Linux distributions by default.

Binary analysis is one of the most underrated skills in the computer industry. It is primarily used by malware analysts, reverse engineers, and people using the underlying software.

This article explores some of the tools available for Binutils. I use RHEL, but these examples should work on any Linux distribution.

[~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)
[~]# 
[~]# uname -r3.10.0-957. El7. X86_64 [~]# 
Copy the code

Note that some packaging commands, such as RPM, may not be available in Debian-based distributions, so use the equivalent DPKG command instead.

Basic knowledge of software development

In the open source world, many of us focus on software in source code form. When the source code of the software is readily available, it’s easy to get a copy of the source code, open your favorite editor, have a cup of coffee, and start exploring.

But the source code is not code that executes on the CPU, which executes binary or machine language instructions. Binaries, or executables, are obtained when source code is compiled. Skilled debuggers are well aware of this discrepancy in general.

The basics of compilation

Before delving into the Binutils package itself, it’s a good idea to understand the basics of compilation.

Compilation is the process of converting a program from source code (in text form) in a programming language such as C/C++ into machine code.

Machine code is a sequence of ones and zeros that the CPU (or hardware in general) understands and can therefore be executed or run by the CPU. The machine code is saved to a file in a specific format, usually called an executable or binary file. On Linux (and BSD using Linux compatible binaries), this is called ELF (Executable and Linkable Format).

The compilation process goes through a complex series of steps before generating an executable or binary for a given source file. Take this source program (C code) as an example. Open your favorite editor and type the following program:

#include <stdio.h>

int main(void)
{
  printf("Hello World\n");
  return 0;
}
Copy the code

Step 1: Pretreatment with CPP

The C preprocessor (CPP) is used to extend all macros and include header files. In this example, the header file stdio.h will be included in the source code. Stdio.h is a header file that contains information about the printf function used within the program. Run CPP on the source code, and the resulting instructions are saved in a file named hello.i. You can open the file using a text editor to view its contents. The source code for printing “Hello World” is at the bottom of the file.

[testdir]# cat hello.c
#include <stdio.h>

int main(void)
{
  printf("Hello World\n");
  return 0;
}
[testdir]#
[testdir]# cpp hello.c > hello.i
[testdir]#
[testdir]# ls -lrt
total 24
-rw-r--r--. 1 root root 76 Sep 13 03:20 hello.c
-rw-r--r--. 1 root root 16877 Sep 13 03:22 hello.i
[testdir]#
Copy the code

Step 2: Compile with GCC

At this stage, the pre-processed source code generated in Step 1 is converted into assembly language instructions without creating an object file. This stage uses the GNU compiler collection (GCC). When you run the GCC command with the -s option on the hello. I file, it creates a new file called hello.s. This file contains assembly language instructions for the C program.

You can view its contents using any editor or cat command.

[testdir]#
[testdir]# gcc -Wall -S hello.i
[testdir]#
[testdir]# ls -l
total 28
-rw-r--r--. 1 root root 76 Sep 13 03:20 hello.c
-rw-r--r--. 1 root root 16877 Sep 13 03:22 hello.i
-rw-r--r--. 1 root root 448 Sep 13 03:25 hello.s
[testdir]#
[testdir]# cat hello.s
.file "hello.c"
.section .rodata
.LC0:
.string "Hello World"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %edi
call puts
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 4.8.5 20150623 (Red Hat 4.8.5-36)"
.section .note.GNU-stack,"",@progbits
[testdir]#
Copy the code

Step 3: Assemble with AS

The purpose of an assembler is to convert assembly language instructions into machine language code and generate object files with an extension of.o. This stage uses the GNU assembler, which is available by default on all Linux platforms.

testdir]# as hello.s -o hello.o
[testdir]#
[testdir]# ls -l
total 32
-rw-r--r--. 1 root root 76 Sep 13 03:20 hello.c
-rw-r--r--. 1 root root 16877 Sep 13 03:22 hello.i
-rw-r--r--. 1 root root 1496 Sep 13 03:39 hello.o
-rw-r--r--. 1 root root 448 Sep 13 03:25 hello.s
[testdir]#
Copy the code

You now have your first ELF file; However, it cannot be implemented yet. Later, you will see the difference between “object file” and “Executable file.”

[testdir]# file hello.o
hello.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
Copy the code

Step 4: Link with ld

This is the final stage of compilation, linking the object file to create the executable. Executables typically require external functions, usually from system libraries (LIBC).

You can call the linker directly using the ld command; However, this command is somewhat complex. Instead, you can use the GCC compiler with the -v (detailed) flag to see how the linking happens. (Using the ld command for links is an exercise you can explore on your own.)

[testdir]# gcc -v hello.oUsing built-in specs. COLLECT_GCC= GCC COLLECT_LTO_WRAPPER=/usr/libexec/ GCC/x86_64-Redhat-linux /4.8.5/lto-wrapper Target:  x86_64-redhat-linux Configured with: .. /configure --prefix=/usr --mandir=/usr/share/man [...]  --build=x86_64-redhat-linux Thread model: Posix GCC Version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) COMPILER_PATH works much like = / usr/libexec/GCC/x86_64 - redhat Linux / 4.8.5 / : / usr/libexec/GCC/x86_64 - redhat Linux / 4.8.5 / : [...]. :/usr/lib/gcc/x86_64-redhat-linux/ LIBRARY_PATH = / usr/lib/GCC/x86_64 - redhat Linux / 4.8.5 / : / usr/lib/GCC/x86_64 - redhat Linux / 4.8.5 /.. /.. /.. /.. /lib64/:/lib/.. /lib64/:/usr/lib/.. / lib64 /, / usr/lib/GCC/x86_64 - redhat Linux / 4.8.5 /.. /.. /.. /:/lib/:/usr/lib/ COLLECT_GCC_OPTIONS='-v' '-mtune=generic' '-march=x86-64'/usr/libexec/ GCC /x86_64-redhat-linux/4.8.5/collect2 --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu [...] /.. /.. /.. /.. /lib64/crtn.o [testdir]#
Copy the code

After running this command, you should see an executable named a.out:

[testdir]# ls -l
total 44
-rwxr-xr-x. 1 root root 8440 Sep 13 03:45 a.out
-rw-r--r--. 1 root root 76 Sep 13 03:20 hello.c
-rw-r--r--. 1 root root 16877 Sep 13 03:22 hello.i
-rw-r--r--. 1 root root 1496 Sep 13 03:39 hello.o
-rw-r--r--. 1 root root 448 Sep 13 03:25 hello.s
Copy the code

Running the file command on a.out shows that it is indeed an ELF executable:

[testdir]# file a.out
a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), forGNU/Linux 2.6.32, BuildID [sha1] = 48 e4c11901d54d4bf1b6e3826baf18215e4255e5, not strippedCopy the code

Run the executable to see if it works as the source code says:

[testdir]# ./a.out Hello World
Copy the code

The work! There was a lot going on behind the scenes to make it print “Hello World” on the screen. Imagine what happens in a more complex program.

Explore the Binutils tool

The above exercise provides a good background for using the tools in the Binutils package. My system comes with Binutils version 2.27-34; The version on your Linux distribution may be different.

[~]# rpm -qa | grep binutils Binutils - 2.27-34 base. El7. X86_64Copy the code

The following tools are provided in the Binutils package:

[~]# RPM - ql binutils - 2.27-34 base. El7. X86_64 | grep bin /
/usr/bin/addr2line
/usr/bin/ar
/usr/bin/as
/usr/bin/c++filt
/usr/bin/dwp
/usr/bin/elfedit
/usr/bin/gprof
/usr/bin/ld
/usr/bin/ld.bfd
/usr/bin/ld.gold
/usr/bin/nm
/usr/bin/objcopy
/usr/bin/objdump
/usr/bin/ranlib
/usr/bin/readelf
/usr/bin/size
/usr/bin/strings
/usr/bin/strip
Copy the code

The compilation exercise above has explored two of these tools: the AS command as an assembler and the LD command as a linker. Read on to learn about the other seven GNU Binutils package tools above.

Readelf: displays ELF file information

The exercise above mentions the terms “object file” and “executable file.” Use the file in this exercise to dump the ELF title of the file to the screen through the readelf command with the -h (title) option. Note that the target file ending with the.o extension is displayed as Type: REL (Relocatable file) :

[testdir]# readelf -h hello.oELF Header: Magic: 7f 45 4c 46 02 01 01 00 [...] [...].  Type: REL (Relocatable file) [...]Copy the code

If you try to execute the object file, you receive an error message indicating that it cannot be executed. This simply means that it does not yet have the information it needs to execute on the CPU.

Keep in mind that you first need to add x (executable bit) to the object file using the chmod command, otherwise you will get a “permission denied” error.

[testdir]# ./hello.o
bash: ./hello.o: Permission denied
[testdir]# chmod +x ./hello.o
[testdir]#
[testdir]# ./hello.o
bash: ./hello.o: cannot execute binary file
Copy the code

If you try the same command on the A.out file, you see that it is of type EXEC (Executable file).

[testdir]# readelf -h a.outELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 [...]  Type: EXEC (Executable file)Copy the code

As shown above, this file can be executed directly by the CPU:

[testdir]# ./a.out Hello World
Copy the code

The readelf command provides a lot of information about binaries. In this case, it will tell you that it is ELF 64-bit, which means that it can only run on 64-bit cpus, not 32-bit cpus. It also tells you that it should be executed on x86-64 (Intel/AMD) architecture. The entry point for this binary is address 0x400430, which is the address of the main function in the C source program.

Try the readelf command on other system binaries you know, such as ls. Please note that on RHEL 8 or Fedora 30 and later systems, the position Independent executable (PIE) has been replaced for security reasons, so your output (especially Type) may be different.

[testdir]# readelf -h /bin/ls
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Copy the code

Run the LDD command to learn the system library on which the ls command depends, as shown below:

[testdir]# ldd /bin/ls
linux-vdso.so.1 => (0x00007ffd7d746000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f060daca000)
libcap.so.2 => /lib64/libcap.so.2 (0x00007f060d8c5000)
libacl.so.1 => /lib64/libacl.so.1 (0x00007f060d6bc000)
libc.so.6 => /lib64/libc.so.6 (0x00007f060d2ef000)
libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f060d08d000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f060ce89000)
/lib64/ld-linux-x86-64.so.2 (0x00007f060dcf1000)
libattr.so.1 => /lib64/libattr.so.1 (0x00007f060cc84000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f060ca68000)
Copy the code

Run readelf on the liBC library file to see what kind of file it is. As it points out, it is a DYN (Shared object file), which means it cannot be executed directly; It must be used by executables that internally use any functions provided by the library.

[testdir]# readelf -h /lib64/libc.so.6
ELF Header:
Magic: 7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - GNU
ABI Version: 0
Type: DYN (Shared object file)
Copy the code

Size: Lists the size and full size of the section

The size command only works with object files and executables, so if you try to run it on a simple ASCII file, you will throw an error saying “file format unrecognized.”

[testdir]# echo "test" > file1
[testdir]# cat file1
test
[testdir]# file file1
file1: ASCII text
[testdir]# size file1
size: file1: File format not recognized
Copy the code

Now, in the exercise above, run the size command on the object file and the executable. Note that the output of the size command shows that the executable (A.out) has much more information than the object (hello.o) :

[testdir]# size hello.o
text data bss dec hex filename
89 0 0 89 59 hello.o
[testdir]# size a.out
text data bss dec hex filename
1194 540 4 1738 6ca a.out
Copy the code

But what do the text, data, and BSS sections mean here?

The text section is the code portion of the binary file that contains all the executable instructions. The DATA section is where all initialized data resides, and the BSS section is where all uninitialized data is stored. (In static image files, sections are called sections, while at runtime sections are called segments, sometimes collectively called segments.)

Compare the size results of some other available system binaries.

For the ls command:

[testdir]# size /bin/ls
text data bss dec hex filename
103119 4768 3360 111247 1b28f /bin/ls
Copy the code

Just look at the output of the size command and you can see that GCC and GDB are much larger programs than LS:

[testdir]# size /bin/gcc
text data bss dec hex filename
755549 8464 81856 845869 ce82d /bin/gcc
[testdir]# size /bin/gdb
text data bss dec hex filename
6650433 90842 152280 6893555 692ff3 /bin/gdb
Copy the code

Strings: Prints printable strings in a file

It is often useful to add the -d flag to the strings command to display only printable characters in the data section.

Hello. O is an object file that contains instructions to print hello World text. Therefore, the only output of the strings command is Hello World.

[testdir]# strings -d hello.o 
Hello World
Copy the code

On the other hand, running strings on a.out (executable) shows additional information contained in the binary at the link stage:

[testdir]# strings -d a.out/lib64/ld-linux-x86-64.so.2 ! ^BU libc.so.6 puts __libc_start_main __gmon_start_glibc_2.5uh-0 uh-0 =([]A\A]A^A_ Hello World; * 3 $"
Copy the code

Objdump: displays information about target files

Another binutils tool that dumps machine language instructions from binaries is called objdump. Use the -d option to disassemble all assembly instructions from a binary file.

Recall that compilation is the process of converting source code instructions into machine code. Machine code is made up of only ones and zeros and is hard for humans to read. Thus, it helps to represent machine code as assembly language instructions. What does assembly language look like? Remember that assembly language is architecture specific; Since I’m using Intel (x86-64) architecture, if you compile the same program using the ARM architecture, the instructions will be different.

[testdir]# objdump -d hello.o
hello.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000
:
0:  55              push %rbp
1:  48 89 e5        mov %rsp,%rbp
4:  bf 00 00 00 00  mov $0x0,%edi
9:  e8 00 00 00 00  callq e

e:  b8 00 00 00 00  mov $0x0,%eax
13: 5d              pop %rbp
14: c3              retq
Copy the code

This output may seem daunting at first, but take a moment to understand it before moving on. Recall that the.text section contains all the machine code instructions. Assembly instructions can be seen in the fourth column (that is, push, MOV, Callq, POP, retq, and so on). These instructions operate on registers, which are memory locations built into the CPU. The registers in this example are RBP, RSP, EDI, EAX, and so on, and each register has a special meaning.

Now run objdump on the executable (A.out) and see what you get. The output of objdump of the executable can be large, so I use grep to narrow it down to the main function:

[testdir]# objdump -d a.out | grep -A 9 main\>
000000000040051d
:
40051d: 55              push %rbp
40051e: 48 89 e5        mov %rsp,%rbp
400521: bf d0 05 40 00  mov $0x4005d0,%edi
400526: e8 d5 fe ff ff  callq 400400
40052b: b8 00 00 00 00  mov $0x0,%eax
400530: 5d              pop %rbp
400531: c3              retq
Copy the code

Note that these directives are similar to the target file hello.o, but contain some additional information:

  • The target filehello.oHas the following commands:callq e
  • Executable filea.outConsists of the following instruction with an address and a function:callq 400400 <puts@plt>The above assembly instruction is being invokedputsFunction. Remember, you use one in your source codeprintfFunction. The compiler inserts a pairputsLibrary function to be calledHello WorldOutput to the screen.

Look at the description of the line above put:

  • The target filehello.oThere is a instructionmov:mov $0x0,%edi
  • Executable filea.outmovInstruction with a physical address ($0x4005d0) rather than$0x0:mov $0x4005d0,%edi

The instruction moves the contents of the binary file at address $0x4005d0 into a register named EDI.

What else could be in the contents of this storage location? Yes, you guessed it: it’s the text Hello, World. How do you know for sure?

The readelf command allows you to dump any section of the binary (A.out) onto the screen. It is required to dump.rodata (which is read-only data) to the screen:

[testdir]# readelf -x .rodata a.out

Hex dump of section '.rodata':
0x004005c0 01000200 00000000 00000000 00000000 ....
0x004005d0 48656c6c 6f20576f 726c6400 Hello World.
Copy the code

You can see the text Hello World on the right and its address in binary format on the left. Does it match the address you saw in the mov instruction above? Yes, it does.

Strip: Strips symbols from the target file

This command is typically used to reduce the size of binaries before delivering them to customers.

Keep in mind that because important information has been removed from the binary, it hinders debugging. However, this binary executes perfectly.

Run the command against the A.out executable and notice what happens. First, ensure that the binary is not stripped by running the following command:

[testdir]# file a.outa.out: ELF 64-bit LSB executable, x86-64, [......]  not strippedCopy the code

Also, before running the strip command, remember the initial number of bytes in the binary:

[testdir]# du -b a.out
8440 a.out
Copy the code

Now run the strip command on the executable and use the file command to ensure normal completion:

[testdir]# strip a.out
[testdir]# file a.out a.out: ELF 64-bit LSB executable, x86-64, [......]  stripped

Copy the code

After stripping the binary, the size of the applet is reduced from 8440 bytes to 6296 bytes. With such a huge space savings for such a small program, it’s no wonder that larger programs are often stripped.

[testdir]# du -b a.out 
6296 a.out
Copy the code

Addr2line: Converts address to filename and line number

The addr2line tool simply looks up the address in the binary and matches it with the lines in the C source code program. Cool, isn’t it?

Write another test program for this; Just this time make sure to compile using GCC’s -g flag, which adds additional debugging information to the binary and includes line numbers (provided in the source code) that help debugging:

[testdir]# cat -n atest.c
1  #include <stdio.h>
2
3  int globalvar = 100;
4
5  int function1(void)
6  {
7    printf("Within function1\n");
8    return 0;
9  }
10
11 int function2(void)
12 {
13   printf("Within function2\n");
14   return 0;
15 }
16
17 int main(void)
18 {
19   function(1); 20function(2); 21printf("Within main\n");
22   return0; 23}Copy the code

Compile and execute it with the -g flag. As expected:

[testdir]# gcc -g atest.c
[testdir]# ./a.out
Within function1
Within function2
Within main
Copy the code

Now use objdump to identify the memory address where the function starts. You can use the grep command to filter out the specific lines you want. The address of the function is highlighted below (address before 55 push % RBP) :

[testdir]# objdump -d a.out | grep -A 2 -E 'main>:|function1>:|function2>:'
000000000040051d :
40051d: 55 push %rbp
40051e: 48 89 e5 mov %rsp,%rbp
--
0000000000400532 :
400532: 55 push %rbp
400533: 48 89 e5 mov %rsp,%rbp
--
0000000000400547
:
400547: 55 push %rbp
400548: 48 89 e5 mov %rsp,%rbp
Copy the code

Now, use the addr2line tool to map from these addresses in the binaries to addresses that the C source code matches:

[testdir]# addr2line -e a.out 40051d
/tmp/testdir/atest.c:6
[testdir]#
[testdir]# addr2line -e a.out 400532
/tmp/testdir/atest.c:12
[testdir]#
[testdir]# addr2line -e a.out 400547
/tmp/testdir/atest.c:18
Copy the code

It says 40051d starts at line 6 in the source file atest.c, which is the line starting with the opening brace ({) of function1. Function2 also matches the output from main.

Nm: Lists the symbol of the object file

Test the NM tool using the C program above. Use GCC to compile and execute it quickly.

[testdir]# gcc atest.c
[testdir]# ./a.out
Within function1
Within function2
Within main
Copy the code

Now run nm and grep to get information about functions and variables:

[testdir]# nm a.out | grep -Ei 'function|main|globalvar'
000000000040051d T function1
0000000000400532 T function2 000000000060102c D globalvar U __libc_start_main @@glibc_2.50000000000400547 T mainCopy the code

You can see that the function is marked T, which represents the symbol in the text section, and the variable is marked D, which represents the symbol in the initialized data section.

Imagine how useful it would be to run this command on a binary without source code? This allows you to peer inside and see which functions and variables are used. Of course, unless the binaries have been stripped, in which case they won’t contain any symbols, so nm commands won’t be very useful, as you can see here:

[testdir]# strip a.out
[testdir]# nm a.out | grep -Ei 'function|main|globalvar'
nm: a.out: no symbols
Copy the code

conclusion

The GNU Binutils tools offer many options for anyone interested in analyzing binaries, and this is just the tip of the iceberg of what they can do for you. Read the man pages for each tool to learn more about them and how to use them.


Via: opensource.com/article/19/…

By Gaurav Kamathe, lujun9972

This article is originally compiled by LCTT and released in Linux China