directory

  • Self-cultivation for iOS Programmers – Introduction (Zero)
  • Self-cultivation for iOS Programmers – Compile, Link process (Part 1)
  • IOS Programmer self-cultivation -MachO File Structure Analysis (II)
  • IOS Programmer self-cultivation -MachO file Static link (3)
  • IOS Programmer self-cultivation -MachO file dynamic link (4)
  • Self-cultivation for iOS Programmers — The Fishhook Principle (part 5)

Fishhook principle

As described in MachO file dynamic link, data access and function calls between modules are all addressed indirection. The data Symbol addresses for the main module to access the dynamic library are placed in the GOT (non-lazy Symbol Pointers) data segment, and the addresses for the functions calling the dynamic library are placed in the LA_symbol_ptr data segment. The data segments are read and written, so we can change the addresses of functions and global variables by changing the GOT (NL_symbol_ptr) and LA_symbol_ptr data segments during the program. That’s how Fishhook works. The data and function addresses inside the module have already been determined when statically linked and are in code segments (readable, executable and unwritable), so Fishhook is a symbol inside the module that cannot be rebinding.

Here’s how Facebook introduced Fishhook:

A library that enables dynamically rebinding symbols in Mach-O binaries running on iOS.

“Symbols” here refers to variables and functions exposed in the dynamic database. So Fishhook can replace variables and functions.

For example 🌰 (dynamic substitution of variables and functions)

// b.m file char *global_var ="world"; = = = = = = = = = = = = = = = = = = = = = = = = = / / the main m files#import <Foundation/Foundation.h>
#import "fishhook.h"static void (*orgi_NSLog)(NSString *format, ...) ; char *orgi_var ="wukaikai";
extern char *global_var;

void my_NSLog(NSString *format, ...)
{
    printf("hello %s\n", global_var);
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        // insert code here...
        printf("hello %s\n", global_var);
        struct rebinding rebind[2] = {{"NSLog", my_NSLog, (void *)&orgi_NSLog}, {"global_var", &orgi_var, NULL} };
        rebind_symbols(rebind, 2);
        NSLog(@"%s",global_var);
    }
    return0; } = = = = = = = = = = = = = = = = = = = = = = = = = / / in turn perform these two commands, generate executable main (don't understand why the two commands, Clang-fic-shared b.m-o libstr. dylib clang-framework Foundation main.m fishhook.c -o main-l.-lSTR = = = = = = = = = = = = = = = = = = = = = = = = = / / output hello world hello wukaikai / / as you can see, global_var and NSLog were replacedCopy the code

Fishhook implementation analysis

Fishhook uses LINKEDIT to calculate the base address. I’ll start with the loading command LC_SEGMENT_64(_LINKEDIT).

LINKEDIT

LINKEDIT segment is a segment created by the Link Editor. This segment contains symtab, dysymtab, string table, etc.

From the point of view of linking, mach-O files store files according to sections. The segment is just a bundle of several sections together. However, from the perspective of loading mach-O files into memory, mach-O files are stored according to the segment. Even if the content of a segment is less than 1 page of memory, it still takes up 1 page of memory. Therefore, the segment has filesize and vmsize, while the section does not need vmsize. The symbol table and the indirect symbol table load command do not have vmsize, so I can also interpret symbol table and indirect symbol table into two sections.

I personally feel that concepts such as segment, section, and load commands should be viewed from different perspectives without strict distinction.

Replace function/variable address procedure

  1. Pass in the function/variable to be replaced. (This function and variable are in other modules (dylib).)
  2. Find nl_SYMBOL_ptr (got)/ LA_symbol_ptr and walk through the nl_SYMBOL_ptr. If the symbol name matches the symbol name passed in in the first step, replace it.

Nl_symbol_ptr (got)/ LA_symbol_ptr (got)/ NL_symbol_ptr (got) How do I know the symbol name for this pointer? Nl_symbol_ptr (got)/ LA_symbol_ptr

The symbol name of the pointer

It’s in the MachO file dynamic link

value = IndirectSymbolTable[got.section_64.reserved1]; SymbolTable [value] is the first symbol in the got data segment. SymbolTable [value+1] is the second symbol in the got data segment. . Section_64 = section_64 = section_64 //la_symbol_ptr do the same for the string table n_strx //la_symbol_ptr ============== Get the symbol name step by step from "reserved1" if you can't read it. That means you need to revisit the earlier part of the series.Copy the code

So we find the symbol table, the string table, the indirect symbol table, and we get the symbol name of the pointer. These are easily obtained by loading commands.

Find nl_symbol_ptr (got)/la_symbol_ptr

Since both sections are in DATA segments, we first get DATA according to the load command. Nl_symbol_ptr (got)/ LA_symbol_ptr based on section_64 flag

#define	S_NON_LAZY_SYMBOL_POINTERS	0x6	/* section with only non-lazy symbol pointers */
#define S_LAZY_SYMBOL_POINTERS 0x7 /* section with only lazy symbol pointers */
Copy the code

Source code analysis

Note that in order to keep the reader’s attention on the main logic line, I have omitted a lot of non-core logic, such as boundary judgment, in the source code below. See Fishhook for the full source code

  1. Step 1: Pass in the function you want to replace
Int rebind_symbols(struct rebinding rebindings[], size_t rebindings_nel) { Int retVAL = prepend_rebindings(&_REBindingS_head, rebindings_NEL); // First call, enterifInside; _dyLD_register_func_for_add_image does two things, the first is toelseAgain, _rebind_symbolS_for_image is called for each image. The second thing is that when dyLD loads the image, _rebind_symbols_for_image is also called for the new image.if(! _rebindings_head->next) { _dyld_register_func_for_add_image(_rebind_symbols_for_image); }else {
    uint32_t c = _dyld_image_count();
    for(uint32_t i = 0; i < c; i++) { _rebind_symbols_for_image(_dyld_get_image_header(i), _dyld_get_image_vmaddr_slide(i)); }}}Copy the code

  1. Step 2: Do three things
    1. Calculate the base address (for step 2 service)
    2. Find symbol table, string table, indirect symbol table
    3. Find nl_symbol_ptr (got)/la_symbol_ptr

Steps 2 and 3 are described above. So why do we calculate the base address? Because ASLR technology, the simple understanding is that Windows all programs virtual memory starting address is the same, but in iOS, to prevent hacking, the starting address has a random offset value. (Not understanding ASLR has no impact on understanding Fishhook, but ignore it for now)

// Rebindings header; Static void rebind_SYMBOLS_for_image (struct rebindingS_entry *rebindings, const struct mach_header *header, intptr_t slide) { segment_command_t *cur_seg_cmd; segment_command_t *linkedit_segment = NULL; //LINKEDIT struct symtab_command* symtab_cmd = NULL; Struct dysymtab_command* dysymtab_cmd = NULL; Uintptr_t cur = (Uintptr_t)header + sizeof(mach_header_t);for (uint i = 0; i < header->ncmds; i++, cur += cur_seg_cmd->cmdsize) {
    cur_seg_cmd = (segment_command_t *)cur;
    if (cur_seg_cmd->cmd == LC_SEGMENT_ARCH_DEPENDENT) {
      if(strcmp(cur_seg_cmd->segname, SEG_LINKEDIT) == 0) { linkedit_segment = cur_seg_cmd; }}else if (cur_seg_cmd->cmd == LC_SYMTAB) {
      symtab_cmd = (struct symtab_command*)cur_seg_cmd;
    } else if(cur_seg_cmd->cmd == LC_DYSYMTAB) { dysymtab_cmd = (struct dysymtab_command*)cur_seg_cmd; }} // Originally is: base address =linkedit memory address - linkedit fileoff // due to ASLR: Uintptr_t linkedit_base = (uintptr_t)slide + linkedit_segment-> vmaddr-linkedit_segment ->fileoff; Nlist_t *symtab = (nlist_t *)(linkedit_base + symtab_cmd->symoff); char *strtab = (char *)(linkedit_base + symtab_cmd->stroff); uint32_t *indirect_symtab = (uint32_t *)(linkedit_base + dysymtab_cmd->indirectsymoff); = = = = = = = = = = = = = = = = = = = = = = = = = = = / / 2. Nl_symbol_ptr (got)/la_symbol_ptr cur = (Uintptr_t)header + sizeof(mach_header_t);for (uint i = 0; i < header->ncmds; i++, cur += cur_seg_cmd->cmdsize) {
    cur_seg_cmd = (segment_command_t *)cur;
    if (cur_seg_cmd->cmd == LC_SEGMENT_ARCH_DEPENDENT) {
      if(strcmp(cur_seg_cmd->segname, SEG_DATA) ! = 0 && strcmp(cur_seg_cmd->segname, SEG_DATA_CONST) ! = 0) {continue; Nl_symbol_ptr (got)/la_symbol_ptrfor (uint j = 0; j < cur_seg_cmd->nsects; j++) {
        section_t *sect =
          (section_t *)(cur + sizeof(segment_command_t)) + j;
        if ((sect->flags & SECTION_TYPE) == S_LAZY_SYMBOL_POINTERS) {
          perform_rebinding_with_section(rebindings, sect, slide, symtab, strtab, indirect_symtab);
        }
        if ((sect->flags & SECTION_TYPE) == S_NON_LAZY_SYMBOL_POINTERS) {
          perform_rebinding_with_section(rebindings, sect, slide, symtab, strtab, indirect_symtab);
        }
      }
    }
  }
}
Copy the code

Have you ever wondered why we use LINKEDIT to calculate our base address? In fact, TEXT, DATA which load command, can get the base address (easy to conclude). I think it’s because the symbol table, indirect symbol table, and string table we’re looking for are all in LINKEDIT, and if they’re not there, we don’t need to do anything else. So if we don’t have LINKEDIT, we certainly don’t have these tables, but other TEXT/DATA tables don’t have this guarantee. (This is also my conjecture, have different opinions, welcome to say your ideas in the comments section)

  1. Nl_symbol_ptr (got)/ LA_symbol_ptr (got)/ NL_SYMBOL_ptr (got)/ LA_symbol_ptr)
static void perform_rebinding_with_section(struct rebindings_entry *rebindings,
                                           section_t *section,
                                           intptr_t slide,
                                           nlist_t *symtab,
                                           char *strtab,
                                           uint32_t *indirect_symtab) {
  uint32_t *indirect_symbol_indices = indirect_symtab + section->reserved1;
  void **indirect_symbol_bindings = (void **)((uintptr_t)slide + section->addr);
  for (uint i = 0; i < section->size / sizeof(void *); i++) {
    uint32_t symtab_index = indirect_symbol_indices[i];
    uint32_t strtab_offset = symtab[symtab_index].n_un.n_strx;
    char *symbol_name = strtab + strtab_offset;
    struct rebindings_entry *cur = rebindings;
    while (cur) {
      for (uint j = 0; j < cur->rebindings_nel; j++) {
        if(STRCMP (&symbol_name[1], cur->rebindings[j].name) == 0) {// First time, save the original functionif(cur->rebindings[j].replaced ! = NULL && indirect_symbol_bindings[i] ! = cur->rebindings[j].replacement) { *(cur->rebindings[j].replaced) = indirect_symbol_bindings[i]; } indirect_symbol_bindings[i] = cur->rebindings[j].replacement; goto symbol_loop; } } cur = cur->next; } symbol_loop:; }}Copy the code

The last

Fishhook is a good example to check if you understand the MachO file. If you have no trouble looking at the Fishhook source code, you already have a good understanding of MachO. If you don’t understand the code, take a look at the previous chapters.