preface

As always, here’s what this article is about. There are two major parts: MachO files and DYLD. The main point is the overall process of DYLD loading an image(not an image, but an image). Because these two parts are both conceptual knowledge and the content of MachO part is relatively not too much, so the two parts of knowledge are summarized in this article 馃惗.

The title is the content of the second chunk of DYLD. If it is directed at the title, you can directly jump to DYLD to see it.

MachO file

Mach-o is short for Mach Object File format, which is a file format for executables, object code, and dynamic libraries.

What are the common MachO files?

  • The target file.o
  • The library files.a | .dylib | xxx.framework/xxx
  • Executable file
  • Symbol table file.dsym

You can view the file type information by using the terminal file + file path.

Universal binary

A program code proposed by Apple that is compatible with binary files of multiple architectures at the same time. Arm64, ARM64E, armV7, armV7s, armV7s, armv7S, armv7S, armv7S, armv7S

  1. Can provide optimal performance for different architectures.
  2. Because you store code for multiple schemas, a general-purpose binary is larger than a single-schema binary.
  3. Because multiple architectures have different code but the same resources, a common binary has only one copy of the resource, so it is not twice the size of a package for a single architecture.
  4. The runtime also only executes the code for the corresponding architecture, and the runtime does not take up extra memory.

Lipo command

Those of you who have written an SDK should be familiar with this, and use the following commands when merging real and simulator packages:

Lipo-create [real machine compilation path /xxx.framework/ XXX] [Simulator compilation path /xxx.framework/ XXX] -output [merged output file path]Copy the code

Remember? This command is used to merge binaries of different architectures (real/emulator).

Don’t worry, there is a split command below:

Lipo [common binary path] -thin [schema to be removed] -output [path to removed binary output]Copy the code

MachO file structure

Let’s start with the whole picture. As can be seen from the figure, MachO file can be divided into Header, Load Commands, and Data.

  1. The Header contains the profile information for the MachO file. Such as: magic number, CPU architecture type, file type, etc.
  2. A directory similar to the MachO file that specifies the starting location for each region, symbol table, dynamic library to load, and so on. LoadCommands parameter
  3. The largest part of the MachO file that contains the segment.

DYLD

Back to the title, “Main is the entry point to the program,” which you’ve probably heard from the moment you started learning to program. But have you ever wondered why main is the entrance to your program? Where is main called? And what happens before main is called?

To find out the answer to the above question, scroll down slowly, Dyld.

Dyld (The Dynamic Link Editor) is apple’s dynamic linker, which is an important part of Apple’s operating system. After the program preparation of the system kernel, DyLD is responsible for the remaining work. It is also open source, so anyone can download the source code on apple’s website to read how it works and learn the details of how the system loads the dynamic library. Dyld source code.

The load method

As we know, every class has a load method, and this load method is called much earlier than the main function. Code injection, including the one in the previous article, is written into the Load method for this reason: if some class methods are swapped before the actual code logic is executed, the methods used during the code logic execution are swapped later.

To explore DYLD, we first break a breakpoint on the load method of any class and run it. We can see that there are nine call stacks before the load method executes.

Explore dyLD source code

Click the first _dyLD_START in the call stack to see the assembly code. Note that the breakpoint is preceded by dyldbootstrap::start, which is the same as the second step in the call stack on the left. This conjecture can be tested by querying the meaning of the assembly BL instruction.

Dyldbootstrap is a namespace in C++, and start is a function of that namespace. Next, we will search the namespace dyLDbootstrap globally in the downloaded dyLD source code and find if there is a start function in it.

  1. RebaseDyld DyLD relocation.
  2. __guard_setup stack overflow protection.
  3. Call the _main function and return the result.

Step 3 is dyld::_main. If you look closely, this is one of the methods we called at the end of the start function.

_main function

Dyld ::_main = “main”; Lol 馃槀, before you call me clickbait, this _main is not the program entry main. Jump into the method to implement a look, this method to achieve more than 600 lines, really many.

_main

Step1: set the operating environment. Mainly set the main program running parameters, environment variables, etc. The parameter mainExecutableMH assigned to the sMainExecutableMachHeader, which is a macho_header structure, said the current main program of MachO header information.

/ / will be the main program of MachO header information assigned to sMainExecutableMachHeader
sMainExecutableMachHeader = mainExecutableMH;   // MH = MachOHeader
// Save the memory address offset of the main program
sMainExecutableSlide = mainExecutableSlide;	
Copy the code

SetContext () is then called to set the context information, including callback functions, parameters, flag information, and so on. The set callback functions are implemented by the dyLD module itself. For example, the loadLibrary() function actually calls libraryLocator(), which loads the dynamic library.

// Set the context information
setContext(mainExecutableMH, argc, argv, envp, apple);
Copy the code

Configure whether the process is restricted and check environment variables

/ / configuration process is limited configureProcessRestrictions (mainExecutableMH envp); ...... / / check the environment variable checkEnvironmentVariables (envp); DYLD_PRINT_OPTS/DYLD_PRINT_ENV Is used to print the corresponding environment variablesif (sEnv.DYLD_PRINT_OPTS) {
    printOptions(argv);
}
if ( sEnv.DYLD_PRINT_ENV ) {
    printEnvironmentVariables(envp);
}
Copy the code

Step2: load the shared cache here is the difference between iOS shared libraries, dynamic libraries and static libraries.

  1. Shared libraries, such as Foundation, UIKit and other system libraries, are used by almost all apps. If each App loads these libraries from disk to memory once, it will not only reduce the loading time, but also occupy much more memory. So there are shared libraries, which are only loaded into memory the first time they are used and store the loaded library’s address information in a cache. Later, other apps directly check whether the memory address is loaded from the cache. If it is loaded, the memory address is copied from the cache and saved; if it is not loaded, the memory address is loaded from the disk to the memory.
  2. Dynamic libraries. On many other operating systems, the shared libraries mentioned above are dynamic libraries. In iOS, the dynamic library is essentially a shared library that has been neutered, because iOS doesn’t allow developers to create it themselves in order to make each App process independent of each otherShared libraries (truly dynamic libraries).
  3. Static library, static library is simple, in fact, static library is similar to a"Folder"Put all the functions and resource files used in this file"Folder"The static library code is compiled into the main program’s MachO file during the program compilation link.

Now you know what loading a shared cache does? I think it is to read the cache of the shared library, load the system shared library used by App into the memory that has not been loaded into the memory, and record the memory address of the loaded library.

CheckSharedRegionDisable ((dyLD3 ::MachOLoaded*)mainExecutableMH, mainExecutableSlide); 路路路 路路路 路 // Map/load shared cache mapSharedCache();Copy the code

IOS shared cache cannot be disabled.

Step3: instantiate the main program. The operating system itself is also an application, but the application is used to manage other applications. Since this is an application, there must be variables/objects in the application. Obviously, this step is to create a main application object from information about the main application. The App is one of the variables relative to the operating system.

// instantiate the main program, Create the main program object sMainExecutable sMainExecutable = instantiateFromLoadedImage (mainExecutableMH mainExecutableSlide, sExecPath);Copy the code

Skip to the implementation of this method to see that it simply creates an ImageLoader object image(not an image, but the image of the main program) and adds it somewhere to store it.

static ImageLoaderMachO* instantiateFromLoadedImage(const macho_header* mh, uintptr_t slide, const char* path)
{
    // try mach-o loader
    if ( isCompatibleMachO((const uint8_t*)mh, path) ) {
    	// Create an image object. Image refers to the main program
    	ImageLoader* image = ImageLoaderMachO::instantiateMainExecutable(mh, slide, path, gLinkContext);
        // Add the main program image
        addImage(image);
        return (ImageLoaderMachO*)image;
    }
    throw "main executable not a known format";
}
Copy the code

Step4: load the inserted library. The inserted library here is the literal translation of the comments in dyld source code. I think the “inserted library” here should refer to the dynamic library, including the dynamic library used in App development and the dynamic library injected during our code injection. This code iterates through the continuous space pointed to by the DYLD_INSERT_LIBRARIES environment variable, fetching all inserted libraries and loading them in turn.

// load any inserted libraries
// Load the inserted library into memory. A inserted library is a non-shared dynamic library because a static library becomes part of the main program at compile time
if( sEnv.DYLD_INSERT_LIBRARIES ! =NULL ) {
    for (const char* const* lib = sEnv.DYLD_INSERT_LIBRARIES; *lib ! =NULL; ++lib)
    	// Load all dynamic libraries
    	loadInsertedDylib(*lib);
}
// record count of inserted libraries so that a flat search will look at 
// inserted libraries, then main, then others.
sInsertedDylibCount = sAllImages.size()- 1;
Copy the code

Point into the loadInsertedDylib (* lib); CacheIndex (path, context, cacheIndex); LoadPhase0, loadPhase1, loadPhase2, loadPhase3, loadPhase4, loadPhase5, loadPhase6 and other methods are called in this method. The specific implementation is not studied in detail. A brief look found that some of the loaded library for some signature verification, cryptid judgment and other operations. (Interested can study the source code)

Step5: link the main program to the inserted library

Link (sMainExecutable, senv. DYLD_BIND_AT_LAUNCH,true, ImageLoader::RPathChain(NULL, NULL), -1); ... // Link to the inserted libraryfor(unsigned int i=0; i < sInsertedDylibCount; ++i) { ImageLoader* image = sAllImages[i+1]; // link(image, senv.dyLD_bind_at_launch,true, ImageLoader::RPathChain(NULL, NULL), -1);
	image->setNeverUnloadRecursive();
}
Copy the code

The link between the main program and the inserted library is called twice. In AllImages, the first image is the image of the main program, followed by the image of the inserted library. Link the main program and the inserted library through the link function. In the link function, the current image is recursively bound with symbols. Note: Symbolic binding only binds nolazy libraries, and lazy libraries are dynamically linked at run time.

Step6: initialize the main program now that the previous steps have done everything we need to configure and load, we need to initialize our main program.

// Initialize the main program initializeMainExecutable();Copy the code

Click to view the method as follows:

void initializeMainExecutable(a)
{
    // record that we've reached this step
    gLinkContext.startedInitializingMainExecutable = true;
    
    // run initialzers for any inserted dylibs
    // Initialize all inserts first
    ImageLoader::InitializerTimingList initializerTimes[allImagesCount()];
    initializerTimes[0].count = 0;
    const size_t rootCount = sImageRoots.size();
    if ( rootCount > 1 ) {
    	// Start at 1, because the 0th is the image of the main program
    	for(size_t i=1; i < rootCount; ++i) {
    		sImageRoots[i]->runInitializers(gLinkContext, initializerTimes[0]); }}// run initializers for main executable and everything it brings up
    // Execute the initialization method of the main program
    sMainExecutable->runInitializers(gLinkContext, initializerTimes[0]);
    
    // register cxa_atexit() handler to run static terminators in all loaded images when this process exits
    if( gLibSystemHelpers ! =NULL ) 
    	(*gLibSystemHelpers->cxa_atexit)(&runAllStaticTerminators, NULL.NULL);
    
    // dump info if requested
    // If these two environment variables are configured, the corresponding status information will be printed
    if ( sEnv.DYLD_PRINT_STATISTICS )
    	ImageLoader::printStatistics((unsigned int)allImagesCount(), initializerTimes[0]);
    if ( sEnv.DYLD_PRINT_STATISTICS_DETAILS )
    	ImageLoaderMachO::printStatisticsDetails((unsigned int)allImagesCount(), initializerTimes[0]);
}
Copy the code

Trace the call stack

If you look at the initializeMainExecutable() method, you will see that this method is the fourth step in the initial call stack, meaning that subsequent steps in the call stack need to be traced from this function. So let’s start tracking it down.

Find the next step in the stack runInitializers, which are already in the code

processInitializers

recursiveInitialization


image
doInitialization
state
dyld_image_state_dependents_initialized
libobjc
load_images

Next we jump into the notifySingle, eh? I’m going to skip to declarations again, the old way

load_images
load_images



load_images

load_images
libobjc
The runtime source

load_images
_objc_init
_objc_init

load_images
_dyld_objc_notify_register
_dyld_objc_notify_register
_dyld_objc_notify_register

registerObjCNotifiers
objc
load_images
sNotifyObjCInit
sNotifyObjCInit
objc
load_images

load_images
objc
load_images

call_load_methods()

void call_load_methods(void)
{
    static bool loading = NO;
    bool more_categories;

    loadMethodLock.assertLocked();

    // Re-entrant calls do nothing; the outermost call will finish the job.
    if (loading) return;
    loading = YES;

    void *pool = objc_autoreleasePoolPush();

    do {
        // 1. Repeatedly call class +loads until there aren't any more
        while (loadable_classes_used > 0) {
            // loop call
            call_class_loads();
        }

        // 2. Call category +loads ONCE
        more_categories = call_category_loads();

        // 3. Run more +loads if there are classes OR more untried categories
    } while (loadable_classes_used > 0  ||  more_categories);

    objc_autoreleasePoolPop(pool);

    loading = NO;
}
Copy the code

Here the call_class_loads (); Inside the method is calling each class’s load method.

Return to main program entry

I’ve figured out when to call the load method, and I’m not done, so back to the _main function that I started with, and I’m done with initializing the main program, but there’s more code to go.

// find entry point for main executable
// Find the entry point of the main executable
result = (uintptr_t)sMainExecutable->getEntryFromLC_MAIN();
if(result ! =0) {
	// main executable uses LC_MAIN, we need to use helper in libdyld to call into main()
	if((gLibSystemHelpers ! =NULL) && (gLibSystemHelpers->version >= 9))
		*startGlue = (uintptr_t)gLibSystemHelpers->startGlueToCallExit;
	else
		halt("libdyld.dylib support not present for LC_MAIN");
}
else {
	// main executable uses LC_UNIXTHREAD, dyld needs to let "start" in program set up for main()
	result = (uintptr_t)sMainExecutable->getEntryFromLC_UNIXTHREAD();
	*startGlue = 0;
}
Copy the code

The getEntryFromLC_MAIN() method finds the entrance to the main executable, the main() function of that program, not the _main() function we just analyzed. After a series of configurations, the return value of the _main() function returns the main() function we found.

conclusion

Time to sum up again, summed up the following six……

Haha, I am tired to death writing this article, but I have learned a lot. After learning the steps in the loading process of App, we may use the different loading order of these things to crack or protect App in the following articles.

Real summary of MachO file

  1. What is a MachO file?
  2. What are the common MachO files?
  3. Universal binaries (MachO files with multiple architectures).
  4. Split and merge MachO files.
  5. MachO file format.

Dyld, what did main do before?

  1. The program from_dyld_startStart executing, enter_mainFunction.
  2. Set the operating environment.
  3. Load the shared cache.
  4. Instantiate the main program.
  5. Load the inserted library.
  6. Link the main program to the inserted library.
  7. Initialize the main program. After a series of call stacks, each class’s load method is eventually called.
  8. doModInitFunctionsFunction, which calls with__attribute__((constructor))C function of theta.
  9. _main()End of call returns program entrymain()Function to start entering the main programmain()Function.

The article addresses the https://juejin.cn/post/6844904110857142279