First, the process of program starting
1. Start time
-
define
What to do before the main function executes
2. Start acceleration?
1. Reduce code. The less code you have, the faster you start
2, should use less dyLib, less embedded dyLib, from a time point of view, it is better to use the system library effect
3. Fewer libraries and methods should be declared and fewer initialization functions should be declared
4. Use Swift more often. Swift avoids many pitfalls you might encounter in C/C++ and Objective-C
3. Start and finish
-
define
All the information the application needs (such as what dyld they use, what symbols they use at what offsets)
4. Optimization of startup time (influencing factors)
1. Network request
2. Load the dylib dynamic library
3. Too many complex NIB files
5. Some extensions to Swift
Some features of Swift:
Swift does not have an initializer
2. Disallow certain types of unaligned data structures (such structures extend startup time)
3. Code is leaner and development is faster
The history of Dyld
1. Dyld1 (NeXTStep3.3 published in 1996) :
Features:
1, using static binary data, standardized POSIX, earlier than dLOpen call, low efficiency and slow
2, C++ dynamic library system before writing, many C++ features (such as initializer sorting method works well in static environment, but dynamic environment reduces performance), large C++ projects, resulting in the dynamic linker needs to complete a lot of work, slow down
3, binding technology, for all dylib in the system and the application to find a fixed address, dynamic loader will load all the contents of these addresses, if loaded successfully, edit all these binary data, in order to obtain all is expected to address, and then the next time it put all the data in the same address, do not have to perform other extra work, This would be a huge speed boost, but it would also mean editing your binary data every time you start up, so it’s not the best approach, at least in terms of security.
2, DyLD2 (complete rewrite of Dyld) MacOS Tiger
The characteristics of
1. Properly support C++ initializer semantics
2. Extend MachO format and update dyld for efficient C++ library support
3. It has a complete native dlopen and DLSYM implementation
4. Correct semantics, deprecating the old API (the old API is still used on macOS, not added to other platforms)
5. Dyld is designed to increase operating speed, so only limited health checks are performed.
3. Safety and efficiency of DYLD
Due to some security issues, some improvements are needed to improve the security of the existing platform
Because speed is improved, thus can reduce the number of pre binding (different from editing application data, here only edit system libraries, can only when a software update to do these things, so when installing applications, there will be to optimize system performance such as prompt words, at this moment is the update for the binding, dyld now used for all optimization, its purpose is to optimize)
Then came DyLD2, added a large number of infrastructure (architecture) and platform, dyLD2 after the release of PowerPC, added x86, X86x64 ARM arm64 and many derivative platforms, appeared in the major platform operating systems, their updates need dyLD.
Security has been enhanced in a number of ways, including the addition of signatures and ASLRs, which means the address space configuration is loaded randomly, meaning that each time the library is loaded, it may be at a different address. For the rest, see Nick’s video from WWDC 2016 on how to get the program started.
Finally, an item in the Mach-o header was added, which is an important boundary checking feature to avoid the addition of malicious binary data.
Performance has been enhanced so that pre-binding can be removed in favor of shared code.
4. Shared code:
Shared code was first introduced in iOS3.1 and macOS Snow and completely replaced pre-binding. It is a single file that contains most of the operating system dylib and can be optimized because it is merged into a single file. We resized all the text segments and all the data segments and rewrote the entire symbol table to reduce the size. Thus only a small number of regions are mounted in each process, it allows us to package binary data segments to save a lot of RAM, and it is actually the Dylib prelinker.
Specific optimizations won’t be discussed here, but the RAM savings are significant, with a runtime savings of 500m-1GB on a normal iOS system.
It also pregenerates data structures for dyld and objc to use at run time, freeing us from having to do these things when the program starts, which also saves more RAM and time.
Shared code Generating and running dyld shared code locally on macOS will greatly optimize system performance, among other benefits. On other platforms, shared code is generated at Apple and shared with developers.
3. Dyld3 (new dynamic linker)
1. Basic introduction
It completely changes the concept of dynamic linking and will become the default setting for most macOS applications. It will be used by default by all operating systems on the 2017 Apple OS platform.
It will fully replace DyLD2 in future AppleOS platforms and third-party applications.
2. Why would we use dynamic linkers again?
The first is for performance. Performance is a constant theme, and we want to make the startup speed as fast as possible. We think it can help us get faster programs up and running.
The second is security. Some security features have been added to DyLD2, but it is difficult to follow the real world to enhance security. Apple has done a lot of work on this, but it is difficult to achieve this goal.
3. So can we have more aggressive security checks? And improve safety by design?
Finally, testability and reliability. Can we make Dyld easy to test? For this, Apple has released a number of great frameworks, such as XCTest, that you can use for testing. But they rely on the underlying functionality of the dynamic linker to plug their libraries into the process, so they can’t be used to test existing DYLD code, making it difficult to test security and performance levels.
4. What should we do in this situation?
We removed most of dyld from the process and it is now just a normal background program that can be tested using standard testing tools. This allows us to further improve speed and performance in the future. It is also possible to allow part of the dyLD to reside in the process, but the resident part is as small as possible, thus reducing the area of the program under attack and increasing the startup speed due to performance improvements. The code runs at an unprecedented speed. To understand this, here is a brief demonstration of how the first line dyLD 2 starts the program.
5. Introduction of startup process of DYLD 2
We use dyld 2 and your program starts:
1. You need to analyze your Mach-o file to figure out which libraries you need, and possibly other libraries, and then do a recursive analysis until you get a complete picture of all dylib. A typical iOS application requires 300 to 600 DyLib, which is a huge amount of data and requires a lot of processing.
2. Then we mapped to all mach-o files and put them into the address space.
3. Then perform a symbol lookup. If your program uses the printf function, it will find out if printf is in the library system, then find its address, copy it to the function Pointers in the application, and then we do the binding and base reset, copy those Pointers. Because random addresses are used, all Pointers must use the base address.
4. Finally, we can run all initializers. At this point, we’re ready to execute main.
6. There’s a lot of work going on among us. How do we speed it up and move these steps out of the process?
First identify the security-sensitive components. From Apple’s perspective, this is one of the biggest security concerns.
Analyzing mach-o headers and finding dependencies, so one can attack with modified Mach-O headers. Also, your program may use @rpaths, which are search paths that can be broken by modifying those paths or inserting libraries in place. So Apple does all of this outside of the background process.
And then you identify the big resource hog, the buffer hog, and they’re symbol lookers. Because in a given library, unless you make a software update or change the library on disk, the symbol will always be at the same offset in the library.
7. We’ve identified these, let’s see how they work in DYLD3.
We move these parts to the upper layer and then write the terminating process to disk, which, as mentioned earlier, is an important part of starting the program. You can use dyld3 later in the process to contain all three parts.
It is an out-of-process Mach-o profiler and compiler, as well as an in-process engine that performs start finish handling. It is also a startup cloaking cache service. Most program launches use the cache, but there is always no need to call an out-of-process Mach-o analyzer or compiler. Startup finalizations are simpler than Mach-O, they are memory-mapped files and do not need to be analyzed in complex ways. We can simply test them, and their effect is to increase speed.
Here’s a detailed look at each section:
Thus, dyLD3 is an out-of-process Mach-O analyzer that parses all search paths, all rpaths, all environment variables that affect startup speed.
We then analyzed the Mach-o binary data, performed all the symbol lookups, and used the results to create the terminating process, which is a normal background process, to improve the performance of our test infrastructure.
Dyld is also a small in-process engine. This part resides in the process, which is what you would normally see, and what it does is it checks to see if the boot end handling is correct.
Then map it to dylib and jump to main.
You may have noticed that dyld3 does not need to analyze the Mach-o header or perform symbol look-up to start your application. Since this is the part that takes time, it can greatly improve the program startup speed.
Finally, dyld 3 is also a startup end-of-cache service. What does that mean? We have used this tool to run and analyze every Mach-o file on the system, we can put them directly into the shared cache, have them map to the cache, all dyLib starts with it, we don’t even need to open any other files. For third-party programs, we generate your end-of-process processing when the program is installed or the system is updated, because the system library has changed by then. By default, end-of-life processing is generated on iOS, tvOS, and watchOS before the program even runs.
On macOS, because programs can be laterally loaded, if needed, the in-process engine can RPC into the background program on first startup, after which it can use cached end-of-process processing. But you don’t need to do that on other platforms.
Possible problems:
First, it is fully compatible with Dyld 2.x, so some of the existing apis may cause your program to run slowly or use fallback mode in Dyld 3. Try to avoid this issue, or discuss it later.
Some of the optimizations you’ve made may no longer be needed, so don’t put too much effort into this.
Stricter link semantics will be used. Many of the semantics are not currently available, or even wrong, and there are many such cases when adding new dynamic linkers in order to find all the boundary examples. What Apple does is put in a workspace that supports the old binary, but doesn’t want to go any further, and then does a link or follow-up check to see which SDKS you’re using, and then disables the workspace for the new binary, allowing you to work around those issues. So the new binary data will cause linker problems.
Next up is the discussion of unaligned Pointers to data segments. Suppose you have a global structure pointing to one function or another global function. Before your program can start, we must fix the pointer. On our system, Pointers must be naturally aligned for optimal performance, however, fixing unaligned Pointers is very complex, they can cover multiple pages of memory, causing more page errors and other issues, which can create subtle multiprocessor-related issues.The static linker has ignored this warning. The LD warning pointer is not aligned. If you can remove this warning, then the problem has been solved.
The source code provided this week has some Swift keypath issues, but they will be fixed, you can either ignore them or fix them. If you want to know how to do this, we’ll show you next. It takes a lot of work, but you can’t do these things in Swift:
The annotated sections are strongly pointed, and by default, the compiler will align them correctly for you. But sometimes, you may need special alignment. The default alignment rule is required for alignment in this example.
This will force the dynamic linker to fix the pointer when the program starts, so when you see code like this, you can clear all alignment and rearrange the structure to put the pointer in front so that alignment is better. Hopefully, you won’t have to do this if you’re writing Swift code.
Symbol resolution Dyld 2 performs lazy symbol resolution. Dyld must load all symbols, which is very resource-intensive, so caching should be used. It’s going to take a lot of resources, it’s going to take a lot of time. For this we use a mechanism called lazy symbol resolution.
By default, a function pointer in the library, such as printf, does not point to printf. By default, it points to a function in dyld, which returns a function pointer to printf. So at startup, calling printf will go into dyld, returning printf for the first call. And then, the second time, just call printf.
Since we have cached and counted all symbols, there is no additional overhead to bind them when the program starts, which we will do.
When you do this, the missing symbol will behave differently. In the existing lazy symbol mechanism, if a symbol is missing, the first call will start correctly, and the first call to the symbol will crash. If you use strong symbols, it will crash immediately.
To do this, we provide a compatibility mode, and what we’re going to do is we’re going to put the symbol that causes the auto crash into Dyld 3. If we can’t find your symbol, we’re going to bind that symbol and the first call will crash. That’s how the SDK works now.
In future SDKS, we will force all symbol resolution up front, and if you miss a symbol, it will crash. During development, you should be able to spot these crashes. Rather than the user discovering them while the program is running.
You can now emulate this with a special linker annotation, bind at Load. If you add it to your debugger, it will become very slow, so you should only put it in the debug version. But put it in a debug version and you’ll get much more reliable behavior. This allows you to use dyLD3 better. It should only be used in beta builds!
Dlopen, DLSYm, and DLADDR should only be used when necessary. They have some very wrong semantics, but in some cases, they are still needed. In particular, when dlSYm finds symbols, we need to find them at runtime, we don’t know about them in advance. Prefetch and presearching cannot be used. When you use dlopen or DLSYM, we will read in all the symbol table pages that we haven’t touched before, which takes up a lot of resources, in addition, we may have to RPC into the background program, depending on the complexity, we are developing better alternatives, which are not finished yet. We also need to understand your use cases to make sure we develop solutions that fit your needs. These plans will be released soon and we look forward to your feedback.
Dlclose, dlCLOSE is a misuse, it’s a Unix API, if we were writing it on our system, we would call it DLRelease, it doesn’t actually close dylib, it reduces the refcount count, if the refcount goes to zero, it closes it. What is its importance? If you have a library for a particular hardware, you should not close the hardware in response to the DLClose, because other code in the program may open the hardware in the background, so your hardware will not close, you should use explicit resource management.
Our platform also has a number of features that prevent Dylib from being uninstalled, so let me show you a few examples, because you might want to do that. You can have objective-C classes in dyLib, which will make DyLib ununmountable, and you can have Swift classes, which will also make DyLib ununmountable. You can have C low-level threads or C++ thread-local variables, which will make dylib ununmountable. So on macOS, which has some off-the-shelf Unix programs, we’ll keep this feature, but almost every dylib on all our other platforms does this, and it doesn’t work well on those platforms, so we can treat it as a no-action instruction that doesn’t work on any platform. Let us know if this causes problems.
Dyld All Image INFos this is the interface to the dyLib inside the process, it’s from the original Dyld1, but it’s just a structure in memory rather than an API, and that’s fine when we have 5 or 10 DyLib, but if we have 300, 400, 500 DyLib, It’s designed in such a way that it wastes a lot of memory, and we need to reclaim that memory. We need high performance and memory savings, so in future releases we will eliminate it. But will provide an alternative API, therefore, it is rarely used, if you want to use it, I hope you know why want to use it, how to use it, make sure we design the API is suitable for your use case, there are a lot of functionality is no longer applicable, does not meet your expectations, if you don’t need them, you can ignore them, we hope to get this information, Please let us know how you will use it.
Best practicesFirst, make sure that bind at Load is added to LD FLAGS, and this should only be done in the debug version. Any unaligned Pointers in the data segment should be fixedThere is a warning that should be used with the new Swift keypath feature to eliminate all error warnings, or you can ignore this warning because Apple will fix it
We would like to know why you are using dLOpen, DLSYM, DLADDR and all Image INFos structures to make sure that our alternative API can meet your needs.
If they are part of POSIX, they will be retained, which will only result in performance degradation, and for all Image INFOS, it will be cancelled to save memory.
Please use the DYLD USAGE title report bug to us, we can support you all the use cases, for more information please visit: developer.apple.com/videos/play…