The background,

Cold start time is an important indicator of App performance. As the first “door” of user experience, it directly determines users’ first impression of App. Since November 2013, Meituan Takeout iOS client has undergone the iterative development of dozens of versions, constantly improving the product form and increasingly complex business functions. Meanwhile, the takeaway App has evolved from the original independent business App into a platform App, and other new businesses such as flash purchase and errand running have been successively added. Therefore, more and more complex work needs to be done during the cold startup of the App, which brings challenges to the performance of the cold startup of the App. In this regard, our team carried out continuous and targeted optimization of cold start based on the changes in business form and the characteristics of takeaway App, in order to present a smoother user experience.

Two, cold start definition

Generally speaking, the cold launch of iOS is defined as the moment the user clicks on the App icon until the appDelegate didFinishLaunching method is complete. This process is divided into two stages:

  • T1: Before main() function, that is, the operating system loads the executable file of App into memory, and then performs a series of loading & linking tasks, and finally executes the main() function of App.
  • T2: main () function, that is, from the main (), and finished appDelegate didFinishLaunchingWithOptions method to execute.

However, when didFinishLaunchingWithOptions execution is completed, the user has not seen the App the main interface, also can’t start using the App. For example, in the takeout App, the App still needs to do some initialization work, and then go through the process of positioning, home page request, home page rendering, etc., before the user can really see the data content and start to use it, we believe that the cold start is completed at this time. Let’s call this process T3.

To sum up, the cold startup process of takeout App is defined as: __ The process from the user clicking the App icon to the user seeing the content of the main interface of the App, namely T1+T2+T3. __ In the process of App cold start, there are many points that can be optimized in each of the three stages.

Third, the current situation of the problem

Performance stock problem

After the iterative development of dozens of versions, meituan Takeout iOS client has accumulated several performance problems in the process of cold start. Solving these performance bottlenecks is the primary goal of cold start optimization. These problems mainly include:

Note: The definition of a startup item is something that needs to be done during App startup. We call it a startup item. For example, initialization of an SDK, preloading of a feature, etc.

Performance increment problem

In general, cold launches don’t have significant performance issues in the early stages of an App. The problem of cold start performance is not sudden in a certain version, but as the version goes on, App functions become more complex, startup tasks become more and more, and the cold start time becomes longer and longer. By the time we finally noticed it and wanted to optimize it, the problem had become very tricky. The incremental performance problems of takeout apps mainly come from the increase of startup items. As version iterations go by, startup tasks simply pile up in the startup process. If the cold start time increases by 0.1s with each version, the cold start time will increase significantly after several versions.

Iv. Governance ideas

There are three main governance objectives for cold start performance issues:

  1. Solve stock problems: optimize current performance bottlenecks, optimize startup process, and shorten cold startup time.
  2. Incremental control: The cold-start process is standardized. Code paradigm and documents are used to guide the maintenance of subsequent cold-start codes and control time increments.
  3. Improve monitoring: Improve cold-start performance indicators monitoring, collect more detailed data, and discover performance problems in a timely manner.

Five, standardize the start process

By the end of 2017, the number of users of Meituan Takeout has reached 250 million, and the App of Meituan Takeout has also completed the evolution from an App supporting a single business to a platform App supporting multiple businesses (the promotion, support and thinking of meituan Takeout iOS multi-reuse), and some of the company’s emerging businesses have also been integrated into the App. The framework of take-out App is mainly divided into three layers: the bottom layer is the basic component layer, the middle layer is the take-out platform layer. The platform layer manages the basic components downward, and provides unified adaptation interfaces for business components upward. The upper layer is the basic component layer. It includes sub-business components of takeout business separation (takeout channels in Takeout App and Meituan App can reuse sub-business components) and other non-takeout business access.

The platformization of App provides an efficient and standard unified platform for the business side, but at the same time, platformization and rapid business iteration also bring problems to the cold start:

  1. The accumulation of existing boot items slows down the startup speed.
  2. New launchers lack added paradigms, are messy, risky to modify, and are difficult to read and maintain.

Faced with this problem, we firstly sorted out all the startup items in the current startup process, and then designed a new startup item management mode for the platform of App: phased startup and self-registration of startup items

Staged start-up

Early because business is simple, all startup is not distinguish, simply accumulation to didFinishLaunchingWithOptions method, but with the increase of business, more and more startup code are stacked together, the performance is poorer, code bloat and chaotic.

Through the sorting and analysis of SDK, we found that startup items also need to be classified according to the completed tasks. Some startup items need to be performed immediately after startup, such as Crash monitoring and statistical reporting, otherwise information collection will be missing. Some startup items need to be completed at an earlier time node, such as some SDK to provide user information, initialization of location function, network initialization, etc. Some startup items can be delayed, such as some custom configurations, some business service calls, payment SDK, map SDK, etc. The staged startup we do is to divide the startup process reasonably into several startup phases, and then allocate them to the corresponding startup phase according to the priority of what each startup item does, with the high priority placed in the early stage and the low priority placed in the late stage.

The following is our redefinition of the startup stage of Meituan Takeout App, sorting out and reclassifying all startup items, and corresponding them to a reasonable startup stage. On the one hand, this can delay the execution of startup items that do not need to be executed prematurely, shortening the startup time. On the other hand, the startup items are grouped so that they can be read and maintained later. These rules are then translated into maintenance documents for startup items to guide the creation and maintenance of subsequent startup items.

Through the above work, we sorted out more than a dozen startup items that can be postponed, accounting for about 30% of all startup items, effectively optimizing the cold startup time of the startup items.

The boot option is self-registered

Once we have identified a plan for the startup phases, the problem we face is how to execute the startup phases. The easy scenario is to create a boot manager at boot time, read all the boot items, and then trigger the boot items to execute when the time node arrives. There are two problems with this approach:

  1. All startup items are pre-written to a single file (import in a.m file, or organize with a.plist file), and this centralized approach results in bloated code that is difficult to read and maintain.
  2. Startup code cannot be reused: Startup items cannot be converged to the sub-business library, and they need to be repeatedly implemented in take-out App and Meituan App, which is inconsistent with the direction of platform of take-out App.

We hope that the maintenance mode of boot items can be pluggable, boot items and business modules are not coupled, and the implementation can be reused at both ends. The following figure shows the startup management method we use, which we call self-registration of startup items: A startup is defined within A subbusiness module, encapsulated as A method, and declares its own startup phase (for example, A startup can be declared to be executed at willFinishLaunch in A standalone App or resignActive in A Meituan App). In this mode, boot items can be reused at both ends. Irrelevant boot items are isolated from each other, making it easier to add or delete boot items.

So how do you declare a startup phase for a startup item? How do you trigger the execution of the boot item at the right time? In code, a startup eventually corresponds to the execution of a function, so a startup can be triggered at run time as long as a pointer to the function is available. This is exactly what Kylin does: the core idea of Kylin is to write data (such as function Pointers) to the __DATA segment of the executable at compile time, and then retrieve data from the __DATA segment at run time to perform operations (call functions).

Why borrow the __DATA section? The reason is to be able to override all startup phases, such as those before main().

Clang provides a number of compiler functions that can perform different functions. One is the section() function. The section() function provides binary reading and writing. It writes constants to the data segment that can be determined at compile time. In the concrete implementation, mainly divided into compile time and run time two parts. At compile time, the compiler writes the data labeled with attribute((section())) to the specified data segment, such as a {key(key represents different startup phases), *pointer} pair to the data segment. At runtime, at the appropriate time node, read the function pointer according to the key and complete the function call.

Call KLN_STRINGS_EXPORT(“Key”, “Value”), which will be expanded as follows:

__attribute__((used, section("__DATA" "," "__kylin__"))) static const KLN_DATA __kylin__0 = (KLN_DATA){(KLN_DATA_HEADER){"Key", KLN_STRING, KLN_IS_ARRAY}, "Value"};
Copy the code

As an example, the compiler registers the startup function to boot phase A:

KLN_FUNCTIONS_EXPORT(STAGE_KEY_A)() {// In the a.m. file, declare boot item A to be executed at STAGE_KEY_A by registering macros.Copy the code
KLN_FUNCTIONS_EXPORT(STAGE_KEY_A)() {// In the b.m file, declare boot option B to be executed in STAGE_KEY_A phase // Boot option code B}Copy the code

During the startup process, STAGE_KEY_A triggers all startup items registered to STAGE_KEY_A time nodes. In this way, we self-register the startup items in a very concise manner with almost no additional auxiliary code.

- (BOOL) application: (UIApplication *) application didFinishLaunchingWithOptions: (launchOptions NSDictionary *) {/ / other logic [[KLNKylin sharedInstance] executeArrayForKey:STAGE_KEY_A]; // Trigger all boot items registered to STAGE_KEY_A time nodes // other logic herereturn YES;
}
Copy the code

After sorting out and optimizing the existing startup items, we also output the follow-up startup addition & Maintenance specification, which specifies the classification principle, priority and startup phase of the follow-up startup items. The purpose is to control the increment of performance problems and ensure the optimization results.

6. Before optimizing main()

Until main() is called, basically all the work is done by the operating system, and there is not much the developer can do, so if you want to optimize this time, you must first understand what the operating system is doing before main(). All the operating system did before main() was load the executable file (in Mach-O format) into memory, load the dyLD dynamic link library, and perform a series of dynamic link operations and initialization procedures (load, bind, and initialization methods). WWDC Topic: Optimizing App Startup Time Is attached here.

Loading process — from exec() to main()

The actual loading process starts with the exec() function, which is a system call. The operating system allocates memory for the process and performs the following operations:

  1. Load the executable file corresponding to App into memory.
  2. Load Dyld into memory.
  3. Dyld for dynamic linking.

Here’s a quick breakdown of what Dyld is doing at each stage:

phase work
Loading the Dynamic library Dyld gets the list of dependent dynamic libraries to load from the header of the main executable. It then needs to find each dylib, and the dylib files on which the application depends may in turn depend on other Dylibs, so all it needs to load is the list of dynamic libraries, a collection of recursive dependencies
Rebase and Bind – Rebase Adjusts the pointer pointer inside the Image. In the past, the dynamic library was loaded at the specified address, and all Pointers and data were correct for the code. Now the address space layout is randomized, so it needs to be modified at the original address based on random offsets

– Bind correctly points to the contents outside the Image. These external Pointers are bound by symbol names. Dyld needs to search the symbol table to find the corresponding implementation of symbol
Objc setup – Objc class Registration

– Add category definition to category registration

– Ensure that every selector is unique
Initializers – Objc’s +load() function

– C++ constructor attribute function

– creation of C++ static global variables of non-primitive types (usually classes or structs)

Finally, dyld calls main(), main() calls UIApplicationMain(), and before main() is complete.

After understanding the loading process before main(), we can analyze some factors affecting T1 time:

  1. The more dynamic libraries are loaded, the slower they start.
  2. ObjC class, the more methods, the slower the startup.
  3. The more + loads of ObjC, the slower it starts.
  4. The more constructor functions C has, the slower it starts.
  5. The more C++ static objects, the slower the startup.

In view of the above points, we made the following optimization work:

Code thin body

With the iteration of business, new code is constantly added, and useless code and resource files are discarded. However, in engineering, useless code and files are often abandoned in corners and not cleaned up in time. On the one hand, these useless parts increase the package size of the App, and on the other hand, they also slow down the cold startup speed of the App. Therefore, it is necessary to remove these useless codes and resources in time.

With knowledge of the Mach-o file, you can see that __TEXT: __objc_methName: contains all the methods in the code, and __DATA__objc_selrefs contains references to all the methods used, and you can get all the unused code by taking the difference between the two sets. The core methods are as follows: objc_cover

def referenced_selectors(path):
    re_sel = re.compile("__TEXT:__objc_methname:(.+)"// Get all methods refs =set()
    lines = os.popen("/usr/bin/otool -v -s __DATA __objc_selrefs %s" % path).readlines() # ios & MAC // The method that is really used
    for line in lines:
        results = re_sel.findall(line)
        if results:
            refs.add(results[0])
    return refs
}
Copy the code

In this way, we identified a dozen useless classes and 250+ useless methods.

+ load optimization

Currently, there are more or less +load methods in iOS apps that can be used to perform actions in App startup. + Load methods can be implemented in the Initializers stage, but too many +load methods can slow down startup, especially in large and medium-sized apps. Through the analysis of +load method in App, it is found that although many codes need to be initialized at an earlier time when App is started, they do not need to be initialized at a very early position like +load. They can be completely delayed to a certain time node after App cold start, such as some routing operations. In fact, +load can also be treated as a boot item, so in the specific implementation of the replacement +load method, we still use the above Kylin way.

Example:

// Replace the +load declaration with the WMAPP_BUSINESS_INIT_AFTER_HOMELOADING declaration. No other changes are requiredWMAPP_BUSINESS_INIT_AFTER_HOMELOADING() {// original +load method code}Copy the code
// Trigger all methods registered to this stage at an appropriate time, [[KLNKylin sharedInstance] executeArrayForKey: @kwmapp_business_Initialization_after_homeloading_key]}Copy the code

7. Optimize time-consuming operations

After main(), the main work is the execution of various startup items (described above), the construction of the main interface, such as TabBarVC, HomeVC, etc. Load resources, such as image I/O, image decoding, archive documents, etc. There may be some time-consuming operations hidden in these operations, which are difficult to find by simple reading. How to find these time-consuming operations? Finding the right tools will do twice the work.

Time Profiler

Time Profiler is a Time performance analysis tool that comes with Xcode. It tracks the stack information of each thread at a fixed Time interval, compares the stack state between Time intervals, calculates how long a method has been executed, and gets an approximate value. There are many tutorials on how to use Time Profiler online, but we will not cover them here. We will include a Tutorial on how to use Time Profiler: Instruments Tutorial with Swift: Getting Started.

Flame figure

In addition to Time profilers, fire charts are also a useful tool for analyzing CPU Time consumption. Fire charts are much cleaner than Time profilers. The result of flame chart analysis is a picture of call stack time consumption. It is called flame chart because the whole graph looks like a dancing flame. The flame tip is the top of the call stack, the bottom is the bottom of the stack, the vertical represents the depth of the call stack, and the horizontal represents the time consumed. The wider a grid is, the more likely it is to be a bottleneck. Analysis of a fire map is mainly to look at the larger flames, especially those like the “mesa”. Below is an image of Caesium, a performance analysis tool developed by Meituan platform:

Through the analysis of the flame diagram, we found that there were many problems in the cold start process, and successfully optimized the time of 0.3s +. The optimization content is summarized as follows:

Optimal point For example,
Find hidden time-consuming operations It was found that an image was archived during the cold boot process, which was very time-consuming
Delay & reduce I/O operations Reduce the number of animated image groups, replace large image resources, etc. Because disk I/O is a very time consuming operation compared to memory operations
Some tasks that have been put off Some resource I/O, some layout logic, object creation timing, etc

Optimize serial operation

In the process of cold start, a lot of operations are executed in serial. When several tasks are executed in serial, the time is bound to be long. If the string behavior can be changed in parallel, then the cold start time can be greatly reduced.

Use of splash screen pages

Many apps now don’t launch directly to the home page, but instead show the user a splash screen that lasts for a short period of time. If used properly, this splash screen can save us some startup time. Because when an App is complex, it takes a long time to build the UI of the App for the first time at startup. Let’s say the time is 0.2 seconds. If we build the UI of the home page first and then add the splash screen to the Window, the App will actually be stuck for 0.2 seconds at cold startup. But if we started with the splash page as the App RootViewController, the build process would be quick. Since the splash screen only has a simple ImageView, and this ImageView is displayed to the user for a short period of time, we can use this time to build the home UI, killing two birds with one stone.

Cache location & home page pre-request

An important serial process in the cold startup process of Meituan takeout App is: home page positioning –> home page request –> home page rendering process. These three operations account for about 77% of the loading time of the whole home page. Therefore, in order to shorten the cold startup time, we must optimize from these three points.

The previous serial operation flow is as follows:

In the optimized design, the client cache positioning is used to pre-request the home page data while initiating positioning, so that positioning and request are carried out in parallel. Then when the user succeeds in real positioning, judge whether the real positioning hits the cache positioning. If so, the pre-request data is valid, which can save about 40% of the loading time of the home page, and the effect is very obvious. If there is no hit, the pre-request data is discarded and a new request is made.

Ix. Data monitoring

Both the Time Profiler and Caesium flame chart can only be used offline to analyze the time-consuming operation of an App on a single device, which is too limited to monitor the performance of an App on a user’s device online. Takeout App uses the Metrics performance monitoring system developed by the company to monitor App performance indicators for a long time, helping us master the real performance of App in various online environments and providing reliable data support for technical optimization projects. One of the core Metrics that Metrics monitors is cold start time.

Cold start start and end time node

  1. End time: The end time is easy to determine, we can use the display of some view elements on the home page as a sign that the home page is loaded.
  2. Start time: Normally, we start to take over the App after main(), but it is obviously inappropriate to use main() as the starting point for cold startup, because T1 time cannot be counted in this way. So, how to determine the start time? At present, there are two common methods in the industry. One is to start with the execution time of the +load method of any class in the executable file. The second is to analyze the dependency of dylib, find the dylib of the leaf node, and then take the execution time of the +load method of one of the classes as the starting point. The latter is earlier, based on the order Dyld loads dylib. However, in both cases, the starting point is only in the Initializers phase, and the duration before Initializers is not accounted for. Metrics takes a different approach, using the App’s process creation time (i.e., exec function execution time) as the cold start time. Because the system allows us to obtain information about the process through sysctl functions, including the timestamp of the process creation.
#import <sys/sysctl.h>
#import <mach/mach.h>

+ (BOOL)processInfoForPID:(int)pid procInfo:(struct kinfo_proc*)procInfo
{
    int cmd[4] = {CTL_KERN, KERN_PROC, KERN_PROC_PID, pid};
    size_t size = sizeof(*procInfo);
    return sysctl(cmd, sizeof(cmd)/sizeof(*cmd), procInfo, &size, NULL, 0) == 0;
}

+ (NSTimeInterval)processStartTime
{
    struct kinfo_proc kProcInfo;
    if ([self processInfoForPID:[[NSProcessInfo processInfo] processIdentifier] procInfo:&kProcInfo]) {
        returnKpprocinfo.kp_proc.p_un.__p_starttime.tv_sec * 1000.0 + kprocinfo.kp_proc.p_un.__p_starttime.tv_usec / 1000.0; }else {
        NSAssert(NO, @"Unable to get process information");
        return0; }}Copy the code

The process was created very early. Through experiments, in a newly created blank App, the process creation time is 12ms earlier than the +load method in leaf node dylib and 13ms earlier than the main function (experimental equipment: IPhone 7 Plus (iOS 12.0), Xcode 10.0, Release mode). The data on the takeout App line is even more obvious. For the same model (iPhone 7 Plus) and system version (iOS 12.0), the process creation time is 688ms earlier than the +load method in leaf node dylib. For all models and system versions, the figure is 878ms.

Cold start process time node

We also put a series of speed points on all the key points during the App cold start. Metrics will record the names of the speed points and how long they are from the process creation time. We did not adopt automatic registration, because the cold start-up process of food delivery App is very complicated, and automatic registration cannot be so detailed and is not practical. In addition, Metrics records a sequential set of time points on the timeline, starting with the time when the process was created, rather than a set of time periods, because sequential time points can calculate the distance between any two time points, which can be processed as time periods. However, a set of time periods may not revert to sequential time points because time periods may not end to end, especially for asynchronous execution or multithreading.

After the speed measurement is complete, Metrics reports all the speed measurement points to the background. The following is a screenshot of some process node monitoring data of Meituan Takeaway App version 6.10:

Metrics will also aggregate the data from the background to obtain the total cold start time and the total cold start time of 50, 90 and 95 quantiles of the time of each speed measurement point, so that we can have a macro understanding of the cold start time distribution. In the figure below, the horizontal axis is the time, and the vertical axis is the number of reported samples.

Ten,

For rapidly iterating apps, with the increase of business complexity, the cold startup time will inevitably increase. The cold start process is also a complicated one. When the performance bottleneck of cold start is encountered, we can optimize it from various aspects and angles according to the characteristics of App and the use of tools. At the same time, the optimization of cold start stock issue is just the first step in the cold start governance, because of the cold start performance problems are not built in a day, also cannot simply by an optimization work can solve, we need through the reasonable design, specification of constraints, to effectively control the incremental performance problems, and through continuous on-line monitoring to timely discover and correct performance problems, Only in this way can we ensure a good App cold launch experience in the long term.

Author’s brief introduction

Guo Sai, senior engineer of Meituan Dianping. He joined Meituan in 2015 and is now the main developer of takeout iOS team, responsible for mobile terminal business development and business infrastructure construction and maintenance.

Xu Hong, senior engineer of Meituan Dianping. He joined Meituan in 2016. Currently, he is the main developer of takeout iOS team, responsible for mobile TERMINAL APM performance monitoring and high availability infrastructure support.

recruitment

Meituan takeout is looking for senior/senior engineers and technical experts in Android, iOS and FE based in Beijing, Shanghai and Chengdu. Please send your resume to [email protected].