Authors: Yang Xikai, Wu Wenyang
Amap has been optimizing the whole link performance experience for three years since 19th, and finally achieved 50% discount optimization on the overall core link, greatly improving the user experience. In the process, some thinking and practical experience of performance optimization are summarized in this paper, hoping to be helpful to you.
Comparison of effects before and after optimization (take the time before optimization as 100% baseline)
Train of thought
The overall idea is divided into three parts: clear performance bottlenecks, reverse order special problem solving and positive order long line control:
- Clear performance card points: find the optimal point to target, the assessment of scientific and clear is very important to optimize the optimization point, scientific reasonable evaluation standards need to be able to evaluate the stand or fall of performance experience, and closer to the user’s true feelings, while the target needs quantifiable, such ability can keep shooting in the process of special efficient execution, avoid detours;
- Reverse order special problem solution: The performance problem is not a single business problem, and often involves the cooperation of multiple production and research teams. We start from the problem and quickly reverse order to gather multiple team resources in the form of a special project, determine the target, quickly overcome the results, and enhance team confidence.
- Positive sequence long-term management and control: optimization is a process of “effect” backward “cause”, and it is a way of solving problems in reverse order. So how to stop the problem from the source of the “cause”, or to prevent the optimization effect from regaining, then our third idea is: long-term continuous positive order control, to avoid the continuous deterioration of the original business, while consolidating the optimization results of the special.
Next, the standard will parse each of these three parts.
Identify performance bottlenecks
Setting standards
The speed of the first screen loading greatly affects the user experience, so the first screen time is used as the statistical standard of the page time. With the continuous upgrade of mobile phone hardware, many high-end devices have good hardware performance to cover the application performance problems, so we will optimize devices of different models and levels to maximize the coverage of online users.
Statistical standards
It is determined that the first screen display time is the statistical standard. The following is how to determine several dimensions of the first screen display time:
- Service: The definition of the first screen of different pages varies with different service forms.
- Product perspective: The first screen is defined around function usage, with high-frequency functions given priority;
- Research and development perspective: anchor the beginning and end of the first screen through the log burial point on the business process;
- Production and research Laton standards: establish a unified production and research communication language, which is quantitative data.
Standard model
- Model grade: according to the equipment score will be divided into high, middle and low three grades;
- Models: selected representative model on the basis of the different grade, we take is according to the proportion of online user equipment, as far as possible to cover more users, selecting typical manufacturer’s equipment as far as possible, of course also need to consider existing test lab available equipment, procurement, after all, not necessarily very timely and avoid unnecessary waste.
Determine optimization terms
Amap has a long history of accumulation, resulting in the optimization of each scene, are faced with complex business codes, and even there are business blind areas. This is a great challenge to quickly analyze the history of a large number of businesses and accurately locate time points. If you rely on manual analysis, human input and time are unrealistic. This requires tools and methodologies to find ways to accelerate.
Identify optimization points from top to bottom
- Dimension analysis of mobile devices:
Unlimited business scenarios run on limited performance resources through mobile devices, and resource allocation is bound to be stretched. Therefore, we need to analyze the time consuming problem according to different devices. For example, time-consuming problems on poor phones may not be a problem on high-end phones. The optimization points are also different, and personalized strategies are needed for targeted optimization. For example, complex interactive animation on low-end computers is a time consuming point. For example, some animations are closed on the search page, which brings considerable benefits in terms of performance and does not damage user experience. This time consuming point can be ignored on high-end computers.
- Business parallel dimension analysis:
There are so many business points, so why do we choose travel scenarios, search scenarios and so on to analyze the time? This requires laton products and BI to speak with data. Through the analysis of online user behavior, most users select the click search box into the search page, search and online user feedback is also entered the caton time-consuming, compared with other functions, the timeliness and importance of the search page is self-evident, so according to the online business classification data, the scene in the first place, the performance resource should tilt to here, We need to analyze the time points here first.
- Business internal dimension analysis:
Firstly, the whole link of the business scene itself is sorted out to find out the key time points. Then, the scene log tool is used to bury the points and upload them to the service to collect the real first-screen time data of online users, providing effective basis for quantifying goals. The timestamp difference between two key points is the phase time, which can help analyze service time. Around the business analysis process we also precipitate a number of analysis tools to assist analysis.
Minimum set, addition, subtraction
Minimum set is a bottom-finding process, making the minimum operable under the condition that the product form of the first screen is not deformed, removing all other items irrelevant to the first screen, which is subtraction. We can understand that this minimum set is our best optimization without changing the existing architecture. If the limit data of the minimum set reaches the standard, then necessary related dependencies should be added back one by one on this basis to ensure the product function is perfect, and other unnecessary dependencies can be optimized, or removed, or postponed, which is addition. Of course, if the limit data of this minimum set cannot reach the standard, then we need to find optimization points from other dimensions. Generally, we can find breakthrough points from two aspects of network time consumption and architecture rationality.
Solve problems in reverse order
Performance issues are not a single business issue and often involve multiple production and research teams working together. So here’s the idea:
Start from the problem quickly reverse order to gather multiple team resources in the form of a special project, determine the target, quickly overcome the results, and enhance team confidence
Special engines
Special attack is also a process of building while playing and precipitation while playing. The approach to troubleshooting performance problems was initially more discrete. It is usually solved one by one, and the next time different scenes need to be repeated, so our thinking is as follows:
Precipitation reusable scheme, problem solving ideas, general framework and tool platform, for the “optimization means” itself to optimize the efficiency, so that the cost is gradually reduced
The launching project was the first performance special project, which completed a scene optimization with 30 people and 3 iterations. The reason for the large number of manpower is that “the first time” faces a lot of problems, such as index analysis and definition standards, burying tools need to be built, optimization process is not experienced, control means need to be built.
Search special completed a scene optimization, cost 8 people, the human situation is much better, version into 2 versions. At that time, there was already a good burial point tool that had been built during the start-up period, and there was certain optimization experience, and a lot of detours were taken.
Core link project completed six scenes, cost 24 people, one version. The optimization process is methodical, with less manpower, more scenes and less time. This benefits from the improvement of optimization efficiency and the gradual reduction of cost. In the process of continuous optimization, relatively mature analysis tools, optimization tools, control tools have been accumulated.
Optimization scheme
Performance optimization is a systematic problem. The optimization plan is divided into three layers: business, engine and basic capability, and the optimization points are defined from top to bottom respectively. Upper-layer services perform adaptive resource scheduling, intermediate engines provide acceleration capabilities, and lower-layer capabilities provide high-performance components.
Service adaptive resource scheduling
Service layer optimization mainly achieves optimal performance through service scheduling, but the optimization work of service scheduling is repetitive and tedious. In order to reduce this cost, we developed a set of resource scheduling framework. After business access, the scheduling work is completed by the framework. In the process of application running, the scheduling framework senses and collects the running environment, and then makes different scheduling decisions for different environment states, generates corresponding performance optimization strategies, and finally performs corresponding optimization functions according to the optimization strategies. At the same time, the scheduling context and the execution effect of the scheduling policy are monitored and fed back to the scheduling decision system, thus providing information input for further decision tuning. In this way, the ultimate performance experience can be expected in different operating environments.
I. Environmental perception
The perception environment is divided into hardware devices, business scenarios, user behaviors and system status:
- In terms of hardware equipment, on the one hand, the known equipment is evaluated and scored by the group laboratory, so as to determine the high and low end models; on the other hand, real-time computing power is evaluated locally on the user equipment.
- In service scenarios, services are divided into foreground display, background operation, and interactive operation. In general, the scenario where the foreground is performing interactive operation has the highest priority, while the background data preprocessing has the lowest priority. For the same type of business scenarios, PK is conducted according to business UV, transaction volume, resource consumption and other dimensions to determine the segmentation priority.
- In terms of user behavior, combined with service user portrait and local real-time calculation, users’ functional preferences and operating habits are determined to prepare for the next precise optimization decision for users.
- In terms of system status, on the one hand, the system provides interfaces to obtain extreme state of the system, such as memory warning, temperature warning and power saving mode; on the other hand, the system performance resources can be determined in real time by monitoring memory, thread, CPU and power.
Second, scheduling decision
After perceiving the environment status, the scheduling system will combine various states and scheduling rules to make service and resource allocation decisions.
- Degradation rule: Disable the function with high energy consumption or low priority on low-end devices or when alarms such as memory and temperature alarms are generated
- Avoidance rules: When high-priority functions are running, low-priority functions are avoided. For example, when users click the search box and search results are fully displayed, low-priority tasks in the background are suspended and avoided to ensure user interactive experience.
- Preprocessing rules: preprocessing is carried out according to user operations and habits. For example, when a user clicks search after 3s, the search results of the user are preloaded before 3S, so as to present the ultimate interactive experience effect when the user clicks
- Congestion control rule: When the device resources are tight, the system proactively reduces the number of application resources. For example, when the CPU is busy, the system proactively reduces the number of concurrent threads. In this way, the problem of resource shortage and failure to apply for resource performance experience is avoided when high-priority tasks arrive
Third, strategy implementation
Policy execution is divided into task execution and hardware tuning. Task execution, mainly through memory cache, database, thread pool and network library to control the operation of corresponding tasks, to indirectly realize the scheduling control of various resources. Hardware tuning is to control hardware resources directly by cooperating with system vendors. For example, when cpu-intensive services start to run, the CPU frequency is increased and the running threads are bound to the large core to avoid performance loss caused by thread switching back and forth and maximize system resource scheduling to improve performance
4. Effect monitoring
In the process of resource scheduling, each module is monitored, and the environment status, scheduling strategy, execution record, business effect, resource consumption and other situations are fed back to the scheduling system, which will judge the advantages and disadvantages of this scheduling strategy for further tuning
Engine acceleration capacity
Map engine
The map engine is a unique part of the map application. This part mainly starts from drawing optimization strategy, including batch block rendering, frame rate scheduling, message scheduling, etc.
Two, cross-end engine
Cross-end engine needs to provide support for the business. It is also a common solution in all scenarios. Compared with client optimization, cross-end engine has more space to play and is close enough to the business by direct contact. So the cross-end engine optimization strategy is to reduce the performance cost of business code. The main programmes are:
- Thread lifting priority
- Context preloading
- Business framework reuse
- Require reference reuse
Here is a brief introduction to context preloading. In order not to affect the running status of existing services, we design an off-time segmenting preloading scheme, which can execute the time required for page calculation and file import in advance before the page is entered.
- Idle: Preload when the business thread is idle to avoid affecting other pages
- Segmented: the granularity of each preloading task is smaller than 16ms, preventing the preloading task from blocking the current thread
- Preloading: Advance the calculation of the target page to accelerate the entry of the target page
Iii. H5 container
1. Offline package acceleration
Offline package acceleration mainly solves the problem of loading speed of complex H5 pages: There are a large number of resource files, which takes a long time to download. As a result, the page loads slowly. Loading the interface is usually reduced by adding loading, which takes a long time for users to wait, and ultimately leads to the loss of conversion rate. In this context, combined with some existing platform capabilities of Autonavi, the offline package acceleration capability is built. The entire link includes:
- Offline package construction: accelerate business development efficiency and dynamically specify offline package resource allocation through front-end scaffolding;
- Offline package release: Connect with existing service release capabilities, build a front-end visual release platform, and provide grayscale control, package update, data statistics and other capabilities;
- End management: package download, management, effective, control the download and update timing — for high frequency pages for pre-download requests, the page open “second”;
- Resource taking effect: Intercepts the loading of resources in the container. If the container is connected to the offline resource management module, the loading takes effect immediately. If the container does not match the cache, the download is performed through normal network requests.
2. Container pre-creation
The pre-creation and preheating of containers will greatly improve the loading speed of H5 pages. The cost of creating WebView instances itself is relatively high. Pre-creation and cache reuse at an appropriate time after APP startup can solve the speed of first opening and second loading. AMap itself has startup task scheduling and off-time task scheduling, and on this basis, pre-creation operations can be carried out in the corresponding WebView module. For the pre-created WebView Context switch problem, AMap’s page stack is actually a custom implementation with only a single Activity, so it is naturally compatible. For pre-creation, itself is a way of space for time, for the differential configuration of devices with different performance, need to focus on polishing — in addition, can also combine the characteristic behavior of terminal intelligence, such as user page jump behavior habits and frequency, to dynamically decide whether to pre-create.
Architecture High-performance Components
Thread pools
Thread pools support scheduling policies such as task priority scheduling, thread count control, and thread avoidance to fully utilize device resources.
Thread queue management module, which provides 5 priority queues:
- High-priority queue: it is used to process UI-related tasks and can quickly return the execution results, such as high-priority tasks in the startup phase.
- Sub-optimal queue: used to perform tasks that need to be returned immediately, such as business page file loading;
- Normal (low priority) queue: Mainly used for tasks that do not require immediate return, such as network requests;
- Background (minimum optimal) queue: Used to process some tasks that users will not perceive, such as buried points and time-consuming I/O operations.
- Main thread idle queue: Used to process tasks that do not need to be executed immediately but are not supported by the service on demand. These tasks are executed only when the main thread detection is idle
Second, network library
Network request is the most time-consuming part of the scene, and its performance almost determines the time-consuming performance of the first screen of the scene. We have done link monitoring and extreme optimization for each link of the network request, mainly including: request refined scheduling, concurrent preprocessing, DNS preloading, connection reuse, etc.
Request link diagram:
- Queuing: fine scheduling of requests; For thread resource classification, high-priority requests have their own independent resources, and resources can occupy a high proportion of resources, but the low proportion of resources can not occupy a high proportion of resources, so as to achieve high-priority request 0 queuing. At the same time, the concurrency of low-priority requests is limited to avoid the underlying bandwidth preemption caused by excessive concurrency.
- Preprocessing: mainly includes a series of time-consuming operations such as common parameter, signature, encryption, etc., which changes the preprocessing operation from the original serial to parallel and reduces the preprocessing time.
- DNS resolution: Common domain names are whitelisted. DNS pre-resolution is performed on common domain names when they are started. When they are used, the resolution takes 0 time.
- Connection construction: through the use of H2 long connection, pre-connection and other strategies, the connection construction is almost zero time consuming;
- Request up/down: according to the body size, intelligent judgment whether to compress, reduce the body size, reduce the transmission time;
- Parsing callback: For scenarios with more complex responses (such as planning), use more efficient data protocol formats (such as PB), reduce data size & parsing time.
Positive sequence long line control
Optimization is a process of “effect” backward “cause”, and it is a way to solve problems in reverse order. So how to stop the problem from the source of “cause”, or prevent the optimized effect from going backwards, then our third idea is as follows:
Long-term continuous positive sequence control to avoid the continuous deterioration of the original business, while consolidating the optimization results of the special.
The horizontal business of Amap client includes travel, search, taxi and other business lines, and the vertical architecture includes business layer, platform adaptation layer, cross-end engine layer and map engine layer. It crosses multiple language stacks. Performance problems have the characteristics of long follow-up process and long links for investigation. Therefore, the idea of control focuses on the construction of standards, processes, automation platforms and tools.
standard
After comprehensive special governance, the goal and requirement of performance control is to avoid continuous deterioration of the state. Due to the existence of test fluctuations and hot heat, rapid iteration, and frequent release of dynamic plug-ins, there are many exits to be controlled. If there is a control omission area, performance will inevitably continue to deteriorate. Therefore, the control standard is determined to take the fixed baseline as the benchmark and the sequential value as the quantitative standard, and all the changing factors are included in the control to effectively prevent the compounding and deterioration.
process
Applies to the client process is divided into three main stages: the main demand analysis and design, business independent iterative development, integration, testing, integration testing phase itself has a lot of bugs, so leaving performance problems found and solve the time very nervous, in order to solve these problems, version control process needs to make good use of the main process of every stage, phased elimination performance problems.
- Requirement analysis and scheme design calculation, identify problems in advance;
- In the iterative development stage, problems are discovered and solved in advance to reduce the risk that performance problems are exposed too late to be repaired and affect online user experience;
- In the integration test phase, the data was collected every day, problems were found in time, and troubleshooting was carried out quickly based on the platform and tools to speed up problem circulation.
- At the stage of gray scale and release, focus on the online data market, establish an alarm mechanism, find problems in time, and troubleshoot online problems through user logs.
platform
Relying on Titan continuous integration platform and ATap automated test platform, we build a tool chain for connecting development, construction, performance testing, problem follow-up, investigation, circulation and solving complete links, so as to improve the efficiency of problem discovery and solution.
-
Titan Continuous Integration Platform
-
Scheduled build, support locating out package task, build type support performance package
-
Automatic test trigger: supports package trigger and timed trigger
-
Integration bayonet and decision, integration application display performance test results, integration decision approval process
-
-
ATap automated test platform
-
Performance tray: Collects performance data to quickly discover problems
-
Buried site details, integration of quick screening tools, speed up the screening
-
Follow up the problem and monitor the problem solving process with Aone to speed up the flow
-
conclusion
So far, the performance experience optimization effect of Autonavi core link has been greatly improved. From the initial optimization results, to the later optimization efficiency, optimization cost reduction. The overall optimization process is summarized as follows:
- Tactically, the adoption of “special” + “technology precipitation” + “long-term management and control” can ensure that the performance experience problems can be solved.
- Strategically, we used to solve problems by “people”. Now we solve problems by “people”, “architecture” and “tools”. Will “tools” solve problems themselves or avoid them in the future? It is only a matter of time before quantitative changes lead to qualitative changes, as the number of accumulated tools for “technological precipitation” and the number of platforms for the construction of “long-term control” continue to increase.
Pay attention to [Alibaba mobile technology] wechat public number, every week 3 mobile technology practice & dry goods to give you thinking!