Big playground that promote has been the technology and products, every large presses, a rich variety of rich media, such as broadcast, video, 3 d, interactive games, AR etc to online, under the big carriers strategy of taobao, is concentrated in the pampered in taobao App, in such a big promote scenarios, began to touch the side of the ceiling of the upper limit of system resource. During the “Double 11” promotion in 2017, the memory problems on the side were particularly prominent, and OOM ranked first among all the problems. Memory issues have been the biggest challenge to end-side stability in recent years.

17 years double 11 Crash problem classification

17 double 11 Crash trend and business online relationship

Let the business

Two years later, through our continuous mining technology and management, memory problems is no longer the main factors which influence the stability of large presses, promote unprecedented support for the 618 largest guess you enjoy unlimited pit, supports the rich live and small video game, hundreds of operations a pit support the venue, to support the interactive don’t downgrade strategy of the business, With the launch of various businesses, our end-to-end stability was further improved, and the crash rate was much better than the same period last year.

On June 18, 2008, the market crashed

Grasp the nettle and work your magic

In the face of memory challenges, we have been groping our way forward for 2 years and accumulated a set of memory governance experience.

Facing the big promotion, when there is a problem, we have to think about the current mechanism and specification, so we developed memory standards and online acceptance of business. At the same time, provides a set of tools for memory analysis, convenient to find problems quickly and accurately. At the same time, we developed three sets of memory optimization strategies:

  1. Budget carefully to improve memory usage
  2. Bottom-pocket disaster recovery, prolong the life of the application as much as possible
  3. Raise the memory limit and break through the system ceiling

Acceptance criteria —–

Due to the existence of memory ceiling, a large acceptance standard is introduced from the perspective of stability. In the process of making the standard, we counted the water level in OOM and analyzed the high risk, dangerous and normal water level, which was used as the guidance for making the memory standard.

The problem of memory is complicated because memory is a global shared pool, and when there is an overflow problem, it is difficult to define which business has a problem when there is no obvious problem. Therefore, when considering standards, we defined two scenarios. Single page and link.

The single-page scenario is mainly to reduce the risk caused by excessive memory usage of a single service. As mentioned above, the memory pool is global and limited. If a single page occupies too much memory, the overall available memory of the system will be greatly reduced. In the case of browsing the same page times, the overall memory risk will be increased.

Link scenario is the memory detection of common browsing links. For example, multi-page overlay detection is carried out from the conventional playing methods of home page – venue – interaction – shop – details – placing order to determine the memory risk of users in normal scenarios.

At the same time, the differences between different technology stacks are also taken into account when instituting memory standards. For example, H5, WEEX,native, including multi-tab venue form and live broadcast, 3D, etc.

Test the TMQ automated test tool developed by my classmate

Memory optimization three board axe

There are three main strategies for memory optimization mentioned earlier, which are described separately here.

Budget wisely – Improve memory utilization

Each 1KB of memory is very valuable when the service repeatedly hits the memory ceiling.

In the actual analysis of memory footprint, we occasionally find that some scenes load images much larger than the size of the view, resulting in a waste of memory. Or in some scenarios, the picture hold too long in the memory, such as in the background or pressure stack after a long time, the pictures of the space still cannot be released to the current interface to use, in the face of such a scene, we are in a high availability system refers to the corresponding function, can detect the case, in order to give users are using memory components, To improve memory utilization.

From the picture library data flow and View life cycle, to design the realization of automatic recovery and recovery of the picture, that is, when the View is not visible, automatically release the picture to the picture library cache, only retain the key value of the picture; When the View is visible, the image is restored by key. Image has its own three-level cache strategy. If the recovered image is still in the cache, it can be immediately restored, with almost no damage to the experience.

At the same time, for some large memory users, some instance number limits are agreed with each business side, such as detail page, large picture, video, webview, etc., which uses relatively large memory. In this case, the number of instances will be required. Current restrictions include detail pages, player instances, etc.

In order to better experience, we have also made some optimization in the downgrading strategy. Instead of a one-size-fits-all approach, we will selectively downgrade according to the capabilities of each device. To better achieve this goal, we first grade the devices, depending on the smart tier created.

Unified downgrade On the basis of device scoring, the device grading capability of default high and low end models is provided, and configuration capability is added. An Orange is assigned to each core business to support service configuration and downgrade of multiple dimensions such as system, brand, model, device, application version, and effective time.

Relying on unified downgrade, accurate experience classification can be achieved. High-end models can adopt various special effects and high-definition pictures to ensure the best experience. For mid-end models, some special effects can be reduced to achieve better results, while for low-end models, stability and basic experience can be guaranteed. Achieve “the most dazzling experience on high-end devices, give priority to fluency on low-end devices, and quickly degrade emergency problems”

Unified demotion effect

Bottom-pocket disaster recovery — prolong life cycle as much as possible

In the most dangerous time of application memory, maybe the next memory application will crash. In the most dangerous time, do we have the ability to ease the problem and let users place more orders? For this reason, we designed the MEMORY disaster recovery SDK.

The specific principle is based on THE PRINCIPLE of GC and LowMemorykiller (Android OOM should distinguish between insufficient JVM heap memory and insufficient native memory), by listening to the system’s GC and Lowmemorykiller, to calculate the system’s current memory state, when the memory is insufficient, Destroying lower-priority activities ensures that users can use as much memory as possible without causing stability problems.

Basic principles of memory Dr

Expand ceiling – Breach system ceiling

The strategy of hand shopping has been the carrier strategy, the front of the play can only alleviate the current stability problem, can only cure the symptoms, not the root cause. Business technology’s ever-increasing demand for memory, unlimited pit space, live video from conferences, and so on, all bring further pressure. The final solution to the memory problem is to increase the memory capacity in the end.

Multiple processes

The use of multiple processes is one way to break through the system ceiling. Due to the change of the new H5 page in the majority, so we focus on the hope to have some breakthroughs in webview. At this time, Apple’s WKWebivew was included in the research scope. As for the advantages of WKWebview in memory, our conclusions are as follows:

The memory of WKWebView is not calculated in the main application’s memory, but is calculated as a separate process. Therefore, for an application using WKWebView, the application as a whole can use more memory than UIWebView, because the Web memory is in the WKWewbView Web process. Does not affect the memory limit of the main application.

For Android, the platform itself supports a multi-process approach, so our initial design relies on a separate process approach to the Activity, even if BrowserActivity is separate.

In the AB experiment of 99, compared with the control group, among the users who visited Tao Gold coins/interaction, the main process native crash rate decreased by 15%-18%, and the crash count (main + sub) decreased by more than 10,000 times. In all users, the memory optimization effect is still significant with a decrease of 3%-5%.

However, considering that many of the basic SDKS were not designed with multi-process in mind at the beginning, and the application life cycle under multi-process also has some changes, the risk of the overall solution is large. Finally, the multi-process solution of UC kernel is adopted. It separates the analysis, typesetting and JS execution of the whole H5 page into an independent process, sharing part of the memory pressure of the main process, so as to achieve the goal of breaking through the single process memory ceiling.

UC multi-process diagram

Impact of multiple UC processes on crash rate

According to the evaluation of rigorous AB experiment results, the Crash rate can be reduced by 30-40% after mobile shopping starts UC multi-process.

64 upgrade

Generally speaking, programs in use today are compiled in a 32-bit instruction set. On a 32-bit system, the size of the memory address is only four bytes, and the theoretical maximum addressing space is only four gigabytes. As mentioned above, under the current mobile business capacity, 4G memory address can not meet, this year began to push mobile Mobile andorID architecture upgrade from 32-bit to 64-bit.

Speaking of 64-bit, arm V8 and later cpus have been upgraded to 64-bit architecture, and Android 5.0 and later systems have been upgraded to 64-bit architecture. We have done a fairly accurate calculation of the buried point, and about 95% of the phones on the market are 64-bit, which means that the benefits of 64-bit upgrades can be covered by the vast majority of users. On the other hand, there are risks associated with 64-bit upgrades. All C/C++ code needs to be recompiled into 64-bit instruction sets. Possible risks include:

  • Pointer length is increased from 32 to 64 bits, and some HardCode writing can be computed incorrectly, causing stability problems.

  • The loading logic of custom SO (such as server remote download) may not take into account the multiple CPU ABI and load the WRONG SO, resulting in stability problems.

  • To see if there is any inconsistency between 64-bit and 32-bit data, especially some binary data, resulting in the original data is not available after the overwrite installation or upgrade

In order to deal with these risks, since March, regression has been carried out for more than 120 SO in the hand wash, including looking at upgrade scenarios with 32-bit and 64-bit overlapping coverage. On the other hand, for the loading logic of SO, full hand wash code scan has been carried out to analyze and view the customized loading so scenarios to confirm whether multiple CPUABI is supported. After several months of grayscale and iteration, the 64-bit rollout was finally completed in time for 618.

In the latest 618 update, it is clear that the OOM share has dropped to around 10% after the 64-bit upgrade and the multi-process method are combined. In the previous 618 update, the OOM share was around 40% at its peak. This also includes 5% of 32 users, which has a significant governance effect on OOM.

Looking forward to

  • The challenge of new forms of technology

The memory problem has always been the biggest challenge to the stability of the big promotion side, and it has been solved well today. Of course, system resources are limited after all, and we still need to use system resources effectively and reasonably. More importantly, facing the future, the coexistence of multiple virtual machines like FLUTTER and other new technology forms will still pose great challenges to system resources.

  • From stability to smooth experience

For users, stability is only the most basic requirement, and we will continue to optimize the experience in the future to bring users a real silky smooth experience.

Mobile client team

The annual autumn recruitment for college graduates has officially started. Please scan the following QR code for more details. At the same time, a large number of mobile terminal recruitment job opening resume address 📮 : [email protected]

Alibaba class of 2021 school recruitment resume delivery