This series of articles is divided into three parts according to the share of Fei Tai, senior development engineer of mobile Taobao client infrastructure, at the Android Green Alliance developer Conference, to introduce the design principle and implementation ideas of emAS-MoTU, a systematic improvement scheme for performance and stability of Mobile Taobao technology team.

This paper focuses on the definition and index of high availability platform, automatic testing framework and performance stability data platform.

For more information, please watch the video below

V.qq.com/x/page/r080…

The machine

Senior development engineer of Mobile Taobao Client Infrastructure

Mainly responsible for improving the performance and stability of mobile Taobao

Definition and metrics of high availability

Mobile high availability definition

Mobile terminal high aims to design the key metrics are available, and to expect can objectively reflect the true feelings of users and quantified in the use process, through the indicators at the same time, establish a series of tools and platforms, from offline to online fast detection, analysis, positioning and solve including stability, performance, functions and other kinds of problems, in order to further enhance the user experience of systematic solution.

High availability metrics

High availability metrics consist of performance and stability. The performance measurement index has seven dimensions, namely, lag rate, startup time, page second open rate, frame rate, ANR rate, flow rate and power consumption. The stability measurement index is mainly Crash rate, which is divided into Java Crash rate and Native Crash rate.

Automatic test framework and performance stability data platform

Automated testing framework

V.qq.com/x/page/e135…

Above is mobile Taobao automated test demo video. The first part is to test the gesture operation of students, and the second part is the effect of automatic script running. The advantage of this automated testing framework is the ability to generate executable automated scripts in real time using gestures. This framework is applied to violence testing to find potential performance stability problems, including memory leaks, thread leaks, resource leaks, and so on, by playing back the same script over and over again. At the same time, it can well solve the regression of P0P1 routine business by implementing routine test cases through this script.

Performance and stability data platform

Performance stability data platform, consisting of four modules, is used to display monitoring data of various dimensions.

1. Crash analysis

It mainly analyzes Java Crash and Native Crash. Java Crash includes the call stack at the time of Crash, the current page, the user’s history access page, the current memory water level and logCAT information to help developers quickly analyze the cause of Java Crash and solve the problem quickly. Native crash mainly includes the semaphore of crash, call stack used by crash, scheduling stack of other threads, logCAT information and loaded SO information. Through these information, developers can quickly find the cause of Native crash.

2. Anomaly analysis

The indicators of the various performance dimensions are shown here, and the main thread is stuck mainly on which messages exceed the threshold, and what its call stack looks like. ANR mainly shows the file information under /data/ ANR, and what the scene of ANR is like. The main thread IO section shows some call stacks for the main thread to operate on IO and how long it takes. The memory leak is divided into two parts: the name of the Java leak component and the name of the Native leak SO. You can quickly locate the cause of the memory leak through these two parts. The resource disclosure section mainly shows the call stack information when developing students call the resource Open.

3. The performance of APM

Launch performance monitors how long it takes a user to click on an icon until the page is actually visible and interactive. Page performance is the time from clicking on the page diagram to the next page that is truly visible and interactive. The system monitors the time taken to start all sub-tasks and decides whether the release of the version conforms to the quality standard through data changes. If it meets the quality standards, it can be released; If no, analyze the subtask time to find out which tasks cause the failure to publish. Quickly locate and analyze, and finally solve the problem. Handamoy opens up data capabilities to businesses to customize performance reports based on their individual needs.

4. Remote tools

Remote tools are primarily specific cases for specific users. When online users report a performance problem to the public opinion platform, the tool can quickly obtain remote logs, Dump memory, and the time consuming of each method from users, quickly analyze the cause, and provide a solution.

Behind the high availability, mobile taobao technology team is how to conduct performance and stability management? How do the processes emAS-Motu is developing provide high quality assurance for hand washing? We will share in the next article, stay tuned!