use

The component responsible for lag monitoring in Matrix is TraceCanary, which is realized based on ASM peg and used to monitor interface smoothness, startup time, page switching time, slow function, lag and other problems. Similar to ResourceCanary, you need to configure the following options before using it, including frame rate, time consuming method, ANR, and start:

TraceConfig traceConfig = new TraceConfig.Builder()
        .dynamicConfig(dynamicConfig)
        .enableFPS(fpsEnable) / / frame rate
        .enableEvilMethodTrace(traceEnable) // Time consuming method
        .enableAnrTrace(traceEnable) // ANR
        .enableStartup(traceEnable) / / start
        .splashActivities("sample.tencent.matrix.SplashActivity;") // Multiple start pages can be specified using the semicolon ";" segmentation
        .isDebug(true)
        .isDevEnv(false)
        .build();
TracePlugin tracePlugin = new TracePlugin(traceConfig);
Copy the code

Then start it in Application/Activity:

tracePlugin.start();
Copy the code

In addition to the above configuration, Trace Canary has the following custom configuration options:

enum ExptEnum {
    // trace
    clicfg_matrix_trace_care_scene_set, / / the splash screen page
    clicfg_matrix_trace_fps_time_slice, // If the number of frames dropped by the same Activity * 16.66ms > time_slice, this Activity is reported. The default value is 10s
    clicfg_matrix_trace_evil_method_threshold, // time threshold for slow methods. The default value is 700ms

    clicfg_matrix_fps_dropped_normal, // Normal frame drop, default [3, 9]
    clicfg_matrix_fps_dropped_middle, // Medium frame drop, default [9, 24]
    clicfg_matrix_fps_dropped_high, // High drop frames, default [24, 42]
    clicfg_matrix_fps_dropped_frozen, // The number of frames stuck, default [42, ~]

    clicfg_matrix_trace_app_start_up_threshold, // Cold start time threshold, 10s by default
    clicfg_matrix_trace_warm_app_start_up_threshold, // Warm start time threshold, 4s by default
}
Copy the code

The report

The caton monitor reports more complex information than a memory leak.

ANR

When ANR appears, Matrix reports the following information:

{
    "tag": "Trace_EvilMethod"."type": 0."process": "sample.tencent.matrix"."time": 1590397340910."machine": "HIGH".// Device level
    "cpu_app": 0.001405802921652454.// Ratio of CPU time occupied by an application: appTime/cpuTime x 100
    "mem": 3030949888.// Total running memory of the device
    "mem_free": 1695964.// Some of the memory available to the device may be used by the system kernel
    "detail": "ANR"."cost": 5006.// Total method execution time
    "stackKey": "30 |".// Key method
    "scene": "sample.tencent.matrix.trace.TestTraceMainActivity"."stack": 15,1,5004 \ "0104574,1,5006 \ n1, n2, 30,1,5004 \ n".// method executes critical path
    "threadStack": " \nat android.os.SystemClock:sleep(120)\nat sample.tencent.matrix.trace.TestTraceMainActivity:testInnerSleep(234)\nat sample.tencent.matrix.trace.TestTraceMainActivity:testANR(135)\nat java.lang.reflect.Method:invoke(-2)\nat android.view.View$DeclaredOnClickListener:onClick(4461)\nat android.view.View:performClick(5212)\nat android.view.View$PerformClick:run(21214)\nat android.os.Handler:handleCallback(739)\nat android.os.Handler:dispatchMessage(95)\nat android.os.Looper:loop(148)\nat android.app.ActivityThread:main(5619)\n".// Thread stack
    "processPriority": 20.// Process priority
    "processNice": 0.// Nice value of the process
    "isProcessForeground": true.// Whether the application is visible
    "memory": {
        "dalvik_heap": 17898.// The allocated Java heap memory in the virtual machine, in KB
        "native_heap": 6796.// Local memory allocated, KB
        "vm_size": 858132.// Virtual memory size, which refers to the total address space accessible to the process, KB}}Copy the code

The device classification is as follows:

  1. BEST, memory is greater than or equal to 4G
  2. HIGH: memory is greater than or equal to 3 GB, or memory is greater than or equal to 2 GB and the number of CPU cores is greater than or equal to 4
  3. MIDDLE: indicates that the memory is greater than or equal to 2 GB and the number of CPU cores is greater than or equal to 2 GB and the number of CPU cores is greater than or equal to 4
  4. LOW: the memory is greater than or equal to 1 GB
  5. BAD: Less than 1 GB memory

Start the

Under normal startup conditions:

{ "tag": "Trace_StartUp", "type": 0, "process": "sample.tencent.matrix", "time": 1590405971796, "machine": "HIGH", "cpu_app": 2.979125443261738E-4, "MEM ": 3030949888, "mem_free": 1666132, "application_create": "Application_create_scene ": 100, "first_activity_create": 318, // First activity startup time "startup_duration": 2381, // total startup time "is_warm_start_UP ": false, // Whether it is a warm start}Copy the code

Application_create, first_activity_create, and startuP_duration correspond to applicationCost, firstScreenCost, and coldCost respectively:

firstMethod.i LAUNCH_ACTIVITY onWindowFocusChange LAUNCH_ACTIVITY onWindowFocusChange ^ ^ ^ ^ ^ | | | | | |---------app---------|---|---firstActivity---|---------... ---------|---careActivity---| |<--applicationCost-->| |<--------------firstScreenCost-------------->| |<---------------------------------------coldCost------------------------------------->| . |<-----warmCost---->|Copy the code

There are four startup scenarios:

  1. 100, Activity pulled
  2. 114, Service pulled
  3. 113, the Receiver pulled up
  4. -100, unknown, like ContentProvider

If the startup is too slow:

{ "tag": "Trace_EvilMethod", "type": 0, "process": "sample.tencent.matrix", "time": 1590407016547, "machine": "HIGH", "cpu_app": 3.616498950411638E-4, "MEM ": 3030949888, "mem_free": 1604416, "detail": "STARTUP", "cost": 2388, "stack": "0,2,1,43 \ n1, 121, 0 \ n1, 1,8,0 \ n2, 99, 0 \ n0, 1048574, 0 \ n0, 1048574,1,176 \ n1, 15,1,144 \ n0, 1048574,1,41 \ n", "stackKey" : "2 |", "subType" : 1 / / 1 on behalf of the cold start, 2 for a warmer}Copy the code

Slow method

{ "tag": "Trace_EvilMethod", "type": 0, "process": "sample.tencent.matrix", "time": 1590407411286, "machine": "HIGH", "CPU_app ": 8.439117339531338E-4," MEM ": 3030949888, "mem_free": 1656536, "detail": "NORMAL", "cost": 804, // Total time of method execution "usage": "0.37%", // Total time of method execution, CPU usage of the current thread "scene": "Sample. Tencent. Matrix. Trace. TestTraceMainActivity", "stack" : "0104574,1,804 \ n1, 14,1,803 \ n2, 29,1,798 \ n", "stackKey" : "29 |}"Copy the code

Frame rate

In the case of frame drop, Matrix reports the following information:

{ "tag": "Trace_FPS", "type": 0, "process": "sample.tencent.matrix", "time": 1590408900258, "machine": "HIGH", "cpu_app": 0.0030701181354057853, "meM ": 3030949888, "mem_free": 1642296, "scene": "Sample. Tencent. Matrix. Trace. TestFpsActivity", "dropLevel" : {/ / the number of different levels of frame drop problem "DROPPED_FROZEN" : 0, "DROPPED_HIGH" : 0, "DROPPED_MIDDLE": 3, "DROPPED_NORMAL": 14, "DROPPED_BEST": 451}, "dropSum": {// Total number of frames dropped for different levels of frame drop problem "DROPPED_FROZEN": 0, "DROPPED_HIGH": 0, "DROPPED_MIDDLE": 41, "DROPPED_NORMAL": 57, "DROPPED_BEST": 57 }, "fps": 46.38715362548828, // frame rate "dropTaskFrameSum": 0 // meaning unknown, normally always 0}Copy the code

The principle is introduced

(Note: This part is mainly collated from the official document of Matrix)

Said in the beginning, Matrix CARDS monitoring are realized based on ASM plugging pile, the principle of which is through the agency compile time task transformClassesWithDexTask, global class file as input, All class files are scanned and staked using the ASM tool, which means that the appMethodBeat. I method is inserted at the beginning of each method, and the AppMethodBeat.o method is inserted at the end of each method, and a timestamp is recorded so that you know how long it took to execute the method.

There are several key points in the piling process:

  1. The staking was chosen when the compile task was executed because the ProGuard operation was completed before the task, meaning that the class file at the time of staking had been obtruded. However, pile insertion after ProGuard is selected, because if pile insertion is carried out in advance, part of the methods do not conform to the inline rules and cannot be optimized during proguard, which ultimately leads to the number of program methods cannot be reduced, thus causing the problem of too many methods

  2. In order to reduce the amount of piling and performance loss, the class method instruction set is traversed to determine whether the scanned function only contains simple instructions such as PUT/READ FIELD, and some default or anonymous constructors are filtered, as well as simple and time-consuming functions such as GET /set.

  3. As for the interface startup time, the time between Activity#onCreate and Activity#onWindowFocusChange needs to be counted, so the implementation classes of all activities in the application need to be collected during the process of piling. The onWindowFocusChange function is overridden.

  4. In order to record the function execution process conveniently and efficiently, Matrix assigns an independent ID to each pile insertion function. In the process of pile insertion, the function signature and assigned ID of pile insertion are recorded, and a mapping is output after the completion of pile insertion, which is used as analytical support after data submission. To optimize memory usage, the method ID and timestamp are recorded in a long array in the following format:

  1. Stack clustering problem: If the original data collected is reported, there is a large amount of data and it is difficult for the background to cluster the faulty stack. Therefore, the collected data should be simply integrated and clipped before the report, and a key that can represent the caton stack should be analyzed to facilitate the background aggregation. Specific way is by traversing the acquisition buffer, adjacent I and o is a complete function, calculate each function performs a call tree and time-consuming, and to some of the same executive function in each level aggregation, finally, a simple strategy, analysis the main time-consuming that first-class functions, as a key representative caton stack.

The method of frame rate monitoring is to register a listener with Choreographer to determine whether the time difference between the last frame and each doFrame callback exceeds the threshold (stutter). If the threshold is exceeded, all data before the array index (that is, all function execution information between the two frames) will be retrieved for analysis and reporting.

ANR monitoring is simpler. When every doFrame comes, reset a timer and insert a node into the buffer array. If there is no cancel within 5s, ANR is considered to have occurred. You can assume that the key method leading to ANR is in there, and you can calculate the key method by calculating the timestamp.

In addition, given that obtaining System time (system.nanotime) before and after the execution of each method has a great impact on performance, and in fact, the execution time of a single function is less than 5ms, which is not the main reason for Caton and can be ignored. If it is multiple calls, it can be reflected in its parent method. Therefore, in order to reduce the impact on performance, Matrix created a thread dedicated to updating time, updating a time variable every 5ms, and only reading the variable before and after each method execution to reduce performance loss.