This article is selected from the “Douyin Android Performance Optimization” series of articles. “Douyin Android Performance Optimization” series articles are technical dry contents made by the technical experts of Douyin Android basic technology department. They share with you the performance optimization methodology, tools and practices gained by the basic technology team in the process of creating the ultimate user experience of Douyin, and exchange and grow together with technical students. The time consuming of user interaction response, as a performance index that Android users perceive most in daily life, is of great significance in daily development. In order to create the ultimate interactive response experience, the Douyin Android basic technology team has been committed to the exploration of the ultimate performance, including how to create the ultimate time detection tool.

1 overview

As the saying goes, to do a good job, must first sharpen its tools, we want to do a good job of performance optimization, the first is to be able to find performance problems, which requires reliable tools to help us do performance analysis. The mainstream performance analysis tools on the market include: Systrace, TraceView, and Android Studio’s CPU Profiler. I believe those who do performance optimization should be very familiar with these tools. Tiktok used Systrace as the main analysis tool at the very beginning, which also played a large role in the early stage of optimization. As Tiktok’s performance optimization enters the deep waters, we need to find and solve more fine-grained and multidimensional performance problems. We will pay attention to the time consumption of a few milliseconds, and pay attention to the lock blocking and IO waiting problems encountered by some low-end users online. However, these mainstream performance analysis tools on the market have been unable to meet the needs of tiktok performance optimization due to their limitations and high performance loss. In order to improve, we need to develop more flexible, refined, and diverse information and tools to help us optimize efficiently.

Under such a background, Douyin Android basic technology team developed Rhea(meaning the lady of Time) tracker, which is a performance analysis tool that automatically adds Trace through static code piling technology to analyze APP runtime time. It means to be a fully functional, efficient goddess that everyone loves and that fits into the core design principles of our tool. Rhea Trace acquisition should not only have low performance loss, but also be able to grab directly from App side without PC tools. Besides tracking more conventional function time, Rhea Trace can also track system calls, such as lock information, I/O time, Binder IPC and more information. Finally, conversion scripting tools are provided to generate visual reports from raw trace files for users to analyze performance issues.

2 comparison of Advantages

Rhea, due to its advantages of non-intrusion, high performance and information integrity, has been used on multiple byte apps with obvious effects and has helped people find performance problems quickly for many times. Information contained in Rhea includes time-consuming information such as application layer functions of unlimited levels, IO, lock, Binder and CPU scheduling, etc. Some of its effects are shown as follows:

Compared with other Android performance screening tools, its specific advantages are as follows:

Currently, Systrace can only monitor specific system information. You need to manually monitor the time spent on the application layer. The performance of TraceView is closely related to the sampling rate. If the sampling is too frequent, the performance cost is huge. If the sampling is too low, it is difficult to find the problem function accurately. Although Nanoscope has almost no performance loss, it needs to customize the ROM brush every time, and the cost is very high. In addition, these tools only support the offline analysis of the application of debugable. These tools are not perfect for APP performance optimization, while Rhea is an aggregator. The advantages of all tools are integrated and the defects are remedied.

3 Architecture Evolution

3.1 Stage 1: Supplement function time-consuming Trace based on Systrace

Systrace is a common tool for Android performance debugging and optimization. It can collect process activity information, such as function call time, lock, etc. Kernel information can also be collected, such as CPU scheduling, IO activity, Binder call information, etc. The information is displayed in Chrome browser on a unified timeline, facilitating performance debugging and optimization. Therefore, Systrace was chosen as the main tool for tiktok’s early performance optimization, and its general process is as follows:

3.1.1 Function transformation

The Systrace tool can only monitor the elapsed time of a particular system call, and it does not support elapsed time analysis of application code, so it has some limitations when used. Native Systrace requires the developer to manually add trace.beginSection and trace.endSection method pairs at the start and end of the method. This process becomes the developer’s prediction of time, and then manually add monitoring function pairs. Systrace is extremely expensive to use by repeatedly adding monitoring points, packaging, running, and collecting data to progressively locate time-consuming methods.

In order to improve the ease of use of Systrace, we developed Rhea 1.0 to transform the function of Systrace and add the automatic piling mechanism: Bytecode staking is used to automatically insert pairs of trace.beginSection and trace.endSection methods, and the performance loss caused by the introduction of monitoring is effectively controlled by limiting method levels at runtime.

Pile insertion class and pile method pseudocode:

class Tracer{
    method_stack = list()
    max_size = 6

    methodIn(method_id, method_name){
        if(method_stack.size()<=max_size){
            method_stack.push(method_id)
            Trace.beginSection(method_name)
        }
    }

    methodOut(){
        if(method_stack.size>0){
            method_stack.pop()
            Trace.end()
        }
    }
}
Copy the code

Method of pile insertion:

method1(){
    Tracer.methodIn(1,method1)
     ...
    Tracer.methodOut()
}
Copy the code

The output data looks like this, and all the methods in the specified hierarchy are displayed in the output HTML as expected:

3.1.2 Method Did not finished problem

When using the modified Systrace, we often encounter the following problems:

The analysis found that the main reason was that the method was interrupted during the execution of the runtime. For example, the method was caught by its caller after an exception occurred during the execution of the method, and the systrAcei. o method of the abnormal method was not called. As shown in figure: When the error method in test is executed, the array of ARR [2] is out of bounds, which results in that the piling method systracei.o (13L) in test is not called, and the exception is caught by the catch block in onCreate. As a result, the piling method of test was not called in pairs, and finally, all method calls in the outer layer of test could not be closed correctly. (Note: The pile method mentioned in this summary, namely the SysTracer related method, is automatically inserted through bytecode pile insertion)

The solution is to insert the extra pile method in the outer position of all the exceptions caught, and re-call the pile method under the chain of this exception is not a pair of problems. The diagram below:

3.1.3 There are still problems

Performance issues

With the in-depth use of Rhea 1.0 functions, while bringing great convenience, the deficiencies of the functions themselves are gradually exposed. In the process of data acquisition, its own performance loss will lead to some deviation in the actual performance optimization process. After strict testing, its performance loss is about 11.5%, as shown below:

In actual use, it is found that after Systrace is enabled, the proportion of corresponding Sleep time will exceed 40% in extreme cases. On the one hand, Sleep time caused by APP lock. For example, in the optimization process of SharedPreference on the startup path of Douyin, after opening Systrace function of Rhea 1.0, SP call was found to have obvious lock time. At that time, after optimization of SP lock, it was found that the effect was not obvious online. After a series of subsequent checks, The lock discovery time is caused after Systrace is enabled. On the other hand, Uninterrupt Sleep takes time due to IO. For example, we saw a lot of __fdget_POS operations during one-time optimization, and the ratio of _fdget_POS operations to Uninterruptible Sleep is at least 60% of the total Uninterruptible Sleep time. It took us a long time to add a lot of additional IO information, and the final reason was that after Systrace was started, trace of all threads would write to the same file, and all threads would synchronously compete for pos lock of kernels. The performance problems of the tool itself mislead the direction of troubleshooting, and also expose the problem that the troubleshooting efficiency is not high due to incomplete I/O information.

Missing calls due to restriction hierarchy

Systrace’s principles dictate that when we insert more functional pegs into the application layer to address time consuming problems at the application layer, it can lead to serious performance problems, so we use the tool offline to reduce runtime performance losses by limiting pegging levels. However, due to the limitation of hierarchy, the function call data beyond the established hierarchy is missing, and the exact time consuming point cannot be located when analyzing the time consuming of functions with deeper hierarchy.

Restrictions on Usage Scenarios

Because Systrace relies on PC in the process of data collection, Systrace cannot meet the requirements for some scenarios that require data collection without PC. For example, our product operation students often test the performance of Tiktok in offline scenarios, such as subway, restaurants, cafes, etc. Systrace cannot support the performance data in actual scenarios.

The low-end machine cannot be used properly

When we conducted time-consuming optimization for low-end machines, we found that some early low-end models, such as Samsung and Oppo, could not support their data fetching.

In view of the above problems, we have conducted in-depth exploration and optimization of the tool. Thus, the development of the tool entered the second phase.

3.2 Stage 2: High-performance full-scene Trace capture tool

3.2.1 Function upgrade

In order to make up for the shortcomings of the function of the first stage, further improve the performance and meet more application scenarios, we found a new solution: in the Java layer, the data is recorded asynchronously to the file after filtering out functions that exceed the specified time threshold by recording the time stamps at the beginning and end of the method and the thread where the method is located. After data collection, the output file can be converted into the specified format, and then the Systrace tool provided by SDK can be converted into the Html format for easy viewing, so as to achieve the same visual effect as Systrace.

3.2.2 Realize the principle of

How does Rhea 2.0 collect data and generate HTML with the same visualization as Systrace? The –from-file command of the Systrace tool in SDK can convert the original.trace data into HTML format for analysis.

After several attempts, it is concluded that the.trace file that can be parsed by the SDK Systrace tool should meet the following format:

Format description:

: specifies the ThreadName, or package name if the main thread.
: ThreadID.
: indicates the Time when a method starts or ends, in seconds.
< B > | E: tag which records for methods to begin end (B) or (E).
: indicates the ID of the process.
: method TAG. The length cannot exceed 127 characters.

Therefore, the data collected by Mtrace must contain at least the above contents.

The following is the corresponding Trace format:

depth,methodID,inTime,outTime,threadName,threadID
Copy the code

Compared with Systrace of Rhea 1.0, the performance loss of Method Trace of Rhea 2.0 is significantly reduced, and the performance loss is also reduced from 11.5% to 3%. The effect is shown as follows:

3.2.3 Best practices

Function using

Compared with Systrace, MTrace provides richer offline functions, including the feedback function to solve the problem of point-to-point delay time for real users and the feedback function to solve the problem of product, operation and QA students going out to check the scene.

In short, performance feedback is available at the touch of a hand no matter where you are! The complete operation process is shown in the figure.

Online case

A real case: Douyin Grayscale version online user feedback card, through the MTrace function package to achieve remote card problem analysis and investigation! The following is the real stuck data returned through user cooperation, and the time consuming call point can be found after parsing:

3.2.4 Existing problems

Since only the data of Java Method layer was collected at this stage, Method Trace could not provide which functions performed IO operations and which files were read/written by IO operations during douyin startup IO time-consuming optimization, which brought great difficulty to the optimization. In addition, in some complex scenarios, Method Trace only records the execution time of the function, but it cannot accurately determine whether the execution time is longer due to multi-thread synchronization and other locks or system IO.

In response to the above problems, we realized that a good set of Trace tools needed to incorporate more system events, so the tool moved on to the third stage of polishing.

3.3 Stage 3: Dynamic integrated Trace tool planning

Rhea 1.0 and 2.0 have made remarkable achievements in the early performance optimization of Douyin, but many limitations and inconveniences have been exposed with the deepening of the optimization work.

On the one hand, there are many limitations to using conventional Systrace tools for performance optimization. First, Trace information is not enough. By default, it only contains the time-saving information preset by the system, which is not enough to support conventional time-saving analysis. You need to manually call the trace. beginSection and trace. endSection methods on the App side to obtain more time-saving information of the function. In order to avoid affecting the size of the online package, it is necessary to remove it manually after use. Second, Systrace itself has a high performance loss, especially when the application carries out a large number of insertion of business code through the way of staking, the performance loss will exceed 50% in extreme cases. Third, Systrace completely relies on PC tools to crawl, which is not flexible enough. Especially need to be able to stable retrieval performance problem scenario, for some particular area or a particular user groups to recurring problems cannot take a effective information for direct and efficient, depend on the research and development or testing walkthroughs, or even part of the user feedback probability problem even walkthrough cannot get to the corresponding information through Systrace, resulting in low efficiency optimization.

On the other hand, although the acquisition function through simple customization of Trace has significantly improved performance and higher flexibility compared with Systrace, the data only contains basic time information, and there are still limitations in some complex scenes (such as time caused by lock holding).

None of the above tools can fully meet the performance optimization work of core scenes such as tiktok startup, first brush and low-end machine, so we need to redesign and plan a dynamic integrated Trace tool with more powerful functions to assist performance analysis.

The tool to be very flexible and can not rely on the PC grab scripts, support online at the same time, to be able to run when apply any want to pull data, as a platform tools, Rhea also need to support dynamic extensions, support for multiple scenario configuration and dynamic switch, can be any need to collect information.
The tool captures comprehensive Trace information, including ATrace piling, equal-lock information, I/O information and Binder time.
We should support visualization, output and formatting in a unified format, and finally display and use the results compatible with Systrace at the front end without changing user habits.
Performance loss should be low, so as not to take the direction of performance optimization.

Therefore, we redesigned a new generation of Trace analysis tools:

On the whole, App integrates time-consuming pile method of function insertion at unlimited levels during packaging by Integrating Rhea SDK, inserts Trace information related to IO, Binder, Lock and so on during running, supports dynamic configuration, and unified Trace format is Atrace. At the same time, it can obtain system-level Linux FTrace, Android Framework Atrace, and App inserted ATrace information, which does not depend on PC fetching, and ultimately provides visual display. The concrete implementation is as follows:

3.3.1 Don’t rely on PC to fetch Trace

In order to achieve Trace capture independent of PC, it is necessary to first understand the implementation mechanism of Android Atrace. First, data sources included in Atrace include:

Among them, user-space data includes custom Trace of the application layer, Trace related to GFX rendering of the system layer, Trace information related to locking of the system layer, etc. The end is by calling the Android SDK provides the Trace. BeginSection or ATRACE_BEGIN record to the same file/sys/kernel/debug/tracing/trace_marker. This node allows the user layer to write strings. Ftrace records the timestamp of the write operation. When the user calls different functions in the upper layer, different call information is written, such as the entry and exit of the function. When atrace processes multiple trace categories at the user level, it simply activates different tags. If Graphics is selected, ATRACE_TAG_GRAPHICS is activated, and the render event is logged to trace_marker.

The kernel space data is mainly some supplementary analysis data FREQ, Sched, Binder, etc. Commonly used information such as CPU scheduling includes:

CPU frequency changes
Task performance
The scheduling of large and small cores
CPU Boost scheduling

These information can be obtained by App by directly reading relevant information under /sys/devices/system/ CPU node, while other information identifying thread status can only be obtained through the system or ADB, and these information is not controlled by a unified node, which needs to activate corresponding event nodes. Let fTrace record tracepoints for different events. At runtime, the kernel logs events into the fTrace buffer, depending on the node’s enablement status. For example, to activate thread scheduling status records, you need to activate related nodes like the following:

events/sched/sched_switch/enable
events/sched/sched_wakeup/enable
Copy the code

After activation, information related to thread scheduling status can be obtained, such as:

Running: The thread is executing code logic normally
Runnable: Indicates that the CPU is in the executable state and waiting for scheduling. If the CPU cannot be scheduled for a long time, the CPU is busy
Sleeping: Hibernate, usually waiting for the event-driven
Uninterruptible Sleep: Uninterruptible sleep, need to see the description of Args to determine the current state
Uninterruptible Sleep– Block I/O: I/O is blocked

Finally, the above two categories of event records are collected into the same buffer of kernel state. Systrace tool script on PC side specifies parameters such as the category of trace to capture, and then triggers /system/bin/atrace on mobile terminal to open the information of corresponding file nodes. Atrace then reads the ftrace cache and generates atrace_RAW information containing only ftrace information, which is ultimately converted into a visual HTML file by script. The general process is as follows:

Therefore, based on the implementation principle of Android Atrace, we synchronously refer to the scheme of Facebook Profilo for directly obtaining Atrace on the APP side to achieve the method of Trace capture without PC.

Libcutils. so handle is obtained from dlopen, and pointer to atrace_enabled_tags and atrace_marker_fd is obtained from symbol. Therefore, atrace_enabled_tags is set to enable the ATrace switch. The specific implementation is as follows:


  std::string lib_name("libcutils.so");
  std::string enabled_tags_sym("atrace_enabled_tags");
  std::string marker_fd_sym("atrace_marker_fd");

  if (sdk < 18) {
    lib_name = "libutils.so";
    // android::Tracer::sEnabledTags
    enabled_tags_sym = "_ZN7android6Tracer12sEnabledTagsE";
    // android::Tracer::sTraceFD
    marker_fd_sym = "_ZN7android6Tracer8sTraceFDE";
  }

  if (sdk < 21) {
    handle = dlopen(lib_name.c_str(), RTLD_LOCAL);
  } else {
    handle = dlopen(nullptr, RTLD_GLOBAL);
  }
  // safe check the handle
  if (handle == nullptr) {
    ALOGE("atrace_handle is null");
    return false;
  }

  atrace_enabled_tags_ = reinterpret_cast<std::atomic<uint64_t> > * (dlsym(handle, enabled_tags_sym.c_str()));
  if (atrace_enabled_tags_ == nullptr) {
    ALOGE("atrace_enabled_tags not defined");
    goto fail;
  }

  atrace_marker_fd_ = reinterpret_cast<int* > (dlsym(handle, marker_fd_sym.c_str()));
Copy the code

Next, we use write and write_chk methods in hook libcutils dynamic library to intercept the corresponding ATrace information and dump it locally or upload it to the cloud for analysis by determining atrace_marker_fd. The implementation is as follows:

ssize_t proxy_write_chk(int fd, const void* buf, size_t count, size_t buf_size) {
  BYTEHOOK_STACK_SCOPE(a);if (Atrace::Get().IsAtrace(fd, count)) {
    Atrace::Get().LogTrace(buf, count);
    return count;
  }

  ATRACE_BEGIN_VALUE("__write_chk:".FileInfo(fd, count).c_str());

  size_t ret = BYTEHOOK_CALL_PREV(proxy_write_chk, fd, buf, count, buf_size);

  ATRACE_END(a);return ret;
}
Copy the code

3.3.2 Provide more comprehensive Trace information

1. Lock time consuming

Java layer locks, either synchronized methods or synchronized blocks, eventually go to the VIRTUAL machine’s MonitorEnter and MonitorExit, where the MonitorEnter switches from no-lock to light-lock, bias in light-lock, and reentrant. A race occurs and spins are exceeded a number of times to upgrade to a recollock allocation monitor object, where the ART is now spinning without being true, but with sched_yield actively freeing the CPU for the next schedule.

The first thing we need to pay attention to is the waiting time information after lock competition upgrades to re-lock. This information will be output to trace_marker through ATrace from Android 6.x.

However, there is some extra work to be done for the light lock information, because in addition to the ATRACE_ENABLE condition, there is another systrace_lock_logging switch variable that is a member of a global variable in the VIRTUAL machine. The value of this member variable is normally determined when the vm is started, and the default value is false. It can be opened by passing the -verbose:sys-locks parameter when the VM is started, but this is not possible for normal applications, so you need to open it dynamically at runtime using an unusual method:

First, check whether the size and member order of this structure have changed since android7.x.
If there are no changes, you can define the same structure yourself, because it is full of primitive bool variables and no other dependencies are introduced.
If there is a change, but forward compatible, we want to access the member position does not change, just append the member later, also can define the same structure;
Use DLSYM to find the global symbol of the VIRTUAL machinegLogVerbosity;
Convert its type to a predefined structure type;
accesssystrace_lock_loggingMember and assign true;
Lightly lock the ATrace information can be output normally;

std::string lib_name("libart.so");
// art::gLogVerbosity
std::string log_verbosity_sym("_ZN3art13gLogVerbosityE");

void *handle = nullptr;
handle = npth_dlopen_full(lib_name.c_str());
if (handle == nullptr) {
  ALOGE("libart handle is null");
  return false;
}

log_verbosity_ = reinterpret_cast<LogVerbosity*>(
    npth_dlsym(handle, log_verbosity_sym.c_str()));
if (log_verbosity_ == nullptr) {
  ALOGE("gLogVerbosity not defined");
  npth_dlclose(handle);
  return false;
}

npth_dlclose(handle);
Copy the code

2. IO time-consuming

When optimizing tiktok’s performance on the startup path, we counted the time spent on cold startup. The longest time spent on cold startup was when the process was in D state (Uninterruptible Sleep). This part of the startup time accounts for about 40% of the total startup time. Why is the process in the D state? A process in the uninterruptible sleep state usually waits for I/OS (such as disk I/OS and peripheral I/OS). The process enters the uninterruptible sleep state because it receives no response from the I/OS. Therefore, in order for the process to recover from uninterruptible sleep, it must restore the I/O on which the process is waiting. Similar to the following:

However, when we use Systrace for optimization, we can only get the call state of the above kernel state, but cannot know what the specific IO operation is. Therefore, we specially designed a set of solutions to obtain IO time information, including user space and kernel space.

First, in the user space, in order to collect the required I/O time consuming information, we use the standard key function family during Hook I/O operations, including open, write, read, fsync, fdatasync, etc., to insert the corresponding trace burying point for the statistics of the corresponding I/O time consuming. Take fsync as an example:

int proxy_fsync(int fd) {
  BYTEHOOK_STACK_SCOPE(a);ATRACE_BEGIN_VALUE("fsync:".FileInfo(fd).c_str());

  int ret = BYTEHOOK_CALL_PREV(proxy_fsync, fd);

  ATRACE_END(a);return ret;
}
Copy the code

The second is that in kernel space, fTrace provides functionality beyond what can be enabled directly by Systrace or Atrace support, and contains advanced functionality that is critical to debugging performance issues (functionality that requires root access and often may require a new kernel as well). Therefore, we added the ability to display custom IO information based on this. In offline mode, we opened the/sys/kernel/debug/tracing/events/android_fs ftrace under the node information, used to collect the IO related information,

Catapult is the mother of Systrace, an open source project from the Google Android and Chrome teams. Catapult is responsible for Systrace and its parser. In Catapult, a cross-platform trace parser is implemented in javascript. On this basis, we developed Rhea tool script to convert it into systrace displayable format for rapid diagnosis and discovery of IO performance bottlenecks.

For example, a View call to a setText method will cause ANR, and Systrace will fetch Trace as follows:

At this time, seeing that the main thread is in D state, there is nothing to do, but through our Rhea tool, Trace is obtained as follows:

It is easy to point out that the problem is caused by the IO time required to read the corresponding font.

3. Binder time-consuming

In the process of tiktok startup performance optimization, we usually encounter the time problem caused by Sleep, which usually accounts for about 30% of the total time. In this Sleep state, the process is usually caused by equal-lock or Binder call time, usually offline. We can start tracing/ Events/Binder nodes to obtain this information, but it is difficult to obtain this information online due to permission issues. Therefore, we use the android_os_BinderProxy_transact method corresponding to Hook libbinder.so to count the call time corresponding to binder.

if (TraceProvider::Get().isEnableBinder()) {
  // static jboolean android_os_BinderProxy_transact(JNIEnv* env, jobject obj,jint code, jobject dataObj, jobject replyObj, jint flags)
  bytehook_stub_t stub = bytehook_hook_single(
      "libbinder.so".NULL."_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j".reinterpret_cast<void*>(proxy_transact),
      NULL.NULL);
  stubs.push_back(stub);
}
Copy the code

Then, the corresponding binder time is counted. If the time exceeds the specified threshold, the corresponding stack is printed for auxiliary analysis of Sleep time.

static void log_binder(int64_t start, int64_t end, int64_t flags) {
  JNIEnv *env = context.env;
  env->CallStaticVoidMethod(context.javaRef, context.logBinder, start, end, flags);
}

status_t proxy_transact(void *pIPCThreadState, int32_t handle, uint32_t code,
                  const void *data, void *reply, uint32_t flags) {
  // todo: add more informations
  nsecs_t start = systemTime(a);status_t status = BYTEHOOK_CALL_PREV(proxy_transact, pIPCThreadState, handle, code, data, reply,
                              flags);
  nsecs_t end = systemTime(a);nsecs_t cost_us = ns2us(end - start);
  if (is_main_thread() && cost_us > 10000) {
    log_binder(ns2us(start), ns2us(end), flags);
    nsecs_t end_ = systemTime(a); }return status;
}
Copy the code

The trace effect is as shown below:

4. Support to add more data sources later

Of course, just support these information we can’t completely cover performance optimization in the process of the future may also face other problems, therefore, we support the function of dynamic configuration and the subsequent need only under the existing framework, simply add the corresponding configuration items and their functions can be quickly and easily to collect the information we need.

enum TraceConfigKey {
  kIO = 0,
  kBinder,
  kThinLock,
  kStopTraceUnhook,
  kLockStack,

  kKeyEnd,
};
Copy the code

5. Time required for obtaining function of pile insertion at any level

Limiting the pile level can certainly improve runtime performance, but there are two problems after limiting the pile level:

Incomplete function call data collection;
Difficult to locate deep time-consuming calls;

Therefore, in user mode, in order to obtain more Trace information of App, it is convenient for performance optimization. We use the piling scheme with no limit on levels. Developed a plug-in that does not limit the level of staking in the compilation stage. By static code staking, trace. beginSection and trace.endSection are inserted at the beginning and end of the App call method respectively. The effect is as follows:

3.3.3 Optimization to reduce performance loss

1. Optimization of pile insertion performance

In the piling stage, we made the following optimization:

Support custom peg scope to reduce the running loss of Trace for other irrelevant modules;
In view of the problem that Trace data does not close, the catch code block is fully inserted.
For frequently called functions, you can selectively add them to the blacklist to improve runtime performance.
In order to support the use of production environment, pile insertion after ProGuard was adopted. Due to optimization such as function inlining, the number of pile insertion could be reduced by 2.6% compared with that before confusion. For online mode, the method ID is directly inserted. After collecting Trace, the method ID needs to be re-mapped to the method name on the host or server side. However, considering the ease of use of offline users, the method name is directly inserted in the offline mode packaging stage.
By analyzing the bytecode information at the compilation stage, the pegging of non-time-consuming functions is filtered out.

2. Optimize the performance of start and stop Trace on App side

Since the implementation of Trace capture on App side depends on hook, we refer to the implementation of Facebook Profilo, but its implementation has problems such as large dynamic library and time-consuming to start and stop Trace. Therefore, we further optimized the size and performance of the dynamic library that the App relies on to obtain ATrace locally. As follows:

3. Optimize Trace write performance

Since a large amount of Trace information is inserted into App methods, after atrace is enabled, all threads will write all traces into trace_marker files, which will cause a sharp increase in I/O losses and cover up real performance problems. The reason is that all threads write to the trace_marker file in a short time and compete for kernel pos lock at the same time. As a result, the obtained trace file cannot truly reflect the performance problem, as shown in the figure below:

Therefore, we intercept the Trace that was originally written directly to the kernel file in user mode, cache it, and dump it in asynchronous IO mode. It avoids both the context loss caused by switching between user mode and kernel mode and the I/O loss caused by direct I/O. The effect is as follows:

We do a visual

Since we stored user atrace and kernel fTrace in ringbuffer of corresponding space respectively, native Systrace can only be visualized separately, so we developed a script tool for unified integration of trace. Multiple trace messages will become a single HTML file, which can be visually displayed in Chrome (Chrome :// Tracing) when viewing trace messages.

4 Future Planning

At present, Rhea’s support for Native is not complete; The performance optimization is not extreme enough, especially when it is used to analyze the lag problem and needs to locate several milliseconds or even finer granularity time, the performance loss will still be a little too large, which will bias the optimization direction to a certain extent. At present, Trace tool is mostly used offline. Due to the excessive insertion of piles, which affects the package size, we can only open the online part for small-scale user groups in a directional manner, and cannot fully go online to locate performance problems of large-scale online users. In the future, we will focus on solving the above problems and build the Trace tool to the utmost.

5 subtotal

The main advantages of Rhea, a new generation of Trace analysis tool, are as follows:

1, the use of flexible, do not rely on PC to grab scripts, and support online and offline multiple modes and configuration switch;

2, support to collect and track a variety of information including unlimited ATrace function time, piling, equilock information, I/O information and Binder time, etc.;

3, high compatibility, support API 16~30 trace capture of all models;

4, zero intrusion code, complete all configuration of the plug-in through Gradle, without any code call directly;

Join us

We are responsible for the trill client research and development of basic technical ability and cutting-edge technology to explore the client team, we focus on the performance, structure, stability, development tools, compile the construction direction of agriculture, security vlsi r&d efficiency and engineering quality of the team, will be 600 million people use the trill into extreme user experience.

If you are passionate about technology, welcome to join the basic technology team of Douyin, and let us build a hundred-million-level global App. At present, we have recruitment needs in Shanghai, Beijing, Hangzhou and Shenzhen. For internal promotion, please contact [email protected]; Email title: Name – Working years – Douyin – Basic technology – Android/iOS.

Welcome to Bytedance Technical Team

Resume mailing address: [email protected]

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Douyin Android Performance Optimization Series: a new generation of all-purpose performance analysis tool Rhea

1 overview

2 comparison of Advantages

3 Architecture Evolution

3.1 Stage 1: Supplement function time-consuming Trace based on Systrace

3.1.1 Function transformation

3.1.2 Method Did not finished problem

3.1.3 There are still problems

3.2 Stage 2: High-performance full-scene Trace capture tool

3.2.1 Function upgrade

3.2.2 Realize the principle of

3.2.3 Best practices

3.2.4 Existing problems

3.3 Stage 3: Dynamic integrated Trace tool planning

3.3.1 Don’t rely on PC to fetch Trace

3.3.2 Provide more comprehensive Trace information

1. Lock time consuming

2. IO time-consuming

3. Binder time-consuming

4. Support to add more data sources later

5. Time required for obtaining function of pile insertion at any level

3.3.3 Optimization to reduce performance loss

1. Optimization of pile insertion performance

2. Optimize the performance of start and stop Trace on App side

3. Optimize Trace write performance

We do a visual

4 Future Planning

5 subtotal

Join us

Douyin Android Performance Optimization Series: a new generation of all-purpose performance analysis tool Rhea

1 overview

2 comparison of Advantages

3 Architecture Evolution

3.1 Stage 1: Supplement function time-consuming Trace based on Systrace

3.1.1 Function transformation

3.1.2 Method Did not finished problem

3.1.3 There are still problems

3.2 Stage 2: High-performance full-scene Trace capture tool

3.2.1 Function upgrade

3.2.2 Realize the principle of

3.2.3 Best practices

3.2.4 Existing problems

3.3 Stage 3: Dynamic integrated Trace tool planning

3.3.1 Don’t rely on PC to fetch Trace

3.3.2 Provide more comprehensive Trace information

1. Lock time consuming

2. IO time-consuming

3. Binder time-consuming

4. Support to add more data sources later

5. Time required for obtaining function of pile insertion at any level

3.3.3 Optimization to reduce performance loss

1. Optimization of pile insertion performance

2. Optimize the performance of start and stop Trace on App side

3. Optimize Trace write performance

We do a visual

4 Future Planning

5 subtotal

Join us

Related Posts

Singleton. You used it today

Glide thread pool

Work Record # Comparison of Android App capture tools