background

There is a kind of native crash occurring in the process of app calling Cookiemanager.getcookie (String URL) on the Android platform for a long time, which has troubled a lot of research and development and seriously affected the user experience. This kind of problem is covered by Android 4.1-9.0, and basically occurs in the startup phase. This kind of problem on watermelon video has been one of the Top 3 lists for a long time. The proportion of native crashes in the Top 10 is more than 40%, the overall proportion of native crashes is >30%, and the proportion of affected users is >1‰ (the proportion of users of such crashes). Android 4.2.2, 4.4.2, 8.1, 9.0, and other Android versions are also plagued with this problem. The most serious problem is that such crashes basically occur within 2s of startup, which seriously affects the user experience of Watermelon Video app. A typical stack screenshot is as follows:

The Native stack

Java stack (>50% of crashes have no Java stack)

Thinking of the February

Such crash stack only has so and offset address information, without corresponding function name, and is not an inevitable problem, so it is difficult to locate the cause of the problem directly. So the key to the investigation is to find a clear function name stack, with detailed function information, in order to further through the relevant function name comparison with AOSP source code analysis to locate the cause.

A preliminary investigation

Although the affected Android versions and models are widely distributed (Android 4.x-9.0), the majority of the stack has almost no crash-related core function information. Fortunately, by combing through all the related crashes, we found that there was a class of crashes on Android 4.2.2 that had a function information _ZN4GURLC2ERKSs (GURL::GURL(STD ::string const&)).

The stack of this kind of crash is consistent with the above problem. Both crash occurred in the native layer when the Java layer called nativeGetCookie, and the stack is basically the same, which can be identified as a kind of problem. Pull and analyze Android 4.2.2 GURL related source code, found that GURL involves a very wide range of code, specific link which layer called the memmove function is a bit of a needle in a haystack.

Since GURL correlation can be found, the guess seems to be urL-related. So I did a simple experiment online to see if it was a problem with the URL passed in to getCookie. Cookiemanager.getcookie was called by all the cookiemanager.getcookie of hook application layer. It was found that there were multiple threads calling Cookiemanager.getcookie at the same time when crash occurred, which was suspected to be a thread safety problem.

Only this information is not enough. If we can get the function name of crash, the problem can be confirmed. Comb this class contains a GURL stack crash again, and sure enough there such stack (ZN8url_util20LowerCaseEqualsASCIIEPKcS1_S1).

At the same time, there are also two types of the stack with clear upper crash function name sorted out, both of which are crash occurred during the execution of GURL constructor. Among them, one kind is the operation exception related to vector (Vector is not thread-safe, which impressed me very much. The AOSP source code has many such vector thread-safety issues: RenderNodeAnimator, etc.), and such exceptions further raise the suspicion of thread-safety issues.

In-depth analysis

ZN8url_util20LowerCaseEqualsASCIIEPKcS1_S1 prototype is url_util: : LowerCaseEqualsASCII (char const *, char const *, char const *), GURL::GURL(STD ::string const&); Although it can not be simply judged as the same kind of problem, but the signs appear to be the same kind of problem. The stack has an explicit crash function name, which may reveal the root cause of the problem.

According to PC= 5CB1453e, crash was caused by the empty (0x0) in the R2 register. Combined with DoLowerCaseEqualsASCII source code, it can be determined that the third parameter B of the function was stored in THE R2 register. B is null.

Identify the cause of the crash, coupled with the source code found call url_util: : LowerCaseEqualsASCII (char const *, char * const, char const *) and stack on GURL constructor invocation chain of two, One is the CompareSchemeComponent function shown in the figure below, and the other is the DoIsStandard function.

The third parameter of CompareSchemeComponent is the third parameter of LowerCaseEqualsASCII, but kFileScheme is a constant and cannot be null.

The third parameter of LowerCaseEqualsASCII in DoIsStandard is a global variable initialized in InitStandardSchemes. If you look at the source code of InitStandardSchemes, you can see that Standard_schemes is a global variable but is initialized lazily. So the question is, is this initialization procedure/global variable thread-safe?

Unfortunately this function is not locked and vector is not thread-safe, of course STD ::vector<const char*>* standard_schemes are not thread-safe. There is a problem with multiple threads calling at the same time, when one thread is initializing standard_schemes, another thread may also be initializing, and there will be a vector synchronization problem; Similarly, if one thread is going through Standard_schemes, another thread may reset standard_schemes to a new value, and there is a chance that the null pointer problem will occur.

Looking up the Chromium source code, I found that the source code relied on in Android 4.0-9.0 all has the thread safety problem of GURL initialization, which has existed for a long time. A fix was submitted in 2019.05.21 (Make // URL Initialization Thread-safe). However, such problems still exist in the old version of Chromium within Android 10 on the market. It is far away to solve such problems by system upgrade. In order not to affect the experience, the application layer should take the initiative to repair or take measures to avoid them.

Repair plan

As we can see from the above analysis, we just need to ensure that standard_Schemes does not have a second thread executing the same logic until initialization is complete. There is no system-level synchronization solution, but problems are thrown at the same point in the application layer, where a synchronization limit can be applied. However, to be on the safe side, do a global defense in the application layer (release restrictions after the first execution). Watermelon video APP is a call to all cookiemanager.getcookie (String URL) through a self-developed AOP tool hook application layer.

This solution has no such problems after the release of watermelon video APP 432 version gray-scale & Full volume, and has completely solved such problems with a small cost.

conclusion

Symbol table information is indispensable in the investigation of Native questions. In most cases, key symbol information may be missing, which adds a high degree of difficulty to the investigation of Native questions. However, due to the large number of updated versions of Android system and the differences in manufacturer customization, some niche models or Android version crash may carry key symbolic information, which is often the breakthrough point. Niche problems should also be paid enough attention to when troubleshooting problems. We also call on mobile manufacturers to reserve some key symbol table information for developers to locate problems.

In addition, although the source code is available online at androidxref.com and cs.android.com, the Android versions are not complete at both sites. Almost all versions of the source code can be downloaded from android.googlesource.com, and local source code through Sublime analysis is also very convenient (you can directly display and jump to the method definition & reference location).

Welcome to the Bytedance technical team