One, the introduction

After a mobile App is released, if you want to obtain the service running status of the App, you usually reflect the status or user feedback through the server interface. There is no platform for online monitoring, alarm, and aggregation and precipitation of abnormal data of the client. It is also impossible to compare abnormal data in multiple dimensions, which makes it more and more urgent to collect application information and crash logs.

Bugless, developed by 37 Mobile games, is designed to detect code anomalies from the perspective of online problem tracking, and solve the problem of the code itself through backtracking. Its main functions:

  1. Monitor SDK service exceptions in real time
  2. Summarize the rearranged and aggregated package body crash data
  3. Count the number of affected devices
  4. Reporting crash Logs
  5. Collect iOS upward compatibility issues
  6. Monitor client requests for network problems

Bugless provides intelligent push notifications to detect exceptions in a timely manner, reduce the impact time and scope as quickly as possible, and reduce losses. Bugless also supports background aggregation of error information data and analysis of historical anomaly data to assist developers in project monitoring and product iteration optimization.

2. Recognize crashes and exceptions

Before we go into Bugless, let’s look at three levels of understanding why apps crash and exception, and what to do about it.

2.1. App level

App crash is caused by the violation of the operating rules of the iOS system. There are three types of crash:

2.1.1 Memory Flash Backout Symptom

It is usually caused by the following aspects:

  • Invalid memory access
  • The memory access is out of bounds
  • Runtime method invocation does not exist
  • Dereference a pointer to an invalid memory address
  • An instruction to jump to an invalid address

2.1.2 The response times out

The startup, suspend, recovery, and end events do not respond in a timely manner

2.1.3 Triggering the Watchdog Mechanism

The Watchdog is used to prevent an application from occupying too many system resources. If the running time of the application exceeds the specified time, the Watchdog will forcibly kill the application. The error code 0x8badF00d is displayed in crashlog.

2.2 Mach exceptions and Unix signals

Mach is the microkernel of the iOS and macOS operating systems, and Mach exceptions are the lowest kernel-level exceptions. In the iOS system, each Thread, Task, and Host has an abnormal port data. A Mach exception can be caught by setting the exception ports for Thread, Task, and Host. The Mach exception is converted to the corresponding Unix signal and passed to the faulty thread.

In the common Exception crash information, often see Exception Type: EXC_BAD_ACCESS (SIGSEGV) fields and content, EXC_BAD_ACCESS and SIGSEGV, respectively refers to the Mach Exception and Unix signal. So this Exception Type means that the Mach layer Exception EXC_BAD_ACCESS is converted to a SIGSEGV signal and passed to the faulty thread. (Triggered by Thread: 0,0 is the main-thread.) The reason Mach exceptions are converted to Unix signals is to be POSIX compliant (SUS specification), so that developers who do not know the Mach kernel can develop compatibility through Unix signals.

There are many types of Unix signals. In iOS applications, there are several common Unix signals:

  • SIGILLAn illegal instruction signal from a program, usually due to an error in the executable itself, or an attempt to execute a data segment. It is also possible to generate this signal when the stack overflows.
  • SIGABRT: The abort signal generated when the abort function is called.
  • SIGBUS: Program memory byte address misaligned abort signal, such as accessing a 4-byte long integer whose address is not a multiple of 4.
  • SIGFPE: program floating-point exception signal, which is usually generated when a floating-point operation error, overflow, or arithmetic error such as divisor is.
  • SIGKILLThe program node receives a stop signal, which is used to immediately stop the program from running and cannot be processed, blocked, or ignored.
  • SIGSEGVAn invalid memory abort signal indicating that a program is attempting to access unallocated memory or write data to a memory address that has no write permission.
  • SIGPIPE: Signal that a program pipe has broken, usually during interprocess communication.
  • SIGSTOP: Program process abort signal, similar to SIGKILLー cannot be processed, blocked, or ignored.

In iOS apps, collecting the above several common signals in general can meet the daily needs of collecting App exceptions.

This section is referenced in: iOS Full Burial Solution (Douban)

2.3 Bugless Crash capture process Principle

The exception that is closely related to App is the Objective-C throw exception, which is one of the easiest to catch. This exception can be caught as follows:

Here is the capture flow chart:

After the App is started and initialized, it will determine whether to enable abnormal listening. If it is enabled, it will listen to the API opened by the system. When an exception occurs in the iOS system, it only needs to listen to the callback of the system.

A representation of an exception generated in Objective-C, such as an Invalid type exception in the first five columns of the chart. In addition to objective-C exceptions, there are two other types of exceptions caught by the Mach Exception Handler and POSIX Signer Handler, respectively, and the crashes are represented as SEGV_ACCERR in the table.

2.3.1 Bugless reported an egress stack

In the case of full data collection, there are two times to obtain the flickered logs:

  • The first time: it is reported immediately after the flash exit, but the first time may fail because the process is killed.
  • Second time: Yes During the restart, the system reports the last flash logout log. But if the user does not start again, it may not be able to upload.

2.3.2 Bugless Anomaly Analysis process

Obtain a copy of the flash exit log and follow the following steps to initially locate the type of the exception. As shown in the figure below:

2.3.3 Bugless stack resolution

After analyzing the cause of the exception according to the process, how to locate the problem? This is where the crash stack resolution tool comes in.

Compare the symbolization tools Symbolicatecrash (command line tool) and SymbolicateX (UI tool). In general, both tools use the same parsing critical tool atOS.

To resolve this, we first go through all the addresses that belong to the main program ‘cheng’ and store them as an array of addresses. Then we pass in the symbol table dSYM file (armV7 or ARM64) via the symbolicationCommand function. And loadAddress, as shown in the following code example:

  • Symbolicatecrash: uses Xcode’s built-in memory address transfer function stack command atos.
  • SymbolicateX: SymbolicateX is a third party open source tool. It is a secondary development tool for the command-line parsing tool XcheckSymb. Atosl can be used instead of ATOS tools to achieve cross-platform log parsing, so as to no longer rely on macOS and Xcode’s XCRun. This sets the stage for symbolizing the stack as a generic online service.

Subsequent optimization of the parsing tool will start with solving the problem of low stack parsing efficiency:

  • On the one hand, shorten the parsing time;
  • On the other hand, introduce batch asynchronous parsing and cache repeating stack mechanism.

2.4, aggregation,

Crash titles: mainly by offset.

Apple’s official aggregation solution:

Use AppBundleName plus the memory address, plus the offset.

For example, syios: 0f100afc000 + 8691804

The new plan:

With the first valid offset in the flash exit thread, the following figure shows the first offset corresponding to the binary filename cheng in the log: + 4437668

Stack the aggregation

Based on the hash value after stack variables are removed.

After filtering out the memory address and offset of the crashed thread, the text is aggregated as hash labels, aggregated by labels, and rearranged by device labels. This method aggregates the stack because the MD5 value of the stack varies with the iOS version. (The specific reason is that the number of current crash stack dependent library rows may vary by system.) The filtering method is as follows,

Regular filtering excludes memory addresses and offsets Regular conditions are as follows:

3. The network layer is abnormal

  • 1) Can report pages not found (status code 404), service unavailable (503) network exception by minute.
  • 2) Make detailed statistics on the timeout times of client requests and calculate the proportion of devices requesting requests in excess of time.
  • 3) Monitor for domain name hijacking by checking whether the returned data is in the expected JSON format.

4. The service layer of the server is abnormal

By reporting network request errors of the client, the SDK service exceptions are reported in real time, which can facilitate the monitoring of account authentication exceptions, in-app purchase exceptions, and delivery exceptions.

Five, the alarm

5.1 Real-time Alarms

Bugless provides error accumulation and alarms by minute, hour, or day, and alerts through enterprise wechat when a threshold is exceeded

The structure of the alarm system is as follows:

The following is an example of a small assistant alarm:

Bugless system

After the above alarm system goes online, only scattered alarm information can be obtained. With the help of bugless background, we can meet the requirement of comparing abnormal data in multiple dimensions.

Here’s what the Bugless backend does.

6.1. Bugless Background

Bugless background statistics show that the service is abnormal:

Table 1 Incorrect password of automatically generated account

6.2 Application case of Bugless Access

So far, Bugless has accessed 4 games and received 3 valid alarms. Include:

  1. R & D order item ID error
  2. Apple’s in-app purchase service is abnormal
  3. High repeat request rate for mobile registration

6.3 Accuracy

The statistics are consistent with apple’s iTunes Connect crash log.

Bugless VS Xcode Organizer Report (Bugless VS Xcode Organizer report) In the same crash, Apple iTunes background collected 61 devices flash back, Bugless collected 59 devices affected. As shown in the figure below,

Bugless Details about background logs

Table 3 Details of Bugless background logs

Table 4 Bugless parsing logs

Seven,

7.1 Problems in the application of Bugless

In the process of use, several problems have been found, among which false alarms often occur. Due to the lack of understanding of the threshold value in advance, the threshold value was adjusted low enough, so that the vast majority of valid data samples would not be missed. As the data sample increases and alarm thresholds become more accurate, the false positives will improve.

7.2 conclusion

The design, development and launch of key technical points of the Bugless project shows that the project can continuously and effectively provide data support for troubleshooting problems in apple’s platform publishing business. Of course, the project still has a process of self-improvement. For example, the secondary development of symbol parsing tools, the lack of system library function stack information, to be improved; On the other hand, the crash log resolution performance needs to be further improved to reduce the waiting time of users.

As the business expands, Bugless also has more opportunities to serve users. Expand Bugless to more business areas, such as intercarrier SDK, overseas business, etc. At the same time, it provides buried site report for research and development, so that the game can collect user usage habits through self-built platform (not third-party platform). If there is a need for customized reports, it can provide one-to-one technical support, bringing convenience to more users.

Eight, the appendix

See link – Exception Stack field description

  • Developer.apple.com/documentati…
  • Developer.apple.com/documentati…
  • Developer.apple.com/documentati…

SymbolicateX Symbolization tool for crashed files on iOS/Mac projects

  • Github.com/tomlee130/S…