Caton reason
The main thread is blocked. During development, the main thread may be blocked by:
- The main thread is doing a lot of I/O: Write a lot of data directly to the main thread to facilitate code writing
- The main thread is doing a lot of calculations: the code is not written properly, and the main thread is doing complex calculations
- Lots of UI drawing: The interface is too complex and UI drawing takes a lot of time
- The main thread is waiting for A lock: the main thread needs to acquire lock A, but A child thread currently holds the lock A, causing the main thread to wait for the child thread to complete the task.
- .
The industry research
Wechat Team (Matrix)
Caton test flow chart
The main thread is stuck
- FPS reduce
- The CPU usage is high
- The main thread RunLoop takes too long. Procedure
Monitoring method
Matrix Caton monitors the start and end states of the main thread by adding an Observer at the start and end of the RunLoop. A child thread checks the status of the main thread periodically. When the status of the main thread exceeds a certain threshold, the main thread is considered to be stalled and thus marked as a stalled thread.
Two criteria are used:
- Single-core CPU usage exceeds 80%
- The main thread RunLoop executes for more than 2 seconds
The timeout threshold of the main program Runloop is 2 seconds, and the check period of the child thread is 1 second. Every second, the child thread checks the running status of the main thread. If the main thread Runloop is detected running for more than 2 seconds, it is considered stalled and a snapshot of the current thread is obtained. At the same time, the wechat team also believes that too high CPU may lead to application lag. Therefore, when the sub-thread checks the status of the main thread, if the CPU usage is too high, it will capture the current thread snapshot and save it in a file. At present, it is considered in wechat application that the single-core CPU usage exceeds 80%, which means that the CPU usage is too high.
Testing strategies
- Memory dump: checks every 1 second. If the main thread is detected to be stagnant, the function call stack of all threads is dumped into memory.
- File dump: If the memory dump stack is not the same as that captured last time, it is dumped to a file. Otherwise the check time is incremented by the Fibonacci sequence (1,1,2,3,5,8…) Until there is no holdup or holdup stack is different. In this way, you can avoid writing multiple files to the same card and the detection thread idling around the same card.
Annealing algorithm
In order to reduce performance loss caused by detection, annealing algorithm is added for detection thread:
- Each time a child thread detects that the main thread is stuck, it retrieves the main thread stack and stores it in memory (it does not take a snapshot of the thread directly to store it in a file).
- Compare the main thread stack obtained with the main thread stack obtained by caton last time:
- If the stack is different, a snapshot of the current thread is taken and written to the file;
- If they are the same, it skips and increments the check time according to the Fibonacci sequence until there is no lag or the main thread is stuck in a different stack. In this way, the same card can avoid writing multiple files; Avoid writing thread snapshot files repeatedly when the main thread is stuck.
Time stack extraction
When a child thread detects a main thread Runloop, it takes a snapshot of the current thread as a stalled file, but the current main thread stack is not necessarily the most time-consuming stack and is not necessarily the primary cause of the main thread timeout. The Matrix Caton monitor solves this problem by extracting the main thread time stack. The caton monitor periodically fetches the main thread stack and stores the stack in a circular queue in memory. In the figure below, a stack is obtained at every interval t, and the stack is stored in a circular queue with a maximum of 3. There is a cursor constantly pointing to the nearest stack. Wechat’s policy is to fetch the main thread stack every 50 milliseconds, saving the last 20 main thread stacks. This increases CPU usage by 3% and memory usage is negligible.
When the main thread detects a block, it retrieves the most recent stack by retracing the stack saved to the round-robin queue. As shown in the figure below, the most recent 20 main thread stacks are recorded in the circular queue of memory when a jam is detected, and the most recent and time-consuming stack needs to be found. The Matrix Caton monitor uses the following features to find the most recent time stack:
- Characterized by the top of the stack function, the same top of the stack function is the same as the whole stack;
- The interval for fetching stacks is the same, and the number of stack repeats is approximately the call time of stack. The more repetitions, the more time it takes.
- There can be multiple stacks that repeat the same number of times, and the nearest stack takes the most time. The most recent elapsed stack obtained is attached to the Caton file.
The procedure for detecting jam is as follows
Stack classification method
Caton monitoring requires careful definition of its own classification rules. You can categorize from the top of the call stack, or from the middle, or from the bottom. Each has its own advantages and disadvantages:
- Outermost classification: The ability to classify the cardons from the same entry. The disadvantage is that the number of layers is not easy to determine. Maybe more than ten layers outside are system calls, or maybe the first layer is wechat function.
- Middle layer classification: it can be classified according to pre-divided “eigenvalues”. The disadvantage is that the “eigenvalue” is not easy to determine, if you want to achieve automatic learning generation, the background analysis system requirements are too high.
- Innermost classification: the ability to classify the same cause of cardons together. The disadvantage is that the same category may contain different businesses.
Wechat uses the optimized version of the innermost classification, that is, secondary classification. The first level is classified according to the innermost 2 layers, so that the same reason can be concentrated together; The second level of classification is from the first level, and then from the bottom four levels of classification, which can be separated into different businesses for the same reason.
Can be run
Gray-scale collection results in an average of 30 dump files per day, and compression upload takes about 300K traffic. It is expected that the official release will have a relatively large pressure on the background and a certain amount of traffic loss to users. So it has to be sampled and reported.
- Sampling report: select different users every day to report, the sampling probability is 5%.
- File upload: The selected user uploads only the first 20 stack files every day, and uploads multiple compressed files each time.
- Whitelist: You can configure a whitelist on the background for users to report problems forcibly. In addition, to reduce the impact on the user’s storage space, the log file only saves the records of the latest 7 days and is deleted after expiration.
Meituan Hertz
Caton test method
It’s easy to imagine that by detecting the FPS you can tell if your App is stalling, and that you can measure the quality of your current page drawing by calculating the frame loss rate over a consecutive FPS frame. However, it has been found in practice that FPS refresh rate is very fast and jitter is easy to occur, so it is difficult to detect the lag by comparing FPS directly. It is much easier to detect the execution time of the main thread message loop, which is a common way to detect lag in the industry. Therefore, what Hertz uses in practice is to detect the time that the main thread executes the message loop each time. When this time is greater than the threshold value, it is recorded as the occurrence of a stall.
Solve the time consuming policy of stuck continuity
It takes a long time to have the continuity of the lag, such as the lag when opening a new page. However, the time is relatively short but the frequency is faster when there is a lag continuity, such as the lag when the list is sliding. Therefore, the judgment strategy of “N times of delay exceeding the threshold T” is adopted, that is, the collection and reporting are triggered only when the accumulated times of delay in a period are greater than N. For example, the threshold T=2000ms and The Times of delay N=1 can be considered as the delay of a long time. However, the lateness threshold T=300ms and the lateness times N=5 can be judged as the lateness with higher frequency.
Dump stack and run logs
The first problem is the timing of stack fetching. The time to grab the stack must be when the holdup occurs, not after, otherwise the code causing the holdup cannot be caught exactly, so grab the stack while the holdup is not finished in the child thread. The second problem is how to classify the stack. The classification of the stuck stack is different from that of Crash stack. It is obviously inappropriate to classify the innermost code, because different business logic codes in the outer layer may have the same call stack in the innermost layer. It is also inappropriate to categorize the outermost code, which can be either business logic code or system call. Using the innermost classification principle, and matching some simple rules, to match the rules of the class name to categorize.
Bugly
Check the evidence of Caton and the timing of the report (bugly.qq.com/docs/user-g… Based on the monitoring of the execution of the main Runloop, observe whether the execution time exceeds the preset threshold (the default threshold is 3000ms). If the monitoring is blocked, the thread stack will be recorded locally immediately, and the report will be executed when App switches from the background to the foreground.
Wedjat
How do we monitor Caton
- FPS monitoring: This is the easiest one to think of. If a higher FPS means a smoother interface, the implementation of calculating FPS is also given above, which measures the quality of the current page drawing by calculating FPS loss rate over a period of consecutive FPS.
- Main thread lag monitoring: This is a common method in the industry to detect the lag by creating a child thread to monitor the RunLoop of the main thread. When the time between two state areas is greater than the threshold, a lag is recorded as occurring. The main thread is also used to monitor the scheme.
MTHawkeye
MTHawkeye is meitu open source card ton monitoring, look at the source code design idea is similar to wechat Matrix, but there are a lot of technology precipitation, should be customized for Meitu, the design is worth studying.
Research conclusion
Most monitoring schemes in the industry are similar. Based on the monitoring of the notification status of RunLoop, the resident child thread is enabled to periodically detect whether the RunLoop status switch of the main thread has a timeout, and the timeout is recorded as a delay. When the delay length exceeds the preset threshold, the stack is dumped and reported at an appropriate time after processing relevant policies. Matrix is the latest open source library of Tencent, which has a good stack processing strategy and is very popular at present.
Scheme comparison
FPS
FPS (Frames Per Second) refers to the number of Frames Per Second on a page. A higher FPS indicates a smoother page. A value between 50 and 60 indicates a smoother page. FPS is indirectly represented by CADisplayLink counting in a cycle. Exception, according to the sliding interface in the sliding state RunLoop from kCFRunLoopDefaultMode to UITrackingRunLoopMode can distinguish whether the scene generated by the page is not smooth is in the scrolling process.
CADisplayLink is a timer that refreshes the screen at the same rate as IOS devices (60 frames per second). By adding a target and binding selector, CADisplayLink registers the timer with RunLoop as NSRunLoopCommonModes. The screen receives each frame flush notification and calls the selector count operation of the target binding. If the timestamp is greater than 1 second, the count is the FPS of the current page.
Ping-Pong
Implementation Principle: Ping is used to test whether a packet can reach an IP address and then test the network response. Used to monitor caton monitor, of course, the main core thought is the child thread maintain a ping timer, through a fixed time slice ping the main thread (send a notification), if the main thread is not busy will receive a notification and pong response (back to send a notice to the child thread), otherwise the child thread more than set time pong threshold, If pong does not receive a reply from the main thread, it is considered to be stalled and the stack is dumped.
Listening to the RunLoop
Implementation principle: Start a child thread and listen to the notification status of RunLoop. If it does not receive the notification status of RunLoop within the specified time threshold, the main thread is judged to be stalled and the stack is dumped. Otherwise, the frequency of the stalled thread can be recorded and reported when it reaches a certain frequency.
Hook objc_msgSend method
Oc Each method invocation is eventually converted to objc_msgSend notification messages. Maintaining a data structure counts the invocation duration of each method for performance analysis. However, this method consumes performance and costs a lot of maintenance, and is not recommended.
Finally, we decided to listen for RunLoop, referring to Matrix and MTHawkeye.
Plan implementation
Test flow chart
Design and Implementation
Design principle
Add an Observer at the start and end of the RunLoop to get the time taken for the start and end states of the main thread. A child thread checks the status of the main thread periodically (default: 200ms). When the running time of the status of the main thread exceeds a certain threshold (default: 400ms), the main thread is considered to be stalled and thus marked as a stalled thread. If the card is longer than 8 seconds, the card is considered dead. The lag threshold and the detection thread cycle directly affect the ability and performance loss of the lag monitoring.
Stack snapshot
The system provides a task_threads method to retrieve all threads of a task. Information about each thread can be retrieved using thread_get_state, which is filled in with an argument of type _STRUCT_MCONTEXT. This parameter is used to fetch the Stack Pointer and Frame Pointer of the current thread, and then trace the whole function Stack to find the addresses of all functions, calculate the physical addresses by offsets, and finally symbolize the function name.
The stack to heavy
An annealing algorithm is used to filter part of the continuous same stack
ThreadBacktraceSnapshot *mainBacktraceSnapshot = [self generateBacktraceSnapshot:dumpType];
ThreadBacktraceSnapshot *preSnapshot = self.snapshotsArray.lastObject;
if (preSnapshot) {
if(! [preSnapshot.backtraceDescription isEqualToArray:mainBacktraceSnapshot.backtraceDescription]) { mainBacktraceSnapshot.capturedCount = self.annealingCount; [self.snapshotsArray addObject:mainBacktraceSnapshot]; self.annealingCount = 1; }else{ self.annealingCount += 1; }}else {
self.annealingCount = 1;
[self.snapshotsArray addObject:mainBacktraceSnapshot];
}
Copy the code
Determine the jammed
The above implementation can record the stuck time, and the business can customize how long the stuck time is determined as the stuck time.
Record stuck time
When the page card length suddenly exceeds the deadlock threshold, the timer is timed based on this threshold until the RunLoop enters the next state, otherwise it will wait until the Watch is triggered
Provided by the API
/ * * set caton threshold and detecting caton thread interval start monitoring @ param runloopTimeOut caton threshold @ param checkRunLoopTimeOutThreshold inspection caton thread interval * / - (void)startWithRunloopTimeOut:(useconds_t)runloopTimeOut andCheckPeriodTime:(unsigned)checkRunLoopTimeOutThreshold; /** Start with the default value and check every 200ms */ - (void)start; - (void)stop;Copy the code
Reported data
Performance loss
CPU fluctuation of 2%-4% Dog monitoring threshold will kill APP. In this way, the actual stuck time can be approached even if flash backoff occurs eventually, and the error has not been concluded yet.
To optimize the
At present, the lag monitoring is only applied in the demo, and has not been used online. There should be many problems, such as performance bottlenecks, stack filtering, annealing algorithm optimization and so on.
The resources
- Mp.weixin.qq.com/s/M6r7NIk-s…
- Tech.meituan.com/2016/12/19/…
- Mrpeak. Cn/blog/ios – ha…
- Github.com/aozhimin/iO…
- Github.com/meitu/MTHaw…