Bytedance Terminal Technology — Xia Liu
preface
The author is a member of the Bytedance terminal technology AppHealth (Client Infrastructure – AppHealth) team. In our work, we will maintain and customize the open source LLVM and Swift tool chain, and promote the implementation of compiler optimization in business scenarios. The compiler, as a complex piece of software, is also prone to bugs and various compatibility and correctness issues. Here we share a compiler bug discovered when clang’s -oz optimization option was enabled.
The problem
In Xcode, we can set different optimization levels for the Clang compiler. For example, -o0 is used by default in Debug mode, -OS is used by default in Release mode (for both speed and volume), but in some scenarios where performance is not very important, we can use -oz level. When enabled, the compiler optimizes code volume more aggressively.
A video component of the company was compiled with The -Oz optimization level of Clang enabled in order to reduce packages. However, in the test after enabling, it was found that memory explosion and OOM flash backoff occurred in the video component when exporting the video, and it could be stably regenerated. Using the Memory Graph of Instruments and Xcode, we can see that a large number of GLframeBuffers are created, and each GLFramebuffer holds a 2MB CVPixelBuffer. A large amount of memory is occupied.
These GlframeBuffers were expected to be reused rather than created repeatedly, but the log found that no buffer was available each time it was retrieved, so new buffewRs were created continuously. In code logic, whether the buffer can be reused depends on whether the -[GLFramebuffer unlock] is called, but it is observed that buffers pile up until the end of the export task before being unlocked, so we need to find out why the unlock is delayed.
By reading the code: The GLFramebuffer is held by a SampleData object and unlocked when -[SampleData dealloc] is called, Memory explosion occurs when SampleData objects are stacked in autoreleasepool, as observed earlier when the Autoreleasepool releases objects in bulk.
Notice that the SampleData object will not enter Autorelasepool if -oz is not enabled, so that’s ok. Next we need to find out why the SampleData object will enter Autorelasepool after -oz is enabled.
In ARC, objects trigger autorelease via C functions such as objc_autoreleaseReturnValue/objC_AutoRelease, There is no sign breakpoint to -[SampleData AutoRelease] to confirm the release time, unless the code is changed back to MRC, so there is a special way:
Add the following class to the project and turn off arc in compiler Flag setting -fno-objc-arc:
// Same as SampleData, it inherits from nsobject@interface BDRetainTracker: NSObject @end @implementation BDRetainTracker - (id)autorelease { return [super autorelease]; // Set the breakpoint} @endCopy the code
Set a breakpoint in the overridden autoRelease method and execute after App startup:
class_setSuperclass(SampleData.class, (Class)NSClassFromString(@"BDRetainTracker"));
Copy the code
So SampleData gets autoreleased and stops at the breakpoint we set. When SampleData is autoreleased -[CompileReaderUnit processSampleData:] :
- (BOOL)processSampleData:(SampleData *)sampleData { ... SampleData *videoData = [self videoReaderOutput]; .Copy the code
Memory inflation is found to disappear if rewritten in the following form:
- (BOOL)processSampleData:(SampleData *)sampleData { @autoreleasepool { ... SampleData *videoData = [self videoReaderOutput]; . }Copy the code
The autoreleased object returned by [self videoReaderOutput] is consistent with ARC’s convention, but the compiler optimised it when -oz was not enabled, and the object did not enter autoReleasepool. See the documentation for LLVM:
When returning from such a function or method, ARC retains the value at the point of evaluation of the return statement, then leaves all local scopes, and then balances out the retain while ensuring that the value lives across the call boundary. In the worst case, this may involve an
autorelease
, but callers must not assume that the value is actually in the autorelease pool.
ARC performs no extra mandatory work on the caller side, although it may elect to do something to shorten the lifetime of the returned value.
Since autorelase is an expensive operation, ARC optimizes it as much as possible, but we can guess from this that the compiler optimizations are invalid when -oz is enabled. Let’s look at the assembly at SampleData *videoData = [self videoReaderOutput] :
adrp x8, #0x1018b5000 ldr x1, [x8, #0x1c0] ; Load @selector(videoReaderOutput) bl _OUTLINED_FUNCTION_40_100333828; Call the external function bl _OUTLINED_FUNCTION_0_1003336bc; Call the outgoing functionCopy the code
The contents of the two _OUTLINED_FUNCTION_ functions called are as follows:
_OUTLINED_FUNCTION_40_100333828:
mov x0, x20
b imp_stubsobjc_msgSend
_OUTLINED_FUNCTION_0_1003336bc:
mov x29, x29
b imp_stubsobjc_retainAutoreleasedReturnValue
Copy the code
So the generated code logic is as expected:
- call
objc_msgSend(self, @selector(videoReaderOutput), ...)
Returns an autoReleased object - It then calls the returned object
objc_retainAutoreleasedReturnValue
Strong reference
We can compare the code generated earlier by enabling -OS, where LLVM’s MIR Outliner is in effect:
adrp x8, #0x10190d000
ldr x1, [x8, #0xf0]
mov x0, x20
bl imp_stubsobjc_msgSend
mov x29, x29
bl imp_stubsobjc_retainAutoreleasedReturnValue
Copy the code
Machine Outliner
Under the -oz optimization level, the two instructions of line 3, line 4 and line 5, line 6 are used in multiple places, so they are separated into independent functions for reuse, and the original place is changed into one function call instruction, the number is changed from 4 to 2, so as to achieve the purpose of packet reduction. This is what LLVM’s Machine Outliner does. Under -oz it is turned on by default for more extreme code reduction (in other optimization levels it is turned on by -mllvm-enable-machine-outliner =always). Its general principle is as follows:
extern int do_something(int);
int calc_1(int a, int b) {
return do_something(a * (a - b));
}
int calc_2(int a, int b) {
return do_something(a * (a + b));
}
Copy the code
In this code, both calc_1 and calc_2 call do_something. Although the arguments are different, we can see some repeated sequences of instructions in the assembly (ARMv7 assembler is used for easy demonstrations).
calc_1(int, int):
add r1, r1, r0 ; A
mul r0, r1, r0 ; B
add r1, r1, r0 ; A
mul r0, r1, r0 ; B
b do_something(int) ; C
calc_2(int, int):
add r1, r1, r0 ; A
add r1, r1, r0 ; A
mul r0, r1, r0 ; B
b do_something(int) ; C
Copy the code
We label identical instructions with the same label, so calc_1’s instruction sequence is ABABC and calc_2 is AABC. By constructing a suffix tree, the compiler can find that their longest common substring is ABC. Then ABC can be separated into a separate function:
calc_1(int, int):
add r1, r1, r0 ; A
mul r0, r1, r0 ; B
b OUTLINED_FUNCTION_0
calc_2(int, int):
add r1, r1, r0 ; A
b OUTLINED_FUNCTION_0
OUTLINED_FUNCTION_0:
add r1, r1, r0 ; A
mul r0, r1, r0 ; B
b do_something(int) ; C
Copy the code
Because the memory-management instructions that the compiler inserts into ARC code are so common, many of these actions will be outlined (see this talk for details about its implementation).
The ARC optimization
But why does ARC’s optimization fail after the instruction is outlined? Note that mov x29, x29, doesn’t actually do anything meaningful (save the x29 register value back to X29). It’s just a special token that the compiler uses to help with runtime optimizations. The implementation of videoReaderOutput returns an AutoRelease object as a call like this:
return objc_autoreleaseReturnValue(ret);
Copy the code
Its runtime implementation is roughly as follows:
// Prepare a value at +1 for return through a +0 autoreleasing convention.
id objc_autoreleaseReturnValue(id obj) {
if (prepareOptimizedReturn(ReturnAtPlus1)) return obj;
return objc_autorelease(obj);
}
// Try to prepare for optimized return with the given disposition (+0 or +1).
// Returns true if the optimized path is successful.
// Otherwise the return value must be retained and/or autoreleased as usual.
static ALWAYS_INLINE bool
prepareOptimizedReturn(ReturnDisposition disposition) {
assert(getReturnDisposition() == ReturnAtPlus0);
if (callerAcceptsOptimizedReturn(__builtin_return_address(0))) {
if (disposition) setReturnDisposition(disposition);
return true;
}
return false;
}
static ALWAYS_INLINE bool
callerAcceptsOptimizedReturn(const void *ra){
// fd 03 1d aa mov x29, x29
if(* (uint32_t *)ra == 0xaa1d03fd) {
return true;
}
return false;
}
static ALWAYS_INLINE void
setReturnDisposition(ReturnDisposition disposition) {
tls_set_direct(RETURN_DISPOSITION_KEY, (void(*)uintptr_t)disposition);
}
Copy the code
Objc_autoreleaseReturnValue will use __builtin_return_address to retrieve the return address. Check for the presence of the mov x29 x29 flag. If so, that means the object I returned will be retained immediately. So there is no need to put it in autoReleasepool. At this point, the runtime records the optimization in Thread Local Storage and returns the +1 object.
Corresponding to the caller videoReaderOutput will use objc_retainAutoreleasedReturnValue reference object, implemented as follows:
// Accept a value returned through a +0 autoreleasing convention for use at +1.
id objc_retainAutoreleasedReturnValue(id obj) {
if (acceptOptimizedReturn() == ReturnAtPlus1) return obj;
return objc_retain(obj);
}
// Try to accept an optimized return.
// Returns the disposition of the returned object (+0 or +1).
// An un-optimized return is +0.
static ALWAYS_INLINE ReturnDisposition
acceptOptimizedReturn(a) {
ReturnDisposition disposition = getReturnDisposition();
setReturnDisposition(ReturnAtPlus0); // reset to the unoptimized state
return disposition;
}
static ALWAYS_INLINE ReturnDisposition
getReturnDisposition(a) {
return (ReturnDisposition)(uintptr_t)tls_get_direct(RETURN_DISPOSITION_KEY);
}
Copy the code
Objc_retainAutoreleasedReturnValue see mark of TLS know without additional retain, and both cooperate to optimize away the autorelease and retain operations at a time, but this is the compiler and runtime optimization details, It should not be assumed that optimization will take place. After it is open – Oz, machine the outliner called yuanyang objc_msgSend and objc_retainAutoreleasedReturnValue call instruction and mark the outline, the optimization results in no trigger, The autoreleasepool object is displayed.
conclusion
So essentially it was an oversight by the developer: using a temporary object that was using a lot of memory and not adding autorelasepool in time to release it, but ARC optimizations hid the problem and eventually revealed it when -oz was turned on.
This is also a compiler bug that should not be outliers to render ARC optimizations invalid. This bug was only recently fixed in LLVM.
ARC optimizations (such as -enable-copy-Propagation) are not enabled for some object life cycles, which can then be extended by developers. Using the object outside of the compiler’s guaranteed life cycle may be fine at first, but once these optimizations suddenly kick in due to compiler upgrades or code changes, places where the object was previously used may have access to a freed object, For more specific examples, see Session 10216 for WWDC 21.
About the Byte Terminal technology team
Bytedance Client Infrastructure is a global r&d team of big front-end Infrastructure technology (with r&d teams in Beijing, Shanghai, Hangzhou, Shenzhen, Guangzhou, Singapore and Mountain View), responsible for the construction of the whole big front-end Infrastructure of Bytedance. Improve the performance, stability and engineering efficiency of the company’s entire product line; The supported products include but are not limited to Douyin, Toutiao, Watermelon Video, Feishu, Guagualong, etc. We have in-depth research on mobile terminals, Web, Desktop and other terminals.
Now! Client/front-end/server/side intelligent algorithm/test development for global recruitment! Let’s change the world with technology. If you are interested, please contact [email protected] with the subject line resume – Name – Job intention – desired city – Phone number.