Introduction of useless code detection scheme for 58.com iOS mixed programming project

Abstract: This paper mainly introduces how to realize OC&Swift useless code detection through the analysis of Mach-O file and the application of disassembly, focusing on the detection scheme of Swift. As an application part of Swift Mach-O, it is recommended to read “On the Storage Differences between Swift and OC from the Perspective of Mach O” and “Swift Hook New Ideas — Virtual Function Table” to understand related concepts and structures. Related code has been open source: WBBlades, if you feel the tool or solution is helpful to you can help click a star.

background

Recently, many large apps are supporting the mixed development of Swift and Objective-C, and various apps belonging to 58 Group are also actively exploring the use of Swift language development. Therefore, it can be expected that in the next few years, the proportion of Swift codes in each iOS project within the group will be increasing. So we need to consider some of the issues that arise when Swift codes proliferate. How to detect useless code in hybrid projects is one of the many problems we face.

About useless code detection

Does unwanted code need to be detected? Do I need to delete it? Useless code removal has the lowest ROI of all performance optimizations. However, almost all technical means with high ROI are one-time optimization solutions, which will be weak after several iterations. By contrast, detection and deletion of code provides a lot of room for optimization over a long period of time. Taking version 10.15.1 of 58APP as an example, the size of the main binary in the official App Store package for iPhone 7 devices accounts for 66% of the size of the App package, while the size of the dynamic library accounts for 15%, while the proportion of resources is less than 20%.

Getting packages downloaded from the App Store on a jailbroken device gives you an accurate view of the package composition on the current device (which I think is the most accurate way to measure). The resource ratio of 58APP is relatively small because we mainly use XCassert to store images, which can take full advantage of the shard delivery capability. If your image store still uses bundle storage, the proportion of resources may be relatively high, in which case it is recommended to transfer resources to XCassert first.

In addition to package size optimization, timely removal of useless code can also help to initiate optimization. In addition, useless code detection and deletion play an important role in project maintenance. Redundant code often means extra effort to assess the scope of requirements, and removing obsolete code in a timely manner can improve development efficiency to some extent. Static detection of dead code is not intended to detect all dead code in a project, but rather to provide the ability to select from a large source code base for subsequent inspection processes. Therefore, static detection needs to provide a fair number of collections of suspected garbage code, and the percentage of garbage code in this collection should be as high as possible. Static detection accuracy is limited, and can not be used as a single means, so can only play the role of pre-filtration. In 58.town, in addition to WBBlades detection, there are secondary filtering based on business code characteristics and runtime judgment.

Several methods of unusable code detection for mixed projects

In OC development environment, there are many useless code detection schemes, but in OC&Swift hybrid environment, there are relatively few useless code detection schemes. The reason is that OC and Swift have great differences in both the compiled front-end and compiled binary files. As a result, OC’s detection scheme may not be applicable to Swift, and Swift’s detection scheme may not be applicable to OC. Currently, common techniques in the industry include AppCode tool detection and static detection schemes such as Pecker based on IndexStoreDB and SwiftSyntax. Get all symbols using SwiftSyntax and index relationships between symbols using IndexStoreDB to determine which codes refer to each other and get a collection of useless codes. Of course, in addition to these two technical solutions, there are many other solutions, such as: source text analysis, optimization of framework object files, mach-O based file analysis, and so on. Which technical solution to choose depends largely on the situation in which the current tool is used, and even the optimal solution varies with the amount of code in the APP. Before connecting to Swift, the 58APP has already decided on a useless code detection scheme based on Mach-O analysis, mainly because it is easier to integrate into the version flow. Therefore, in order to maintain the unity of the technical scheme, we still adopted the method based on Mach-O file analysis to realize the useless code detection after the project was mixed.

How does OC implement garbage code detection?

OC has a variety of useless code detection and optimization schemes, including compilation, linking, Product, run, and so on. 58.com adopts the double guarantee mechanism of Product scanning and runtime verification. The scanning of Product is realized by WBBlades scanning mach-O files. The basic idea is to make a difference set of classList and classRefs, form a preliminary set of useless classes, and make a secondary adaptation according to the characteristics of the business code. For example, classes that are base classes or member variables, classes that implement dynamic invocation through full strings, and classes registered by RN or Hybrid modules through load methods are all considered useful code and will not appear in the garbage code collection, reducing the cost of secondary verification. However, this scheme cannot be directly applied to the Swift language development project. Next, we will discuss the reasons and solutions.

Swfit class call

OC’s detection scheme largely relies on classList and classrefs to make difference sets. Other technical means are merely complementary. Without a section like Classrefs to provide us with the main information, the technical foundation of the whole solution would be shaken. So first we need to figure out how the class will be used and stored in classRefs. Let’s start with an example:

WBBladesClass *b = nil; 
id c = [WBBladesClass new];
Class d = NSClassFromString(@"WBBladesClass");
Copy the code

In the example above, OC classes are stored in ClassRefs only if an explicit method call is made through [WBBladesClass New].

Does the Swift class also have this feature?

In the Swift calling environment, classes that are explicitly called are not added to the ClassRefs section. The following code, after compiling the link, looks at MachOView and finds that the TestClass0 and TestClass1 classes are not in ClassRefs.

class TestClass0: NSObject {
    dynamic func hello() {
        let obj = TestClass0.init()
    }
}
class TestClass1 {
    func hello() {
        let obj = TestClass1.init()
    }
}
Copy the code

However, if the class is exported for use in an OC environment, the Swift class is added to classRefs.

Class TestClass2: NSObject{} + (void)load{id obj = [TestClass2 new]; }Copy the code

Therefore, it can be shown that classrefs is only applicable to OC language environment, even excluding Struct, enum and other types, the difference set scheme of classlist and classrefs is not applicable to Swift useless code detection.

So how do you recognize that a Swift type is being called?

So how do you know if a Swift class is being used if there is no classrefs to record it? We have previously described the storage structure of Swift classes in detail in The Mach-O Perspective on Swift and OC Storage Differences and Swift Hook New Ideas — Virtual Function Tables.

struct ClassContextDescriptor{ uint32_t Flag; uint32_t Parent; int32_t Name; int32_t AccessFunction; int32_t FieldDescriptor; int32_t SuperclassType; uint32_t MetadataNegativeSizeInWords; uint32_t MetadataPositiveSizeInWords; uint32_t NumImmediateMembers; uint32_t NumFields; uint32_t FieldOffsetVectorOffset; < > generic signature / / the number of bytes is associated with the number of parameters and constraints of the generic < MaybeAddResilientSuperclass > / / have add 4 bytes < MaybeAddMetadataInitialization > / / have you add 4 * 3 bytes VTableList[]// Use 4 bytes to store offset/pointerSize, 4 bytes to describe the number, and N 4+4 bytes to describe the function type and function address. OverrideTableList[]// Use 4 bytes to describe the number, followed by N 4+4+4 bytes to describe the class being overwritten, the function being overwritten, and the address of the function being overwritten. }Copy the code

Here there may be some students will have questions, the above structure is not consistent with the structure of the debugging. When debugging, the structure of the Swift class should look like this:

struct SwiftMetadataClass { NSInteger kind; id superclass; NSInteger reserveword1; NSInteger reserveword2; NSUInteger rodataPointer; UInt32 classFlags; UInt32 instanceAddressPoint; UInt32 instanceSize; UInt16 instanceAlignmentMask; UInt16 runtimeReservedField; UInt32 classObjectSize; UInt32 classObjectAddressPoint; NSInteger nominalTypeDescriptor; NSInteger ivarDestroyer; . //N function address};Copy the code

So in runtime we get struct SwiftMetadataClass from class.self, and when we say storage descriptor we mean ClassContextDescriptor, they’re not one structure.

Struct SwiftMetadataClass* swiftClass = (__bridge struct SwiftMetadataClass*  )(TestClass2.self);Copy the code

SwiftMetadataClass is the Swift runtime DATA. In a Mach-O file, the SwiftMetadataClass structure of the class is stored in the DATA section. Swift mach-o has an extra section named (__ TEXT,__const) than OC. This section stores the structure of the Swift TypeContextDescriptor (the parent of the ClassContextDescriptor). The TypeContextDescriptor is closer to the source form than the SwiftMetadataClass, With TypeContextDescriptor we can easily tell which Module the code is defined in, how many properties it has, what type the properties are, whether they are generic, how many functions it has, which functions override the parent class, and so on. But since MachOView doesn’t fit the Swift Mach-O file very well, we see the section as unformatted binary data.

What does ClassContextDescriptor have to do with SwiftMetadataClass?

In short SwiftMetadataClass. NominalTypeDescriptor ClassContextDescriptor point to is the class, While ClassContextDescriptor is through ClassContextDescriptor AccessFunction function call to get to the corresponding SwiftMetadataClass address.

 let tclass = TestClass1.self
Copy the code

Breakpoint view executive, before class to get IP address, will first call TestClass1 metadata accessor functions (in fact, the TestClass1 ClassContextDescriptor. AccessFunction)

bl  0x100a3b32c ; type metadata accessor for BBB.TestClass1
Copy the code

This means that we know that a class is being used as long as we find AccessFunction calls in assembly code.

How to find AccessFunction in assembly code?

In Mach-O files, the code is stored as machine instructions, and we cannot get the mnemonics and operands directly. Therefore, it is necessary to use disassembly library to disassemble and convert instructions into assembly code. With assembly code, we can tell if a class is called in a function simply by looking for its AccessFunction address in the instruction range of each function.

How do I know the interval of instructions for each function?

This step is easy. In the Mach-O file in Debug mode, the symbol table tells us the address of each symbol. Function as a symbol, of course, the symbol table also records the address of the function, which is n_value in the structure below.

/*
 * This is the symbol table entry structure for 64-bit architectures.
 */
struct nlist_64 {
    union {
        uint32_t  n_strx; /* index into the string table */
    } n_un;
    uint8_t n_type;        /* type flag, see below */
    uint8_t n_sect;        /* section number or NO_SECT */
    uint16_t n_desc;       /* see <mach-o/stab.h> */
    uint64_t n_value;      /* value of this symbol (or stab offset) */
};
Copy the code

But each symbol only knows the name of the symbol and the starting address of the symbol. Taking function as an example, function can only know the function name and the starting address of function through symbol table. Although function end can be roughly judged through static analysis of RET instructions, there is a large deviation in this way in Swift assembly code. Therefore, it is necessary to adopt a more direct way to use symbol table to cut assembly instructions into segments to realize the judgement of function instruction interval, which is also one of the reasons why WBBlades need to analyze the Debug package.To do this, you first sort the symbol table by address, and then treat the starting address of the next symbol as the end of the current function. In this way, the function instruction interval is cut.

Interesting problems encountered

I found a lot of interesting problems while doing Swift adaptation for WBBlades, which was a series of pitfalls in the development process.

Section judgment is not rigorous.

Bytedance published an article titled “Toutiao Optimization Practice: iOS package size binary optimization reduces download size of a line of code by 60 MB”. It is possible that some apps have made section migration. If the APP does section migration, the two sections that were in the same segment become two sections in different segments. Because the base address of different segments may be different, once the address calculation occurs across segments, you need to correct the base address; otherwise, the offset address of the file may be wrong. In addition, since the segment name may be customized, it is not possible to determine a unique section by combining the segment name with the section name. Section permissions + section names are required to identify sections.

If ((segmentCommand maxprot & (VM_PROT_WRITE | VM_PROT_READ)) = = (VM_PROT_WRITE | VM_PROT_READ)) {/ / to read and write permissions, __DATA,__CONST_DATA,__AUTH_CONST, etc.}Copy the code

, of course, this way of judgment is not entirely accurate, because the section after the migration, the new section to read and write access by default, this also means that the data in the original TEXT, the migrated may become VM_PROT_WRITE | VM_PROT_READ. This is why permissions need to be reset after migration.

Getting the class name loop through Parent may cause an exception

Func extensions(of value: Any) {struct extensions: AnyExtensions {} return}Copy the code

Swift differs from OC in that there are many places in Swift where classes or structures can be defined. For example, in the above code, a structure is defined in a function. The Parent of the Extensions structure is not a Model Type, so you need to use the Extensions structure to parse the binary.

Complex generic structures

Generics are complex because the signature of a generic type is data of indefinite length. It depends on the parameter format of the generic and the number of conditions. See the layout instructions below for how many bytes generics take up.

content	The number of bytes	note
addMetadataInstantiationCache	4B	class only
addMetadataInstantiationPattern	4B	class only
GenericParamCount	2B
GenericRequirementCount	2B
GenericKeyArgumentCount	2B
GenericExtraArgumentCount	2B
params	GenericParamCount
pandding	(unsigned)-GenericParamCount & 3	Fill, 4-byte alignment
EachParam	3 * 4 * GenericRequirementCount

Anonymous layout

Anonymous’s official explanation is as follows

/// This context descriptor represents an anonymous possibly-generic context
/// such as a function body.
Anonymous = 2,
Copy the code

Unlike the layout of classes, structures, etc., Anonymous has the following layout in binary:

Flag(4Byte) + Parent(4Byte) + generic signature (variable length) + mangleName(4Byte)Copy the code

However, mangleName does not necessarily exist in Anonymous, so it is necessary to determine whether mangleName exists when parsing Anonymous.

/// Flags for anonymous type context descriptors. These values are used as the /// kindSpecificFlags of the ContextDescriptorFlags for the anonymous context. class AnonymousContextDescriptorFlags : public FlagSet<uint16_t> { enum { /// Whether this anonymous context descriptor is followed by its /// mangled name, which can be used to match the descriptor at runtime. HasMangledName = 0, }; . };Copy the code

If TypeContext is Anonymous, check whether the first two bytes of Flag are 0. If it is 0, Anonymous has no mangleName.

other

In Swift, fileprivate, open, etc., the code is somewhat different, and we have adapted it in the open source code. In addition, sometimes Swift access is not through AccessFunc, but directly to the address of the class. The demangling cache variable for type metadata for symbol table will be used in this case.

Support.

When the WBBlades did a binary scan, they tested the code in the APP that contained the following. The ✅ code in the example can be identified as being used. Where V1.1 is before Swift binary adaptation, V2.0 is after adaptation.

Method of use

The APP to be tested needs to print a real ARM64 package in the Debug environment.
Compile the WBBlades to generate the WBBlades executable. Github.com/wuba/WBBlad…
Drag the WBBlades executable file to the system terminal and enter -unused to drag the real machine package to the terminal.
Enter, wait for a few minutes. The result file is displayed on the desktop. If there are many Swift codes, it may take a long time.

Application situation and prospect

At present, there are about 2W + classes and 1K + Swift type definitions in 58.ong APP. After static detection, it was found that the useless code proportion of OC code was about 8%, while that of Swift code was relatively low, about 2%. After manual review, we found that the code detection accuracy of some lines of business was high, with an accuracy of 80%+, while the accuracy of the screening results of some lines of business was low. The main reason for the reduced accuracy is that multiple strings are concatenated into class names for dynamic invocation and reflection is used in Swift, which is difficult to detect by general means without knowing the concatenation rules of the code. In the future, we will gradually improve the tool, and give the number of bytes of each useless code in the binary file along with the output scan result, so that developers can make decisions.

conclusion

Swift is a very magical and esoteric language, where flexibility at the top comes at the expense of complexity at the bottom. The author is also gradually exploring and learning, so it may be inevitable to look at Swift with OC thinking. For example, before the development of the tool, the author has been considering how to detect Swift useless classes, but in fact Struct, Enum and other types are equally important in the development. As a result, the WBBlades continue to be optimized. Currently, WBBldes is working with 14 teams or individuals from inside and outside 58 Group to continuously experience and collect questions. If you have a good idea or question, please leave a comment on GitHub.

The authors introduce

Deng Zhuli: Senior development engineer of USER Value Growth Center – Platform Technology Department -iOS Technology Department, author of WBBlades open source tool

reference

Developer.apple.com/documentati…

Github.com/apple/swift…

www.jianshu.com/p/158574ab8…

www.jianshu.com/p/ef0ff6ee6… Mp.weixin.qq.com/s/egrQxxJSy…

Github.com/alibaba/Han…

www.jianshu.com/p/0cbbbe783…

Juejin. Cn/post / 693976…