This article provides a static analysis method for finding unused classes in an executable file called Mach-o.
In the Mach-O file, the __DATA __objc_classrefs section records the addresses of referenced classes, and the __DATA __objc_classlist section records the addresses of all classes. The addresses of unused classes can be obtained by taking the difference set, and then the information of unreferenced classes can be obtained by symbolizing them.
Reference class address
Section information in Mach -O can be printed using the Mac tool otool. It is important to note that the emulator and the real machine correspond to the executable file, and the data storage mode needs to be distinguished. Arch can be obtained by using the file command.
#binary_file_arch: distinguish Big-Endian and Little-Endian
#file -b output example: Mach-O 64-bit executable arm64
binary_file_arch = os.popen('file -b ' + path).read().split(' ')[-1].strip()
Copy the code
Distinguish x86_64 from ARM when fetching class address.
def pointers_from_binary(line, binary_file_arch):
line = line[16:].strip().split(' ')
pointers = set(a)if binary_file_arch == 'x86_64':
#untreated line example:00000001030cec80 d8 75 15 03 01 00 00 00 68 77 15 03 01 00 00 00
pointers.add(' '.join(line[4:8][::-1] + line[0:4][::-1]))
pointers.add(' '.join(line[12:16][::-1] + line[8:12][::-1]))
return pointers
#arm64 confirmed,armv7 arm7s unconfirmed
if binary_file_arch.startswith('arm') :#untreated line example:00000001030bcd20 03138580 00000001 03138878 00000001
pointers.add(line[1] + line[0])
pointers.add(line[3] + line[2])
return pointers
return None
Copy the code
Run otool -v -s __DATA __objc_classrefs to obtain the address of the referenced class.
def class_ref_pointers(path, binary_file_arch):
ref_pointers = set()
lines = os.popen('/usr/bin/otool -v -s __DATA __objc_classrefs %s' % path).readlines()
for line in lines:
pointers = pointers_from_binary(line, binary_file_arch)
ref_pointers = ref_pointers.union(pointers)
return ref_pointers
Copy the code
All class addresses
Run otool -v -s __DATA __objc_classlist to obtain the addresses of all classes.
def class_list_pointers(path, binary_file_arch):
list_pointers = set()
lines = os.popen('/usr/bin/otool -v -s __DATA __objc_classlist %s' % path).readlines()
for line in lines:
pointers = pointers_from_binary(line, binary_file_arch)
list_pointers = list_pointers.union(pointers)
return list_pointers
Copy the code
Take the difference set
Subtracting the reference class from all the class information, we get the address information of the unused class.
unref_pointers = class_list_pointers(path, binary_file_arch) - class_ref_pointers(path, binary_file_arch)
Copy the code
symbolic
The address and corresponding class name can be obtained by using the nm-nm command.
def class_symbols(path):
symbols = {}
#class symbol format from nm: 0000000103113f68 (__DATA,__objc_data) external _OBJC_CLASS_$_EpisodeStatusDetailItemView
re_class_name = re.compile('(\w{16}) .* _OBJC_CLASS_\$_(.+)')
lines = os.popen('nm -nm %s' % path).readlines()
for line in lines:
result = re_class_name.findall(line)
if result:
(address, symbol) = result[0]
symbols[address] = symbol
return symbols
Copy the code
filter
In the actual analysis, it is found that if the subclass of a class is instantiated but the parent class is not instantiated, then the parent class will not appear in the __objc_classrefs section. In the unused class, this part of the parent class needs to be filtered out. Otool -ov can be used to obtain the class inheritance relationship.
def filter_super_class(unref_symbols):
re_subclass_name = re.compile("\w{16} 0x\w{9} _OBJC_CLASS_\$_(.+)")
re_superclass_name = re.compile("\s*superclass 0x\w{9} _OBJC_CLASS_\$_(.+)")
#subclass example: 0000000102bd8070 0x103113f68 _OBJC_CLASS_$_TTEpisodeStatusDetailItemView
#superclass example: superclass 0x10313bb80 _OBJC_CLASS_$_TTBaseControl
lines = os.popen("/usr/bin/otool -oV %s" % path).readlines()
subclass_name = ""
superclass_name = ""
for line in lines:
subclass_match_result = re_subclass_name.findall(line)
if subclass_match_result:
subclass_name = subclass_match_result[0]
superclass_match_result = re_superclass_name.findall(line)
if superclass_match_result:
superclass_name = superclass_match_result[0]
if len(subclass_name) > 0 and len(superclass_name) > 0:
if superclass_name in unref_symbols and subclass_name not in unref_symbols:
unref_symbols.remove(superclass_name)
superclass_name = ""
subclass_name = ""
return unref_symbols
Copy the code
In order to prevent accidental damage of some tripartite libraries, you can also filter some prefixes, or keep only classes with certain prefixes.
for unref_pointer in unref_pointers:
if unref_pointer in symbols:
unref_symbol = symbols[unref_pointer]
if len(reserved_prefix) > 0 and not unref_symbol.startswith(reserved_prefix):
continue
if len(filter_prefix) > 0 and unref_symbol.startswith(filter_prefix):
continue
unref_symbols.add(unref_symbol)
Copy the code
The final result is saved in the script directory.
script_path = sys.path[0].strip()
f = open(script_path+"/result.txt"."w")
f.write( "unref class number: %d\n" % len(unref_symbles))
f.write("\n")
for unref_symble in unref_symbles:
f.write(unref_symble+"\n")
f.close()
Copy the code
This idea can reduce code redundancy and package size to some extent. Because the analysis is static, dynamic calls cannot be included, and further validation is required for classes that need to be deleted.