On September 26, 2016, Version 1.0.0 of Douyin was launched. Up to now, the daily active users of Douyin have exceeded 600 million. In just six years, Douyin has achieved explosive growth from zero. In the context of rapid business development, massive growth of data, and higher requirements for picture quality in the form of video/live broadcast, how does the basic technical team of Douyin respond to the changing times with technological innovation and optimize user experience with the heart of the craftsmanship? What has the Douyin team done in the iOS development arena, which has not been widely regarded by the outside world?

On the afternoon of January 22, the third Bytedance Technology Salon was live streamed to the audience. The theme of the salon was “Douyin iOS Basic Technology Revealed”, inviting five Douyin iOS client engineers, Chen Xiancai, Chen Wenhuan, Shu Biao, Han Jianlei and Zhu Feng, to explain the practical experience of Douyin App in iOS client development from different perspectives. For nearly 40,000 online audience brought a grounded, cutting-edge technology feast.

Evolution and Challenges of Large-scale App Development Architecture by Chen Xiancai

The architecture determines the scale and efficiency of the project. Mr. Chen Xiancai, iOS client architect of Douyin Basic Technology, introduced how the Douyin team continuously promoted the evolution of Douyin’s architecture from modular, component-based and plug-in without affecting business iteration and business scale expansion.

modular

Face early development code volume expansion, increase business scale, developers, and business normal iteration, the contradiction of the technical team first consider from the perspective of improving efficiency, stripping out from the main engineering resources configuration and App rely on engineering, environment, and design the ability to rely on the basis of the underlying code, forming a shell project. The team also designed a relatively uniform template from the perspective of the source environment resource tools so that modules could be created and developed based on uniform standards.

In terms of r&d process and tool volume, the team supported the development of multi-warehouse MR, establishing a basic R&D environment for local R&D and CI/CD. In turn, the team modularized the entire project according to module standards and ensured that most of the modules could be compiled to binary targets. The realization of the modularity goal not only improves efficiency, but also lays the foundation for the continuous evolution of the subsequent architecture.

componentization

In the context of continuous business development, the code in a single business module is also accelerating expansion; After modularization separation, unreasonable interface dependency needs further analysis and governance. The addition of add-ons and basic capabilities to iOS makes Swift integration inevitable… In this context, technicians start the componentization process to reduce the overall r&d efficiency of the team.

In order to solve the code reuse problem and reduce the dependency complexity, the team redefined the five-level architecture layer of Douyin:

  1. Shell engineering;
  2. The business layer.
  3. The interface layer.
  4. The service layer.
  5. The base layer;

This architecture layer transforms the network dependency structure brought by modularization into tree dependency structure, which reduces the complexity of dependency and ensures that the dependency between different layers does not deteriorate.

With dependency management for hundreds of thousands of components, the technical team broke the conventional “dependency leveling” approach and improved it with containerization. A version container includes shell engineering, dependency list, dependency change record, overall build history, and product release aggregation information. In the case of missing or disconnecting dependency chain, the problem of missing dependency is verified in Mr Subbin by means of difference set to prevent the deterioration of engineering dependency relationship.

At the same time, based on the new layered architecture, technicians define the dependency specifications of each layer of components to prevent unreasonable cyclic dependencies and ensure that the overall dependency does not deteriorate. In a hierarchical dependency specification, the higher layer can depend on the lower layer, and the implementation can depend on the interface. The interface layer has no dependencies, and the previous direction is declared as the primary. Finally, after several iterations of douyin optimization, the dependence of each component decreased significantly.

Another problem with binarization is changes at the interface layer. In order to deal with binary pollution caused by interface conflicts, technicians combined with the syntax tree information of the main trunk, directly checked the real call usage through Mr, and intercepted about 10% of binary pollution every day, which effectively guaranteed the stability of the overall development of the team.

In order to deal with the impact of configuration problems, environment problems, asynchronous Mr Interface calls and conflicts on the stability of the trunk, technicians introduced the RC (Release Candidate) branch, which merged multiple Mr Codes and entered the stability trunk after checking, thus avoiding local compilation failure, CI package failure and other problems.

After the stabilization problem is solved, the continuous separation of new business warehouses also becomes a problem affecting the development efficiency. Technicians introduced single warehouse multicomponent – a layered architecture that allows multiple components to be added to a warehouse without dismantling it. At the same time, Swift and OC code is isolated at the interface layer to avoid compile dependency passing between components.

To improve the overall research and development efficiency, the team also provided a binary-based code isolation solution to isolate business-differentiated codes into binaries by binding the adapter protocol and acquiring the adapter protocol. At the same time, related infrastructure is built to monitor code changes, so that the impact of multiple apps can be perceived and quantified.

pluggable

In the componentization evolution process, The business scale of Douyin continues to expand, and the number of components has increased from 100+ to 800+. Binary has been unable to meet the requirements of efficiency improvement. At the same time, the team faces new challenges in efficiency, quality and cost.

In this context, in order to improve online performance and local efficiency, the technical personnel started the process of transformation from static binary to dynamic binary. In the business lazy loading scenario, the technical personnel will not home page business code and its exclusive base library dependency directly into the dynamic library lazy loading; In addition, specialized code is isolated through dynamic libraries and plays a role in specific scenarios such as iPad customization business and large business block refactoring.

To reduce the complexity of underlying dependencies and improve code quality, the team also designed a service framework that supports binding abstract interfaces to concrete implementations and implementation zeal. The framework greatly meets the capability requirements of decoupling, dynamic deployment, service composition, smoothing of underlying language differences at compile time, and supporting service eagerness at run time.

In addition, the technical staff also made active exploration in the local multi-mode research and development.

Douyin iOS Automation Services: Container and Scale Exploration by Chen Wenhuan

Automated testing and continuous integration are of great value to guarantee the quality of software engineering and are also one of the guarantee means for incremental development of large-scale projects. Chen Wenhuan, iOS client engineer of Douyin Basic Technology, introduced how douyin iOS automation can achieve containerization and scale service, as well as some technical challenges and solutions involved.

IOS container testing

Containerized testing, on the one hand, improves test stability, and on the other hand, isolates the environmental impact of different test tasks. In the service layer architecture of Douyin iOS container construction, the lowest layer rack platform provides abstract machine management and control capabilities. Based on this, technical personnel set up special testing services including unit test, UI test, etc. The platform side also provides data report consumption and some business management capabilities. At the same time, technicians also access the R&D environment and CI tool chain based on the company’s componentization status and different CI systems. The operation of the whole architecture enables many project components of the company to use some common testing services, which have been applied to large projects such as Tiktok, Live Media, Toutiao and so on.

In the service isolation scheme of rack platform, technicians adopted the Docker scheme under Linux cluster (left in the figure below), and the Docker image contains some test cases and tool chains. This solution enables rack environments to operate independently of each other and supports rapid deployment and control capabilities. The right side of the figure is the core service used by technical personnel on iOS devices. The top layer is the iOS background Runner process developed by byte engineers, which is used to accept device control instructions and communicate with the underlying services of iOS. In addition, it also includes installation Proxy, debugServer and other processes. The interaction between Docker image and iOS device communicates with lockdownd service through USB protocol.

IOS Device Control

Device control is inseparable from UI interaction. Common click-throughs, swiping gestures, popup controls, keyboard input, and foreground invocation are all basic capabilities that need to be used in automated testing. Based on the XCTest system library, the test code is integrated into a special App (called UI Runner) that is installed on the test device to start execution. Chen Wenhuan analyzed the control mechanism of iOS devices in detail based on the API of XCTest simulating clicking the home button.

NSXPCConnection is a noun of interest in an instance where XCTest simulates hitting the home button. NSXPCConnection is a two-way communication between processes provided by Apple. A process can create a listener to listen for requests from other processes. The technician prints the instance information of NSXPCConnection at run time and finds that it points to the com.apple. testManagerd service, and the corresponding binary is TestManagerd. As you can see from the main function after TestManagerd starts, it also registers the service name com.apple.testManagerd, which is consistent with the results of the run analysis. Next, we can see how the entire Xcode toolchain drives the test by observing the NSXPCConnection call to TestManagerd when the Xcode toolchain drives the test.

Analyzing the full protocol interaction process of XPC Message, you can see that it uses two sets of protocols roughly. One is the XCTest protocol, starting with XCT, which directly calls the UI interaction capabilities of TestManagerd. The other group, which starts with the IDE, is the process of whitelisting authorization for the Xcode tool chain.

In a byte-derived iOS device control chain scheme, start an App and authorize the App process through its _IDE_authorize protocol. Its PID is added to the testManagerd interface use whitelist. This allows the App to directly use all of the device control interfaces in TestManagerd through cross-process calls.

Scale test of M1 simulator

In November 2020, Apple released its own M1 chip, which can run iOS programs. In this context, the Tiktok team began to explore testing on M1 devices to reduce construction costs and provide new possibilities for improving test stability.

If you run the real package test directly on M1, you will face the limitations of App signature check, one Bundleld can only run one App, no home button, fixed screen size, fixed model and version, etc. All these problems will restrict the scale test of the rack. Therefore, running real machine package test on M1 simulator has become the focus of the technical team to explore.

Faced with Binary with wrong Platform error caused by emulator startup, the technician adopted the processing mode similar to IPAPatch after verification. After the compilation product was generated, the post-processing process was added and macho modification was added. Inject/modify the LC_BUILD_VERSION field for compatibility and finally enable the Tok Tok Real pack to run smoothly on M1 emulator.

In addition, Chen Wenhuan also took Metal Framework adaptation as an example to introduce the processing ideas and solutions of system library adaptation.

Super App Build Efficiency increased by 40%! JOJO, Byte self-developed iOS Build System

JoJo is a byte-developed iOS build system with Bazel at its core that provides a complete set of solutions for everything from CI/CD to native build development. Starting from the relationship between JoJo and Bazel, Shubiao, iOS development engineer of Douyin basic Technology, introduced four features of JoJo: high performance, high scalability, multi-engineering architecture support and multi-IDE support. It reveals the mystery of how JoJo helps douyin, Toutiao and other hundred-million-level APPS to increase their efficiency by 40%.

Cornerstones of high performance

The core of the build is made up of many different tasks and their interdependencies. There is often a requirement in a build system that a fixed product should be produced for a task with the same resources, parameters, and tools. Based on this, the build system can establish a single-task-level cache reuse, which greatly speeds up build performance.

The core problem of implementing the compile cache mechanism is the dependency calculation of the build task. Unlike a typical build system, JoJo combines remote caching, remote execution, and dependent computation. When JoJo is built locally, it implements an incremental build scheme similar to Xcode — all the files needed for the build are obtained from the.d file generated by the compiler after the last build of the C or Swift source code, and the dependency calculation is performed. The. D file here is a dependency description file that is generated by the compiler after a build, describing all the files involved in the build.

In JoJo, technicians implement tools based on clone and Swift compilers to perform fast dependency calculations on C and Swift code. 2000+C files can be scanned in seconds and Swift code can achieve similar performance. As a result, JoJo provides a correct and fast cache reuse experience with almost no overhead while maintaining correctness.

In addition, the JoJo build system speeds up builds through distributed caching and building clusters. For each constructed subtask, JoJo will calculate a key according to its dependency, and then query the existing product through this key to the remote cache server. If the match is successful, the product will be downloaded, the file will be output, and the subtask is complete. If it misses, JoJo actually calls the relevant tool for a build, either locally or remotely. To avoid uploading related resource files from the local to the remote cluster, JoJo downloads the required files from the cache server over the internal high-speed network, and only needs to transfer a list locally to the cluster. The remote cluster itself can be scalable, either as a Mac machine or a Linux machine, making the cluster much more scalable. The result is a complete distributed build architecture.

Specific to the construction scenario of individual engineers, due to the difference in network speed and local performance, the overall task scheduling requirements are also full of variables. To this end, JoJo implements an intelligent scheduling system. Unlike Xcode, which has a fixed limit on the number of concurrent tasks, JoJo can dynamically adjust scheduling policies based on differences in network, CPU, and cluster resources. In addition, JoJo measures the speed in real time as the network transmits data, and determines whether to disable the remote mechanism based on the performance of the local CPU. All these further ensure the stability and performance of the distributed build system.

High scalability

While using Bazel as his core engine, JoJo rewrote and created a large number of rules that were completely independent of Bazel. In practice, JoJo bypasses such processes as unit testing, static analysis, dynamic library lazy loading, and index building so that tasks can be managed and cached automatically by the build system.

Bazel’s built-in Query command and aspect mechanism give JoJo flexible data query capabilities, giving engineers the freedom to retrieve any compilation information, including build parameters, dependencies, etc., which can also be consumed by another rule in the build process. This enables dynamic build capability.

Multiple engineering architecture support

Monolith, Multirepo, and Monorepo are the common repository management mechanisms. JoJo is designed to be extensible to support any architecture. Currently, JoJo supports the standard Cocoapods project to build directly without any business changes, which is how Tiktok works. Toutiao uses Monorepo for business management, third-party libraries and base libraries continue to use cocoapods for a mixed build model. At the same time, JoJo is also trying to develop a standard paradigm for Monorepo development within the company to solve learning costs and migration costs in a one-stop shop.

For different architectures, JoJo extends a new rule to support different architectural descriptions, and for a specific architecture, the related rules take care of the specific processing, all of which are unified into an intermediate layer. This intermediate representation abstractly describes static library dynamic library construction, dependencies, and so on. Finally, JoJo builds through the middle tier to produce the final artifacts. This enables support for multiple architectures and hybrid patterns.

IDE integration

JoJo itself supports a variety of IDES. Teacher Shubiao takes Xcode as an example to introduce the way to build using JoJo under Xcode. In order to make business students feel as little as possible after switching to JoJo, the technical team developed some logic after research, and took over the indexing, debugging, logging, progress bar and other functions of Xcode completely by some means. Therefore, under the JoJo system, Xcode project completely changes into the role of “front end”, you only need to browse the project files and directory structure, all the underlying tasks are completed by JoJo, and the business experience is basically close to the native experience.

After the modification, Xcode communicates directly with JoJo’s Build Service, which in turn calls JoJo to Build, and provides data such as Build progress, logs, compilations, and parameters for Xcode to consume. Other unrelated requests are forwarded to XCB Build Service for processing. Furthermore, JoJo also hooks into SK Agent’s index building process so that technicians can use JoJo to build index tasks, thus achieving full process takeover through JoJo and ensuring the independence of each function.

In addition, the technical team also from the index cache, binary debugging source index, the introduction of intelligent analysis system for error optimization and guidance, and other aspects of JoJo further optimization, in order to better help the development of the business.

Douyin iOS Experience Optimization: Exploring Fluency Optimization

At present, Mr. Han Jianlei, who is responsible for the basic experience of Douyin iOS client, clarified common problems related to fluency and optimization strategies based on specific sensible cases, and provided certain troubleshooting ideas and solutions for indicators deterioration problems based on practical experience.

Introduction to Fluency

What is fluency? In terms of scenes, a range of operations, including page refresh, animation, transitions, pop-ups, drags, slides, etc., all fall into the category of fluency. From the perspective of user experience, fluency can be understood as visual experience, tactile experience and auditory experience. Overall, fluency can be used to measure the user’s interaction experience in various scenarios. According to the practical experience of Douyin’s technical team, fluency optimization can bring at least 3% revenue from viewing duration and 6.6% revenue from video play volume. Fluency optimization is closely related to business indicators such as per capita viewing duration, page penetration, user retention and advertising revenue.

Currently, Tiktok uses frame loss and FPS as the core metrics for fluency issues.

Degradation of attribution

Teacher Han Jianlei guided us to imagine a scene like this: one day, the online core indicator FPS suddenly deteriorates greatly. How should the problem be solved?

  • Step 1: Analysis. From the version, channel, scene and other multidimensional analysis, but also through other indicators for horizontal comparison, find out the problem point;
  • Step 2: Debug replay. Using tools such as Time Profiler, if a lot of time-consuming actions are detected, how do you determine which are new and which are historical? Can only judge deterioration, but cannot determine deterioration extent how to do? What if it can’t even reproduce under Debug? Can these problems be intercepted in advance and not brought online?

Based on the above problems, the technical team developed a set of function time monitoring system. By comparing the online time of the large market, it is easy to locate which function has deteriorated and the extent of deterioration, which helps the technical personnel to quickly locate the new deterioration function without considering whether the Debug can reproduce.

In addition, the technical personnel also combed the call of key functions in sliding, first brush and other scenarios, and then intercepted the function in the form of assembly Hook. In the call cycle of the main function, the execution time of the sub-function was recorded, so that the time of each sub-function and the internal call stack could be collected. At the same time, in order to enable the function time monitoring system to be applied to various scenarios, the upper layer supports dynamic configuration delivery and export of complete call links, so as to achieve the monitoring goal and minimize the overall performance loss.

To optimize the practice

After reviewing the common optimization strategies, Professor Han Jianlei elaborated the methodology to deal with the problem in specific cases from the two types of detail optimization, frame rate and lag.

Here are three cases of frame rate optimization.

  • First calls the self in the slide. XxViewControler. The view, and the xxVC is lazy loading, if not initialized before, here trigger vc created, called viewdidload is obviously unreasonable;
  • The second one reads userDefaults while dragging and dropping. The first one loads all the data in the PList into the content, causing a lag. This is a simple problem, but one that often arises in real development.
  • The third one iterates through the willDisplayCell, matching by string, which can also cause frame loss if the array is large (say hundreds or thousands).

In addition, Teacher Han Jianlei also introduced the solution ideas and optimization paths of the technical team to deal with the staid problem in actual combat by taking three real staid cases of Douyin as examples.

On the basis of effectively dealing with the lag problem, how to intercept the small deterioration of similar frame loss problem? How can recurring problems be prevented from deteriorating? With these problems in mind, the team found through questionnaire survey that the deterioration of fluency was not highly valued, and RD was far less enthusiastic about repairing existing problems than new ones. In this context, technical personnel also developed a set of refined monitoring system, which can hook intercept and record some common bad cases such as stuck and time-consuming operations at relatively sensitive times or scenarios, and apply it to the anti-deterioration platform. As the number of optimization problems of stalling and frame rate increases, the number of bad case entries increases, the quality of fine monitoring improves and the granularity becomes finer, thus forming a virtuous cycle.

The overall process is as shown in the figure above. The client implants the monitoring code into the host App through dynamic library injection, and then performs automated test tasks. When each scene is hit, data and stack records will be made. At the end of the task, unified symbolization; Then report anti-deterioration background, and finally generate data reports, trigger alarm or intelligent diagnosis.

In addition, douyin’s technical team has also invested more in slow functions, animation, time-consuming task fragmentation, low-end phone degradation and other aspects to better meet the user’s fluency experience. In the future, the team will continue to explore UI/ animation, architecture, thread control, etc., and continue to deliver satisfactory results on fluency and user experience.

Douyin iOS Stability Optimization and Exploration by Zhu Feng

Zhu Feng, iOS client engineer of Douyin Basic Technology, has been involved in the stability optimization and guarantee system construction of Douyin iOS application. Starting from the basic concept of stability, he explained the stability framework and core indicators in detail, and imagined the future of stability optimization.

Basic concept

In a narrow sense, Crash refers to language mechanism errors, CPU access anomalies, and active exit problems encountered at the code level. In a broad sense, Crash includes problems such as too much memory killed by the system (OOM), main thread block killed by the system (WatchDog), and too high CPU killed by the system, which all belong to the category of stability concerns.

Stability frame

When it comes to stability frameworks, one of the things that has to come up is the startup task. In the problem of SDK initialization timing of APM, we need to make most of the code execute after monitoring the SDK, including Crash monitoring, WatchDog monitoring, OOM monitoring, which is directly related to the stability framework.

In the governance of Premain code, Zhu Feng introduced the method of postponement of custom section mode. This method can replace the traditional +load method, but at the same time, you should not allow the number of sections to expand indefinitely, or you may exceed the system dyLD limit and crash the startup.

Logs are essential for troubleshooting stability problems. Difficult problems are often caused by the stack is not clear, in this case, it is very important to analyze the Crash context through log search information. Logs must be recorded based on MMAP to ensure that logs are not lost. In addition, the log analysis tool developed by the Tiktok technical team can significantly improve the analysis efficiency of developers.

Detailed explanation of core indicators

Teacher Zhu Feng explained the formation mechanism and coping strategies of common problems such as Objective C exception, multi-thread Crash, Crash when killing process, whole system call stack, and Crash caused by compiler optimization level.

Taking the whole system call stack Crash as an example, the general coping process is to reverse the system library code on the basis of checking the log analysis context, bypass the problematic code through Swizzle/Fishhook, and analyze with CoreDump. If the problem can be repeated locally, You can use Xcode Malloc Logging to find the address assignment call stack.

In dealing with the Crash problem, in addition to the troubleshooting of difficult problems, there are long-term online and offline coping mechanisms. Offline there are ASAN automated test, gray stage monkey automated test, integration stage startup crash automated test, etc. Online, safety cushion, safety mode, coredump, etc.

Common causes of WatchDog include file IO, network IO, CPU intensive, main thread and sub-thread shared lock, etc. The solution usually involves putting child threads, adapting the business logic to callback forms, optimizing the granularity of locks, and so on.

OOM is one of the most serious challenges faced by tiktok’s technical team in stability optimization. For Tiktok, there were more than twice as many OOM problems as Crash. Increasing memory usage, unattributed optimization/degradation, and the large number of low-end devices make it difficult to get around the OOM.

For online OOM, the technical team mainly uses MemoryGraph mechanism (self-developed), Matrix Memory Stat and large picture monitoring to cope with it. Offline parts are circumvented through MLeaksFinder, Xcode Leaks automated tests, and AutoreleasePool automated missing tests.

Mr. Zhu feng explained each of the solutions and tools.

Looking to the future, tiktok iOS App stability optimization will make more exploration and efforts from the framework, process, static and dynamic analysis and other aspects, to escort the construction of super-large App. In the end, Zhu feng encouraged everyone to always be interested in the underlying technology and keep exploring, and never set limits on their own growth.

After each lecturer’s sharing, the online audience interacted with the lecturer through the comment section and bullet screen. 5 teachers are combined with their own professional direction and practical experience, patient and meticulous to make targeted answers.

At this point, the third Bytedance technology Salon has come to a successful conclusion.

How to obtain PPT and playback video?

Follow the public account “Bytedance Technical Team”, reply the key word “Salon review” in the background, and obtain the download link and playback video of the PPT of 5 teachers.

Bytedance Technology Salon is a technical exchange activity initiated by ByteTech, a bytedance technology community, for developers from all over the industry. By building an inclusive, open and free exchange platform, it promotes the popularization and implementation of cutting-edge technologies and helps technical teams and developers grow rapidly. Bytedance Technology Salon’s technology sharing comes from bytedance and the technical experts working in the leading Internet companies. According to the hot technology direction and practice summary, bytedance technology Salon presents a series of technical feast for reference for the technical team and developers.

What topics would you like to hear shared at future salons? Which technologists would you like to see share their hands-on experience? Let us know in the comments below. The fourth Bytedance Technology Salon is scheduled for March, so let’s get together in spring!