With a Reinvent Data Science vision, Zilliz focuses on open-source Data Science software that leverage a new generation of heterogeneous computing. As projects flourish, we are placing higher demands on continuous integration, continuous delivery, and continuous deployment (CI/CD). This article, the first in the CI/CD series, focuses on compilation optimization practices for continuous integration.
| problems and challenges
During the build process, we encountered the following problems:
1) Long compilation time
The project needs to complete hundreds of code integrations every day. Faced with hundreds of thousands of lines of code, even small feature changes made by developers may lead to full compilation of the project, which takes more than an hour or longer, which is obviously unacceptable.
2) Complex compilation environment
The project code is compiled in different operating systems (CentOS, Ubuntu, etc.), underlying dependency libraries (GCC, LLVM, CUDA, etc.), hardware architecture and other environments, and the compiled products generated in each compilation environment may not be used in other platforms.
3) Complex project dependencies
At present, there are at least 30 or 40 dependencies of various functional components and third parties involved in project compilation. The development of the project often brings about changes in dependencies, and it is inevitable to encounter dependency conflicts. Versioning between dependencies is too complex, and updating dependent versions can easily affect other component services.
4) Third-party dependencies download slowly or fail to download
Problems such as network latency or unstable third-party dependent repositories cause slow resource downloads or access failures, which seriously affect code integration construction.
| main train of thought
Decouple project dependencies. Components with complex dependencies are divided into different repositories for version management, and the version information, compilation options, and dependencies of each component are organized in the form of configuration files. Configuration files are added to the component repository for version management and are updated over project iterations.
Implement compilation optimizations between components. According to dependencies, compilation options and other information recorded in configuration files, relevant component codes are pulled for compilation. After compilation, binary products generated and archived lists of corresponding compiled products are uniformly marked and packaged, and uploaded to a private warehouse for centralized storage. Archived manifest plays back the compiled artifacts when the component and other components on which the component depends have not changed, providing a compilation cache effect. Problems such as network latency or instability of third-party dependent warehouses can be solved by building privatized warehouses internally or using multi-mirror warehouses.
Implement compilation optimizations within components. Select a language-specific compilation cache tool to cache and package the compilation products during compilation and upload them to a private repository for centralized storage. For example, in the case of C/C++ compilation, a compiler cache tool such as CCache can be used to cache C/C++ compilation intermediates, and then archive the CCache local cache after compilation. Such compiler cache tools only cache the modified code files one by one after compilation, and copy the corresponding compilation products hit by the unchanged code files, so that it can directly participate in the final compilation.
Ensure compile environment consistency. Since the generation of compilation products is sensitive to system environment changes, unknown errors may occur in different operating systems, underlying dependency libraries and other environments. Therefore, we need to mark and archive the compilation product cache according to system environment changes. The system environments we work with are so diverse that it is difficult to categorize them by a few dimensions, so we have introduced containerization techniques to unify the compilation environment to solve these problems.
| the implementation key points
- Decoupled project dependencies
There is no universal definition of decoupling project dependencies. Intra-project dependencies are often considered in terms of business requirements, technology stack selection, deployment style, and so on. External project dependencies are usually determined based on the dependencies of third-party dependency libraries to internal components. There are third-party dependency libraries that have strong dependencies between components in terms of compilation mode, compilation options, and so on, and choose to compile along with the component business code. The third-party dependent libraries that can be shared in the project form a unified independent repository for centralized compilation.
- Intercomponent compilation optimization
Engineering compilation optimization among components is divided into the following tasks:
1. The developer submits the modified component business code to trigger the code integration of the project, obtains the configuration file in the component repository, obtains the version information (Git Branch or Tag, Git Commit ID) and compilation options of upstream and downstream dependent components according to the dependency relationship, and constructs the dependency diagram.
2. Dependency checking. Alarm for cyclic dependencies, version conflicts and other problems between components.
3. Flattening the dependency relationship. The dependency graph is sorted by depth-first traversal (DFS), and the components with repeated dependencies are pre-merged.
4. Generate a Hash for each component’s version information, compilation options and other information, and then use the MerkleTree algorithm to generate the Root Hash containing the component’s dependencies. The encrypted Hash and component name information are combined to form the unique label information of the component.
5. Check whether the private repository has a compilation archive of the component based on the component’s unique label information. If an existing compilation product archive file is matched, decompress the compilation product archive file and obtain the archive manifest file for playback of the compilation product. If not, the component is compiled, and the resulting compilation artifacts and manifest files are marked and archived and uploaded to a private repository.
- Component internal compilation optimization
Engineering compilation and optimization of components are divided into the following tasks:
1. Add the system environment dependencies needed to compile component code to the Dockerfile. Dockerfile is checked for compliance with the Hadolint tool to ensure that the image conforms to Docker best practices.
2. Build the environment image based on the iteration version (project version + build version), operating system and other version information.
3. Start the container used to build the compilation environment with the image and pass the image ID into the container as an environment variable. Docker inspect ‘–type=image’ –format ‘{{.ID}}’ repository/build-env:v0.1-centos7 ‘
4. Select the appropriate compiler cache tool for the technical stack to compile and cache the code. Go inside the container for code integration and compilation, and check whether the private repository has a compilation cache for the compilation cache tool based on the image ID. If an existing compilation cache is hit, it is directly downloaded and unpacked to the specified directory. After all components in the compilation environment are compiled, the compilation cache generated by the compilation cache tool is packaged and updated to the private warehouse by uniformly marking the information such as project iteration version number and image ID.
- The construction scheme is optimized again
Initially we built images that were too bulky, increased disk and network resource overhead, and took longer and longer to deploy. Here are some suggestions:
1. Select the simplest base image to reduce the size of the image, such as Alpine and BusyBox.
2. Reduce the number of mirror layers. The required environment dependencies are as reusable as possible. Merge instructions. You can concatenate multiple commands with “&&”.
3. Clean up the intermediates of the mirror build.
4. Make full use of the image cache build.
After the implementation of the scheme for a period of time, with the increase of compilation cache, the cost of disk and network resources of privatized warehouse increases, and the utilization rate of part of compilation cache is not high. Here are some suggestions:
1. Periodically clear cache files. Check the privatized warehouse regularly through scripts and other forms, and clear the cached files that have not changed for a period of time and have not been downloaded with a high number of downloads.
2. Selectively compile and cache. Compilation caching is not required for code that requires less resource overhead to compile.
Since the installation and use of Docker, privatized warehouse construction and other contents are not within the scope of this chapter, interested students can study on their own.
| summary and outlook
This paper analyzes the dependencies of its own projects, introduces the compilation and optimization methods between components and within components in detail, and provides the ideas and best practices for building a stable and efficient code continuous integration system. It solves the problem of slow project iteration caused by complex dependency relationship, uniformly operates within the container to ensure the consistency of the environment, and improves compilation efficiency through playback of compilation products and cache of compilation cache tools.
At present, this practice has provided technical support in continuous integration of Milvus and other products. After the compilation optimization described in this paper, the compilation time of the project is reduced by 60% on average, which greatly improves the construction efficiency of the project. We will continue to explore parallelization of compilations between and within components as we continue to empower the field of data science.
| welcome to Milvus community
Github.com/milvus-io/m… | source
Milvus. IO |’s official website
space.bilibili.com/478166626 | Bilibili