The background,

Large C++ projects will face the problem of time-consuming compilation. Whether it is development debugging iteration, admission testing, or continuous integration phase, compilation behavior is everywhere, and reducing compilation time is very important to improve development efficiency.

Meituan search and NLP department provides basic search platform services for the company. For the sake of performance, the underlying basic services are implemented in C++ language. The DeepQueryUnderstanding service (DQU), which we are responsible for, also faces the problem of long compilation time. The entire service code takes about 20 minutes to compile before optimization (32 core machine compiles in parallel), which has affected the efficiency of the team’s development iterations. Based on this background, we have made a special optimization for the compilation of DQU service. In this process, we have also accumulated some optimization knowledge and experience, here to share with you.

Ii. Compilation principle and analysis

2.1 Compilation Principles

In order to better understand the compilation optimization scheme, before introducing the optimization scheme, let’s briefly introduce the compilation principle. Usually, when we do C++ development, the compilation process mainly includes the following four steps:

Preprocessor: macro definition replacement, header expansion, conditional compilation expansion, comment removal.

  • The gcc-e option yields preprocessed results with an extension of. I or.ii.
  • C/C++ preprocessing doesn’t do any syntax checking, not only because it doesn’t have syntax checking capabilities, but because preprocessing commands aren’t part of C/C++ statements (which is why you don’t define macros with semicolons), syntax checking is something the compiler does.
  • After preprocessing, all you get is the real source code.

Compiler: Generates assembly code to produce assembly language programs (which translate high-level languages into machine languages) in which each statement accurately describes a low-level machine language instruction in a standard text format.

  • The gcc-s option yields a compiled assembly code file with a.s extension.
  • Assembly language provides a common output language for different compilers of different high-level languages.

Assembler: Generates object files.

  • The GCC -c option yields the compiled result file with the.o extension.
  • O files are generated in binary encoding mode.

Linker: Generates executable or library files.

  • Static library: when a link is compiled, all the library code is added to the executable, resulting in a larger file that is no longer needed at runtime. The suffix is usually “.a “.
  • Dynamic libraries: Instead of adding library code to executable files when compiling links, libraries are loaded by runtime link files at program execution time, so that executable files are smaller, and dynamic libraries are usually named with the suffix “.so “.
  • Executables: Link all binaries into one executable, regardless of whether they are target binaries or library binaries.

2.2 C++ compilation features

(1) Compile each source file independently

The C/C++ compilation system is very different from other high-level languages, where the compilation unit is the entire Module, that is, all the source code under the Module is executed in the same compilation task. In C/C++, the unit of compilation is a file. Each.c/.cc/.cxx/.cpp source file is an independent compilation unit. As a result, compilation optimization can only be performed based on the contents of this file, and it is difficult to provide code optimization across compilation units.

(2) Each compilation unit needs to parse all contained header files independently

If N source files refer to the same header, the header will need to be parsed N times (a ghost story for a Thrift or Boost header that runs to thousands of lines).

If there is a template (STL/Boost) in the header file, the template is instantiated once for each CPP file, and STD :: Vector in N source files is instantiated N times.

(3) Template function instantiation

In the C++ 98 language standard, the compiler is required to instantiate every instance of a template that appears in source code. When linking, the linker also needs to remove duplicate instantiation code. Obviously, when the compiler encounters a template definition, it repeats the instantiation and compilation work every time. At this point, you can make the compiler much more efficient if you can avoid such repeated instantiation efforts. The introduction of external templates, a new language feature in the C++ 0x standard, solves this problem.

In C++ 98, there is already a language feature called Explicit Instantiation, whose purpose is to instruct the compiler to immediately instantiate a template (that is, force Instantiation). The external template syntax is modified on the basis of the explicit instantiation instruction syntax. Extern is added before the explicit instantiation instruction to get the external template syntax.

Template class vector Extern template class vector extern template class vector

Once an external template declaration is used in a compilation unit, the compiler skips template instantiations that match the external template declaration when compiling the compilation unit.

(4) Virtual function

The compiler handles virtual functions by adding a pointer to each object that holds the address to the virtual table that holds the addresses of the virtual functions of that class (including those inherited from the base class). The virtual table holds the address of the new function if the derived class overrides the new definition of the virtual function, or the address of the original version of the function if the derived class does not redefine the virtual function. If a derived class defines a new virtual function, the address of that function is added to the virtual function table.

When a virtual function is called, the program looks at the address of the virtual table stored in the object, turns to the corresponding virtual table, uses the address of the virtual function defined in the class declaration, uses the address of the function in the array, and executes the function.

Using virtual functions

① The object will add a storage address space (4 bytes for 32-bit systems, 8 bytes for 64-bit systems). ② Each class compiler creates a virtual function address table. ③ For each function call, we need to add an operation to look up the address in the table.

(5) Compilation optimization

GCC provides nearly 100 optimization options to meet the needs of users with varying degrees of optimization, which can be used to balance the compilation time, object file length, and execution efficiency of the THREE-DIMENSIONAL model. There are different methods of optimization, generally there are the following categories:

① Simplify operation instructions. ② Try to meet the flow operation of CPU. ③ Re-adjust the execution order of the code by guessing the behavior of the program. ④ Make full use of registers. ⑤ Expand simple calls and so on.

If these compiler options, all in the optimization of code targeted or a complex work, fortunately GCC offers from O0 – O3 and Os this several different optimization levels to choose from, in these options, contains most effective compiler optimizations, and on this basis, to block or add some options, Thus greatly reducing the difficulty of use.

  • O0: No optimizations are made. This is the default compilation option.
  • O and O1: Partial compilation optimizations are performed on a program. The compiler tries to reduce the size of the generated code and shorten the execution time, but does not perform optimizations that require significant compilation time.
  • O2: is a more advanced option than O1, with more optimization. GCC will perform almost all optimizations that do not involve time or space trade-offs. When the O2 option is set, the compiler does not perform loop unwrapping and function inlining optimization. Compared to O1, O2 optimization increases compilation time based on improved execution efficiency of generated code.
  • O3: More optimizations over O2, such as the use of pseudo-register networks, inlining of ordinary functions, and more optimizations for loops.
  • Os: Mainly the optimization of code size, usually all kinds of optimization will disrupt the structure of the program, so that debugging work becomes impossible to start. And will disrupt the execution order, depending on the memory operation order of the program need to do relevant processing to ensure the correctness of the program.

Compilation optimizations can cause problems:

① Debugging issues: As mentioned above, any level of optimization will result in a change in code structure. For example, merging and eliminating branches, eliminating common suboxpression, and replacing and changing load/store operations within loops will change the execution order of object code beyond recognition, resulting in a serious lack of debugging information.

② Memory operation order change problem: after O2 optimization, the compiler will affect the execution order of memory operations. For example: -fschedule-insns allows data processing to complete other instructions first; -fforce-mem May cause dirty data inconsistency between memory and registers. Some logic that depends on the order of memory operations needs to be rigorously processed before it can be optimized. For example, using the Volatile keyword to restrict how variables can be manipulated, or using the Barrier to force the CPU to follow instructions strictly.

(6)C/C++ optimizations across compilation units can only be left to the linker

When the linker links, it first determines where each object file will be placed in the final executable. Then access the address redefinition table for all target files and redirect the addresses recorded in it (plus an offset, the starting address of the compilation unit on the executable). Then the unsolved symbol table of all the object files is traversed, and the matching symbol is searched in all the exported symbol tables, and the realization address is filled in the position recorded in the unsolved symbol table. Finally, the contents of all the object files are written in their respective positions, and an executable file is generated. The details of the link are complicated. The link stage is a single process, which cannot be accelerated in parallel, resulting in extremely slow link of large projects.

Analysis of service problems

DQU is a query understanding platform used by Meituan search, which contains a large number of models, word lists. In terms of code structure, it contains more than 20 Thrift files, uses a large number of Boost processing functions, and introduces SF framework, SDK, and three submodules of word segmentation. Modules with the method of dynamic library compilation load, between modules for data transmission through the message bus and message bus is a big Event, so this class contains various modules need the definition of data types, so each module will introduce the Event header files, unreasonable dependencies caused the file changes, Almost all modules are recompiled.

The compiler problems faced by each service have their own characteristics, but the essence of the problem is similar, combined with the process and principle of compilation, we analyze the compiler problems of DQU service from three aspects: precompilation expansion, header file dependence and compilation process time-consuming.

3.1 Compilation Analysis

Ii files are reserved through the C++ precompilation stage. Check the size of the compiled files after expansion. You can reserve the compiled intermediate files by specifying the compilation selection “-save-temps” in cmake.

set(CMAKE_CXX_FLAGS "-std=c++11 ${CMAKE_CXX_FLAGS} -ggdb -Og -fPIC -w -Wl,--export-dynamic -Wno-deprecated -fpermissive -save-temps")
Copy the code

Compile the most direct reason is the time-consuming compile files after expansion is larger, compile unfolds, file size and content analysis can be seen through the precompiled unfolds the file has more than 40 ten thousand lines, found that there are a large number of Boost library reference and header file references the file is bigger, affect the compile time consuming. In this way, you can find commonalities in the compilation time of various files. Below is a screenshot of the file size after the compilation is expanded.

3.2 Dependency analysis of header files

The dependency analysis of header files is a way to analyze whether the code is reasonable in terms of the number of reference headers. We implemented a script to count the dependency of header files and analyze the dependency reference count of output header files to help judge whether the dependency of header files is reasonable.

(1) Total header file reference result statistics

Use the tool to count the total number of header files that are directly and indirectly dependent on the compiled source file, which can be used to analyze the number of imported header files.

(2) Single header file dependency statistics

You can use the tool to analyze the dependency relationship of header files and generate the dependency topology diagram to intuitively see the unreasonable dependence.

The figure contains the hierarchy of references and the number of reference header files.

3.3 Compilation Time Statistics in sections

Compile time segment statistics from the results on each file to compile time consuming and takes in all stages of compilation, this is one of the visual result, under normal circumstances, is head and file size and file reference number is a positive correlation, cmake by specifying the environment variables can print out the compile and link stage time-consuming, Through this data, the time consuming situation can be intuitively analyzed.

set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE "${CMAKE_COMMAND} -E time")
set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK "${CMAKE_COMMAND} -E time")
Copy the code

Compile time result output:

3.4 Analysis tool construction

Using the above tools to analyze can get several compilation data:

① Dependencies and number of header files. ② The size and content of precompilation expansion. ③ Compilation time of each file. ④ Overall link time. ⑤ Compilation parallelism can be calculated.

Through the input of these data, we can make an automatic analysis tool to find out the optimization points and display the interface. Based on this purpose, we built an automatic analysis tool for the whole process, which can automatically analyze common time-consuming problems and TopN time-consuming files. The analysis tool processing process is shown as follows:

(1) Overall statistical analysis effect

Specific field description:

Cost_time Compilation time, expressed in seconds. File_compile_size Specifies the size of an intermediate file, in M. ③ file_name indicates the file name. ④ include_h_nums, the number of imported header files, the unit is. ⑤ top_h_files_info, the most TopN header files are introduced.

(2) Top10 compilation time file statistics

Used to show the TopN file that takes the most time to compile statistics. N can be customized.

(3) Top10 compilation of intermediate file size statistics

By counting and displaying the size of the compiled file, it is used to determine whether this piece meets expectations. This is a one-to-one correspondence with the compilation time.

(4) Header file statistics of the most files introduced in Top10

(5) Top10 header file repetition statistics

At present, the im tools support a key into a compile time consuming analysis results, a few small tools, such as relying on tools are integrated into the company’s online file number integration testing process, through an automated tool to check code changes to compile time consuming, the influence of the construction of the tool are iterative optimization, the subsequent MCD platform will be integrated into the company, It can automatically analyze to locate the compilation time-consuming problem and solve the compilation time-consuming problem of other departments.

Fourth, optimization plan and practice

By using the above related tools, we can find the commonness of Top10 compilation time files, for example, all rely on the message bus file platform_query_analysis_enent.h, which directly and indirectly introduces more than 2000 header files. We focus on optimizing these files. Common problems such as Boost usage, template class unwrapping, and Thrift header unwrapping were identified and optimized specifically for these problems. In addition, we also used some common compiler optimization schemes in the industry and achieved good results. The following details the various optimization schemes we used.

4.1 General compiler acceleration scheme

There are a number of general-purpose compile-acceleration tools out there that can speed up compilation without invading code and are well worth trying out.

(1) Parallel compilation

On Linux, the GNU Make tool is generally used for compilation. When executing the Make command, you can add the -j parameter to increase the parallelism of compilation. For example, Make -j 4 will start four tasks. Instead of writing this parameter to death in practice, we use the $(nproc) method to dynamically get the number of CPU cores in the compiler as compilation concurrency to maximize the performance benefits of multicore.

(2) Distributed compilation

When using distributed compilation techniques, such as Distcc and Dmucs to build a large-scale, distributed C++ compilation environment, Linux platforms use network clusters for distributed compilation, network delay and network stability need to be considered. Distributed compilation is suitable for large-scale projects, such as stand-alone compilation that can take hours or even days. The DQU service does not need to be accelerated in a distributed manner in terms of code size and standalone compilation time. For details, please refer to the Distcc documentation.

(3) Precompiled header files

PCH (Precompiled Header) saves the compiled results of common Header files in advance so that the compiler can use the Precompiled results directly when handling the import of corresponding Header files, thus speeding up the compilation process. PCH is a very common method to speed up compilation in the industry, and we have very good feedback. In our project, we did not achieve the desired results because many Shared libraries were compiled and generated, and Shared libraries could not share PCH with each other.

(4) CCache

CCache (Compiler Cache) is a Compiler and Cache tool. Its principle is to store the compilation result of CPP in the file Cache, and obtain the compilation result directly from the Cache if the corresponding file does not change during subsequent compilation. If the target file is already compiled (and dependencies remain unchanged), it will not be compiled again if the source file timestamp remains unchanged. But CCache is a file-by-file cache, and multiple projects on the same machine can share the cache, making it more applicable.

(5) Module compilation

If your project was developed in C++20, Module compilation is also a way to optimize compilation speed. Prior to C++20, each CPP was treated as a compilation unit, and there was a problem with imported header files being parsed and compiled multiple times. Module solves this problem. Instead of requiring header files (just one Module file, no need to declare and implement both files), Module compiles your (.ixx or.cppm) Module entity directly and automatically generates a binary interface file. Different from include and import preprocessing, a compiled module will not be compiled again the next time it is imported, which can greatly improve the efficiency of the compiler.

(6) Automatic dependency analysis

Google has also released the open source Include-What-you-Use tool (IWYU), a clang-based C/C++ engineering redundant header file inspection tool. IWYU Clang compiler suite, use this tool to scan the file dependency problem, at the same time, the tool also provides script header file dependence problem, we try to set up the analysis tool, this tool also provides automated header file solution, but because of our code depends on more complex, the dynamic library, static library, warehouse, etc., The optimizations provided by this tool cannot be used directly. Other teams with simpler code structures may consider using this tool to analyze optimizations. The following result files are generated to guide which header files need to be removed.

> > > Fixing # includes' in/opt/at meituan/zhoulei/query_analysis/SRC/common/qa/record/brand_record. H '@ @ 1, 9 + 1, 10 @ @ #ifndef _MTINTENTION_DATA_BRAND_RECORD_H_ #define _MTINTENTION_DATA_BRAND_RECORD_H_ -#include "qa/data/record.h" -#include "qa/data/template_map.hpp" -#include "qa/data/template_vector.hpp" -#include <boost/serialization/version.hpp>  +#include <boost/serialization/version.hpp> // for BOOST_CLASS_VERSION +#include <string> // for string +#include <vector> // for vector + +#include "qa/data/file_buffer.h" // for REG_TEMPLATE_FILE_HANDLERCopy the code

4.2 Code optimization scheme and practice

(1) Pre-type declaration

By analyzing header reference statistics, we found that the bus type Event was the most referenced in the project, which in turn placed various business required members, as shown in the following example:

#include "a.h" #include "b.h" class Event {// business A, B, C... A1 a1; A2 a2; / /... B1 b1; B2 b2; / /... };Copy the code

As a result, the Event contains a large number of header files. After the header file is expanded, the file size reaches 15M. The use of events is required by all kinds of businesses, which naturally slows compilation performance significantly.

We solve this problem by using a pretype declaration, that is, no header file of the corresponding type is introduced, only a pretype declaration is made, and only Pointers of the corresponding type are used in the Event, as shown below:

class A2; / /... Class Event {// Business A, B, C... shared_ptr<A1> a1; shared_ptr<A2> a2; / /... shared_ptr<B1> b1; shared_ptr<B2> b2; / /... };Copy the code

The corresponding header file needs to be introduced only if the corresponding member variable is actually used; This really brings in header files on demand.

(2) External templates

Because this feature is instantiated only when the template is used, the same instance can appear in multiple file objects. The compiler instantiates each template, and the linker removes duplicate instantiation code. In projects where templates are used extensively, the compiler generates a lot of redundant code, which can greatly increase compilation and link times. This can be avoided with external templates in the new C++ 11 standard.

// util.h
template <typename T> 
void max(T) {... }Copy the code
// A.cpp extern template void max<int>(int); #include "util.h" template void max<int>(int); // Explicitly instantiate void test1() {Max (1); }Copy the code

Instantiate a Max (int) version of the function when compiling A.cpp.

// B.cpp
#include "util.h"
extern template void max<int> (int); // Declaration of external templates
void test2(a)
{
    max(2);
}
Copy the code

The Max (int) instantiation code is no longer generated when compiling the B.cp. this saves the instantiation, compilation, and linking time mentioned earlier.

(3) Use of polymorphic replacement templates

Our project makes heavy use of dictionary-related operations such as loading dictionaries, parsing dictionaries, and matching dictionaries (various fancy matches), all of which support different types of dictionaries through the Template Template extension. According to the statistics, there are more than 150 dictionary types, which also causes the code volume of template expansion to swell.

template <class R>
class Dict {
public:
  // Match key and condition, assign value to record
  bool match(const string &key, const string &condition, R &record);  // Each type of Record is expanded once
private:
  map<string, R> dict;
};
Copy the code

Fortunately, most of the operations in our dictionary can be abstracted out of several classes of interfaces, so we can only implement operations on base classes:

class Record {  / / the base class
public:
  virtual bool match(const string &condition);  // Derived classes need to be implemented
};
​
class Dict {
public:
  shared_ptr<Record> match(const string &key, const string &condition);  // The user passes in a pointer to the derived class
private:
  map<string, shared_ptr<Record>> dict;
};
Copy the code

Through inheritance and polymorphism, we effectively avoid a lot of template unrolling. Note that using Pointers as the Value of a Map increases the stress of memory allocation, and it is recommended to use Tcmalloc or Jemalloc instead of the default Ptmalloc to optimize memory allocation.

(4) Replace Boost libraries

Boost is a widely used base library that covers a large number of commonly used functions and is very convenient and useful, but has some drawbacks. A significant drawback is that the implementation takes the form of HPP, where declarations and implementations are placed in header files, which makes the pre-compiled expansion very large.

// String manipulation is a common feature, and the size of this header expansion is over 4M just by introducing it
#include <boost/algorithm/string.hpp>
// In contrast, multiple STL headers are introduced and expanded to only 1M
#include <vector>
#include <map>
// ...
Copy the code

There were no more than 20 Boost functions mainly used in our project. Some of them could be replaced by STL, and some of them were implemented manually, which enabled the project to shift from heavy reliance on Boost to mostly boost-free, greatly reducing the compilation burden.

(5) Precompile

Some Common changes in the code are relatively small, but have a certain impact on the compilation time, such as Thrift generated files, model library files and Common files under the Common directory. We adopt the method of pre-compilation into dynamic libraries to reduce the compilation time of subsequent files and also solve part of the compilation dependence.

(6) Solve compilation dependency and improve compilation parallelism

In our project, there are a large number of module-level dynamic library files that need to be compiled, and the compilation dependencies specified by the Cmake file limit the execution of compilation parallelism to some extent.

For example, in the following scenario, compilation parallelism can be improved by properly setting library file dependencies.

4.3 Optimization Effect

We compared the compilation time before and after optimization with 32C and 64G memory machines, and the statistical results are as follows:

4.4 Keep the optimization results

Compilation and optimization is a “upstream” business. Developers tend to add new features, new libraries, and even new frameworks, and it is often difficult to remove old code, libraries, and frameworks (as front-line developers must know). Therefore, how to hold the previous optimization results is also crucial. We have the following experiences in practice:

  • Code reviews are difficult (changes that increase compilation time are often not intuitively visible by reviewing the code).
  • It’s the tools and processes that are worth relying on.
  • The key is to control the increment.

We found that the compilation time of CPP files was positively correlated (in most cases) with the size of their precompiled expansion files (.ii); For each online version, the size of the pre-compiled expansion of all CPP files is recorded to form its Compile Fingerprint (CF, Compile Fingerprint). By comparing CF versions of two adjacent versions, we can more accurately know which changes are the main causes of the compilation time caused by the new version, and further analyze whether the time increase is reasonable and whether there is room for optimization.

We made this approach into a scripting tool and introduced it into the online process, so that we could clearly understand the impact of each code release on compilation performance, and effectively help us to hold the early optimization results.

Five, the summary

DQU project is an important link in meituan’s search business. The system needs to connect 20+RPC, dozens of models, load more than 300 dictionaries, use tens of gigabytes of memory, and respond to more than 2 billion large C++ services per day. In the case of high-speed business iteration, the lengthy compilation time brings great trouble to the development students and restricts the development efficiency to a certain extent. Finally, we through the compilation optimization analysis tool construction, combined with the use of general compilation optimization acceleration scheme and code level optimization, shorten the compilation time of DQU 70%, and by introducing CCache and other means, so that the compilation of local development, can be completed in 100s, to save a lot of time for the development team.

After achieving initial results, we summarized the entire problem solving process and precipitated some analysis methods, tools, and process specifications. These tools quickly and effectively detect compile-time changes caused by new code changes in subsequent development iterations, and became part of our on-go process checks. This is a big difference from our previous one-time or targeted compilation optimizations. After all, the maintenance of code is a lasting process. To solve this problem systematically, not only effective methods and convenient tools are needed, but also a standardized and standardized on-line process is needed to maintain the results. I hope this article will be helpful to you.

reference

  • [1] Principles of Compilation: Perspective, Schematic Principles of Compilation
  • [2] CCache
  • [3] Distributed compilation
  • [4] Precompile header files
  • [5] Precompile header files
  • [6] C++ Templates
  • [7] Include-what-you-use

Author’s brief introduction

The authors of this article are Zhou Lei, Shi Han, Zhu Chao, Wang Xin, Liu Liang, Chang Shu, Li Chao, Yun Sen and Yong Chao, all from the AI platform search and NLP department of Meituan.

| want to read more technical articles, please pay close attention to Meituan technical team (meituantech) WeChat public official.

| in the public, the menu bar reply goodies for [2019], [2018] special purchases, goodies for [2017], [method] the key word, can see Meituan technology team calendar year essay collection.