CTR model is widely used in Internet search, recommendation, advertising and other scenarios. In recent years, with the introduction of deep neural networks, CTR model’s reasoning demands more and more hardware computing power. This paper introduces meituan’s CTR model optimization practice. By analyzing the structural characteristics of the model and combining with the GPU hardware architecture, we designed a series of processes to customize and optimize the model, achieving the goal of reducing delay, improving throughput and saving cost.
1 background
Click-through-rate (CTR) refers to the click-through Rate of an online advertisement, that is, the actual number of clicks on an advertisement divided by the number of advertisements displayed. The scoring model that serves CTR index is generally called CTR model. We can further extend this concept to models for estimating conversion rates in Internet applications. CTR model is widely used in recommendation, search, advertising and other scenarios. Compared with CV (computer vision) and NLP (natural speech processing) scenarios, THE HISTORICAL structure of CTR model is relatively simple and requires less computation. Meituan’s CTR model always uses CPU reasoning. With the introduction of deep neural networks in recent years, the CTR model structure gradually becomes more complex, and the computational load becomes more and more large, so the CPU cannot meet the requirements of the model for computational power.
GPU, with thousands of computing cores, can provide intensive parallel computing capability within a single machine, demonstrating powerful capabilities in CV, NLP and other fields. Nvidia has built a full GPU ecosystem through CUDA[1] and related apis. Based on this, Meituan basic development platform deploys CTR model on GPU through a set of solutions. In the model prediction stage alone, the GPU depth optimization scheme we provided based on Nvidia T4 has improved the throughput capacity by 10 times compared with CPU under the same cost constraint. At the same time, in a typical search refinement scenario, the overall throughput capacity is more than doubled from the end-to-end dimension.
In addition to improving throughput and reducing cost, GPU scheme also brings additional possibilities to the application of CTR model. For example, in the scenario of automatic completion of a search box, due to the natural interaction properties, the delay requirements are very strict, and generally complex models cannot be used. However, with the support of GPU capability, the average response time of a complex model is reduced from 15 milliseconds to 6~7 milliseconds, which has reached the on-line requirement.
Next, this paper will discuss the GPU optimization ideas, effects, advantages and disadvantages of the new generation CTR prediction service provided by Meituan machine learning platform, hoping to help or inspire students engaged in related work.
2 challenges of GPU inference in CTR model
2.1 Challenges at the Application Layer
- CTR model structure is variable, including a large number of business related structures, while new SOTA models are emerging. Hardware vendors due to manpower constraints will focus on optimizing commonly used classical structures, such as ResNet. For structures without convergence, there is no official end-to-end optimization tool to support it.
- The CTR model usually contains a large Embedding table structure, which takes into account that the Embedding table is not fit for video storage.
- In typical recommendation scenarios, model timeliness is required to achieve faster POI exposure, and the online model service needs to provide incremental updating of the model.
2.2 Framework Layer challenges
- Operator level: Current mainstream deep learning frameworks, such as TensorFlow and PyTorch, can be said to be the second generation of deep learning frameworks. They first solve the problem of Caffe, which has an obvious problem that the granularity of Layer is too coarse. As a result, algorithm developers in that era had to be able to “write your own custom layer”. Both TensorFlow and PyTorch place a high priority on the ability to express the model, resulting in small operator granularity, which incurs significant additional overhead on both CPU and GPU architectures.
- Framework level: Both TensorFlow and PyTorch are essentially training frameworks that are algorithm-developer friendly but non-deployable. It implies many designs for the convenience of distributed training. For example, In order to facilitate the disassembly of Variable into different PS, TensorFlow has built in the design of Partitioned_Variable. In gPU-based single-machine prediction scenarios, these structures also introduce additional overhead.
2.3 Hardware Layer challenges
First, the operator granularity of TensorFlow is fine, resulting in a model usually composed of thousands of operators, whose execution on GPU is transformed into the execution of the corresponding GPU kernel. Kernel is a function that executes in parallel on the GPU.
GPU kernel can be roughly divided into several stages, such as data transmission, kernel startup, and kernel calculation. The startup of each kernel requires about 10𝞵𝘀. A large number of small operators lead to the short execution time of each kernel, and the time of kernel startup accounts for the majority. Adjacent kernels need to read and write video memory for data transmission, resulting in a large amount of memory access overhead. However, the GPU fetch throughput is much lower than the computational throughput, resulting in low performance and low GPU utilization.
Secondly, THE GPU card contains multiple computing units. Theoretically, different computing units can run different kernels, but in fact, for the sake of simple programming, CUDA defaults to assume that the same kernel runs in a Stream at the same time. Although it is possible to run multiple streams, there is a lack of fine-grained coordination between multiple Steam streams.
After full investigation and discussion, we decided to focus on how to solve the problem of low execution efficiency of common CTR model structure on Nvidia GPU under TensorFlow framework in the first phase. We first converged the problem into the following two sub-problems:
- If the granularity of the operator is too fine, the GPU execution efficiency is low.
- The model structure is changeable, the manual optimization cost is large, and the universality is poor.
3 Optimization means
In order to solve the above problems, we conducted some investigations on deep learning accelerators in the industry. The mature inference optimization schemes in the industry are mainly TensorRT/XLA/TVM. TensorRT adopts manual optimization, operator fusion of some customized model structures, and efficient tuning of computationally intensive operators such as convolution. XLA is a built-in compiler optimization tool of TensorFlow, which mainly aims at access intensive structure and realizes operator fusion through compilation means. TVM[2] has comprehensive optimization capabilities. It uses compilation methods to merge operators and realizes automatic tuning of computationally intensive operators by means of machine learning.
After extensive research and comparison, we finally chose TVM as the optimization tool. TVM can deal with the changeable model structure better by means of compilation, and solve the problem of poor universality of manual optimization. However, TVM application also has a series of problems in the business model: fewer operators are supported, and the current support for dynamic Shape is not good enough. To solve these two problems, we combined TVM and TensorFlow, combined the structural characteristics of CTR model with the hardware characteristics of GPU, developed a series of processes, and realized the optimization of CTR model.
3.1 Operator Fusion
By fusing several small operators into a semantically equivalent large operator, the number of kernels on GPU can be effectively reduced. On the one hand, the reduction of kernel number directly reduces the cost of kernel transmission; On the other hand, the amount of computation performed by the large kernel after fusion increases, which avoids frequent access caused by data transmission between multiple kernels and improves the access ratio of computation.
It can be seen that in the left and right equivalent structure in the figure above, the operation performed by the 21 operators on the left can be completed in one equivalent operator. In the GPU activity, at least 21 GPU kernels and 21 graphics memory read/write operations are performed on the left, while only one kernel and one graphics memory read/write operations are performed on the right. For each fused operator, corresponding kernel implementation is required. However, the combination of operators of the model is infinite, so it is unrealistic to implement kernel manually for each operator after fusion. By means of compilation, TVM can automatically fuse operators and generate device code, avoiding the burden of handwritten kernel one by one.
3.1.1 TF-TVM automatic graph cutting optimization
The TensorFlow model cannot perform TVM transformations if it contains operators that are not supported by TVM. The idea is to cut out the parts that can be optimized with TVM and turn them into TVM engines, while the other parts still use TensorFlow operators. There are similar problems in the conversion of XLA and TRT. We analyze the implementation of TF-XLA and TF-TRT:
- The Grappler[4] Graph implementation has a POST_REWRITE_FOR_EXEC phase (searchable in the source code) following the Grappler[4] Graph. In this phase, three passes are executed for the Graph. Rewrite the subgraph and build LaunchOp.
- Tf-trt registers a Grappler optimizer (Grappler) that finds the connected subgraph and replaces it with the TRT Engine.
In the realization of the final scheme, we refer to tF-TRT design. Compared with XLA, the advantage of this design lies in the tight coupling between XLA cutting diagram scheme and TensorFlow source code, directly embedding XLA’s three passes into the main process of starting the Session. We don’t want to be too coupled with TensorFlow’s source code. We extend the tF-TVM scheme, and in practice we treat this process as an independent process. Automatically triggered when the model is deployed or updated.
In the reasoning stage, TVM is used to execute the optimized subgraph, and TensorFlow native implementation is used to execute the rest of the calculation graph, and the two are combined to complete the reasoning of the model. Because TVM and TensorFlow Runtime use separate memory management, data transfer between different frameworks incurs additional performance overhead. In order to reduce this overhead, we broke through the underlying data structures of the two frameworks to avoid additional data copies as much as possible.
3.1.2 Equivalent replacement of calculation figure
Too many operators not supported by TVM in TensorFlow model will lead to tF-TVM fragmentation and affect the final optimization effect. In order to make the TF-TVM slice diagram as large and complete as possible, and to make the fusion force in the TVM optimization process more powerful, we detected some complex structures in the model and replaced them with equivalent structures with more efficient execution or easier fusion.
For example, the native EmbeddingLookup structure of TensorFlow splits the Embedding table to generate dynamic operators such as DynamicPartition and ParallelDynamicStitch in order to support distributed training. These dynamic operators are not supported by TVM, resulting in excessive fragmentation of TF-TVM graphs. In order to make the TF-TVM slice diagram more complete, we modify this structure by graph replacement. By merging the Embedding table in advance, we get a simplified EmbeddingLookup structure.
3.2 CPU-GPU Data Transmission Optimization
The TVM optimized subgraph is replaced with a node that executes on the GPU, usually with dozens or even hundreds of inputs, while the front-end inputs (such as Placeholder) are typically executed on the CPU, involving multiple CPU-GPU transfers. Frequent transmission of small data volume cannot make full use of bandwidth. In order to solve this problem, we modify the model structure, add merge and split nodes in the calculation graph, control the position of the cut graph, and reduce the number of data transmission.
One possible way to merge these inputs is to merge them according to the same Shape and Dtype, and then split the split nodes into the subgraph of TVM for optimization. This method will lead to some problems, such as poor operator fusion effect of partial and molecular graphs; On the other hand, the parameter transfer memory of GPU kernel function is limited to 4KB, and the generated code will be illegal in the case of very large TVM node inputs (such as more than 512).
3.3 Manual optimization of high-frequency subgraphs
For the subgraphs that cannot be supported by TVM, we abstract the structures frequently used in business and implement them efficiently by using hand-written custom operators.
For example, some of the sequence features in the model are input as String. The input String is converted into the completed digital Tensor, and the Tensor of int is applied as the script. The semantics of this molecular graph are shown in the figure, hereinafter referred to as SE structure:
In this part of the structure, the native implementation of TensorFlow only has the CPU-based version. In the case of large data volume and high parallelism, the performance degrades seriously and becomes the bottleneck of the whole model. In order to optimize the performance of this part of the structure, we implement efficient equivalent operation on GPU.
As shown in the figure, the PadString operator completes multiple strings according to the maximum length at the CPU side, splines them into a uint8 type Tensor with continuous memory for one-time transmission to the GPU. After StringEmbedding receives the completed string, it makes use of the GPU parallel computing feature to cooperate with a large number of threads to complete the string segmentation and table lookup operations. In the process of protocol summation, prefix and other key processes, Reduce/Scan algorithm on GPU is used. Warp_shuffle instruction is used in the coding process. Different threads exchange data through registers, avoiding the overhead of frequent memory access and achieving good performance.
According to the GPU Scan algorithm, an 8-element prefix and operation only needs 3 iteration cycles. In a model with dozens of similar operations, the GPU timeline pair before and after manual optimization is shown in the figure below. It can be seen that the time consumption of H2D + StringEmbedding is greatly reduced from 42 milliseconds to 1.83 milliseconds.
In addition to StringEmbedding structure, we implement efficient integration of StringSplit + ToNumber + SparseSegmentSqrt, multi-path parallel StringEmbedding and other structures, and carry out corresponding replacement through structure matching in the optimization process.
3.4 CPU – GPU shunt
In actual online RPC requests, the number of samples (hereinafter referred to as Batch) in each request varies within the range of [1,MaxValue], which is relatively fixed due to the constraints of the upstream business system and other basic system capabilities. As shown in the figure above, taking a certain search service as an example, we have calculated the online Batch value distribution. The number of requests made by Batch=MaxValue accounts for about 45%, that by Batch=45 accounts for 7.4%, and that by Batch=1 accounts for 2.3%. The remaining Batch accounts for 0.5 percent to 1 percent. For Gpus, increasing the Batch of a single request can make better use of hardware resources, give full play to the parallel computing capability of Gpus, and show better latency and throughput than CPU. When Batch is small, the advantage of GPU over CPU is not obvious (the following figure shows the change of CPU/GPU delay in the same model under fixed pressure).
Most of the requests are made by GPU, and CPU resources are too free. We put some small Batch of broken requests on CPU to run, so that the resource utilization of the whole Worker can be more balanced and the overall performance of the system can be improved. According to the test, we set a Batch threshold value and the judgment logic that the calculation graph is executed differently on heterogeneous hardware: for the case of small Batch, the calculation graph is executed directly on CPU, and only the Batch requests exceeding the threshold value will be reasoned on GPU. From the online statistics, 77% of the total traffic is on the GPU and 23% is on the CPU.
Batch size is a very important information in a series of GPU optimization strategies and actions. Kernel implementations optimized under different batches may be different, so as to achieve the optimal computing performance under corresponding workload. Due to the characteristics of online traffic, the Batch distribution of requests sent to GPU is relatively small. If we optimize the kernel implementation of a model for each Batch, it is obviously not economical and universal. Therefore, a Batch Bucket strategy is designed to generate N optimization models with a fixed Batch. When the actual request comes, the Bucket closest to the Batch is found and the request is uppaddled to the corresponding Batch calculation, thus improving the GPU utilization efficiency.
4 Pressure measurement performance analysis
We select a model for on-line performance manometry analysis.
- CPU model Test environment: 16-core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, 16G memory.
- The GPU model test environment is 8-core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, Tesla T4 GPU, 16G memory.
The following figure compares the inference delay (Y-axis) of GPU model at different BATchsizes under different QPS (X-axis). When THE BatchSize of GPU model is below 128, the difference of inference time is not obvious, and larger BatchSize is more conducive to throughput. By comparing the GPU model with BatchSize of 256 and the CPU model with BatchSize of 25, when THE QPS is lower than 64, the reasoning time of the two models is basically the same. When the QPS exceeds 64, the reasoning delay of the GPU is lower than that of the CPU. GPU throughput is 10 times better than CPU.
At the same time, we can see the steepness of different curves. When THE QPS of CPU is higher than 64, the delay will rise rapidly, while GPU will remain stable until the QPS exceeds 128, but it will still rise significantly, which is still more stable than CPU.
5 Overall Architecture
According to the structural characteristics of CTR model, we abstract out a set of platform general optimization process. By analyzing the structure of the model, the optimal strategy is automatically applied to ensure the optimization effect of the model through performance evaluation and consistency check.
6. Deficiencies and future planning
In terms of ease of use, the current solution provides a set of online optimization scripts, which will automatically optimize the deployment after the user submits the model. Due to the analysis, editing and TVM compilation of the computational graph structure, the current model optimization takes a long time, most of which takes about 20 minutes. Then you need to consider the efficiency of speeding up TVM compilation.
In terms of versatility, TVM compilation optimization and high-performance handwriting operators are the main sources of income from our practical application. Manual optimization tests students’ understanding of business model and GPU programming ability. It is not easy to write a high performance fusion operator, and it is even more difficult to achieve a certain migration ability and scalability.
In general, CTR model inference on GPU still needs to be considered in the future. In addition to providing better performance based on business understanding, there is also the problem of models being too large to fit into video memory and supporting online model updates.
Author’s brief introduction
Wei Long, Xiao Zhuo, Wen Kui, ð« fei, Xiao Xin, etc., all come from Meituan basic RESEARCH and development platform – machine learning prediction
The resources
[1] CUDA C++ Programming Guide [2] TVM Documentation [3] Accelerating Inference In TF-TRT User Guide [4] TensorFlow graph optimization with Grappler
Recruitment information
Meituan machine learning platform continues to recruit a large number of positions, internship, social recruitment, coordinate Beijing/Shanghai, welcome interested students to join us, to build a multi-field company-level machine learning platform, to help everyone eat better, better life. Resumes can be sent to: [email protected].
Read more technical articles from meituan’s technical team
Front end | | algorithm back-end | | | data security operations | iOS | Android | test
| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.
| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.