Bert model has a deep network structure and a large number of parameters, so it faces great challenges in real-time and throughput to deploy Bert model as an online service. This paper mainly introduces some difficulties and engineering optimization in the process of 360 Search deploying Bert model as online service.

background

In 360 search scenario, the delay and throughput of online Bert service are very high. After preliminary investigation, exploration and experiment, there are three main challenges to make Bert model into online service:

1. A large number of model parameters. The 12-layer Bert model has more than 100 million parameters, which is much more computational than other semantic models.

2. Long reasoning time. It has been verified that the 12-layer Bert model has a delay of about 200ms on CPU and an unoptimized inference delay of 80ms on GPU. Such performance is unacceptable in the search scenario.

3, the amount of reasoning calculation, the need for more resources. According to the pressure test, a single computer room needs hundreds of GPU cards to undertake all online traffic, and the investment cost is far higher than the expected income.

Based on the above difficulties, we investigated several popular inference frameworks such as TF-SERVING, OnnxRuntime, TorchJIT, and TensorRT. After comparing the support of quantization, preprocessing, length, stability and performance, and community activity, we found that: Finally, Nvidia’s open source TensorRT was selected. After determining the framework selection, we optimized Bert online service at several different levels.

Bert online service optimization

Optimizations provided at the framework level

The optimization provided by TensorRT reasoning framework itself is as follows:

Interlayer fusion and tensor fusion. The essence is to improve GPU utilization by reducing the number of kernel function calls.

Kernel automatic tuning. TensorRT selects the optimal layer and parallel optimization algorithm on the target GPU card to ensure optimal performance.

Multi-stream execution. Multiple task streams are processed in parallel by sharing weights to optimize video memory.

Dynamic application for Tensor video storage. And then when you Tensor uses it you actually apply for video memory, which is a huge improvement in video memory utilization.

Model quantization. The model throughput is greatly improved while the accuracy is guaranteed, and the reasoning delay is reduced.

Knowledge of distillation

The on-line delay of the 12-layer Bert model could not meet the performance requirements, so we distilled it to a 6-layer lightweight small model. After knowledge distillation, the accuracy of the 6-layer model can reach 99% of that of the 12-layer model while reducing the amount of computation. The experimental verification shows that the number of Bert model layers is positively correlated with the performance of online service, and the TP99 index of the 6-layer model improves by one time compared with that of the 12-layer model.

FP16 quantitative

In the structure of Bert model, most of the Tensor is fp32 precision, but there is no need for back propagation in reasoning. At this point, the precision can be reduced and the throughput of the model can be greatly improved while ensuring the effect of the model. After FP16 quantization, the reasoning delay of the model is 1/3 of the original, while the throughput is 3 times of the original, and the video memory occupation is also 1/2 of the original model. However, compared with the original model, the quantized model of FP16 has a loss after 10,000 digits, and the quantized model has almost no impact on the final effect after verification in the 360 search scenario. On balance, the quantified gains far outweigh the losses.

To optimize the water

After the development of online service, a phenomenon was observed in the process of pressure testing: no matter how high the pressure of pressure testing request was, the GPU utilization rate would reach a bottleneck and remain at about 80%. When the pressure continues to increase, GPU utilization still does not increase, but the delay keeps increasing.

In the figure above, H2D and D2H indicate that data is copied from memory to video memory and data is copied from video memory to memory respectively. Kernel indicates that the Kernel function is being executed. The three actions of online request inference are: first, copy the request data from memory to video memory, then GPU initiates the core function call for inference calculation, and finally, copy the calculation result from video memory to memory. The part of THE GPU that actually performs computation is the part that executes the core function (the blue part in the figure above). During data copying, the GPU is idle (the white part in the figure above). In this case, the GPU will have idle time no matter how high the pressure is, so the utilization rate will not be full.

One way to solve the above problem is to add a Stream, so that the calculation part of the kernel function of the two streams can be executed alternately, and the effective working time ratio of GPU can be increased, and the GPU utilization rate can be reduced to more than 98%. Stream can be understood as a task queue, and H2D can be understood as a task. Adding one more Stream will not increase the occupation of extra video memory, and multiple streams share the weight of the model.

Run the architecture

The figure above describes the operating architecture of a single Bert service process with two GPU cards. Explain the nouns from left to right. Task represents a predictive request to be processed, context is used to store contextual information about the request, stream represents the task flow, profile describes constraints on model inputs (such as limiting the maximum batch size of inputs), Engine is an optimized model compiled by TensorRT from the original model. Each GPU card loads a model, and each model has two streams sharing model weights to provide prediction services externally.

Whenever the Bert service receives a predictive request from the client, the request will be placed in the task queue. Above 4 working threads in thread pool whenever free time a forecasting task from the task queue, keep the context information and then copy the request data through the Stream to the memory, the GPU calls the kernel function will again after you reasoning results through the Stream back to the memory, the worker threads will result in a designated location after notice the upper, A complete request prediction process is complete.

Cache optimization

In the search scenario, some hot words will appear in the search content of the day, and caching can effectively reduce part of the calculation. After the request level cache is added to the search system, the average cache hit ratio can reach 35%, which greatly relieves the pressure of Bert online service.

Dynamic sequence length

In the initial online service, the input dimension is fixed, that is, the last dimension of the input shape is the maximum sequence length that has appeared in offline statistics. After online small-traffic verification and online request statistics, it is found that the proportion of online request length over 70 is less than 10%. Therefore, the dynamic sequence length optimization method is adopted, that is, the sequence with the longest length in a request batch is used as the input length, and zeros are added to the remaining sequences. After this optimization, the performance of the line is improved by 7%.

Bert online service exploration

After the above optimization, in the process of testing and small flow verification, we also encountered some problems, to share with you.

Model dynamic loading leads to increased latency

In the search scenario, there is a requirement for hot loading of multiple versions of the model. After the online development, a phenomenon was observed that TP99 would increase when the new model was hot loaded, and the cause was found out later through positioning analysis.

When Bert makes predictions online, there is an operation to copy model input data from memory to video memory. When Bert service dynamically loads the model, there will also be an action of copying model weight data from memory to video memory. When copying the model to video memory occupies PCI bus, at this time, the copy of predictive request data from memory to video memory will be affected, so TP99 will increase. The copy of model weight lasts for a few seconds. At this time, TP95 is normal. According to statistics, only a few requests will have delay increase, which basically has no impact on services.

Accuracy of shock

In the early development process, we observed that the results returned by the same sequence input model under different batch sizes were always different, but oscillated within a fixed interval. For example, the returned results were always between 0.93-0.95 and were not fixed. This phenomenon was stably repeated under TensorRT 7.1.3.4. After communication and feedback with Nvidia colleagues, it has been fixed under 7.2.2.3.

Memory footprint

A single Bert model only occupies several hundred MB of video memory, but after the function of multi-version models is added, 5-8 models may be loaded in Bert online service. If it is not handled properly, OOM problems may occur. Our current approach is that if the model cannot be loaded properly due to insufficient video memory, we will only prompt the model loading failure without affecting the normal service. The amount of video memory used to load a new model can be determined in advance based on three main factors:

1. Weight of the model itself. The weight of the model itself is required to occupy video memory, which is approximately equal to the file size of the model on disk.

2. Some contextual information needed for model reasoning. This part of the video memory occupation is divided into two parts, one is to store the persistent information of context, including the video memory occupied by the input and output data. The other part is the intermediate information occupied by reasoning, which usually does not exceed the size of model weight.

3. The CUDA runtime also consumes some video memory, but this memory footprint is fixed.

Summary and Prospect

After preliminary framework investigation and verification, model optimization, engineering architecture optimization and deployment and exploration process, Bert online service was officially launched in 360 search scenario. At present, a single T4 card in the optimized 6-layer model can calculate 1500 Qt per second, and the online peak TP99 is 13ms. On the basis of ensuring the stability and performance of Bert online service in engineering, the service effect has also achieved considerable benefits compared with the baseline.

We will continue to explore and promote the application of Bert in 360 in the future. At present, there are some urgent optimization points in engineering in the search scenario:

1. At present, Bert service is still deployed on physical machines, with difficulties in upgrading and expanding capacity, poor disaster recovery and waste of resources. We are advancing the Bert K8S deployment process.

2, the current training Bert, distillation, data and model management, and deployment of each module is more dispersed, work we are doing a job is to take these modules are integrated into the company’s internal machine slowly learning platform, do training model, data management, model management, upgrade service deployment, AB experiment platform and routing, shortening the period of online, Improve work efficiency.