What is the hardest part of deep learning practice? Guess what 80% of developers say:

“Of course, it is.”

Why is it hard? It’s like a chef finding a recipe based on ingredients, or a pharmacist finding a prescription based on herbs.

However, mastering the tuning parameters, at most, we have mastered half of deep learning. The other half is “model deployment.”

What’s so hard about model deployment? For example, the former chef mastered many recipes in the training school after various trainings. When he finally arrived at the hotel, he found that the kitchen environment of the hotel was different from that of the training. He was in a mess during the peak hours of dining, and the customer waited for an hour before the food was served.

Although the metaphor is slightly exaggerated, it also illustrates the relationship between deep learning model training and inference deployment.

As we know, deep learning is generally divided into training and reasoning. Training is the “learning” process of neural network, mainly focusing on how to search and solve model parameters and discover rules in training data.

Once you have a trained model, you apply it to an online environment to make predictions about unknown data, a process known in AI as inference.

In practical application, the reasoning stage may face completely different hardware environment from the training stage, and of course, it also corresponds to different computing performance requirements. The model we have trained needs to be able to correctly and efficiently realize the reasoning function in the specific production environment and complete the online deployment.

So, when we train the model and finally get it online, we might run into problems like:

  • The hardware environment for online deployment is different from that for training

  • Reasoning calculation takes too long and may cause service unavailability

  • The memory usage of the model is too high to go online. Procedure

For industrial-scale deployments, the requirements are often numerous and demanding, and not every deep learning framework can be well supported for actual production deployments. A perfect framework for reasoning support will make your model work twice as efficiently online.

As a deep learning framework derived from industrial practice, Paddle has especially profound accumulation and polishing in Inference and deployment ability, and provides a server-side Inference library with strong performance and easy to start, which helps users get rid of various troubles of online deployment.

01 What is Paddle Inference


The Inference Inference deployment ability of flying oar frame has been upgraded and iterated through several versions, and a perfect Inference library named Paddle Inference has been formed. The Paddle Inference model has rich features and excellent performance, and has been deeply adapted and optimized for different application scenarios of different platforms, so as to achieve high throughput and low time delay, which ensures that the flying Paddle model can be quickly deployed in the server.


High performance implementation of 02 Paddle Inference

Memory/video memory overcommitment improves service throughput

At the initial stage of reasoning, the OP output Tensor in the model is analyzed for dependence, and the two undependent Tensor is reused in memory/video memory space, thus increasing computational parallelism and improving service throughput.

Fine-grained OP horizontal and vertical fusion reduces computational effort

In the initial stage of reasoning, multiple OP in the model is fused into one OP according to the existing fusion mode, which reduces the computational load of the model and the number of Kernel Launch, thus improving the reasoning performance. At present, Paddle Inference supports dozens of fusion models.

Built-in high-performance CPU/GPU Kernel

Built-in high-performance kernel jointly built with Intel and Nvidia ensures high-performance execution of model reasoning.

Subgraph integration with TensorRT to accelerate GPU reasoning speed

For GPU Inference scenarios, Paddle Inference integrates TensorRT in the form of subgraphs. TensorRT can optimize some subgraphs, including horizontal and vertical fusion of OP, filter redundant OP, and automatically select the optimal kernel for OP to speed up Inference.

Subgraphs integrate Paddle Lite lightweight reasoning engine

Paddle Lite is a lightweight, low frame overhead reasoning engine for Paddle deep learning framework. In addition to mobile applications, Paddle Lite reasoning can also be performed using a server. Paddle Inference integrates Paddle Lite in the form of subgraph to facilitate the user to make a little change in the original way of server Inference, so that the Inference ability of Paddle Lite can be opened and the Inference speed can be faster. In addition, Paddle Lite can be used to perform inferential calculations on baidu Kunlun and other high-performance AI chips.

Support for loading PaddleSlim quantized compressed models

PaddleSlim is a model compression tool for deep learning of flying paddles. Paddle Inference can be linked with PaddleSlim, which supports loading and deployment of quantized, trimmed and distilled models, thus reducing model storage space, reducing calculation memory and speeding up model Inference speed. In terms of model quantification, Paddle Inference has made in-depth optimization on X86 CPUS, and the single-thread performance of common classification models can be improved by nearly 3 times, and that of ERNIE models by 2.68 times.

By comparing the training forward time and Inference time of resnet50 model and Bert model, we can observe that the Paddle Inference has significant acceleration effect.

Note: To test the time consuming method, use the same input data to run for 1000 times, then run in a cycle for 1000 times, record the time consuming of the model each time, and finally calculate the average time consuming of the model.


03 Paddle Inference and generality

Compatible with mainstream software and hardware environments

X86 cpus on the server and NVIDIA GPU chips are compatible with Linux, Mac, and Windows operating systems. Support all flying oar training output model, complete training and use.

Multi-language environment rich interface can be flexibly called

Support C++, Python, C, Go and R language API, simple and flexible interface, 20 lines of code to complete the deployment. For other languages, ABI stable C apis are provided that users can easily extend.


04 Paddle Inference


Let’s take a look at how to use the flying paddle to accomplish server-side inference deployment.

“One function” saves the model

The flying paddle framework provides a built-in function save_inference_model that saves the model as a model format for reasoning. Save_inference_model can pry the training model according to the input and output required by inference and remove the irrelevant parts of inference. Compared with training, the model obtained is more streamlined and suitable for further optimization and deployment.

from paddle import fluid

place = fluid.CPUPlace()
executor = fluid.Executor(place)

image = fluid.data(name="image", shape=[None, 28, 28], dtype="float32")
label = fluid.data(name="label", shape=[None, 1], dtype="int64")

feeder = fluid.DataFeeder(feed_list=[image, label], place=place)
predict = fluid.layers.fc(input=image, size=10, act='softmax')

loss = fluid.layers.cross_entropy(input=predict, label=label)
avg_loss = fluid.layers.mean(loss)

executor.run(fluid.default_startup_program())

Save the model to the model directory, save only the parts of the network related to the input image and output and inference
fluid.io.save_inference_model("model", feed_var_names=["image"],
    target_vars=[predict]. executor=executor)
Copy the code

“One Configuration manager” takes care of deployment Settings

After saving the inference model, you can use the inference library, Paddle Inference provides AnalysisConfig to manage various Settings for Inference deployment, such as set on CPU or GPU deployment, load model path, turn on/off computational graph analysis optimization, deployment acceleration using MKLDNN/TensorRT, etc. You can enable the optimized configuration based on your online environment. At the same time, zero Copy can be configured to manage input and output. Feed OP and Fetch OP can be skipped during inference execution to reduce redundant data copy and improve inference performance.

from paddle.fluid.core import AnalysisConfig

Create a configuration object
config = AnalysisConfig("./model")
Zero Copy is used for configuration
config.switch_use_feed_fetch_ops(False)
config.switch_specify_input_names(True)
Copy the code

“One predictor” tackles high performance reasoning

With the deployment configuration defined, you can create the predictor. Paddle Inference provides a method of multinomial graph optimization. When the predictor is created, the Inference model is loaded and the graph optimization is carried out automatically to enhance the Inference performance.

# create a predictor
from paddle.fluid.core import create_paddle_predictor
predictor = create_paddle_predictor(config)
Copy the code

Once the predictor is created, all you need to do is pass in the data and run the inference to calculate the prediction. Assuming we have read the input data into a numpy.ndarray array, flyblade provides an easy-to-use API to manage the input and output.

Get and pass in data
input_names = predictor.get_input_names()
input_tensor = predictor.get_input_tensor(input_names[0])
input_tensor.copy_from_cpu(input_data.reshape([1, 28, 28]).astype("float32"))

# Run the predictor, where the actual prediction will be performed
predictor.zero_copy_run()

# Output the predicted results
ouput_names = predictor.get_output_names()
output_tensor = predictor.get_output_tensor(output_names[0])
output_data = output_tensor.copy_to_cpu()
Copy the code

Let’s take a look at the whole process of using the paddle deployment model with an example of a complete Python API. Let’s take the example of deploying the ResNet model on a P4 GPU server.

  • (1) Install paddlePaddles.

Download and install PaddlePaddle from the official website.

  • (2) Model acquisition.

wget http://paddle-inference-dist.bj.bcebos.com/resnet50_model.tar.gz && tar -xzf resnet50_model.tar.gz
Copy the code

  • (3) Prepare the model deployment code and save it in the infer_resnet.py file.

import argparse
import argparse
import numpy as np
from paddle.fluid.core import AnalysisConfig
from paddle.fluid.core import create_paddle_predictor

def main():   

    args = parse_args()

    # set AnalysisConfig
    config = set_config(args)

    # to create PaddlePredictor
    predictor = create_paddle_predictor(config)

    Get the name of the input
    input_names = predictor.get_input_names()
    input_tensor = predictor.get_input_tensor(input_names[0])

    # set input
    fake_input = np.random.randn(args.batch_size, 3, 318, 318).astype("float32")
    input_tensor.reshape([args.batch_size, 3, 318, 318])
    input_tensor.copy_from_cpu(fake_input)

    # run predictor
    predictor.zero_copy_run()

    # fetch output
    output_names = predictor.get_output_names()
    output_tensor = predictor.get_output_tensor(output_names[0])
    output_data = output_tensor.copy_to_cpu() # numpy. Ndarray type
    for i in range(args.batch_size):
        print(np.argmax(output_data[i]))

def parse_args():

    # model path configuration
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_file".type=str, help="model filename")
    parser.add_argument("--params_file".type=str, help="parameter filename")
    parser.add_argument("--batch_size".type=int, default=1, help="batch size")

    return parser.parse_args()

def set_config(args):
    config = AnalysisConfig(args.model_file, args.params_file)
    config.enable_use_gpu(100, 0)
    config.switch_use_feed_fetch_ops(False)
    config.switch_specify_input_names(True)
    return config

if __name__ == "__main__":
main()
Copy the code

  • (4) Perform reasoning tasks.

# model is the model storage path
python3 infer_resnet.py --model_file=model/model --params_file=model/params
Copy the code

This is the complete process for model deployment using the Paddle Inference Python API, available from the official website. If you want to learn more about C++ deployment, you can refer to the C++ examples provided on the official website.

The Python example:

https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/python_infer_cn.html #id6

C + + example:

https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/native_infer.html#a- name-c-c-a


05 Paddle Inference How to further optimize the performance?


Here we have completed a basic reasoning service, can we deliver? It’s not enough for excelsior developers. Flying paddle can also help users further improve their reasoning performance through the following methods:


Enable MKLDNN to speed UP CPU reasoning


On X86 cpus, if the hardware supports it, you can turn on DNNL (Deep Neural Network Library, originally named MKLDNN) optimization, which is an Intel open source high-performance computing Library for Neural Network optimization on Intel-based processors and graphics processors. The flying paddle is automatically invoked and just needs to be turned on in the configuration options.

config.enable_mkldnn()
Copy the code

Switch to GPU reasoning

If you want to use an NVIDIA GPU, just one line of configuration is required to automatically switch to the GPU.

Initialize 100 MB video memory on GPU 0. This is an initial value and the actual video memory may change dynamically.
config.enable_use_gpu(100, 0)
Copy the code

Start TensorRT to speed up GPU reasoning

TensorRT is a high-performance deep learning inference acceleration library that can provide low latency and high throughput optimization services for deep learning inference applications on gpus. The Paddle Inference integrates TensorRT in the form of subgraph. Based on the already configured GPU reasoning, only one line of configuration is required to enable Paddle TensorRT accelerated reasoning:

config.enable_tensorrt_engine(workspace_size=1 << 30,
                              max_batch_size=1,
                              min_subgraph_size=3,
                              precision_mode=AnalysisConfig.Precision.Float32,
                              use_static=False,
                              use_calib_mode=False)
Copy the code

Start Paddle Lite lightweight reasoning engine

For some small models with small computation and little actual Inference time, if the Paddle Inference is directly used, the frame time may be at the same order of magnitude as the model time. In this case, the Paddle Lite subgraph can be used to reduce the frame time. Paddle Inference integrates Paddle Lite in the form of a subgraph, and only one line of configuration is required to start Paddle Lite’s Inference acceleration engine.

config.enable_lite_engine(precision_mode=AnalysisConfig.Precision.Float32)
Copy the code

What scenarios are covered by flying OARS in the field of industrial deployment?


Industrial deployment may face diversified deployment environments. According to different application scenarios, flyblade provides three inference deployment schemes:

  • As a native high performance Inference library of the Paddle deep learning framework, Paddle Inference can be applied to local server deployment scenarios, which is ready to use in training.

  • For the service deployment scenario, Paddle provides Paddle Serving deployment plan. In this scenario, the reasoning module is used as a remote call service, the client makes a request, and the server returns the reasoning result. Is an essential solution for cloud deployment.

  • For deployment scenarios of end-to-end hardware such as mobile terminals and embedded chips, Paddle Provides Paddle Lite deployment solution to meet high-performance and lightweight deployment requirements.

More information can be found at the following project address to explore the powerful practical capabilities of the flyblade for industrial deployment.


07 Related Materials


Project address:

https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/inference

Fly blade Paddle Lite project address: https://github.com/PaddlePaddle/Paddle-Lite

Fly blade Paddle Serving project address: https://github.com/PaddlePaddle/Serving

Fly oar PaddleSlim project address: https://github.com/PaddlePaddle/PaddleSlim

If you join the official QQ group, you will meet a large number of like-minded deep learning students.

Official QQ group: 703252161.

If you want to learn more about flying OARS, please refer to the following documentation.

Website address: https://www.paddlepaddle.org.cn

Flying oar Core Framework Project Address:

GitHub: https://github.com/PaddlePaddle/Paddle

Gitee: https://gitee.com/paddlepaddle/Paddle