The author; Agora Zhou Shifu

In recent years, with the rapid development of AI technology, there is an increasingly strong demand for AI models to be deployed to mobile devices. Deployment of THE AI model to mobile devices brings many benefits:

  1. For the company, there is no need to deploy a cloud server for the AI model, which can greatly reduce the operation and service costs of the company.
  2. For users, users’ data, such as videos, pictures, sounds, and texts, can be calculated on the local mobile terminal instead of being uploaded to the cloud server, effectively protecting users’ privacy.

However, AI models are faced with the following challenges and limitations when deployed on mobile devices:

  1. Mobile devices have limited computing capacity. Although the CPU, GPU and NPU of current mobile devices can provide powerful computing capacity, the computing capacity of mobile devices is still insufficient compared with THE AI model. Some cloud models cannot run normally on mobile devices, and even if they can run, the user experience will be compromised.
  2. Mobile devices are sensitive to the storage size of AI models. Take VGG16 model as an example, its size is 500MB. It is estimated that few users are willing to download and use such a large model if it is integrated into APP.
  3. The mobile device cannot provide enough memory for the AI model. Besides complex operation and large model, the AI model consumes memory from a few GB to a dozen GB.

Therefore, in order to successfully deploy AI models to mobile devices and provide good user experience, it is necessary to continuously explore and optimize the lightweight design of models, model transformation, model quantification and inference acceleration.

Model lightweight design

In the structural design stage of the MODEL, the AI model should be designed according to the hardware characteristics of the mobile terminal device. On the premise of meeting the performance of the model, the number of layers of the model and the number of channels of each layer should be as small as possible. The size of the convolution kernel of the convolution layer should be as small as possible, and 5×5 and 7×7 should not be used in large quantities, mainly 3×3 and 1×1. The lightweight design of the model can refer to MobileNet and ShuffleNet. MobileNet, Google’s lightweight model for mobile devices, has gone through three iterations. The core idea of MobileNet V1 model is to factor the original standard convolution operation into a Depthwise convolution and a 1×1 Pointwise convolution. MobileNet V2 introduces residual structure on the basis of V1, expands dimension, and adopts linear activation in pointwise convolution at the output end. MobileNet V3 uses Neural Architecture Search (NAS) and introduces a lightweight attention model based on the Squeeze and Congestion structure. ShuffleNet is a lightweight model proposed by Megvii and has gone through two iterations. The core of ShuffleNet V1 adopts two operations: Pointwise group convolution and Channel shuffle, which greatly reduces the computational amount of the model while maintaining the accuracy. ShuffleNet V2 points out that the memory access loss time (MAC) and FLOPs jointly determine the actual landing training and running speed according to the phenomenon that the model speeds are very different in the case of the same FLOPs (float-point Operations).

Model transformation

Currently, the mainstream open source training frameworks for deep learning include Tensorflow and Pytorch. Tensorflow has been widely used in the industry due to its first-mover advantage and complete community, while Pytorch has surpassed Tensorflow in academic circles due to its ease of use and excellent performance. Although Both Tensorflow and Pytorch offer mobile versions, their reasoning speed is limited and cannot meet the requirements of mobile reasoning speed. After several years of development, the inference framework of mobile terminal has formed a part of the hegemony. At present, the mainstream mobile inference frameworks include NCNN, MACE, MNN and TNN.

NCNN, the first open source project of Tencent Youtu Lab, is a high-performance neural network forward computing framework optimized for mobile terminals. From the beginning of its design, NCNN deeply considers the affiliation and use of mobile terminals. No third party dependencies, cross-platform, mobile CPU faster than all known open source frameworks.

MACE is a neural network computing framework launched by Xiaomi for mobile heterogeneous computing platform, which is specially optimized in terms of performance, power consumption, system response memory, model encryption and protection, and hardware support range.

MNN is an efficient and lightweight deep learning framework. It supports deep model reasoning and training, especially in the end – side reasoning and training performance of the industry leader. At present, MNN has been used in more than 20 apps of Alibaba such as mobile Taobao, mobile Tmall, Youku, Dingding and Xianyu, covering more than 70 scenes such as live broadcast, short video, search recommendation, commodity image search, interactive marketing, rights issuance and security risk control. In addition, there are several applications in IoT scenarios.

TNN: Built by Tencent Youtu Lab, it is a high-performance and lightweight reasoning framework for mobile terminals, with many prominent advantages such as cross-platform, high performance, model compression and code cutting. Based on the original Rapidnet and NCNN frameworks, TNN framework further strengthens the support and performance optimization of mobile devices, and also draws on the advantages of high performance and good extensibility of mainstream open source frameworks in the industry.

Mobile inference frameworks NCNN, MACE, MNN and TNN are not directly compatible with the model of Tensorflow and Pytorch, but need to convert Tensorflow and Pytorch into ONNX (Open Neural Network Exchange) first. Then convert ONNX to models supported by NCNN, MACE, MNN and TNN. ONNX is a standard for representing deep learning models that can be transferred between frameworks.

Mobile Deployment

Convert Tensorflow and Pytorch to NCNN, MACE, MNN or TNN via ONNX. Compile a library for mobile inference framework based on ios or Android. In order to further pursue the inference speed of the model, the convolution layer and the activation layer can be fused with the inference framework to reduce the IO overhead between the network layers. The float32 can also be quantized int8, and the model size can be reduced by 4 times and the reasoning speed can be increased by 2-3 times without reducing the accuracy of the model.