By Wang Qifei

Organizing/LiveVideoStack

Hi, I’m Qifei Wang, senior software engineer at Google Research. First of all, I’d like to thank LiveVideoStack for inviting me here. Today, my topic is the latest advances in machine learning for efficient terminal devices.

This presentation will consist of five main sections. First, I’ll give you a brief introduction to on-end machine learning. Secondly, I will discuss how to build machine learning models suitable for mobile. In the third and fourth parts, I will respectively introduce the on-end machine learning optimization suitable for mobile applications, and the latest research on on-end machine learning based on privacy protection. Finally, I’ll discuss prospects for future work in on-board machine intelligence.

1. Machine learning

1.1 What is On-end machine learning

Thanks to the great success of deep learning, the devices, machines and things around us are getting smarter and smarter. Devices such as smartphones, home assistants, wearables, autonomous cars, machines such as drones, and machines such as light switches and home sensors are leveraging machine intelligence to support applications such as automatic translation, autonomous driving, smart home and more. Smartphones, home assistants, wearables, etc. On the machine side, there are self-driving cars, drones, and everyday devices such as light switches and household sensors. These machines are taking advantage of automatic translation, autonomous driving, smart home and more. Users can use machine intelligence as they please and enjoy it.

In the early years, because computing resources on mobile were so limited, most machine intelligence was achieved in the cloud. In cloud-based machine intelligence, source data is sent to the cloud for intelligent reasoning, and then the results are downloaded to the local device. Such cloud-based and client-based systems can suffer from delays, privacy, and reliability. More recently, however, we’ve noticed a trend to move intelligent reasoning from the cloud to the edge to improve these issues.

1.2 Why do we need machine learning

In cloud-based machine intelligence applications, long interaction delays between users and devices are usually caused by unstable network bandwidth. Low and stable interaction latency is provided by transferring machine intelligence to the client.

Machine intelligence requires access to private user data, such as user pictures, documents, emails, voice, etc. Machines uploading all their data to the cloud can raise privacy and security concerns. Because the terminal machine can only process all data on the local device, it can protect users’ private data from malware attacks.

Finally, moving smart computing to the end keeps smart services always available, even when networks are unavailable or cloud services are turned off.

Therefore, on-end machine intelligence has become a key research direction in the field of intelligence and mobile. Privacy-protected cloud computing balances latency, reliability, privacy, and performance.

1.3 On-end reasoning

Basically, therefore, on-end intelligence is achieved by running deep learning reasoning on the device using input signals from the device sensors, such as camera, microphone, and all the other sensors. The model runs entirely on the device without having to communicate with the server.

1.4 the challenge

Limited computing resources

Although machine learning for end-use devices shows great advantages, it still faces many challenges. The primary challenge is limited computing resources, and over the past few decades we have seen the computing power of mobile chipsets increase in compliance with Moore’s Law. However, compared with a cluster with a distributed computing system, the computing resources of a single device are still very limited and cannot meet the growing computing demands of emerging applications.

Limited power

Today, users are using their devices more than ever, and each new phone continues to boost battery capacity and support quick charging. However, the limited power of the device remains a major challenge for long battery life.

Equipment overheating

In addition, high power consumption often causes the device to overheat, especially for wearables, which can affect the user experience and cause security concerns.

From the experimental data, it can be seen that the floating-point calculation used in machine learning will require higher power consumption than the integer calculation. In order to learn quickly and reduce power consumption and memory usage, we must optimize the machine intelligence model to meet the power, memory and latency limitations of end-device applications.

2. Build machine learning models suitable for mobile terminals

Now, let’s talk about building smart models for mobile.

2.1 Model efficiency

Before diving into the details of developing a mobile smart model, let’s take a look at the performance data for the traditional server-side smart model and the mobile smart model. The figure above shows the data between model size and accuracy; The figure below shows the data between model accuracy and latency. The red dotted box shows the performance data of the traditional intelligent model, and the blue dotted box shows the mobile intelligent model. As you can see from the figure, traditional server-side intelligence models such as Google Inceptiom are much more burdensome than MobileNet models in terms of model size and inference latency. As a result, the traditional model is too onerous to be applied to mobile applications.

2.2 MobileNet V1

In 2017, Google released its famous deep learning architecture on the MobileNnet side. One of its major contributions is to transform standard convolution operations into per-channel convolution operations. As shown in the figure on the left, the per-channel convolution operation decomposes the standard convolution operation into two separate convolution operations:

In the first step, it convolved M input channels through M convolution kernels;

Step two, it convolutes the output of step one with 1×1, rather than convolving the input with N minus 1 different set of convolution operations as the standard convolution operation.

Doing so can reduce the computational complexity and number of parameters of the model by about 10 times and keep performance consistent with the latest server-side intelligent models such as Inception.

In addition, MobileNetmobilenetV1 scales the model size equally by controlling the global scaling factor.

2.3 MobileNet V3

In 2019, researchers designed a new MobileNet V3 platform. It builds new MobileNetMobilenet models by searching for model structures associated with hardware performance. The new platform builds intelligent models by integrating network adaptability and mobile network architecture search, and setting objective functions with target latency, memory and power consumption.

2.4 MobileNet Performance Benchmarking

As shown, the researchers are improving the performance of the on-end machine learning model with MobileNetmobilenetvV3 and an efficient neural architecture search. These on-side intelligent models have achieved similar performance to the latest server-side intelligent models. But the computational complexity is low. More specifically, MobileNet V3 achieves the highest accuracy with the lowest computational complexity limitations. This is somewhat similar to mobilenet. The Mobilenet architecture has become a reference and benchmark for intelligent models on the application side.

MLPerf

In addition, I’d like to introduce you to MLPerf, a machine learning performance benchmark. This is an open platform for researchers to publish the latest performance benchmarks for intelligent models on different hardware platforms, including accuracy, latency, memory footprint, and power consumption.

Each test result covers the most common tasks, including image classification, object detection, image segmentation, and natural language processing on the most popular data sets. Based on these benchmarks, users can easily view model performance and select the appropriate model for their application.

2.5 TFLite

On the other hand, Google has released TFLite, an on-side smart infrastructure, as a lightweight machine learning library and tool for mobile and embedded devices. It is embedded within the TensorFlow ecosystem, allowing developers to convert trained TensorFlow models to TFLite model format through a built-in converter. The transformed TFLite model can be used to build cross-platform applications.

In Android, the Android Neural Network API provides a native interface to run the TFLitetflite model and provides an interpreter to developers, who can build custom C ++ and Java apis to invoke the model on the device for intelligent inference. In the iosiOS system, users can call the interpreter directly through C ++.

TFLite’s reasoning speed is faster

TFLite stands out in terminal device machine learning with the following features. Firstly, its unified model format based on FlatBuffer is compatible with different platforms. Secondly, it optimizes pre-fusion activation and deviation calculation for mobile terminals. In addition, it provides a kernel optimized for NEON on ARM for significantly faster execution; Finally, it also supports post-training quantification. As one of the most popular model optimization methods, model quantization converts floating point coefficients to integers. Typically, quantification can reduce model size by a factor of four and speed up execution times by 10-50%.

As you can see from the figure, the model quantized by TFLite’s own quantization tool significantly reduces the reasoning time of the Mobilenet-like model and the Inception V3 model. In addition, developers using post-quantification can take advantage of the latest models without having to retrain them from scratch.

The model of compression

TFLite also recently released a comprehensive library for squeezing traditional large models into smaller models for end-device situations, called Learn2Compress. This technology uses user-provided, pre-trained, large TensorFlow models as input to train and optimize and automatically generate ready-to – use, on-end intelligent models that are smaller in size, more memory efficient, energy efficient, faster in reasoning, and with minimal loss of accuracy. Specifically, model compression is achieved by removing the weights or operations (such as low-score weights) that are most useless to the prediction.

It also introduced 8-bit quantization as well as combined model training and model distillation to obtain compact small models from large models. For image classification, Learn2Compress can generate a small and fast model with good prediction accuracy suitable for mobile applications. For example, the Learn2Compress model is 22 times smaller than the Inception V3 base model and 4 times smaller than the MobileNet V1 base model on the ImageNet task, and the accuracy is only 4.6-7% lower.

TFLite task API

In addition to a stable framework and advanced learning techniques, TFLite also exposes a set of powerful and easy-to-use tools libraries for application developers to create ML experiences using TFLite. It provides optimized read-to-use model interfaces for popular machine learning tasks, including Bert NLP engine-based natural language classifiers and question answerers, as well as visual task apis, including classifiers, detectors, and sectionalizers.

The TFLite task library works across platforms and is developed and supported on JAVA, C++ and Swift. The TFLite ML machine learning task API provides four main benefits. First, it provides a concise and unambiguous API for non-ML machine learning experts to use. Second, it provides developers with a high degree of scalability and customization, allowing developers to build their own Android and iosOS applications without knowing the model. Third, it has released a powerful but generic data processing tool library that supports common visual and natural language processing logic to translate between user data and the data formats required by the model, as well as processing logic that can be used for both training and reasoning. Finally, it achieves high performance through optimized processing, with the data processing flow taking less than a few milliseconds to ensure a fast reasoning experience with TensorFlowTFLite. All the models used in the task library are supported by Google Research. Next, I’ll discuss how to use the TFLite task API to build machine intelligence applications on the device.

Run the TFLite Task APIs from Java

Here, I’ll show you an example of an Android client using the TFLite task API. The Android client calls the JAVA interface to pass the input signal, which is further forwarded to the model call through its OWN API. After the model is inferred, the output is sent to the JAVA interface and further transmitted back to the Android client.

In the example, the user needs to copy the model file to a local directory on the device:

Step 1: Import Gradle dependencies and other Settings for the model file.

Step 2: You can use the Object Detector option to create object detectors and make synchronous inferences by calling detection methods. In end-to-end system design, the MediaPipe framework can be used to design both synchronously and asynchronously. Refer further to the open source MediaPipe system for details on building an end-to-end visual system.

3. Build machine learning models suitable for mobile terminals

It looks like we’ve done a great job building on-end machine intelligence applications for the smart community. Can we do better? The answer is yes.

3.1 Hardware Acceleration

A major work currently under way in the on-end machine learning community is to accelerate ML machine learning inference through hardware accelerators such as Gpus edgetPus and DSPS. The image above shows some of the hardware accelerators recently developed for mobile devices. As you can see from the chart, the performance of the latest chipsets (such as Hisi, Kirin 980, Snapdragon 855 and MediaTtek P9) has been significantly improved. This exciting news will encourage developers to develop more applications on terminal devices.

The figure in this slide shows that power benchmarks for basic filtering operations and image analysis operations running on ARM Gpus and FPgas have significant advantages over running on CPUS to reduce energy costs by being optimized on gpus and FPgas.

For Filter2D (which is one of the most common operations in deep learning) running on a GPU can cut the power consumption of the GPU in half. Running on an FPGA further reduces power consumption to a quarter of that of the CPU.

We listed benchmarks for different hardware platforms by running mobile models, such as mobilenetMobileNet, and server-side popularity models, such as Iinception. Running MobileNet V1 and V2 on a desktop CPU takes approximately 45 ms; Running together on both the CPU and FPGA is a significant 20-fold reduction.

In addition, running MobileNet V1 and V2 on an embedded CPU, such as a quad-core Cortex A53, takes more than 150 ms, compared to less than 2.5 ms on an EdgeTPU.

By comparing the Inception model running on a CPU and EdgeTPU, we can also observe that the latency running on EdgeTPU is significantly reduced compared to the latency running on THE CPU.

Incredibly, the significant delay reduction above was achieved with the microchipset shown in the image on the right.

EfficientNet-EdgeTPU

Here, we will show an example of building a hardware-accelerated on-end machine learning model using automatic machine learning. We used EfficientNet, one of the most advanced mobile neural network architectures, as the basis for this work. To build EfficientNets designed to take advantage of the accelerator architecture of Edge TPU, we invoked the Automated Network Architecture Search framework and extended the original EfficientNet’s neural network architecture search space with building blocks that can be executed efficiently on Edge. We also built and integrated a “delay predictor” module that provides an estimate of the model delay when executed on the Edge TPU by running the model on a period-accurate model structure simulator. The automatic network structure search controller uses a reinforcement learning algorithm to achieve a joint reward function in an attempt to maximize the predictive delay and model accuracy through search.

We know from past experience that the power consumption and performance of Edge TPU will be maximized when the model is suitable for on-chip memory. Therefore, we also modify the reward function to generate higher rewards for models that satisfy this constraint.

The efficientNet-Edgetpu-Small/Medium/Large EfficientNet model delivers better latency and accuracy over the existing EfficientNet, ResNet, and Inception models, with a network architecture dedicated to the Edge hardware. The one we got, efficientnet-Edgetpu-small model, has more accuracy but runs 10 times faster.

As a widely used terminal device inference platform, TFLite also supports native hardware acceleration. Here, we show an example of the MobileNet V1TFLite model running on a CPU, GPU, and edge TPU.

Overall, the CPU running floating point on MobileNet Vv1 takes about 124 milliseconds to deduce a frame of data. Running the quantized MobileNet Vv1 on the CPU is 1.9 times faster than the floating point model, and running the floating point model on the GPU is 7.7 times faster than the CPU, using only about 16 milliseconds per frame.

Finally, it takes only 2 milliseconds to run the quantization model on the Edge TPU. This is 62 times faster than the floating-point model on the CPU. Because we can conclude that we can significantly optimize model inference in terms of latency, power consumption, and memory through hardware acceleration.

4. Privacy awareness of on-end machine learning

Have we achieved the ultimate goal of on-end machine intelligence? We’re just getting started.

4.1 Data On the Terminal Is Meaningful

As we mentioned at the beginning, data privacy is another major reason why we are moving towards machine intelligence in our terminal devices. However, the training of the latest on-end machine intelligence models still needs to be done on the server side. A typical application: In order for a machine to be able to recognize an animal such as a dog for humans, we can use the public training image on the left to train the model, but we usually need to use the model in extremely challenging scenarios like the one shown on the right. So how do you achieve high precision in the challenging everyday use cases of personalization? A simple solution is to collect private images and retrain the model by centralizing the data center. Although big companies like Google have built the most secure and powerful cloud infrastructure to process this data to provide better services. But this is clearly still not the best solution. Because it raises the issue of using a user’s private data, which may contain sensitive information such as the user’s face, the user’s living space, etc. How can we improve the personalization of the model and protect user privacy?

4.2 Federated Learning

Now, for models that are trained through user interaction with mobile devices. We will introduce another approach – federated learning.

Federated learning enables mobile phones to cooperatively learn shared predictive models. Keeping all training data on the device at the same time, thus decoupling the ability to do machine learning from the need to store the data in the cloud, goes beyond using local models to predict mobile devices by bringing model training into the device. Here’s how it works: The user’s device downloads the current model, learns from the data on the phone to improve the current model, then aggregates the changes into a small local update, sends the model update to the cloud using only encrypted communication, and immediately updates it to average with other users here to improve the shared model. With all training data retained on the user’s device and no user’s personal data updates stored in the cloud, federated Learning provides smarter models, lower latency and lower power consumption while ensuring privacy.

Another immediate advantage of this approach is that in addition to providing updates to the shared model, you can immediately use the improved model on your phone, thus providing a personalized experience for the way you use your phone.

Test federated learning with Gboard

We are currently testing federated Learning on Gboard, Google’s keyboard app for Android. When Gboard displays a suggested query, your phone stores information locally about the current context and whether the recommendation was accepted. The federated learning board processes the history on the terminal device to suggest improvements for the next iteration of the Gboard query suggestion model.

For Gboard, which has millions of users, deploying this technology to different devices is a very challenging task. In the actual deployment, we use a miniature version of TensorFlow to implement model training on the device, with a carefully planned schedule to ensure that training is performed only when the device is idle and plugged into power and free wireless connection, thus without affecting terminal performance.

5. Future jobs

It looks like we’ve achieved a good goal, so what does the future hold?

In the past, all the training and reasoning was done on a centralized cloud system. This has led to growing concerns about privacy, delay and reliability. Today, we make some distributed machine intelligence models by reasoning with efficient smart devices that are still trained in centralized data centers and run on local devices.

In the near future, with federated Learning technologies we will have fully distributed AI to address privacy issues and support for lifelong terminal learning. The low latency and high capacity of 5G, which is being deployed globally recently, will also enable AI processing to be distributed between devices, edge clouds and central clouds to provide flexible hybrid system solutions for a variety of new and enhanced experiences.

This wireless edge architecture is adaptive and can be weighed appropriately for each use case. For example, performance and economic trade-offs may help determine how to allocate the workload to meet the latency or computation requirements required by a particular application. By then, we can see a loT of emerging applications in loT (Internet of Things) smart cities and personalization.

conclusion

In this presentation, we briefly outline the opportunities and challenges of on-end machine learning. Secondly, we discuss the computing of resource efficiency for machine learning of terminal devices. In this section, we introduced advanced techniques used by the TFLite framework for mobile model architecture to compress models, as well as an open source machine learning task API for user-building machine intelligence applications on the side. Finally, we introduce the latest advances in on-end machine learning (federated learning) for privacy protection. We also pointed out the future direction of on-end ARTIFICIAL intelligence.