Article/Youdao Technology Team

In recent years, Youdao technology team has done a lot of exploration and application work in the research of real-time AI capability on mobile terminals. After Google released TensorFlow Lite (TFLlite) in November 2017, the Youdao technical team immediately followed the TFLite framework and soon applied it to Youdao Cloud Note products. This article will introduce how we use TFLite in document recognition in Youdao Cloud Note and what features TFLite has.

An introduction to document identification

1. Definition of document identification

Document recognition was originally a problem when developing the document scanning function of Youdao Cloud Note. The document scan function hopes to recognize the area of the document in the photo taken by the user, stretch it (scale reduction), recognize the text in it, and finally get a clean picture or a text version with formatting notes. The following steps are required to implement this feature:

  1. Identify the document area: Find the document out of the background, identify the four corners of the document;

  2. Stretch the document area to restore the aspect ratio: according to the coordinates of the four corners of the document, according to the perspective principle, calculate the original aspect ratio of the document, and stretch the document area to restore the rectangle;

  3. Color enhancement: According to the type of the document, choose different color enhancement methods to make the color of the document picture clean and clean;

  4. Layout recognition: understand the layout of the document picture, find the text part of the document;

  5. OCR: Identify “text” in the form of pictures into encodable text;

  6. Generate notes: Generate formatted notes from OCR results based on the layout of document images.

Document identification is the first step in document scanning and the most complex part of the scene

2. Document identification role in Youdao AI technology matrix

In recent years, based on deep neural network algorithm, Youdao has done a series of work in the processing and understanding of natural language, image, speech and other media data, and produced neural network-based multi-language translation, OCR (optical character recognition), speech recognition and other technologies. With the combination of these technologies, our products have the ability to allow users to record content in their most natural and comfortable way, use technology to understand the content, and unify it into text for further processing. From this point of view, our technologies form a network of natural language-centered, interchangeable media forms.

Document recognition is an insignificant but indispensable link in the transformation chain from image to text. With its existence, we can accurately find the documents to be processed in the vast sea of maps, and extract them for processing.

3. Introduction to the algorithm of document recognition

Our document recognition algorithm is based on FCNN (Fully Convolutional Neural Network), which is a special CNN that corresponds to an output for each pixel of the input image (relative, In ordinary CNN network, each input picture corresponds to one output). Therefore, we can mark a batch of images containing documents, and mark the pixels near the edges of the documents in the image as positive samples, and the other parts as secondary samples. During training, pictures are taken as the input of FCNN, and the output value is compared with the marked value to get the training punishment, so as to carry out training. For more details on the document recognition algorithm, see the article document Scanning: Deep Neural Network practice on Mobile from Youdao Technology team.

As the main body of the algorithm is CNN, the operators mainly used in the document scanning algorithm include the convolutional layer, Depthwise convolutional layer, full connection layer, pooling layer, Relu layer and other commonly used operators in CNN.

4. Document identification and TensorFlow

There are many frameworks for training and deploying CNN models. We chose to use the TensorFlow framework for the following reasons:

  1. TensorFlow provides a comprehensive and large number of operators, and it is not difficult to create new operators yourself. In the early stage of algorithm development, you need to try all kinds of different model network structures, and use all kinds of weird operators. A framework that provides a comprehensive operator can save a lot of effort at this point;

  2. TensorFlow can cover multiple platforms, such as server, Android and iOS, and has complete operator support on each platform.

  3. TensorFlow is a more mainstream option, which means it’s easier to find an off-the-shelf solution on the Internet when things get tough.

5. Why use TFLite for document recognition

Before the release of TFLite, the document recognition function in Youdao Cloud Note was based on TensorFlow Mobile. When TFLite launches, we want to migrate to TFLite. The main driver of our migration was the size of the linked library.

When compressed, the TensorFlow dynamic library on Android is about 4.5m in size. If you want to accommodate multiple processor architectures on the Android platform, you may need to pack about 4 dynamic libraries, with a combined volume of about 18MB. The tflite library is around 600K in size. Even if you pack a linked library for four platforms, you only need about 2.5m in volume. This is of great value in the mobile App that pays for every inch of land.

The introduction of TFLite

What is TFLite

TFLite is a neural network computing framework for mobile and embedded applications released by Google I/O in developer Preview on November 5, 2017. Compared with TensorFlow, it has the following advantages:

  • Lightweight. As mentioned above, the linked libraries generated through TFLite are small;

  • There’s not much to rely on. While TensorFlow Mobile relies on libraries such as Protobuf for compilation, Tflite does not require a large dependency library;

  • You can use mobile hardware acceleration. TFLite can be hardware accelerated through Android Neural Networks API (NNAPI), as long as the acceleration chip support NNAPI, can accelerate TFLite. On most Android phones, however, Tflite still runs on the CPU.

2. TFLite code structure

As TFLite users, we have also explored the TFLite code structure, which we share here.

Currently, the TFLite code resides in the “TensorFlow /contrib/lite” folder of the TensorFlow project. There are header/source files and subfolders under the folder.

Some of the more important headers are:

  • Model.h: Classes and methods associated with model files. FlatBufferModel is used to read and store model content, while InterpreterBuilder can parse model content.

  • Interpreter. H: provides the Interpreter class to infer from, which is the class we deal with most often;

  • Context. h: provides a struct TfLiteContext for storing Tensors and some states. It is typically wrapped in Interpreter when actually used;

In addition, there are some more important subfolders:

  • Kernels: This is where the operator is defined and implemented. The regester.cc file defines which operators are supported, which can be customized.

  • Downloads: A number of third-party libraries, including:

  • Abseil: Google’s extension to the c++ standard library

  • Eigen: a library of matrix operations;

  • Farmhash: hash library.

  • Flatbuffers: library of flatBuffers model formats used by TFLite;

  • Gemmlowp: a low-precision matrix computing library from Google;

  • Neon_2_sse: Maps neon instructions on the ARM to the corresponding SSE instructions.

  • Java: Mainly Android platform related code;

  • Nnapi: Provides an NNAPI call interface. If you want to implement NNAPI yourself, take a look;

  • Schema: The specific definition of the FlatBuffers model format used by TFLite;

  • Toco: Code for converting a Protobuf model to the FlatBuffers model format.

How do we use TFLite?

1. TFLite compilation

TFLite runs on Both Android and iOS, but the official compilation process is different.

On Android, we can compile using the Bazel build tool. The installation and configuration of the Bazel tool will not be detailed, but should be familiar to those who have experience compiling TensorFlow. According to the official documentation, bazel compile target is “/ / tensorflow/contrib/lite/Java/demo/app/SRC/main: TfLiteCameraDemo”, this is a demo app. If you want to compile the library files, you can compile “/ / tensorflow/contrib/lite/Java: tensorflowlite” the target, get the libtensorflowlite_jni. So libraries, and the corresponding Java interfaces.

See the official documentation for more details:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/mobile/tflite/demo_android.md

On iOS, you need to compile using a Makefile. Run on MAC build_ios_universal_lib. Sh, will compile the generated tensorflow/contrib/lite/gen/lib/libtensorflow – lite. A the library file. This is a fat library that packages libraries for x86_64, I386, ARMv7, ARMV7S, arm64, etc.

See the official documentation for more details:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/mobile/tflite/demo_ios.md

The TFLite library’s calling interface is also different on the two platforms: Android provides the Java layer calling interface, while iOS provides the c++ layer calling interface.

Of course, TFLite engineering structure is relatively simple, if you are familiar with TFLite structure, you can also use their familiar compiler to compile TFLite.

2. Model transformation

Instead of using the old Protobuf format (presumably to reduce dependency libraries), TFLite uses FlatBuffers. Therefore, the trained Protobuf model file needs to be converted to the FlatBuffers format.

TensorFlow officially provides guidance on model transformation. First, because TFLite supports fewer operators, let alone training-related operators, you need to remove unnecessary operators from the model in advance, namely Freeze Graph. You can then do the model format conversion using the tensorFlow TOCO tool. These two tools are also compiled through Bazel.

See the official documentation for more details:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/mobile/tflite/devguide.md

3. Missing operators

Currently, TFLite only provides limited operators, mainly those used in CNN, such as convolution and pooling. Our model is a full convolution neural network, most of the operators TFLite provides, but the conv2D_transpose (reverse convolution) operator is not provided. Fortunately, this operator appears at the end of the network model, so we can take out the calculation results before deconvolution and use c++ to realize a deconvolution by ourselves, so as to calculate the final result. Since the computation of reverse convolution is not large, the operation speed is basically not affected.

What if, unfortunately, the operator your model needs but TFLite lacks is not at the end of the network? You can customize a TFLite operator and register it in the TFLite kernels list, so that the compiled TFLite library can handle the operator. Also, during model transformation, you need to add the — allow_CUSTOM_ops option to keep operators in the model that are not supported by TFLite by default.

See the official documentation for more details:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/g3doc/custom_operators.md

Advantages and disadvantages TFLite

Advantages: achieve a balance between library size, ease of development, cross-platform and performance

As a comparison, Youdao Technology team selected some other mobile terminal deep learning frameworks to analyze their performance in four aspects of “ease of development, cross-platform, library size and performance” :

  • TensorFlow Mobile, because it is the same code as TensorFlow on the server, it can directly use the model trained on the server, which is very convenient to develop; Can support Android, iOS, cross-platform no problem; As mentioned earlier, libraries are large in size; Performance mainstream.

  • Caffe 2 can be easily converted from caffe’s model to Caffe 2, but it lacks some operators and is not convenient to develop. Can support Android, iOS, cross-platform no problem; Libraries compile to be large, but static libraries can be compressed; Performance mainstream.

  • Mental/Accelerate, both of which are frameworks on iOS. At the bottom level, model transformation is needed & to write inference code by ourselves, which is painful to develop. Only iOS is supported. Library is the system, does not involve the size of the library; Very fast.

  • CoreML, this is a framework for iOS 11 released by WWDC17. There are some model transformation tools that are not painful to develop when they involve only general operators, but very difficult when they involve custom operators. Supports only iOS 11 or later. Library is the system, does not involve the size of the library; Very fast.

The last is TFLite:

  • TFLite, whose model can be converted from the model obtained by TensorFlow training, but lacks some operators, and its development is generally convenient; Can support Android, iOS, cross-platform no problem; The library compiles small; In our experiments, it was a little faster than TensorFlow.

It can be seen that TensorFlow Mobile is easy to develop and has good versatility, but has a large link library and mainstream performance (other Mobile versions of server-side neural network frameworks also have similar characteristics). Mental/Accelerate these lower-level libraries are fast but not cross-platform and can be a pain to develop; Caffe2, TFLite and other optimized neural network frameworks for mobile terminal are more balanced, although there will be incomplete operator problem at the beginning, but as long as the team behind the continuous support to promote the development of the framework, this problem will be solved in the future.

Advantages: Relatively easy to expand

Because TFLite’s code is relatively simple (compared to TensorFlow) and its structure is relatively uncluttered, it is relatively easy to extend. If you want to add an operator that is not available on TFLite but is available on TensorFlow, you can add a custom class. If you want to add an operator that is also not available on TensorFlow, you can also modify the FlatBuffers model file directly.

Cons: OPS is not comprehensive enough

As mentioned above, TFLite currently mainly supports CNN-related operators, but does not have good support for operators in other networks. Therefore, if you want to migrate RNN models to mobile, TFLite is currently not OK.

However, according to the latest Google TensorFlow Developer Summit, Google and the TensorFlow community are working hard to increase the coverage of OPS, and we believe that as more developers have similar needs, more models will be well supported. That’s one of the reasons we chose a mainstream community like TensorFlow.

Disadvantages: Currently does not support a variety of computing chips

Although TFLite is based on NNAPI and can theoretically take advantage of all kinds of computing chips, there are not many computing chips that support NNAPI yet. Expect TFLite to support more computing chips in the future. After all, there is a limit to the speed of neural network optimization on CPU, and using custom chips is the door to the new world.

conclusion

For the past year or two, there seems to have been a wave of real-time AI on mobile. The Youdao technical team has also made a lot of attempts in the research of MOBILE TERMINAL AI algorithm, and launched offline neural network translation (OFFLINE NMT), offline word recognition (offline OCR), offline document scanning and other mobile terminal real-time AI capabilities, which have been applied in Youdao Dictionary, Youdao Translator and Youdao Cloud Note. Due to the current mobile AI is still in the stage of vigorous development, various frameworks and computing platforms are still not perfect.

Here, we take the offline document recognition function in Youdao Cloud Note as a practical case, and see that TFLite, as an excellent mobile AI framework, can help developers relatively easily implement common neural networks on mobile terminals. In the future, we will also bring you more youdao technology team’s technical exploration and practical product application in mobile terminal real-time AI combined with TFLite.

For more technical practice articles of Youdao Technology team, please pay attention to the public account of “Youdao Technology Team”.


Be a Tensorflower