Use the Pipcook to identify the front-end components in the picture

Literature/Ali Tao Department
F(x) Team
–
Lack of month

preface

In order to help you better learn Pipcook and machine learning, we have prepared a series of practical tutorials, which will respectively explain how to use Pipcook in our daily development from front-end component identification, image style transfer, AI poetry, and automatic blog classification. If you need to know Pipcook 1.0, Read the article AI ❤️ JavaScript, Pipcook 1.0.

background

Whether you met in the front-end business such scene: some pictures in the hand, do you want to have an automatic way to identify these pictures which components are included in this picture, what is the position of these components are all in the picture, belong to what type of components, this type of task generally referred to as the target detection in the field of deep learning.

Target detection and recognition refers to finding targets from a scene (picture), including where and what

This detection is very useful. For example, in the study of image generation code, the front-end code is mainly composed of DIV, IMG and SPAN. We can identify the shape, bitmap and text position in the picture, and then directly generate the corresponding description code.

This tutorial will teach you how to train a model to do such a detection task.

Sample scenario

For example, as shown in the following image, this image contains several components, including buttons, switches, input fields, etc., and we want to identify their positions and types:

For the trained model, after the input of this image, the model will output the following prediction results:

{ boxes: [ [83, 31, 146, 71], // xmin, ymin, xmax, ymax [210, 48, 256, 78], [403, 30, 653, 72], [717, 41, 966, 83] ], classes: [0, 1, 2, 2 // class index], scores: [0.95, 0.93, 0.96, 0.99 // scores]}Copy the code

At the same time, we will generate LabelMap during training. Labelmap is a mapping between a serial number and the actual type. This is mainly generated because our classification name in the real world is text, but before entering the model, we need to convert the text into numbers. Here is a Labelmap:

{
  "button": 0."switch": 1,
  "input"2} :Copy the code

Here’s an explanation for the above prediction:

Boxes: This field describes the location of each identified component, displayed in order of upper left and lower right, as in [83, 31, 146, 71], indicating that the upper left coordinate of the component is (83, 13) and the lower right coordinate is (146, 71).
Classes: This field describes the category of each component. With LabelMap, we can see that the components identified are buttons, switches, input fields, and input fields
Scores: the confidence degree of each component recognized. The confidence degree is the information about the model’s recognition of the result. Generally, we set a threshold value, and we only take the result whose confidence degree is greater than this threshold

Data preparation

When we want to do such a target detection task, we need to make, collect and store our data set according to certain specifications. There are two main data set formats of target detection in the industry today. Coco data set and Pascal Voc data set respectively. We also provided corresponding data collection plug-ins to collect data in these two formats respectively. Here we take Pascal Voc format as an example, and the file directory is:

train

1.jpg
1.xml
2.jpg
2.xml
.

validation

1.jpg
1.xml
2.jpg
2.xml
.

test

1.jpg
1.xml
2.jpg
2.xml
.

We need to divide our data set into train, Validation and test in a certain proportion. Among them, the training set is mainly used to train the model, while the validation set and test set are used to evaluate the model. The validation set is mainly used to evaluate the model in the training process, so as to conveniently check the over-fitting and convergence of the model. The test set is used to evaluate the model as a whole after all the training.

For each image, Pascal Voc specifies an XML annotation file to record what components are in the image and where each component is located. A typical XML file reads:

<? xml version="1.0" encoding="UTF-8"? > <annotation> <folder>less_selected</folder> <filename>0a3b6b38-fb11-451c-8a0d-b5503bc351e6.jpg</filename> <size> <width>987</width> <height>103</height> </size> <segmented>0</segmented> <object> <name>buttons</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>83</xmin> <ymin>31.90625</ymin> <xmax>146</xmax> <ymax>71.40625</ymax> </bndbox> </object> <object> <name> Switch </name> < Pose >Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>210.453125</xmin> <ymin>48.65625</ymin> <xmax>256.453125</xmax> <ymax>78.65625</ymax> </bndbox> </object> <object> <name>input</name> < POSE >Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>403.515625</xmin> <ymin>30.90625</ymin> <xmax>653.015625</xmax> <ymax>72.40625</ymax> </bndbox> </object> <object> <name>input</name> < Pose >Unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>717.46875</xmin> <ymin>41.828125</ymin> <xmax>966.96875</xmax> <ymax>83.328125</ymax> </bndbox> </object> </annotation>Copy the code

The XML annotation file consists of the following parts:

Folder/filename: These two fields mainly define the image location and name of the annotation
Size: width and height of the picture
object:

Name: category name of the component
Bndbox: location of the component

We have prepared such a data set, you can download down to see: download address

Start training

After the data set is ready, we can start training. Using Pipcook can be very convenient for target detection training. You only need to build the following pipeline,

{
  "plugins": {
    "dataCollect": {
      "package": "@pipcook/plugins-object-detection-pascalvoc-data-collect"."params": {
        "url": "http://ai-sample.oss-cn-hangzhou.aliyuncs.com/pipcook/datasets/component-recognition-detection/component-recognition-de tection.zip"}},"dataAccess": {
      "package": "@pipcook/plugins-coco-data-access"
    },
    "modelDefine": {
      "package": "@pipcook/plugins-detectron-fasterrcnn-model-define"
    },
    "modelTrain": {
      "package": "@pipcook/plugins-detectron-model-train"."params": {
        "steps": 100000}}."modelEvaluate": {
      "package": "@pipcook/plugins-detectron-model-evaluate"}}}Copy the code

From the above plugins, we can see that they are used separately:

Pipcook/plugins-object-disaction-pascalVOC-data-collect this plug-in is used to download Pascal Voc format data sets, mainly, we need to provide URL parameters, we provided the above data set address we prepared
@pipcook/plugins-coco-data-access We have downloaded the data set now, we need to access the data set into the format required by the subsequent model. Because the Detectron2 framework adopted by our model needs coco data set format, we adopt this plug-in
Detectron-fasterrcnn-model-define We built the Faster RCNN model based on detectron2 framework, and this model has a very good performance in the accuracy of target detection
This plugin is used to start training for all models built with detectron-model-train. Iteration is set to 100000. If your dataset is very complex, you need to increase the iteration number
We use this plug-in to evaluate the training effect of the model. The plug-in is valid only if the test set is provided. Finally, the average precision of each category is given

Due to the large size of target monitoring models, especially those in the RCNN family, training is required on machines with an NVIDIA GPU and a CUDA 10.2 environment ready:

pipcook run object-detection.json --verbose --tunaCopy the code

During the training process, the model will print the loss of each iteration in real time. Please pay attention to check the log to determine the model convergence:

[06/28 10:26:57 d2.data.build]: Distribution of instances among all 14 categories:
|   category   | #instances | category | #instances | category | #instances ||:------------:|:-------------|:-----------:|:-------------|:----------:|:-------------| | tags | 3114 | input | 2756 | buttons | 3075 | | imagesUpload | 316 | links | 3055 | select | 2861 | | radio | 317 | textarea | 292 | datePicker | 316  | | rate | 292 | rangePicker | 315 | switch | 303 | | timePicker | 293 | checkbox | 293 | | | | total | 17598 | | | | | [06/28 10:28:32.d2.utils. Events]: iter: 0 total_loss: 4.649 loss_cls: 2.798 loss_box_reg: 0.056 loss_rpn_cls: 0.711 loss_rpn_loc: 1.084 data_time: 0.1073LR: 0.000000 [06/28 10:29:32d2.utils. Events]: iter: 0 total_loss: 4.249 LOss_cls: 2.198 loss_box_reg: 0.056 LOss_RPn_clS: 0.711 LOss_RPn_LOC: 1.084 data_time: 0.1073 LR: 0.000000... [06/28 12:28:32 d2.utils. Events]: iter: 100000 total_loss: 0.032 loss_cls: 0.122 loss_box_reg: 0.056 loss_rpn_cls: 0.711 LOSS_RPN_LOC: 1.084 DATa_time: 0.1073 LR: 0.000000Copy the code

After training, output will be generated in the current directory, which is a brand new NPM package, so we first install dependencies:

cd output
BOA_TUNA=1 npm installCopy the code

With the environment installed, we can begin to predict:

const predict = require('./output');
(async () => {
  const v1 = await predict('./test.jpg'); console.log(v1); // { // boxes: [ // [83, 31, 146, 71], // xmin, ymin, xmax, ymax // [210, 48, 256, 78], // [403, 30, 653, 72], // [717, 41, 966, 83] // ], // classes: [ // 0, 1, 2, 2 // class index // ], // scores: [/ / 0.95, 0.93, 0.96, scores of 0.99 / / / / / /}}) ();Copy the code

Note that the results given have three parts:

Boxes: This property is an array where each element is another array of four elements, xmin, xmax, ymin, and ymax
Scores: This property is an array where each element is the confidence of the corresponding predicted result
Classes: This property is an array where each element is the corresponding predicted class

Make your own data set

After reading the above description, if you are eager to solve your own problems with object detection, there are the following steps to create your own data set

Collect pictures

This step is easier to understand. In order to have your own training data, you need to collect enough training pictures first. In this step, you do not need to make your own pictures have corresponding marks, but only need the original pictures to mark

mark

There are many annotation tools in the market now. You can use these annotation tools to mark the components on your original image, and what is the position and type of each component. Let’s take Labelimg as an example to introduce in detail

You can install the software from the Labelimg website above and follow these steps:

Follow the instructions on the official website to build and start.
In menu/File click Change Comments folder saved by default
Click open Directory
Click create RectBox
Click and release the left mouse button to select an area to annotate the rectangular box
You can use the right mouse button to drag the rectangular box to copy or move it

training

After the above data set is created, organize the file structure as described in the previous chapter, and then start pipeline training.

conclusion

By this point, the reader has learned to recognize multiple front-end components in a single image and can apply them to more general scenarios. In the next chapter, we will introduce a more interesting example, that is, how to use Pipcook to transfer image styles, such as replacing oranges in pictures with apples, or replacing realistic photo styles with oil paintings, etc.