background

For CV students, image recognition is the simplest model for entry, but also the most basic model. In different CV task weights, even though they have been developed for many years, they have always retained the model weights of trained image recognition task weights as backbone for accelerating training convergence. However, in the face of some image recognition tasks, how do we transform a relatively complete recognition task according to business requirements, or quite an interesting topic.

The business requirements

To be able to screen out pictures of sneakers uploaded by users with a clean and uniform background, we will call this project background complexity detection task. It helps to reduce the difficulty of subsequent evaluation and selling algorithm, and maintain the image quality of the whole APP at a level.

Project requirement

The accuracy rate of the test set can reach more than 80%. It can be used to give users hints, but it is not mandatory. And reach 90% above, can force the user to meet the requirements of uploading picture quality.
Can realize the end – side application.

Model design

mobilenet backbone + FPN + modified SAM

The final model is actually very simple in terms of module breakdown, there is nothing particularly difficult to understand.

The overall direction of efforts is as follows:

Analysis of business characteristics, this is a typical spatial recognition task, that is, through the content of a certain part of the image or some parts of the region to complete the purpose. For the background complexity, we need to remove the part outside the main body and judge whether the rest is “complex”, so we can think of the attention mechanism of spatial series.
It is impossible for the user to strictly control the proportion of the subject in the picture. Some users tend to fill the screen with subjects, while others prefer more white space. For different scales, if you want to achieve fine classification, it is necessary to work on the feature map at a higher resolution, so FPN is used here.

The mobilenet series was a natural choice for end-to-end applications.

The above design ideas are actually completely around the current business scene to make the design.

Finally, in the test set, the designed CNN model achieves 96% accuracy. Can be used to force users to upload high-quality picture quality basis.

If you want to know the model used in this project, in fact, sharing here can achieve the goal. Each module design, in fact, is not difficult, there is no new concept. The code is concise and the implementation is easy.

But the most interesting part of the project itself is actually the idea of realizing the whole project, why we chose such a model at last, how other ideas were excluded and how we made trial and error!!

The project process

The traditional CV

The project needs to be implemented at the end. Although there is no clear requirement for real-time performance of mobile phones, the whole algorithm file is required, and the library referenced occupies a small enough memory. The first thing to think about is whether this problem can be solved without deep networks.

Wrong way of thinking

To analyze background complexity, I thought of using

For body content that is irrelevant to the overall business but seriously affects the results, a Gaussian kernel of fixed size is used to filter the content, and then other processing is performed.
Edge detection and gradient information are used to analyze the background complexity of an image

Fourier frequency information analysis of the picture of high and low frequency information to judge: high frequency represents more background information, more complex; Low frequency means clean.
The simple pixel weighted average of multiple images with clean backgrounds was used as the template, and then the images with complex backgrounds were screened out by anomaly detection method.

The above are some simple ideas I could come up with at the beginning of the project. But when we looked at the sample images, all three of these ideas were wrong except for the first one (spatial information). For example, a pair of shoes are placed on the carpet, the carpet itself is clean, the picture is also removed from the subject and the carpet without any other objects. But the carpet has its own pattern, a lot of high-frequency information, a lot of gradient information. According to the principles of Idea 2, 3, and 4 above, such a background is a highly complex one. But actually, in the sample, the background is pretty clean.

By analyzing data samples and reflecting on business logic, the following conclusions are drawn:

The concept of background complexity is not about the difference of backgrounds between different images.
The concept of background complexity is based on whether each image itself has a different background and some foreground objects that are not the subject.

Then, the overall idea needs to make some changes, that is, to judge the background complexity through the way of self-similarity. This is also a strategic adjustment after getting more familiar with the business.

Template matching

So what parts can be used to judge the self-similarity?

Four corners

As mentioned above, there is a correct idea in the wrong idea, that is, to use the spatial dimension to think about business, and it is right to remove the main information. That is, when we judge the background self-similarity, we need to avoid the subject information. According to this idea, we can experience that the four corners can represent the background information of a graph to a certain extent. If the contents of the four corners are similar, most business objectives can be solved. The pairwise self-similarity is calculated using the information of four angles, with six values. The higher the value of 6, the higher the similarity of the four angles, and the higher the probability of clean background.

Two Angle

After observing actual business samples, it is found that users tend to take photos with more white space on the top than on the bottom. In other words, the lower two corners often contain subject information. Therefore, the final confirmed scheme is to use only the two upper angles for similarity matching.

However, matching the output of only one value with the similarity of only two angles is too unstable and risky for the whole business. Therefore, I used the two upper angles as templates respectively, and then did sliding window matching respectively, and recorded the number greater than a certain threshold value as an indicator of similarity evaluation. Therefore, there are finally three indicators to measure the similarity of a picture background. One is the similarity of the top two angles to each other. The other two are how much each corner, alone as a template, can match up with the rest of the map. According to the pre-set weight, weight, and finally get a score. This score is used to determine whether the final picture is a complex one.

Finally, through this method, the input image for unified resize, keep the image in a certain size, so that the template matching speed control in a reasonable range, according to the above approach, in the test machine can achieve 80% accuracy. This algorithm can be used as a tool to guide photos in the product. When the algorithm determines that the background is complex, it prompts the user to take a photo again.

CNN

It doesn’t matter that CV didn’t understand the above tradition. It simply describes the process of using business analysis to get some valuable ideas for the entire project. The following ways are the focus of this article.

baseline

In order to optimize the project quickly and iteratively, we chose the identification method to do this project. It’s a very common image recognition task. Because I want to be able to put the model in the phone, I can think of and more familiar with mobilenet V1.

The optimization model

Target detection

At the beginning of this article, I mentioned that this is a standard spatial row recognition problem. For such recognition problems, I usually use the detection model to solve the recognition task.

This line of thinking is common. For example, in the community, we want to identify what kind of clothes the person in the picture wears, whether he is carrying a bag of what brand, what brand of shoes he is wearing. Regardless of the actual algorithmic difficulty, end-to-end thinking is an identification problem. Input a graph, output several labels. But in practice, it is difficult to use the recognition model directly because there is a lot of redundant background information. In the face of such problems, we still need to use the detection model to do. In the end, only the output bounding box is not needed, and some policy work for weight removal may also be needed.

Going back to this business, the difference is that the area we end up caring about is the area that gets rid of the main body. One idea in traditional CVS is to do something like this with a Gaussian kernel of a fixed size. But precisely because of this, we can’t take an object detection model and add some strategies to the output as in the community example above.

First you need an object detection model to get to the subject. Then, the final answer can be obtained by using the mask filtering scheme, whether it is connected with the traditional CV algorithm or the baseline of CNN.

Using the method of target detection can clearly solve such a business problem, but also the most intuitive choice.

Hidden target detection

Although the method of target detection is intuitive, it has two huge defects:

Not being able to end to end, requiring at least two steps, or two models, is a burden on speed, on memory space.
The cost of target detection is much higher than the cost of identification.

So, can we optimize these two defects? The answer is yes.

We can predict an area, namely an bounding box, in the process of the intermediate model just like the step of target detection. The output is going to be 4 dimensions. Usually can output the upper left corner, the lower right corner; Upper left corner, width and height; Or central coordinates, width and height. Then use the original image and the results of these four dimensions to do a mask filter, filter out the subject, and then carry out the next step of recognition.

The advantage of doing this is to avoid directly detecting all defects in the task with the target. He combined the previous scheme step by step into a model. The middle label is also learned by myself rather than manually marked out, saving a lot of preparation time before training.

In fact, the famous pai Li tao map search algorithm is to use this idea of map search recognition. General graph search is also divided into the method of target detection + recognition, but the scheme of patli tao eliminates the pre-task of target detection by using the method mentioned above.

The hidden division

Since target detection can, then in fact semantic segmentation, or instance segmentation can also do such content. Let’s take a look at the other limitations of covert target detection:

Due to the problems of bounding Box itself, he still includes or removes some necessary information.
For multiple objects, this approach is inflexible. Of course, the problem is not insurmountable. The problem of multi-target detection can also be solved by designing some convolution to obtain feature map through target detection, but the number of parameters will also increase.)

Similarly, in the stage of mask generation, we use typical segmentation ideas to discard the neat shape like bounding box and go to the outer contour boundary of an object in the provincial capital. This boundary is also hidden, and does not need to be manually marked. Then the input image is used to filter the mask.

Spatial attention mechanism

Implicit segmentation makes a lot of sense, but actually the number of parameters will increase, the calculation will increase. In fact, whether the target detection or segmentation method, in a broad sense, is a spatial attention strong supervision. That is, the spatial attention mechanism that we have artificially defined. However, the same question remains. Our final goal is not to learn the coordinates of bounding box or how accurate the mask of the outer contour information of the background is. These intermediate results are not critical to the final goal. In other words, the model can learn its own set of areas of concern, even if it does not need to be highly interpretable. So for such scenes, we do not need to do the top layer of the segmentation model, that is, the mask with the highest resolution. As long as the middle layer effect is already out, no need to waste more parameters.

For “U” image segmentation model, we only need to achieve “J” is enough. The “J” structure is the standard backbone + FPN. FPN not only retains a certain high resolution information, but also combines the semantic information from the bottom, which is a perfect “fusion body”. Making another mask in this layer is equivalent to doing semantic segmentation in this layer. In fact, this is the spatial attention mechanism.

Spatial + channel attention mechanism

Now that we have spatial attention, we can also add the channel dimension of attention mechanism. The familiar attention module SE belongs to the channel dimension. BAM/CBAM is an attention mechanism in both spatial dimension and channel dimension. But he did it separately. Whether serial or parallel, are divided into two modules to do. In this project, Modified SAM module adapted in YOLOV4 was used to synthesize the weight of space + channel in one step. Simple and easy to understand. But the channel dimension is really a “incidentally”, in this business, I think it can be added or not.

Let’s look at the grad CAM result in the middle

Results contrast

The improved model was compared with the model without the attention module and FPN. The accuracy of the original model in the validation set is 93%, while the improved model is 96%. But more importantly, the improved model has a significant increase in interpretability. When interpretability increases, there is the possibility of a clear optimization direction.

conclusion

The above is a way of thinking about how to make innovation in image recognition. However, the purpose of innovation is not for innovation, the most important core of the whole project is to solve business problems.

Article/poem poetry

Pay attention to the technology, do the most fashionable technology!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

[Object acquisition technology] Image recognition algorithm based on self-attention mechanism

background

The business requirements

Project requirement

Model design

The project process