Author: breezecheng, WXG application researcher at Tencent

WeChat sweep code already thorough popular feeling, WeChat sweeping day ios version 12.23 is launched, from the identification of specific image in the form of coding (qr code/small program code/barcode/sweep translation), to accurately identify natural scene images (/ clothing/shoes/bags/beauty makeup appliances/toys/books/food/jewelry/furniture/other goods), What are the difficulties to overcome? What new scenes will be generated when scanning objects are displayed with pictures (videos) and the valuable ecological contents inside wechat, such as e-commerce, encyclopedia and information? This article will elaborate.

1. Scan an overview of knowledge

1.1 What does a scan do?

Scanning objects refers to mining valuable information in wechat content ecology (e-commerce + encyclopedia + information, as shown in Figure 1) with pictures or videos (commodity picture: shoes/bags/beauty makeup/clothing/home appliances/toys/books/food/jewelry/furniture/other commodities) as input media and showing it to users. Here we basically cover the full quantity of micro channel high quality small program e-commerce covers hundreds of millions of goods SKU, can support users to compare N goods and direct orders to buy, encyclopedia and information is the aggregation of micro channel soyso, Sogou, Baidu and other head media, to show and share the information content related to the shooting goods to users.

Figure 1. Scan the schematic diagram of identifying objects

You are welcome to update the new ios version oF wechat → Scan → Identify things to experience by yourself. You are also welcome to submit experience feedback to us through the feedback button in the identify things interface. FIG. 2 is the real photo display of scanning objects.

play

FIG. 2. Real photo display of a scan object

1.2 What are the scenes when you scan the objects?

The purpose of scanning objects is to open a new window for users to directly access the ecological content inside wechat. The window takes the form of scanning pictures as input, and the encyclopedia, information and e-commerce in the ecological content of wechat are provided to users as display pages. In addition to the scanning operation that users are very familiar with, we will further expand the operation of long press to recognize pictures in the future, and make scanning and recognizing objects more accessible to users. The landing scene of scanning objects is shown in the figure below, which mainly covers three parts:

A. Popular science knowledge. By scanning objects, users can not only get the encyclopedia, information and other common sense or interesting news related to the object in wechat ecology, to help users better understand the object;

B. Shopping scene. The same search function allows users to immediately search for their favorite products to the same products in wechat mini program e-commerce, supporting users to scan and buy;

C. Advertising scenes. Scanning objects can help the public account articles and videos to better understand the embedded picture information, so as to better launch matching advertisements and improve click rate.

1.3 What new technologies does scanning knowledge bring to the scanning family?

For a sweep, we are familiar with the two-dimensional code, small program code, bar code, translation. In various forms of code or text characters, it can be thought of as a specific form of coding images, and the general content is to identify natural scene images, for the family of scan is a qualitative leap, we hope that from the beginning of the knowledge content, to further expand ability to scan the understanding of natural scene images, such as wine, sweeping cars, plants, he faces and so on service, See Figure 3 below.

Figure 3. Scan the family

Two. Scan the technical analysis of knowledge

2.1 Scan the overall frame of the discerning object

Below, we focus on introducing the complete technical implementation scheme of scanning objects. Figure 4 shows the overall framework schematic diagram of scanning. The framework mainly consists of four parts:

1) User request link;

2) Offline warehousing of commodity inspection;

3) Search for the same item + Access to information encyclopedia;

4) Model training and deployment.

The core technology modules extracted from the four links can be summarized into three, namely, data construction, algorithm research and development, and platform construction. We will cover them one by one.

Figure 4. Scan the overall frame of the object

Data to construct

In the ERA of AI, data is king, and the purpose of data construction is to better serve THE AI algorithm. For example, subject detection, category prediction and feature extraction are required for user request graph and store entry graph, and high-quality training data are required to support detection model, classification model and retrieval model. In a word, the data determines the performance upper limit of the entire scan object.

Algorithm research and development

The algorithm is developed to make full use of existing data and achieve the optimal compromise in accuracy and speed for each link of identification, such as detection, category prediction and retrieval feature extraction, so as to achieve accurate recall of the same type and more relevant information display for any product request of users. The quality of algorithm development determines the lower limit of the performance of a scan object.

Platform construction

No matter it is data construction, algorithm research and development, or model on-line, it cannot be separated from a good platform support. We have created a complete platform for data cleaning, model training, deployment and on-line. It can be said that the construction of the platform is related to the efficiency of RESEARCH and development, and determines whether Scan can realize the online.

2.2 Scan the construction of identification data

The data construction can be divided into two parts: one is the construction of training data for model training, and the other is the construction of online retrieval database to support users’ any commodity retrieval request.

Model training database construction training database is mainly to support model training, such as object detection model, category prediction model and the same type of retrieval model, in the encyclopedia information search needs to train the text classification model of commodity title, named entity recognition model, etc. In this article, we will focus on the algorithm strategy of visual semantics. For encyclopedia information, we will elaborate on the NLP algorithm in the next article. In Chapter 2.3, we will describe the image de-duplication used in data construction, the semi-supervised learning algorithm used in detecting database annotations, and the semi-automatic same-style denoising + merging algorithm proposed in retrieving data construction.

Figure 5. Training data construction (visual module)

The coverage of online retrieval database is very important, which determines whether it can support any commodity search request of users. We use four strategies of directional import, long tail plus climb, access replay, and active discovery to continuously expand the scale of our business map, which currently covers 95%+ common goods. This number continues to rise, and we expect any subsequent requests for products to be able to recall exactly the same item.

2.2.1 Image deduplication

The first step is to clean up duplicate images in both database detection and database retrieval, which can reduce both storage pressure and model training pressure. There are two commonly used image deduplication algorithms as follows:

1) MD5 is used to remove identical images;

2) hash to heavy, in addition to the full repeat figure, also can remove some simple processing on the original images, such as change the original image brightness, contrast, scale, edge sharpening, fuzzy, chromaticity and rotation Angle, there seems little point in these pictures keep, because the data in the model training strategy could be covered. Commonly used hash de-weighting mainly includes aHash, dHash, pHash, etc. For detailed understanding, please refer to relevant articles [1,2]. We focus on comparing the speed and robustness of each de-weighting algorithm, as shown in FIG. 6 below.

FIG. 6. Comparison of speed and robustness of commonly used de-duplication algorithms

By comparing the figures in Figure 6, it can be seen that dHash has the fastest weight removal speed, because it only needs to calculate the brightness difference of adjacent pixels, which is very simple and easy to process. Moreover, it has a good tolerance for simple PHOTOSHOP operations of pictures. Small changes such as brightness, scale, contrast, edge sharpening, blur, chroma and so on all have good anti-interference ability and help us clean up more invalid images. Finally, we also used dHash to conduct re-operation on our training database. In the database of 11 types of full crawler products, about 30% of repeated pictures were cleared, totaling about 300W.

2.2.2 Detecting database construction

As can be seen from the overall framework shown in Figure 4, the first step of scanning the identified objects is subject detection, that is, to locate the area of interest to the user first and remove the interference of the background on the subsequent links. Subject detection is generally recognized as the primary operation of most of the products that search for pictures by pictures. As shown in Figure 7 below, Ali’s Pealitao, Baidu’s Photo Recognizing and Microsoft’s Flower recognizing mini program. Of course, the algorithms of subject detection vary from one another. For example, Bealitao adopts object detection algorithm, Baidu adopts significance region prediction, and Microsoft Flower recognition requires users to coordinate positioning. In order to liberate the user, we hope that the algorithm can automatically locate goods area, considering the significant regional prediction is very difficult to handle multiple commodities appear in the field of vision, similar stand tao, we adopt a more accurate object detection algorithm to locate the goods location, and select the highest degree of confidence commodities for subsequent retrieval and information data show.

FIG. 7. Schematic diagram of the main body inspection operation of commonly used products

Of course, the object detection model cannot be separated from the support of the detection database. Here, we compare three methods for labeling the position and category of boundbox, namely manual detection labeling, weakly supervised detection labeling and semi-supervised learning detection labeling.

The commonly used manual detection and labeling tool Labelimg is shown in the figure below. We need to mark the locations and corresponding category labels of 11 types of commodities in the commodity database of crawlers. Considering the huge time and money cost of manual labeling, we only use this strategy to label a small number of samples, and more use subsequent algorithms for automatic detection frame labeling.

FIG. 8. Labelimg, a common manual detection annotation tool

The core idea of weakly supervised detection labeling algorithm is that the time cost of labeling the object category in the image is much lower than that of labeling box + category. How to use only the whole image category information to automatically infer the location information of the object, so as to complete automatic detection labeling? A large number of researchers in academia and industry have conducted research and exploration in this direction. The main idea is to mine the local significance regions in images and use them to represent the category semantic information of the whole image. Several representative articles are listed in FIG. 9. The algorithm in the lower left corner of Figure 9 is mainly described. It is the first article in the industry that uses deep learning to do weakly supervised detection, which is developed by our lab fellow. Based on this algorithm, we participated in ImageNet14 competition and won the champion of a small branch. Despite the vision is very good, weak supervision and detection algorithm has a serious defect, is easy to fit to the local area, such as from a mass of cat in the picture fitting to cat face is the position area, and it is difficult to locate to complete the cat’s position, it is for our product retrieval is difficult to accept, we need accurate recall the same product (fine-grained retrieval), All the information was important, so we basically passed the algorithm.

FIG. 9. Commonly used weakly supervised detection algorithms [3,4,5,6]

Half test mark than weak supervision detection algorithm, a semi-supervised detection algorithm is more close to the task itself: manual tagging detection based on a small box + inspection source database initialization model, then use the model to automatic database with the rest of the goods, with the increasing number of annotation to update detection model, after the iteration model updates and automatic labeling. Of course, in order to control the continuous positive excitation in the above links, we conducted manual correction for labeled samples with low confidence, as shown in the algorithm schematic diagram below. Based on this algorithm, we have completed the labeling of the whole commodity detection database, and millions of detection boxes have been labeled in the full commodity database of 11 categories, among which only a few hundred thousand are manually started, and the rest are basically automatic labeling based on the algorithm, greatly saving the labeling time and money cost.

Figure 10. Schematic diagram of semi-supervised detection algorithm flow

2.2.3 Retrieval database construction

After completing the image reweighting and the main body detection, a natural idea is whether we can directly use this batch of product images after matting combined with SKU article number to conduct the retrieval model training. The answer is no. There are still two major problems in the product images after matting: 1. Noise problem of the same style, that is, there are pictures of different styles under the same SKU, which may be caused by the wrong pictures uploaded by users, partial details displayed by merchants, error of matting detection and other factors; 2. The problem of the same style combination, different SKUs may correspond to the same style of goods, if not combined, the number of classification categories will expand rapidly, and the model is difficult to converge. Therefore, before training the retrieval model, we need to complete the same item denoising and the same item merging. We propose the same item denoising algorithm based on automatic clustering and the same item merging algorithm based on confusion classification matrix, as shown in Figure 11 below. We will parse this algorithm separately later.

FIG. 11. Schematic diagram of the same denoising + same merging algorithm
2.2.3.1 Denoising of the same paragraph

We use clustering algorithm to automatically remove the noise image in each commodity SKU. We have studied common clustering algorithms, as shown in Figure 12 below.

Figure 12. Clustering effect display and speed comparison of common clustering algorithms

For the theory and implementation process of these algorithms, you may refer to [7]. In FIG. 13, we make comparative analysis of the above clustering algorithm on common indicators. Considering the characteristics of commodity SKU clustering, we pay more attention to the anti-noise ability of the clustering algorithm, adaptability to different SKU feature distribution and processing speed. After comprehensive analysis of practice, we choose DBSCAN as our clustering algorithm. And it is reformed to be more suitable for commodity clustering denoising, we call it the hierarchical method DBSCAN.

FIG. 13. Comparison of various indicators of common clustering algorithms

Hierarchical DBSCAN is mainly divided into two steps, step1 respectively. Search for the largest class of clusters with the closest distance, and step2. Revisit the noise samples to retrieve the same difficulty samples and increase diversity. Let me briefly describe these two steps.

Hierarchical DBSCAN Step 1 The purpose of this step is to select the largest class of clusters with the closest distance in SKU, because the distribution of the same type of samples in SKU is more regular than that of noise samples, and the number is generally more. The algorithm is shown in Figure 14 below. The minimum number of samples in our neighborhood is set to 2. After all core points are connected, assuming that there are multiple class clusters, we select the class cluster with the largest number as the commodity picture we select, and the rest as noise samples.

Figure 14. Schematic diagram of DBSCAN step 1 process

FIG. 15 shows the step I clustering effect of an actual SKU, from which we can find that the maximum clustering is very simple due to the strict control threshold, and there are actually some difficult positive samples (indicated by red circle) mistakenly identified as noise in the noise. These difficult positive samples are very important for improving the robustness and generalization ability of the retrieval model, so we need to retrieve the difficult positive samples in the noise back to the largest category on the left, which is what step 2 completes.

Figure 15. A diagram showing the actual clustering effect of step 1 for an SKU

Hierarchical DBSCAN Step 2 In order to retrieve the difficult positive sample from the noise sample and improve the richness of the training sample, we need to revisit the noise sample, calculate the distance between the noise sample and the largest cluster, and reallocate the near noise sample that meets the threshold condition to the largest cluster, as shown in Figure 16 below.

Figure 16. Schematic diagram of DBSCAN step 2 process
2.2.3.2 Merge the same paragraph

The purpose of the same style combination is to combine the same style of goods with different SKUs, so as to obtain the actual and effective style category. We use the classification obfuscation matrix to complete automatic same-term merging. Based on the pre-merge data we can train the initial classification model and calculate the probability of confusion between any two SKUs. The confusion probability is defined as p_ij=c_ij/ N_i, where P_ij is the confusion probability that class I is predicted to be class J, c_ij is the number of samples that class I is predicted to be class J by the model, and N_i is the actual number of class I samples. The confusion probability matrix of a CERTAIN SKU is shown below. It can be seen that when two SKUs are actually of the same style, as shown in Figure 17, category 1 and 3 on the left, it is difficult to distinguish between them and the probability of confusion will be high. By setting a score threshold for merging the same skUs, we can merge the same SKUs. In the actual operation, the iterative steps on the right of the figure below are adopted, that is, the same type with the highest confidence is merged with a high threshold value, then the classification model is updated and optimized, and more skUs of the same type are merged with a lower threshold value. The above steps are iterated until no SKUS of the same type need to be merged.

Figure 17. Left: schematic of confusion probability between commodity SKUs, right: schematic of iterative merge process

After merging the same items, the size of the training retrieval database we obtained is shown in the figure below, with a total of 7W + multi-category and 1KW + training samples. Compared with the current mainstream open source databases such as ImageNet (1000 categories, 130W +) and OpenImage(6000 categories, 900W +), Be more diverse in category and number.

2.2.3.2 Cost and benefit

Here, the comparison between using the algorithm for denoising with the same item + merging with the same item and using pure manual cleaning has brought huge benefits in terms of time and money, and accelerated the iteration speed of the whole project.

2.3 Research and development of scanning object recognition algorithm

Since the previous chapter covered how to prepare your rations (cleaning high quality training data), this chapter is a natural fit (algorithmic model for efficient use of existing training data). According to the overall framework of scanning objects, we focus on three modules involved in visual same-type search, that is, object detection, category prediction and same-type retrieval.

2.3.1 Object detection

Object detection is the first step of scanning the identified objects. We need to effectively locate the position of goods in the pictures taken by users and eliminate the interference of background on the subsequent retrieval of the same item. As shown in Figure 18 below, researchers optimize detection algorithms from different perspectives. For example, they are divided into one-stage and two-stage detection from the perspective of speed. Starting from Anchor, it can be divided into anchor-based, anchor-free and guided anchors. Recently, Anchor has started to rise again, matching two-stage in performance and one-stage in speed. There are also starting from category imbalance, object detection of different scales and so on.

Figure 18. Common deep learning algorithms for object detection

Consider the commodity inspection, mainly pay attention to three problems: 1. Speed problem; 2. Unbalanced detection of labeling categories; 3. There is a large variation in the size of objects. Taking these three questions into consideration, we choose retinanet [8] in Figure 19 as our detector. As we all know, retinanet is a one-stage detector which directly regress the location and category of objects from the original image and is fast. Second, it uses pyramid architecture to effectively adapt to multi-scale object detection. Retinanet suggests that Focal Loss can effectively adapt to category imbalance problems as well as sample difficulty problems. Subsequently, we will further optimize the head of retinanet using the anchor-free strategy to further accelerate the detection speed of the model, which will not be introduced here.

Figure 19. Schematic diagram of the retinanet algorithm architecture

Retinanet – Resnet50FPN was compared with the classical one-stage detector YOLOv3 and the two-stage detector ftP-RCNN-Resnet50-FPN as shown in Table 20 below. The review data is based on 7K outsourced images from 11 categories and shows a good compromise in speed and accuracy when comparing the results of the forms. In the future we will use tensorRT to further optimize the forward speed of retinanet and reveal maximum 80FPS for scenarios.

Figure 20. Schematic diagram of the retinanet algorithm architecture

2.3.1 Category prediction

Category prediction is to determine the category of objects in the matting detection, so as to facilitate the subsequent feature extraction and index with the specified category model. You may have some doubts, because the previous detector has identified the location and category of the goods, why not directly use the category of the detector, but to make a new category prediction? The reason is shown in Figure 21 below: The data of the training detector is limited, and the pictures uploaded by the user may be strange, so the subclasses that do not appear in the training library are easy to cause the classification error of the detector, and the confusion between classes will also lead to classification error.

Figure 21. Problems with categories that use detectors directly

So how to improve the accuracy of category recognition? Here we use massive online retrieval database to improve the accuracy of category prediction. That is, we perform a search for user Query in the retrieval database, query the category to which top-20 of the nearest neighbor belongs, and then re-weight the vote to obtain the final category of the object by combining the predicted category of the detector. Finally, through detection + retrieval, we can greatly improve the accuracy of category prediction by nearly 6 points.

2.3.2 Search for the same item

The same search is to scan the soul of knowledge. Different from general search by image, only similar images need to be found. The same type of search is a fine-grained search, and the exact same type of query needs to be retrieved. For example, Huawei Watch GT2 needs to be searched for this type of watch, not watches of other brands, which leads to great challenges for the same type of search. Figure 22 lists the difficulties and challenges of the same type of retrieval: 1. The problem of interclass confusion, that is, how to distinguish similar items from the same item; 2. Recall of the same style, that is, how to retrieve the same style effectively when there are big differences between the same style and the same style. Considering these difficulties, we put forward 9 general directions to optimize our same retrieval model. Here are some explanations.

Figure 22. Difficulties and challenges of the same type of retrieval
2.3.2.1 Classification model of the same retrieval

This is our baseline model. The classification model we used is shown on the left side of Figure 23 below. As is known to all, the classification model uses Softmax to map the logical value to the probability value of the class. As shown in the table in the upper right corner of the following figure, Softmax can well magnify the difference of the logical value and approach the probability of the correct class to 1, which is beneficial to the fast convergence of the model. In the figure below, we also show the decision boundary of softmax classification model and the feature distribution learned from mnist-10 category [9,10]. It can be observed that the feature space learned by SoftMax classifier has three major features:

1) The feature space is fan-shaped, so the cosine distance retrieval is better than the European distance retrieval. In the future, cosine distance will be used for the same type of retrieval.

2) Uneven distribution of different congenies. In fact, we hope to attach equal importance to each category, because they can evenly divide the whole feature space, which is conducive to model generalization ability.

3) The boundary sample retrieval is not accurate. It can be seen from the feature graph that the cosine distance between the boundary sample and the adjacent class is probably less than the distance between the sample and the same class, resulting in retrieval errors. Next, we focus on fixing the problems existing in the classification model.

Figure 23. Analysis of important characteristics of classification models
2.3.2.2 Improvement of classification model for retrieval in the same paragraph 1 Normalization operation

There are two kinds of normalization, namely, the normalization of classification weight W in FIG. 23 and the normalization of feature X. So what does normalization do? First, let’s talk about the relationship between the length of the weight W and the unbalanced distribution of feature space. Some researchers [10] showed that the mode length of W corresponding to a category with a large number of samples in the training database would also be longer, and the representation model paid more attention to the classification accuracy of this category while ignoring other categories with a small number of samples, as shown in FIG. 24. Both MNIST and WebFace databases verified the above mapping relationship. In fact, what we hope is that each category can be equally valued, and each category in the feature space can evenly divide the whole space. Therefore, we need to normalize W to make the weight of all categories consistent, namely


:

FIG. 24. Mapping relationship between the number of samples in each category and the weight W of the classification

The operation of feature normalization is similar, namely:


Review the decision boundaries for SoftMax classification:


We normalize both W and X, so the decision boundary only depends on the Angle, which forces the feature distribution to be more fan-shaped after model training convergence, which is beneficial to cosine retrieval. But the normalization of both will make the model difficult to convergence, you can think about why? Referring to the Softmax feature in Figure 23, as the weight and feature are normalized, the maximum logical value of classification is 1 and the minimum is -1. In the same three categories of classification learning, the probability of Softmax corresponding to GT category is only 0.78, which is far less than 1, resulting in large loss and poor convergence of the model. The solution is simple, and the logical value can be multiplied by a scale value S, which can enlarge the differentiation and facilitate the model convergence.

2.3.2.3 Classification model improvement of the same type of retrieval 2 Angle Margin

The core purpose of increasing Angle Margin is to make the fan-shaped distribution of Softmax classification more conducive to retrieval: that is, to make similar categories more concentrated and different categories more distant. Three common strategies for increasing Angle margin are shown in figure 25 below: multiplicative margin[10,11], additive cosine margin[12], and additive Angle margin[13].

Figure 25. Comparison between common Softmax and Margin Softmax

After margin is added, in-class features obtained by Softmax classification model become more compact, as shown in Figure 26 below. In addition, compared with multiplicative margin, additive margin is easier to train. This is because multiplicative margin reduces the monotonic interval of cosine from [0, π] to [0, π/m], and the gradient updating interval becomes smaller, while the monotonic interval of additive margin remains unchanged, which is conducive to model convergence.

Figure 26. Comparison of feature distribution between common Softmax and Margin Softmax
2.3.2.4 Improvement of classification model for retrieval in the same paragraph 3 Sorting loss

The learning purpose of classification model is category differentiation, which is somewhat different from retrieval task. The purpose of introducing sorting loss is to optimize retrieval task by display, that is, the distance of similar samples in Euclidean space should be smaller than that of different kinds. In general, the superimposition of sorting loss and classification loss is better. The model structure we designed is shown in Figure 27 below:

Figure 27. Classification model + sequencing loss model architecture

On the right side of FIG. 27 are commonly used sorting loss functions, such as Contrastive Loss, Triplet Loss, and Lifted Structure Loss. The figure focuses on the triplet loss function and the visualization target of optimization, that is, similar distance should be less than one margin than different similar distance.

2.3.2.5 Classification model of the same retrieval and performance comparison after improvement

Here we compare the performance of the classification model with its improved version on the commodity retrieval task. The evaluation set is the 11 categories of commodity databases collected by us, in which the user comment graph is taken as the query sample, and the retrieval sample is composed of the same merchant graph and the noise set not of this style, which can better simulate the online retrieval database. Figure 28 shows the performance comparison of the classification model and its improved version in the jewelry category. It can be seen that: 1) ArcFace[13], which is better than the commonly used Softmax classification model, is obtained after normalized and angular additive margin are added. 2) The sorting Loss is added to the classification model, such as Hard Triplet Loss, and the retrieval performance is better than the commonly used Softmax classification model; 3) Classification + normalization + Angle margin+ sort, as shown in the last two lines, the performance is further improved, but the improvement is not particularly large.

Figure 28. Performance comparison of classification model and its improved version
2.3.2.6 Multi-task model for the same retrieval

In order to further improve the feature expression ability of the retrieval model, we explored to capture more abundant commodity characteristics by retrieval features. For example, commodity attributes such as perspective and brand were added to the category of style, as shown in Figure 29 below.

Figure 29. Retrieval features embed multiple commodity attributes

In order to adapt to the learning of multi-attribute tagging, we designed a multi-task collaborative learning model as shown in Figure 30 below. The benefits of multi-task collaborative learning model are very obvious. It can make better use of the correlation between attributes, which is conducive to the optimal convergence of the network and the improvement of the generalization ability of the model. In this way, the retrieval features obtained are more conducive to the same item retrieval of goods. So here’s the question, how do you design different task weights? Of course we can manually set the weight of each task, but this requires a deeper understanding of each task and a larger number

Figure 30. Multi-task learning networks using multi-attribute tagging

The other strategy is to automatically obtain the weight of each task, which has been studied by many researchers. Here, we adopt the validation set trend algorithm [14] to automatically calculate the weight of each task, as shown in FIG. 31 below. The idea of the algorithm is relatively straightforward, that is, high weight is set manually for the main task, such as style classification, and other tasks are weighted based on the difficulty of optimization. For example, tasks that are difficult to optimize (with large loss fluctuation and high absolute value) have great weight, while those that are easy to optimize have small weight. After using multi-task collaborative learning, the model

Figure 31. Multi-task cooperative learning model based on validation set trend algorithm

Compared with the single task model, the retrieval performance of type ii is greatly improved, as shown in Figure 32 below.

Figure 32. Retrieval performance comparison between multi-task model and single-task model
2.3.2.7 Attention model of the same retrieval

In Figure 22, we show that one of the biggest challenges of same-type search is the confusion of same-type search for similar items. In order to better distinguish the same style from similar styles, we hope to pay more attention to some prominent areas, such as the logo of the watch, engraved text, logo and pattern of shoes, as shown in Figure 33 below. Commonly used attention models are divided into three types [15]

Figure 33. Attention model

Feature spatial attention, feature channel attention, and a combination of the two. Experimental comparison shows that the model retrieval performance improves with the increase of feature spatial attention, but decreases with the increase of feature channel attention, as shown in Figure 34 below. We believe that spatial attention is conducive to strengthening the model’s emphasis on significant spatial regions, while attention to feature channels may lead to over-fitting of the model, and the performance of the model decreases.

Figure 34. Comparison of retrieval performance of three attention models
2.3.2.8 Hierarchical difficulty perception model for retrieval in the same paragraph

In view of the two major difficulties and challenges in the same type of retrieval, namely, confusion between classes and difference within classes, we adopt hierarchical difficulty perception model to pay full attention to these two problems, as shown in FIG. 35 below. Although it is the same style, the degree of similarity may vary greatly. For example, the bag of this style is changed due to the shooting Angle and occlusion. For the negative samples of different styles, there are also other bags with very similar styles, and other goods with large differences include wine and lipstick.

Figure 35. Hierarchical distribution characteristics of intra-category differences and inter-category confusion in commodity retrieval

Difficult perception model hierarchy model [16] following structure as shown in figure 36, the sorting loss according to the hierarchical distribution, the first level for all sort of positive and negative samples to study, the second layer is responsible for the difficult of positive and negative samples, and the third layer is to be responsible for the more difficult the positive and negative samples, and the increasing difficulty of sample on the deeper network design model.

Figure 36. Hierarchical difficulty perception model

The hierarchical difficulty perception model can effectively improve the retrieval performance of the model by matching the hierarchical distribution characteristics of positive and negative sample pairs through the hierarchical design, as shown in the experiment in Figure 37 below.

FIG. 37. Deceleration performance of the hierarchical difficulty perception model
2.3.2.9 Mutual learning model of the same item retrieval

It is well known that competitions often combine the results of multiple models to improve accuracy. However, in the actual project landing, we need to consider the accuracy and speed of deployment model at the same time. Fusion of multiple models will occupy more computing resources and reduce forward speed. So how to combine the advantages of multiple model structures without increasing computing resources? This is the core idea of the mutual learning model. As shown in the figure below, the mutual learning model absorbs the structural advantages of other models through KL divergence loss, and only one model needs to be deployed without increasing computing resources.

Figure 38. Mutual learning model

In the experiment, resnet152 and inceptionv4 were used for mutual learning, and resnet152 was used for retrieval in actual deployment. Its performance is shown in figure 39 below. In addition to increasing model training time, mutual learning does not increase any burden on model on-line, but the accuracy can be significantly increased.

Figure 39. Retrieval accuracy of the same item for the mutual learning model
2.3.2.10 Local significance erasure for the same retrieval

After all the above strategies were used to improve the retrieval performance of the model, we found that there was still a problem, that is, the deep learning model would pay too much attention to the texture region of the image and ignore the shape of the object. For example, in the same type of suitcase search, return bags, wallets, etc. with the same pattern, but not the suitcase. How to make the model care

Figure 40. Commonly used CNN models pay too much attention to image texture and ignore shape

How about paying attention to the shape of the object as well as the texture? We use local saliency erasure to break the texture of the original image and force the model to focus on the shape of the object. There are three common types of local significance erasure, as shown in Figure 41 below, which are random erasure, Bernoulli erasure and antagonistic erasure respectively. Here we focus on comparing the first two, after the antierase

Figure 41. Local saliency erasure

The experimental results are shown in Figure 42 below. Local significance erasure can greatly improve the retrieval accuracy of the same model.

Figure 42. Local saliency erasure same-item retrieval performance
2.3.2.11 Mutual K nearest neighbor code reordering for the same retrieval

So far we have been talking about optimizing the retrieval model. Here we are talking about how to further optimize the retrieval results of the retrieval model, namely reordering. In reordering, I personally appreciate the mutual K-nearest Neighbor algorithm [17], which is very simple and efficient. The cross-K learning algorithm was first proposed for pedestrian re-retrieval, as shown in FIG. 43 below. Its core discovery is that for query samples, among the top-10 samples retrieved, the positive sample (with the same person as Query) is changed into query sample, the original Query sample exists in its K neighbor. And negative samples (and query of the same individual), no original query in the k neighbor samples, based on the findings, the author on the basis of markov distance, increasing the distance measurement, based on mutual k neighbor as shown in figure below, based on the measurement can be effective to reorder the original sorting, Noah before the samples are to will be removed after negative sample.

Figure 43. Cross-k learning algorithm in pedestrian re-recognition

However, in fact, we cannot directly use this algorithm for the same type of goods retrieval, because our query is a user comment graph, while the retrieval graph is a merchant graph. There are great differences between them, resulting in the failure of mutual k nearest neighbors. Subsequently, we focus on how to optimize the feature measurement space to reduce the domain difference of the model. Then we use mutual k nearest neighbors to reorder.

2.3.2.12 Comparison of competing products

Finally, based on 7K pictures uploaded by users, we evaluated the accuracy of 11 types of detection and compared the accuracy differences of different competing products with the operating environment of NVIDIA GPU P4. The comparison test shows that our algorithm is superior to JINGdong and close to Peili Tao.

2.4 Scan the construction of knowledge platform

Is the so-called sharpened knife is not wrong to chop wood workers, platform construction for our data construction, model learning, model deployment is crucial. Below we introduce one by one.

2.4.1 Data cleaning platform

In order to speed up manual calibration and manual annotation, we developed a series of tools to assist in annotation and verification of model accuracy, which will not be explained here.

2.4.2 Model training platform

In recent years, machine learning platforms have developed rapidly. Many companies have their own deep learning platforms. We have few people and little money. We mainly developed caffe and PyTorch two sets of similar search platforms, which will be introduced later.

Figure 44. History of machine learning platforms [18]
2.4.2.1 caffe

The first model training platform we developed was Caffe. The core architecture of the Caffe platform is shown in Figure 45 below. The Caffe platform is now mostly used by industry, but less so by academia. The caffe retrieval platform developed by us has the following features:

1) Support rich data augmentation strategies; 2) Support multi-type data loading; 3) Support distillation learning/mutual learning algorithm; 4) Support difficulty perception model algorithm; 5) Support sorting model algorithm; 6) Support batchsize expansion.

The advantages and disadvantages of caffe are very obvious. The advantages of caffe are: 1) fast training and stable results; 2) Based on Prototxt fast test various multi-model/multi-label/multi-data source arbitrary combination. Its disadvantages are: 1) the development of new algorithm is slow; 2) Inflexible debugging; 3) Poor video memory optimization; 4) Less updating of academic frontier methods. The fourth, more fatal, disadvantage was that we couldn’t quickly follow up on academic preamble, so we decided to develop the PyTorch search platform as a follow-up.

Figure 45. Caffe platform core architecture
2.4.2.2 pytorch

The PyTorch retrieval architecture we developed, shown in Figure 46 below, basically supports all the characteristics of Caffe’s retrieval platform: 1) supports rich data augmentation strategies; 2) Support multi-type data loading; 3) Support distillation learning/mutual learning algorithm; 4) Support sorting model algorithm; 5) Support more mainstream network EfficientNet; 6) Support data denoising/merging with the same paragraph/retrieval; 7) Support mixed precision training. Pytorch also has obvious advantages and disadvantages. Its advantages are as follows: 1) Automatic derivation and high algorithm development efficiency; 2) Dynamic graph, Python programming, easy to use; 3)Tensorboard is easy to visualize; 4)Github has many resources and keeps up with the latest trends; 5)Pytorch1.3 supports mobile deployment. Of course pyTorch is not perfect and has some disadvantages compared to Caffe. 1) It is not as convenient as Caffe Prototxt in multi-tasking.

Figure 46. Pytorch builds the same search platform

2.4.3 Model deployment platform

In model training, we can ignore the running time cost of the model, but in model deployment, we should pay full attention to the resource occupation of the model, and try to improve the concurrent capability of the model, such as GPU deployment optimization of video memory, hardware adaptation, speed up. Here we focus on tensorRT used in background GPU deployment and NCNN used in mobile deployment.

Figure 47. Model training to deployment
2.4.3.1 Model deployment platform: tensorRT

TensorRT is a deployment platform developed by Nvidia that can effectively reduce model memory and accelerate model forward speed. Here we do not expand the details, you can pay attention to the following detection model and retrieval model, through tensorRT quantization acceleration, video memory and speed have a huge leap.

Figure 48. TensorRT deployment acceleration
2.4.3.2 Model deployment platform: NCNN

For mobile terminal deployment, we use NCNN architecture developed by Tencent. Its advantages are shown in the left picture of Figure 49 below, and the demo is shown in the right picture.

play

Figure 49. Mobile NCNN deployment

2.4.4 Task scheduling system platform

The task transfer platform is developed by our backstage gods and is mainly used for the effective invocation of each task. Considering that our retrieval database is hundreds of millions of databases, it is necessary to ensure that the platform has good fault tolerance, disaster tolerance and robust mechanism. As shown in Figure 50 below, of course, what is shown here is only the tip of the iceberg, and the backstage gods will explain it in detail in KM.

FIG. 5 billion level retrieval task scheduling platform

Three. Scan the outlook of knowledge

Finally, we look forward to the future of our scanning objects. In the same words, we expect that scanning objects will become a living habit of everyone: scan and know what you see; Sweep, new life, new pose.

Figure 51. Scan the future

reference

[1] Internal company documents

[2] blog.csdn.net/Notzuonotdi…

[3] Learning Deep Features for Discriminative Localization,CVPR16

[4] Weakly Supervised Object Localization with Latent Category Learning, ECCV14

[5] Weakly Supervised Deep Detection Networks, arXiv16

[6] Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation, arXiv16

[7] https://scikit-learn.org/stable/modules/clustering.html

[8] Focal Loss for Dense Object Detection, arXiv18

[9]zhuanlan.zhihu.com/p/76391405

[10] SphereFace: Deep Hypersphere Embedding for Face Recognition,arXiv18

[11] Large-Margin Softmax Loss for Convolutional Neural Networks, arXiv17

[12] CosFace: Large Margin Cosine Loss for Deep Face Recognition, arXiv18

[13] ArcFace: Additive Angular Margin Loss for Deep Face Recognition, arXiv18

[14] Adaptively Weighted Multi-task Deep Network for Person A! ributeClassification, MM17

[15] Concurrent Spatial and Channel ‘Squeeze & Excitation’ in FullyConvolutional Networks, arXiv18

[16] Hard-Aware Deeply Cascaded Embedding, ICCV17

[17] Re-ranking Person Re-identification with k-reciprocal Encoding, CVPR17

[18] Internal documents