The task of image retrieval refers to finding images containing the same or similar instances from the image database given a query image. One of the typical applications is e-commerce commodity search. For example, Users only need to take photos on Taobao to accurately search, which improves the e-commerce shopping experience. Let’s take a look at the realization plan and computer vision technology behind Taobao.com.

Read the full text in one picture

Get “computer vision” industry solutions

The project implementation code, project data set, paper collection and article collection of “recommendation and calculation advertising” series have been sorted into dACHang industry solutions. Go to the public account (AI Algorithm Research Institute) background reply keyword “computer vision” to obtain.

Related code implementation reference

ShowMeAI community technical experts partners have also implemented the typical algorithm of image retrieval and built relevant applications. For details on “CNN and triplet Based Image Retrieval implementation”, please visit our GitHub project (github.com/ShowMeAI-Hu…). See the implementation code. Thank you to the ShowMeAI community for participating in this project and welcome to PR and Star!

Recommended reading | click “computer vision” series

Giant | image retrieval and its technical implementationScottApplication @ Computer Vision series


Image retrieval is an AI technology with many application scenarios in the Internet industry, typical applications such as e-commerce commodity retrieval (Taobao “Pai Li Tao”, Jingdong “Photo shopping”), users can easily take photos to accurately retrieve goods, the whole set of realization contains a lot of computer vision technology. In this article, we will take a look at the implementation scheme and computer vision technology behind Taobao Paolitao based on Frank, a senior algorithm engineer from Alibaba CV.

1. Image retrieval improves shopping experience

The image retrieval task refers to finding the image containing the same/similar instance from the image database given a query image containing a specific instance (such as a specific object, building, scene, etc.).

Because different images have different shooting angles, illumination, occlusion and so on, accurate retrieval requires a lot of algorithm technical support. Meanwhile, for major Internet companies with huge image databases, query efficiency is also one of the core issues that need to be considered.

Take e-commerce as an example. Bealitao was first launched in the APP in 2014 and now has tens of millions of daily active users. Compared with traditional e-commerce search based on text search, Pealitao only requires users to take a photo, which can be accurately searched, eliminating tedious text description, simplifying the shopping process of users and greatly improving the shopping experience of e-commerce.

Ii. Image search process architecture of Taobao · Bealitao

The image search architecture of Bealitao is shown in the figure below, which is divided into offline and online processing processes.

2.1 Offline Process

The offline processing process is mainly to extract features from image periodically and build index. The complete offline process includes:

1. Detection and feature learning: construct offline image selection, and extract interested goods from the selection image through target detection;

2. Feature extraction: Feature extraction is carried out for commodities, large-scale index library is built and put into image search engine for query;

3. Index building: Keep the index database updated at a certain frequency.

2.2 Online Process

The online processing process is mainly to retrieve the query images uploaded by users and return the retrieval results in the database. The steps include:

1. Category identification: classify the query image and identify the commodity category;

2. Target localization & CNN feature extraction: extract the features of the image target region and search for candidates in the index engine based on similarity measurement;

3. Image index and rearrangement: results can be rearranged by reordering goods and return search results.

Category identification module

3.1 Image selection processing

The reason for choosing products on Taobao is that there are many same or highly similar product images on Taobao platform. Direct use will lead to a large number of the same products in the final search results, affecting user experience.

Taobao contains a large number of product images from different sources, such as the corresponding “main picture”, “SKU picture”, “unpacking picture”, etc. First of all, it is necessary to do a screening of these massive images, and select the images that users are relatively interested in as commodity images to build index.

This process is equivalent to “filtering the entire image library according to the attributes attached to the image, image quality, etc.” After adding the image selection filter module, it will periodically select and delete duplicate or highly similar commodity images every day, so as to optimize the index file.

3.2 Category prediction based on the combination of model and search

Taobao’s category system is based on a hierarchical structure of leaf categories, which can take into account both visual and semantic similarity. Taobao makes prediction of images first in the process of taking photos and getting results of one of 14 categories, such as clothing, shoes, bags, etc. This processing can reduce the search space of the image library. The specific implementation is based on the combination of model and search, as follows:

3.2.1 Model-based prediction module

  1. GoogLeNet V1 network structure was used to balance high accuracy and low latency, and image sets containing target tags of different commodity classes were used for training.
  2. The input image was resized to 256×256 and randomly cropped to 227×227. Softmax (cross entropy) loss function was used as the loss function of the classification task.

3.2.2 Prediction module based on search

This module does not train the classification model directly, but based on the similarity and matching idea, uses the feature model and the database to be retrieved to complete the weighted KNN classification based on search. The specific prediction process is as follows:

  1. Obtain the pictures to be classified entered by the user;
  2. The application of search-based classification method: image feature extraction, and in the database to be retrieved to find Top-K similar images, according to these images of the class target tag to predict the input images.

3.2.3Taobao · Pai Li Tao application practice

  • The offline part. Taobao collected 200 million images attached with “real category label” as the reference image library, and trained a general category feature model to extract general features offline from the reference image library and build an index.
  • Online section. During the actual prediction, the general features are extracted from the query image, and the Top 30 results are retrieved from the image reference set. The label of the image to be queried is predicted by querying the Top 30 neighbors of the image and weighted voting according to each class target tag. Where, the weighting function is the distance function of the query image and its neighbor image.

The effect of category recognition is improved. Taobao will “model-based prediction results” and “search-based results” again weighted fusion, to further improve the accuracy of category prediction. Compared with the single method, the combination of the two methods improves the accuracy of taobao’s final category prediction by more than 2%.

4. Object detection & feature joint learning & measurement learning module

Under the taobao scenario, there is a big challenge in applying Bealitao to image search: there is a huge difference between the images of ** users and merchants. ** merchants’ product pictures are usually of very high quality, and the shooting environment and equipment are very good. The images that users query, often taken with a mobile phone camera, come with a host of problems such as lighting, blurring and complex backgrounds.

In order to reduce the influence of complex background, the system needs to be able to “locate the subject object” and “extract the subject feature” in the image. Taobao proposed a branch network CNN framework based on metric learning to combine learning subject detection box and feature representation, so as to achieve the consistency of features between users’ real images and merchants’ index images without background interference.

The legend shows the difference between target subject detection and targeted retrieval:

  • In line 1, there is no subject detection, and the retrieval results are obviously affected by background interference.
  • The second line adopts the subject detection, and the retrieval result has a very significant improvement.

4.1 Triple mining

4.1.1Introduction of the principle

In metric learning, it is a common and effective method to construct Triplet Loss Function ** by using Triplet samples.

This method was first proposed by The Google research team in the paper FaceNet: A Unified Embedding for Face Recognition. It is often used in Face Recognition tasks to distinguish extremely similar samples of different classes (such as brothers).

Its basic idea is: For the set Triplet (Anchor, Positive, Negative), Triplet Loss tries to learn a feature space so that the Anchor sample of the same category is closer to the Positive sample in this space. The distance between Anchor of different categories and Negative sample is farther.

Note: Anchor and Positive are different samples of the same kind; Anchor and Negative are different samples. For example: In the image retrieval scenario, Positive is the same as image retrieval, but Negative is different.

4.1.2 Advantages and disadvantages of the method

  • Advantage 1: Based on the neural network model of Triplet Loss, details can be distinguished well. Especially in the image classification task, when two inputs are very similar, Triplet Loss can learn better and subtle Feature features from these two input vectors with little difference, so as to perform well in the classification task.
  • Advantage 2: Compared with other classification Loss functions, Triplet Loss can set a certain threshold according to the needs of model training. In the network structure with Triplet Loss, a threshold margin is generally set during training, and the designer can control the distance between positive and negative samples by changing the margin value.

While Triplet Loss is effective, it also has drawbacks:

  • Disadvantage 1: The selection of triples leads to uneven data distribution, so the model is unstable and converges slowly in the training process, and parameters need to be adjusted continuously according to the results.
  • Disadvantage 2: Triplet Loss is easier to overfit than classification Loss.

4.1.3 Application practice of Taobao.com

In the scene of Taobao.com, given an input image, the core problem is to use image features to reliably match different source images from users and sellers. For the triplet scenario, we need to close the distance between the “query image” and the “image of the same item”, and further the distance between the “query image” and the “image of different items”.

The Triplet Loss function used here is as follows:


loss ( q . q + . q ) = [ L 2 ( f ( q ) . f ( q + ) ) L 2 ( f ( q ) . f ( q ) ) + Delta t. ] + \operatorname{loss}\left(q, q^{+}, q^{-}\right)=\left[L 2\left(f(q), f\left(q^{+}\right)\right)-L 2\left(f(q), f\left(q^{-}\right)\right)+\delta\right]_{+}

  • L2L2L2 represents the L2L2L2 distance between two vectors
  • Delta \delta is the parameter that controls the interval
  • FFF is a CNN feature extraction method that needs to be learned and can be learned through end-to-end training

4.1.4 Triplet sample mining

The method is as follows, but model training needs to rely on a large number of samples. The core here is “mining difficult triplet samples”. The simple processing method is to select the positive sample image from the same category as the query image, and select the negative sample image from other categories. The problem of this method is that there is a large visual difference between the negative sample image and the query image, which leads to the triplet sorting loss function being easily zero in the training process, without contributing any loss.

In fact, the Taobao app “User click data” is used to “mine difficult triplet samples”, as shown below:

  • The basic idea

In image retrieval scenario, a large part of the user will click on the same goods in return list images, which means “click the image of” can be seen as a “query image is sample image”, “not” click on the image can be used as a “negative sample image” hard, they are similar to the query image belonging to the same goods.

  • Problem points & Solutions

In the above idea, “unclicked image” may still be “the image with the same item as the query image”, because when many baby images of the same item are returned, the user will only click one or two of the results, we need to filter “unclicked image with the same item as the query image”.

  • Negative sample construction

The final calculation of “negative sample image of query image” is as follows:


q { d nonclick  min [ dist ( d nonclick  . q ) . dist ( d nonclick  . d click  ) ] gamma } q^{-} \in\left\{d^{\text {nonclick }} \mid \min \left[\operatorname{dist}\left(d^{\text {nonclick }}, q\right), \operatorname{dist}\left(d^{\text {nonclick }}, d^{\text {click }}\right)\right] \geqslant \gamma\right\}

  • Distdistdist is the distance function of the feature
  • γ\gammaγ is the distance threshold

Taobao.com has made some designs for distance calculation, and adopted the method of multi-feature fusion, combining “local features”, “features of different versions” and “general features of ImageNet pre-training”, so as to find the negative noise samples more accurately.

  • Positive sample construction

Similar thinking, a more accurate selection method of “query picture positive sample” is shown in the following formula:


q + { d click  dist ( d click  . q ) Epsilon. } q^{+} \in\left\{d^{\text {click }} \mid \operatorname{dist}\left(d^{\text {click }}, q\right) \leqslant \varepsilon\right\}

  • Data amplification

All negative sample images are shared between triples acquired in a small batch during the sample construction of Bealitao. This allows you to extend all available triples in a small batch and add more training data. No sharing mechanism is adopted to generate MMM triples. By sharing negative samples, m square triples can be generated before entering the loss layer.

  • Loss function optimization

To further reduce the noise in the training image, the original triplet sorting loss function is improved:


 loss  = 1 Q q Q 1 N q q N q [ L 2 ( f ( q ) . f ( q + ) ) L 2 ( f ( q ) . f ( q ) ) + Delta t. ] + Q = { q q . L 2 ( f ( q ) . f ( q + ) ) L 2 ( f ( q ) . f ( q ) ) + Delta t. > 0 } N q = { q L 2 ( f ( q ) . f ( q + ) ) L 2 ( f ( q ) . f ( q ) ) + Delta t. > 0 } \begin{array}{l} \text { loss }=\frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\left|N_{q}\right|} \sum_{q^{-} \in N_{q}}\left[L 2\left(f(q), f\left(q^{+}\right)\right)-L 2\left(f(q), f\left(q^{-}\right)\right)+\delta\right]_{+} \\ Q=\left\{q \mid \exists q^{-}, L 2\left(f(q), f\left(q^{+}\right)\right)-L 2\left(f(q), f\left(q^{-}\right)\right)+\delta>0\right\} \\ N_{q}=\left\{q^{-} \mid L 2\left(f(q), f\left(q^{+}\right)\right)-L 2\left(f(q), f\left(q^{-}\right)\right)+\delta>0\right\} \end{array}

The improved idea is to calculate the average loss for all triples of the same query image, so as to minimize the influence of noise triples. By querying the triplet loss function at the image level and learning CNN features, users’ real shot images and merchants’ high-quality images are mapped to the same feature space, so that images from different sources can be matched more reliably.

A simple code implementation example of Triplet Loss

class TripletLoss(nn.Module) :
""" Triplet Loss implementation on Pytorch """
def __init__(self, margin) :
    super(TripletLoss, self).__init__()
    self.margin = margin
def forward(self, anchor, positive, negative, size_average=True) :
    distance_positive = (anchor - positive).pow(2).sum(1)  # .pow(.5)
    distance_negative = (anchor - negative).pow(2).sum(1)  # .pow(.5)
    losses = F.relu(distance_positive - distance_negative + self.margin)
    return losses.mean() if size_average else losses.sum(a)Copy the code

4.2 Deep Ranking Network

4.2.1 Network Structure

An important link in the whole process is to eliminate background noise and detect the subject object. In computer vision, the target detection method is to deploy existing subject detection algorithms (such as Faster R-CNN or SSD), but such methods have long delay and need a large number of boundary box annotation.

Taobao Bealitao proposed a “joint network model of two branches” to simultaneously learn detection and feature representation. The branch network model structure is shown in the figure below:

4.2.2Parameter learning and training

The triplet mined in the above steps is supervisory information. Learn the joint model under a Deep Ranking framework. In this way, the discriminant features can be learned through the triplet positive and negative sample metric relationship, and the object subject mask that plays an important role in feature discrimination can be obtained by regression based on the branch structure. Subject masks are learned by branching structures in an attention-like mechanism in the absence of boundary-box annotations. In general, the Deep Ranking framework looks like the figure below.

Each depth joint model under the Deep Ranking framework shares parameters, and the detected mask function M(x,y)M(x,y)M(x,y) is shown in the following formula, Using testing branch first return to the rectangular coordinates (x1, xr, yt, yb) (x_ {1}, x_ {r}, y_ {t}, y_ {b}) (x1, xr, yt, yb), using the step function HHH said:


M ( x . y ) = [ h ( x x 1 ) h ( x x r ) ] x [ h ( y y t ) h ( y y b ) ] h ( x x 0 ) = { 0 . x < x 0 1 . x x 0 M ( x . y ) = [ h ( x x 1 ) h ( x x r ) ] x [ h ( y y t ) h ( y y b ) ] h ( x x 0 ) = { 0 . x < x 0 1 . x x 0 \begin{array}{c} M(x, y)=\left[h\left(x-x_{1}\right)-h\left(x-x_{\mathrm{r}}\right)\right] \times\left[h\left(y-y_{\mathrm{t}}\right)-h\left(y-y_{\mathrm{b}}\right)\right] \\ h\left(x-x_{0}\right)=\left\{\begin{array}{l} 0, x<x_{0} \\ 1, x \geqslant x_{0} \end{array}\right. \end{array} \begin{array}{c} M(x, y)=\left[h\left(x-x_{1}\right)-h\left(x-x_{\mathrm{r}}\right)\right] \times\left[h\left(y-y_{\mathrm{t}}\right)-h\left(y-y_{\mathrm{b}}\right)\right] \\ h\left(x-x_{0}\right)=\left\{\begin{array}{l} 0, x<x_{0} \\ 1, x \geqslant x_{0} \end{array}\right. \end{array}

The main bounding box region is the product of the input image I(x,y)I(x,y)I(x,y) and M(x,y)M(x,y)M(x,y) M(x,y). However, the step function HHH is not differentiable. For end-to-end training, we can use sigmoid sigmoid to approximate the step function.


f ( x ) = 1 1 + e k x f(x)=\frac{1}{1+\mathrm{e}^{-k x}}

Make KKK differentiable when it is large enough. In the whole process, users with weak supervision only need to click on the data and do not need to rely on the marking of the boundary box for training, which greatly reduces the cost of human resources and improves the training efficiency.

Image Indexing & Re-ranking

5.1 billion level massive image retrieval engine

In order to accelerate the response speed to the maximum extent, Taobao Pealitao uses large-scale binary engine for query and sorting, and adopts multi-Shards and multi-Replications engine architecture as a whole.

5.1.1Structural framework

The structure is shown in the figure below, which ensures a fast response to a large number of user queries, while having very good scalability.

  • Multi-shards

A single memory cannot store large amount of feature data, so features are stored on multiple nodes. With each query, top-K results are retrieved from each node and combined to get the final result.

  • Multi-replications

A single signature database cannot cope with a large amount of query traffic. Therefore, multiple signature databases are replicated to distribute the query traffic to different server clusters, reducing the average query time of users.

5.1.2Coarse screening and fine sorting

On each node, two types of indexes are used: rough filter and fine sort, explained as follows:

Rough screening: Improved binary inversion index retrieval based on binary features (CNN feature binarization) is adopted.

  1. With image ID as the key word and binary feature as the value, a large number of mismatched data can be quickly filtered through hamming distance calculation.
  2. According to the binary encoding of the returned image data, the nearest neighbor is refined.

Fine sorting: More precise sorting of coarsely filtered candidates based on additional metadata, such as visual attributes and features. The process of finishing is slow:

  1. Metadata is stored in a non-binary form
  2. Metadata is too expensive to load all of it into memory. Here the cache hit ratio is a key factor affecting performance.

Through rough screening and fine sorting (this process is very similar to recall + sorting in recommendation system), recall results with lossless accuracy can be basically met, and retrieval efficiency is greatly improved.

5.2 Rearrangement combined with results of multiple information dimensions

5.2.1 Existing problems

The above image retrieval results, based on computer vision and deep learning ranking, try to accurately retrieve the picture goods. But back to the nature of e-commerce, the same item can have many different illustrations. Matching the results of the same product that most closely resembles the shape of the search image does not guarantee that it is the product that most inspires users to click and buy.

5.2.2 Solutions of Taobao · Pealitao

After the adjustment and change of business objectives, it becomes a comprehensive sorting problem. Finally, it needs to resort according to the comprehensive information such as the price, favorable rating, sales volume and user portrait of the goods in the result list. After the above image retrieval process, the preliminary result list obtained by Taobao Bealitao includes the same results of Top 60 image similarity, and further uses semantic information to reorder the results of Top 60, including sales volume, conversion rate, click rate, user portrait, etc.

  • A simple way to deal with it is to integrate the relevant description features of different dimensions with the traditional GBDT+LR model and normalize the final score to 0,1, so as to ensure the appearance similarity and comprehensively consider the importance of other attribute dimensions.
  • For a more accurate sorting method, CTR prediction model and complex network method of multi-objective optimization can be considered. Refer to previous articles such as “Multi-objective sorting” and “IQiyi Multi-objective”. Combined with the result reordering stage of multi-information dimensions, the final result keeps the overall appearance similarity, while reducing the weight and filtering of the image with poor quality, so as to obtain the product image more in line with the user’s intention.

Six, the relevant code implementation reference

Get “computer vision” industry solutions

The project implementation code, project data set, paper collection and article collection of “recommendation and calculation advertising” series have been sorted into dACHang industry solutions. Go to the public account (AI Algorithm Research Institute) background reply keyword “computer vision” to obtain.

Related code implementation reference

ShowMeAI community technical experts partners have also implemented the typical algorithm of image retrieval and built relevant applications. For details on “CNN and triplet Based Image Retrieval implementation”, please visit our GitHub project (github.com/ShowMeAI-Hu…). See the implementation code. Thank you to the ShowMeAI community for participating in this project and welcome to PR and Star!

Recommended reading | click “computer vision” series

Giant | image retrieval and its technical implementationScottApplication @ Computer Vision series

7. References

  • [1] Shaoqing Ren, Kaiming He, Ross B Girshick,et al. Faster R-CNN: Towards Real-Time Object Detection with Region ProposalNetworks. IEEE Transactions on Pattern Analysis and MachineIntelligence (T – PAMI), 2017:1137-1149.
  • [2] Wei Liu, Dragomir Anguelov, Dumitru Erhan,et al. SSD: Single Shot MultiBox Detector. In European Conference on ComputerVision (ECCV), 2016:21 — 37.
  • [3] YanhaoZhang, Pan Pan, Yun Zheng,et al. Visual Search at Alibaba. In Proceedings of the 24th InternationalConference on Knowledge Discovery and Data Mining (SIGKDD), 2018:993-1001
  • [4] Yushi Jing, David C Liu, Dmitry Kislyuk,et al. Visual Search at Pinterest. In Proceedings of the 21th InternationalConference on Knowledge Discovery and Data Mining (SIGKDD), 2015:1889 — 1898
  • [5] Christian Szegedy, Wei Liu, Yangqing Jia,et al. Going deeper with convolutions. In IEEE Conference on Computer Visionand Pattern Recognition(CVPR), 2015:1–9.
  • [6] Jiang Wang, Yang Song, Thomas Leung, etal. Learning Fine-Grained Image Similarity with Deep Ranking. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2014:1386-1393.
  • [7] OlgaRussakovsky, Jia Deng, Hao Su, ImageNet Large Scale VisualRecognition Challenge. (2014). ArXiv :arXiv:1409.0575, 2014.

  • Author: Han Xinzi @showmeai, Frank @Taobao
  • Address: showmeai. Tech/article – det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source