Did you buy a new dress during the May Day holiday
When it comes to buying clothes, you’ve had the experience of walking down the street and seeing someone wearing a zhuang outfit. You can’t help but wonder where you bought such a beautiful outfit. You really want to buy something to wear, but you don’t know them so you have to look at them. But what if there was a way to find online sellers based on their photos?
Aleksandr Movchan, a programmer in Germany, gave this problem a name — Street to Shop — and decided to use Distance Metric Learning in machine Learning. DML) solves this problem. (No longer afraid of not being able to buy the dress I like!)
Measure learning
Before introducing this “Shopping” tutorial, a quick word about measurement learning. Metric Learning is also known as similarity Learning. If it is necessary to calculate the similarity between two pictures, how to measure the similarity between pictures so that the similarity of different categories is small while the similarity of the same category is large is the goal of measurement learning.
In mathematics, a measure (or distance function) is a function that defines the distance between elements in a set. A set of measures is called a metric space. For example, if our goal is face recognition, we need to build a distance function to enhance the appropriate features (such as hair color, face shape, etc.); If posture recognition is our goal, we need to build a distance function that captures posture similarity. In order to deal with the variety of feature similarity, we can manually construct the distance function by selecting the appropriate feature in a particular task. However, this method requires a lot of manual input and may be very robust to data changes. As an ideal substitute, metric learning can autonomously learn the metric distance function for a specific task according to different tasks.
Metric learning methods can be divided into metric learning by linear transformation and nonlinear model of metric learning. Some classical unsupervised linear dimensionality reduction algorithms can also be regarded as unsupervised Markov metric learning, such as principal component analysis, multidimensional scaling and so on.
Metric learning has been applied to computer vision in image retrieval and classification, face recognition, human activity recognition and posture estimation, text analysis and some other fields such as music analysis, automated project debugging, microarray data analysis and so on. Let’s take a look at the tutorial.
Building a data set
First, as with any machine learning problem, we need a data set. In fact, one day when I saw a large number of photos of clothes on Alibaba.com, I had an idea: I could use this data to make a search function based on photos. For simplicity and convenience, I decided to focus on women’s clothing, and girls (and some boys) love to shop for clothes.
Here’s the category of women’s wear I crawled:
-
The skirt
-
Dress shirt,
-
Hoodies & sweatshirts
-
sweater
-
Jackets & Coats
I used Requests Requests and BeautifulSoup to crawl the image. Clothing photos of sellers can be obtained from the main page of clothing category, but photos uploaded by buyers need to be obtained in the evaluation area. There is a “color” attribute on the clothing page, which can be used to determine if the clothing is a different color or even a completely different clothing. So we see different colors of clothing as different commodities.
You can click here to see the code I use to get all the information about a garment.
We just need to search the clothing page by clothing category, get the URL for all the clothing, and use the code above to get the information for each clothing item.
Finally, we get two image data sets for each garment: images from the seller (the URL field for each element in item[‘colors’]) and images from the buyer (the URL field for each element in item[‘feedbacks’]).
For each color, we only get one photo from the seller, but there may be more than one photo from the buyer, and sometimes no photo at all (just like the buyer show on Tmall, some show a lot of pictures, some show none at all).
Very good! We got the data we wanted. However, there was a lot of noise in the resulting data set: there was a lot of noise in the image data from the buyer, such as photos of the delivery package, photos that showed only the texture of the clothes and only part of it, and some photos taken just after the package was ripped open.
To mitigate this problem, we labeled 5,000 images into two categories: benign and noisy. Initially, my plan was to train a classifier for both categories and then use it to clean the dataset. But then I decided to leave this work behind and just add clean data to the test and validation datasets.
The second problem is that sometimes several sellers sell the same clothes, and sometimes several stores show the same photos (or just a little editing). How to solve this problem?
One of the easiest ways to do this is to do nothing and use a robust algorithm in distance metric learning. However, this approach will affect the validation effect of the data set because we will have the same clothing in the validation and training data. So that can lead to data leakage. Another approach is to look for similar (or even identical) outfits and combine them into one. We can use perceptual hashing to find the same clothing photos, or train a model using noise data to find the same clothing photos. I chose the second method because it allows the same photos to be merged into one, even a slightly edited one.
Distance metric learning
One of the most commonly used learning methods for distance measurement is Triplet loss:
Where Max (x,0) is hinge function, D (x,y) is the distance function between X and Y, F (x) is the deep neural network, M is the boundary, A is anchor, P is the positive points, n is the negative points. F(a), F(p), F(n) are points in high dimensional Spaces (vectors) generated by deep neural networks. Necessary, in order to make model when dealing with the photos of illumination and contrast changes more robust, often need to will be regularized vector to obtain the same unit length, for example, | | x | | = 1. Anchor and positive sample belong to the same analogy, while negative sample is an example in another category.
Then the main idea of Triplet loss is to distinguish the vectors of positive example pairs (anchor and positive) from negative example pairs (anchor and negative) with a distance boundary M.
But how to choose a triplet (a, P, n)? We could choose a sample at random as a triplet, but this would lead to the following problems. First, there are N ^ 3 possible triplets. This means that we need to spend a lot of time going through all the possible triplets. In practice, however, we don’t need to do this, because after several training iterations, many meta-triplets already meet triplet limits (such as zero losses), which means they are useless for training.
One of the most common ways to select a triplet is hard negative mining:
Selecting the most difficult sample will in practice lead to poor local minima at the beginning of training. Specifically, it causes a shrinking model (e.g., F(x) = 0). To alleviate this problem, we can use semi-hard negative mining.
The semi-indistinguishable samples are farther from the anchor than the positive samples, but they are still indistinguishable (not conforming to the Triplet restriction) because they are inside the boundary M.
There are two ways to generate semi-indistinguishable (and indistinguishable) samples: online and offline.
-
The online approach means that we can randomly select a small batch of samples from the training data set and select the Triplet from the sample here. But to do it online we need to have a lot of data. In our case, I couldn’t do that because I only had a GTX 1070 with 8 gigabytes of RAM.
-
In offline mode, we need to pause the training at intervals, predict vectors for a certain number of samples, select the Triplet, and train the model with these Triplet. This means we need to do two forward calculations, which is a little bit of a price to pay for using the offline approach.
Very good! We can now train the model with Triplet Loss and offline semi-impartible sample mining. But! We need a few more ways to perfectly solve the street-to-store problem. Our task was to find the clothing photos with the highest degree of similarity between seller and buyer.
However, the quality of the seller’s photos is often much better than the quality of the photos uploaded by the buyer (think again, the photos posted by the online store are generally processed through N photoshop procedures), so we have two fields: seller’s photos and buyer’s photos. To achieve an efficient model, we need to narrow the gap between these two domains. This problem is called domain adaptation.
I propose a very simple method to narrow the gap between these two domains: We choose anchor from seller’s photos and positive and negative samples from buyer’s photos. That’s it! Simple but effective.
implementation
To implement my idea for a quick trial, I used the Keras library and the TensorFlow back end.
I have chosen Inception V3 as the basic convolutional neural network for my model. As normal, I initialized the convolutional neural network with ImageNet weights. Then two fully connected layers are added at the end of the network after L2 regularization. The vector size is 128.
def get_model():
no_top_model = InceptionV3(include_top=False, weights='imagenet', pooling='avg')
x = no_top_model.output
x = Dense(512, activation='elu', name='fc1')(x)
x = Dense(128, name='fc2')(x)
x = Lambda(lambda x: K.l2_normalize(x, axis=1), name='l2_norm')(x)
return Model(no_top_model.inputs, x)
Copy the code
We also need to realize the Triplet loss function, which can pass the Anchor and positive and negative samples into the function as a separate small batch of data, and divide this small batch of data into three tensors in the function. The distance function is Euclidean distance squared.
def margin_triplet_loss(y_true, y_pred, margin, batch_size): out_a = tf.gather(y_pred, tf.range(0, batch_size, 3)) out_p = tf.gather(y_pred, tf.range(1, batch_size, 3)) out_n = tf.gather(y_pred, tf.range(2, batch_size, 3)) loss = K.maximum(margin + K.sum(K.square(out_a-out_p), The axis = 1) - Keith um (Keith quare (out_a - out_n), axis = 1), 0.0)return K.mean(loss)
Copy the code
And optimize the model:
#utility function to freeze some portion of a function's arguments
from functools import partial, update_wrapper
def wrapped_partial(func, *args, **kwargs):
partial_func = partial(func, *args, **kwargs)
update_wrapper(partial_func, func)
returnPartial_func opt = keras.optimizers.Adam(lr=0.0001) model.pile (loss=wrapped_partial(margin_triplet_loss, margin=margin, batch_size=batch_size),Copy the code
The experimental results
The performance metrics for the model are called R@K. Let’s see how to calculate R@K. Verify each buyer photo in the set as a query, we need to find the corresponding seller photo. Each time we query a photo, we compute the embedded vector and search for the vector’s nearest neighbor among all seller photos. We will use seller photos not only in the validation set, but also in the training set, because this will increase the number of distractions and make our task more challenging.
So we get a query photo and a list of the most similar seller photos. If there is a corresponding seller photo among the K most similar photos, we return 1 for the query, if not, we return 0. Now we need to return such a result for each query in the validation set and find the average score for each query. This is R@K.
As I mentioned above, I cleaned out a small number of buyer photos from the noise data. So I tested the quality of the model with two validation sets: a full validation set and a subset with only clean data.
The result of the model is not very good, we can also do this to optimize:
-
Clean out the buyer data from the noise data. I’ve taken the first step in this direction by cleaning out a small data set.
-
Combine clothing photos more accurately (at least in validation sets).
-
Further narrowing the gap between domains. I think it can be done with domain-specific enhancement methods (such as enhancing image illuminance) and other specific methods (such as the one in this paper).
-
Using a different distance metric learning method, I tried the method in this paper, but the results were worse.
-
And of course collecting more data.
Demo, code and trained model
I made a demo of the model, which can be viewed here.
You can take a photo of your favorite girl’s outfit on the street (be safe), or upload a random photo from the validation set to your model and see how it looks.
Click here to view the project code base.