Learning Fashion Compatibility with Bidirectional LSTMs

Address: arxiv.org/abs/1707.05…

Code address: github.com/xthan/polyv…

This article was first published on my official account:

Mp.weixin.qq.com/s/tl5603eTD…

Contact information:

Github:github.com/ccc013

Zhihu column: Machine learning and Computer Vision, AI paper notes

Wechat official account: AI algorithm notes

1. Introduction

The demand for fashion collocation recommendation is increasing. This article is based on two aspects of fashion recommendation:

  1. Given the existing clothing, recommend a vacant clothing, thus forming a set of collocation, that is, in the case of a jacket, pants, recommend a pair of shoes;
  2. Generate an outfit from multiple forms of input, such as text or an image of a dress;

At present, the difficulty lies in how to model and deduce the matching relationship between items of different fashion categories by simply calculating visual similarity.

At present, most of the relevant work mainly focuses on the three directions of clothing analysis, clothing recognition and clothing search, and there are also these problems for a small number of clothing recommendation work:

  1. Not considering making a matching recommendation;
  2. Support only one of the above two directions, that is, either recommend an outfit or recommend a missing piece of clothing from an existing outfit;
  3. Currently, there is no work that can support multiple forms of input, such as keywords, or images, or images + keywords;

For a suitable set of matches, as shown in the following figure, this paper argues that these two key attributes should be met:

  1. Any piece of the outfit should match visually and in style;
  2. There should be no duplicate types of clothing, such as two pairs of shoes or two pairs of pants;

At present, the main methods to be tried in the work of collocation recommendation are as follows:

  1. Semantic attributes are used to specify which clothes are matched, but this requires marking of data, which is costly and cannot be used in large quantities.
  2. Use metric learning to learn the distance between a pair of fashion items, but only to learn about matching, not a set;
  3. For the improvement of 2, the voting strategy is adopted, but the calculation cost is also very high, and the consistency of all items in the set cannot be well utilized.

In order to solve the above problems, this paper proposes an end-to-end framework to jointly learn the matching relationship between visual semantic vectors and clothing objects. The following figure is the overall framework of this paper.

First, the Inception V3 model is used as a feature extractor to convert input images into feature vectors, and then a layer of bi-LSTM (BI-LSTM) with 512 hidden units is used. Two-way LSTM is adopted because the author believes that a set of collocation can be regarded as a sequence of a specific order, and each piece of clothing in the collocation is a time step. At each point in time, the BI-LSTM model will predict the next image based on the previous image.

In addition, the method in this paper also learns a visual semantic vector by mapping image features to a semantic representation. It not only provides semantic attributes and type information as the input and regularization method for training LSTM, but also realizes multiple forms of input to users to generate a set of collocations.

After the model is trained, this paper evaluates the model through three tasks, as shown in the figure below:

  1. Fill in the blank: given a set of missing pieces of clothing, then given four choices, let the model choose the item that best matches the current match;
  2. Collocation generation: To generate a collocation based on multiple inputs, such as text input or a dress image;
  3. Prediction of matching: Given a set of matches, the matching score is given.

2. Polyvore dataset

This experiment uses a data set called Polyvore. Polyvore is a popular fashion site where users can create and last match data. These matches contain a wealth of information in various forms, such as pictures and description of the item, number of likes for the match, and hashtags for the match.

Polyvore dataset has a total of 21889 sets of matches, which are divided into training set, verification set and test set, 17,316, 1497 and 3076 sets respectively.

Here, referring to the paper Mining Fashion Outfit Composition Using An End-to-end Deep Learning Approach, a medium image segmentation algorithm is used to ensure that there are no duplicate clothes in the training set, validation set and test set. In addition, for the collocation that contains too many pieces, only the first 8 pieces of clothing are reserved for convenience. Therefore, the data set contains a total of 164,379 samples, and each sample contains pictures and its text description.

For the cleansing of text descriptions, words with less than 30 occurrences were removed and a dictionary of 2,757 words was constructed.

In addition, there is a fixed order of clothes in Polyvore data set — generally tops, bottoms, shoes and accessories. The order of tops is also fixed, usually shirts or T-SHIRTS, followed by coats. Accessories are generally handbag, hat, glasses, watch, necklace, earrings, etc.

Therefore, this fixed order allows the LSTM model to learn information about time.

Method 3.

3.1 Fashion Compatibility Learning with Bi-LSTM

The first part introduces clothing matching learning based on two-way LSTM. This takes advantage of the nature of LSTM models, which can learn the relationship between two points in time, while using memory units modulated by different cells to help exploit long-term time dependence.

Based on this feature, this paper regards a set of collocation as a sequence, and each picture in the collocation is an independent time point, and then uses LSTM to model the visual collocation relationship of collocation.

Given a set of collocation, F=x1,x2,… xNF={x_1, x_2, \cdots, x_N}F=x1,x2,… xN, where XTX_txt is the feature extracted from the t clothes in the collocation after CNN. At each time point, the forward LSTM is first used to predict the next picture for a given picture. In this way, learning the relationship between two time points is equivalent to learning the matching relationship between two pieces of clothing.

The loss function used here is as follows:

θf\theta_f θf represents the model parameters in the forward prediction model, and Pr(⋅)P_r(\cdot)Pr(⋅) is calculated by LSTM model, representing the probability of predicting xt+1x_{t+1}xt+1 based on the previous input.

In more detail, LSTM maps inputs to outputs through the following series of hidden states, calculated as follows:

Where xT, HTX_T, H_TXT and HT represent input and output vectors respectively, other IT, FT, CT, OTI_T, F_T, C_T, O_TIT, FT, CT and OT represent activation vectors of input gate, forgetting gate, memory unit and output gate respectively.

In the Recurrent Neural Network Based Language Model, softmax output is used to predict the next word in a sentence, This paper also adds a softmax layer after hTH_THT to calculate the probability of the next dress appearing:

Where XXX said the current batch of all images, it can make a model to learn to distinguish between degrees at a variety of samples of style and matching information, in addition, can actually let XXX said the whole data set, but this article does not consider it because so much is the number of data sets and image features of dimension is too big. Just limiting it to one batch can improve the speed of training.

In addition to predicting clothes in the forward direction, it is also possible to predict clothes in the reverse direction. For example, for a pair of trousers, the next item of clothing can be a top or a shoe, so there can be a reverse LSTM:

Pr(⋅)P_r(\cdot)Pr(⋅)

Note that two zero vectors x0,xN+1x_0, x_{N+1}x0,xN+1 will be added to F to let the bidirectional LSTM know when to stop predicting the next item of clothing.

In general, a set of collocation are usually the same style clothes collection, which have a similar style, color or texture, for example, the idea of this article, the collocation to learn as a fixed sequence of sequences, which can learn collocation match, also can learn to match the overall style of the study (mainly through the memory unit).

3.2 Visual-semantic Embedding

The second part is the learning of visual – semantic vector.

Usually for clothing recommendations, there are many forms of input, such as image or text description type, so it is necessary to learn a multimodal vector space of text and image.

Instead of manually labeling images as tag attributes, which can be time-consuming and exhausting, we used weakly labeled Web data, namely, the text description of each image from the data set. Based on this information, despite visual Semantic Embeddings with Multimodal Neural Language Models, in this paper, despite visual Semantic Embeddings with Multimodal Neural Language Models, The image and its text description are mapped to a joint space, and a visual-semantic vector is trained.

Here are some definitions. Let S=w1,w2,… wMS={w_1,w_2,\cdots, w_M}S=w1,w2… wM represent text description, where wiw_iwi represent words, and then represent them by a one-hot vector EIE_IEi. ⋅ Ei is then mapped to the vector space vi=WT⋅eiv_i = W_T \cdot e_IVi =WT⋅ Ei, where WTW_TWT represents the word vector matrix, so the final representation of this text description is


v = 1 M i v i v=\frac{1}{M}\sum_i v_i

The picture vector is expressed as:


f = W I x f=W_I\cdot x

In the visual-linguistic vector space, cosine similarity is used to evaluate the similarity between images and their text description:


d ( f . v ) = f v d(f,v) = f\cdot v

F and V are normalized, so a contrastive loss is used for optimization of visual-semantic vector learning:


E e ( Theta. e ) = f k m a x ( 0 . m d ( f . v ) + d ( f . v k ) ) + v k m a x ( 0 . m d ( v . f ) + d ( v . f k ) ) E_e(\theta_e)=\sum_f \sum_k max(0, m-d(f,v)+d(f,v_k)) + \sum_v\sum_k max(0, m-d(v,f)+d(v,f_k))

Where θe=WI, ww\ theTA_e ={W_I, W_T}θe=WI,WT represents model parameters, vkv_kvk represents text descriptions that do not match f, and fkf_kfk represents images that do not match V. Therefore, the distance between picture vector F and its text description V is minimized to make the distance m closer than the distance between F and the mismatched text description Vkv_kvk, and the same effect is achieved for V. In the training process, such mismatched samples in the same batch are used to optimize the loss mentioned above, which makes clothing with similar semantic attributes and styles closer to each other in the vector space learned.

3.3 Joint Modeling

Here we combine the above two operations, that is, we learn the matching of collocation and the visual-semantic vector at the same time, so the overall objective function is as follows:

The first two are the objective functions of bidirectional LSTM, and the third is the computational visual semantic vector Loss. The training of the whole model can be realized through Back Propagation through Time (BPTT). Compared with standard bidirectional LSTM, the only difference is that the gradient of CNN model is the average value of two sources (namely LSTM part and visual semantic vector learning), which enables CNN to learn useful semantic information at the same time.

Experiment 4.

4.1 Implementation Details

  • Two-way LSTM: InceptionV3 model is adopted to output 2048-dimensional CNN features, and then through a full connection layer, 512-dimensional features are output, and then input to LSTM. LSTM has 512 hidden units and the dropout probability is set to 0.7.
  • Visual-semantic vector: The dimension of the joint vector space is 512, so WIW_IWI is 2048×5122048\times 5122048×512, WTW_TWT is 2757×5122757\times 5122757×512, 2757 is the number of dictionaries, and let the interval m=0.2
  • Joint training: the initial learning rate is 0.2, then the attenuation factor is 2, and the learning rate is updated every 2 epoches; The batch size is 10, that is, each batch contains 10 sets of matching sequences, about 65 pictures and corresponding text description. Then, all layers of the network will be fine-tuned, and the training will be stopped when the verification set loss is stable.

4.2 Results of several different experiments

For fill in the blank, the experimental results are as follows:

Some good and bad examples:

For collocation matching prediction, some examples of prediction are as follows:

For collocation generation, the first implementation inputs just a picture of the clothes, as shown below:

  • Given a single image, LSTM in two directions can be executed simultaneously to obtain a complete set of collocation, as shown in Figure A.
  • If multiple pictures are given, as shown in Figure C, two pictures, an operation similar to fill in the blank will be performed first to predict the clothes between the two pictures. Then, as shown in D, clothes in other positions will be predicted to get a complete match.

For clothing and text description, it would look like this:

The implementation of this input will first generate a set of initial matching based on the given clothing picture, and then for the given input text description VqV_QVq, the non-query clothing FIF_IFi in the initial matching will be updated. The update method is argminfd(f, FI +vq)argmin_f d(F, f_I + V_q)argminfd(f, FI +vq), so the updated clothes image will not only be similar to the original clothes, but also be close to the input query text in the visual semantic vector space.

Or just enter a text description, as follows:

  • In the first scenario, namely the picture examples in the first two lines, the text description input is an attribute or style of clothes. First, the clothing picture closest to the text description will be regarded as a query picture, and then BI-LSTM will generate a set of matching through this picture, and then update the matching based on the given picture and text input.
  • In the second scenario, the following two image examples, the given text description refers to a certain clothing category, so the corresponding clothing images will be retrieved according to the text description, and then used as query images to generate matching.