motivation
Imagine a data set of hundreds of thousands to millions of images with no metadata describing the content of each image. How do we build a system that can find a subset of these images to better answer users’ search queries?
What we basically need is a search engine that can sort image results based on how well they correspond to the search query, expressed in a natural language, or expressed in other query images.
The way we will solve the problem in this paper is to train a deep neural model that learns a fixed-length representation (or embedded form) of any input image and text such that if text-image or image-image are “similar”, they are close in Euclidean space.
Data set:
I can’t find enough search results ranking data sets, but I can get this data set: http://jmcauley.ucsd.edu/data/amazon/ e-commerce project image it will link to their title and description. We will use this metadata as a supervisory source to learn meaningful joint text-image representations. To manage computing and storage costs, the experiments were limited to fashion (clothing, shoes and jewelry) items and 500,000 images.
Problem setting:
Our data set links each image to a description written in natural language. Therefore, we define a task in which we learn fixed-length joint representations of images and text so that each image representation approximates the representation it describes.
Model:
The model has three inputs: an image (i.e. anchor points), an image title and description (i.e. positive examples), and a third input is some randomly sampled text (i.e. negative examples).
Then define two sub-models:
- Image encoder :Resnet50 pre-trained ImageNet+GlobalMaxpooling2D
- Text encoder :GRU+GlobalMaxpooling1D
The image submodel produces the embedding of anchor point E_a, the text submodel outputs the embedding of positive title and description E_p and the embedding of negative example text E_n.
Then, we train by optimizing the following loss function:
L = max( d(Ea, Ep)-d(Ea, En)+alpha, 0)
Where d is the Euclidean distance and alpha is the hyperparameter, which is 0.4 in this experiment.
Basically, what this loss allows is to make d(E_a, E_p) smaller and d(E_a, E_n) larger, so that the embedding of each image is close to the embedding of its description and away from the embedding of random text.
Visualization results:
Once we learn the image embedding model and the text embedding model, We can through the use of tsne (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to their projection in the two-dimensional space to realize visualization. Test images and their corresponding text descriptions are connected by green lines
As can be seen from the figure, the image and its corresponding description are usually close in the embedded space. This is what we would expect given the training loss of use.
Text image search:
Here, we use several text query examples to search for the best match in a set of 70,000 images. We calculate the text embedding of the query, and then the embedding of each image in the collection. We finally select the first nine images closest to the query in the embedded space.
These examples show that the embedding model can learn useful representations of images and embeddings composed of simple words.
Image search:
Here, we’ll use the image as a query and then search a database of 70,000 images for the most similar example. The ordering is determined by the Euclidean distance of each pair of images in the embedded space.
The results show that the generated embedding is a high-level representation of the image that captures the most important features of the represented object without undue influence by direction, illumination, or local detail, and without explicit training.
Conclusion: In this project, we studied the machine learning module, which allows us to build a keyword – and image-based search engine, applied to image collections. The basic idea is to learn a meaningful joint text and image embedding function, and then use the distance between the items in the embedding space to sort the search results.
References:
- Large Scale Online Learning of Image Similarity Through Ranking
- Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
- https://github.com/KinWaiCheuk/Triplet-net-keras/blob/master/Triplet%20NN%20Test%20on%20MNIST.ipynb
Code: to reproduce the results = > https://github.com/CVxTz/imagesearchengine rock and AI technology blog resources summary station: http://docs.panchuang.net/PyTorch, the official Chinese tutorial station: Chinese official documents: http://pytorch.panchuang.net/OpenCV http://woshicver.com/