Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

[阅读 全 文

Unsupervised Image Captioning is a paper about the direction of Image caption, which was included in 2019CVPR. I have been reading this paper recently, so I will make a note of this paper. If there is anything wrong, we can discuss it in the comments section.

1. Abstarct

At present, most image caption models rely heavily on paired image-statement data sets, but they are expensive to obtain. Therefore, in this paper, the author tries unsupervised model for the first time. The model requires an image set, a corpus, and a visual detector. At the same time, because the existing corpus is mostly used for language research and has little correlation with pictures, the author climbs a large range of picture description corpus, including 2 million natural language sentences.

2. Introduction

This image describes the conceptual difference between the existing Image Caption model:

  1. Figure A refers to supervised learning, which requires image-sentence pairs for training.
  2. Figure B refers to generating caption for objects that do not exist in image-sentence pairs but in the image recognition dataset, so that new objects can be introduced into the sentence.
  3. Figure C refers to the generalization of what is learned from an existing image-sentence pair to unpaired data. In this way, the new model is not trained with paired image-sentence data.
  4. Figure D refers to a sentence that converts an image into pivot language (Chinese) and pivot language into the target language (English).
  5. Figure E refers to the use of a semi-supervised learning framework in which an external text corpus is used for pre-training.
  6. Figure F is the unsupervised learning model proposed by the author of this paper.

There are three key steps in this model:

  1. We train language models on a sentence corpus using an adversarial text generation approach, which generates sentences from given image features. Because in the unsupervised situation, the author did not have the correct description of the training image. Therefore, we use adversarial training to generate sentences.
  2. In order to ensure that the generated subtitles contain the content in the image, the author extracts the knowledge provided by the visual detector into the model, that is, when the words corresponding to the visual concept detected in the image appear in the generated sentence, they will be rewarded.
  3. Given a given image feature, we can decode a Caption that can be further used to reconstruct image features. Similarly, we can also feature encode sentences from corpus, and then reconstruct sentences. Through two-way reconstruction, the generated sentences represent the semantic meaning of the image, thus improving the model.

In general, this paper makes four contributions:

  1. An unsupervised image caption
  2. Three objectives of training image Caption model are proposed
  3. It is proposed to initialize the pipe with no label data
  4. Crawled 2 million sentences and brought good results

3. Unsupervised Image Captioning

The unsupervised image Caption model in this paper needs a set of images II_{}I={I1I_{1}I1, I2I_{2}I2,······ InI_{n}In} and a set of sentences SS_{}S={S1S_{1}S1, S2S_{2}S2,·····, StS_{T}St} and an existing visual detector. It should be noted that the sentences are obtained from an external corpus and have no relation to the image.

3.1 The Model

The figure above shows the main elements of the model, including an image encoder, a statement generator, and a statement discriminator.

Encoder

Encoder in this paper uses the Inception-V4 network, a kind of CNN, which encodes the input image into a feature representation fimf_{im}fim:

Generator

In this paper, LSTM is used as the generator for encoding. At each moment, LSTM outputs the probability of words based on image features and previously generated words.

Where FC represents the full connection layer, ~ represents the sampling operation, n is the generated sentence length,
W e W_{e}
Represents word embedding matrix, while
x t x_{t}
Is the input to LSTM,
s t s_{t}
Is the one-hot vector representation of generated words,
h t h_{t}
^g^ is the hidden state of LSTM,
p t p_{t}
Is the probability of the t step dictionary,
s 0 s_{0}
and
s n s_{n}
Each represents the beginning (SOS) and end (EOS) of a sentence,
h 1 h_{-1}
^g^ is initialized to 0. In our unsupervised model,
x t x_{t}
From the probability distribution
p t p_{t}
Sampled.

Discriminator

The discriminator used in this paper is also implemented by an LSTM, which is used to distinguish whether a sentence is a real sentence ina corpus or generated by a model.


h t h_{t}
^d^ is the hidden state of LSTM,
q t q_{t}
Represents part of the generated sentence
S t S_{t}
= [
s 1 s_{1}
· · · · · ·
s t s_{t}
] The probability that the discriminator considers a sentence to be true. Similarly, given a real sentence S, discriminator output from the corpus
q t q_{t}
(1 <= t <= L), where L is the length of sentence S.
q t q_{t}
Discriminator indicates the probability that the first t words of a sentence S are regarded as true.

3.2 Training

In this paper, the authors define three objectives to make the unsupervised model feasible:

3.2.1 Adversarial Caption Generation

The author uses confrontational training to make the generated sentences more realistic. At each time step the author gives the generator a reward (Adversarial reward) with a value of

In the case of the discriminator, the corresponding discriminator loss function is

We need to make the first part as big as possible and the second part as small as possible.

3.2.2 Visual Concept Distillation

In order to make the generated sentences relate to the pictures, the author introduced a concept reward, which is generated when the words generated by the model detect corresponding visual concepts in the images. A set of concepts and confidence values are given for an image:

Cic_ {I} CI is the t th concept detected, and Viv_ {I}vi is the corresponding confidence score. The concept reward for the t generated word StS_{t}St is

Where function II_{}I () is the indicator function, that is, in the formula when sts_{t}st= cic_{I}ci function value is 1, otherwise is 0.

3.2.3 Bi – directional Image – Sentence Reconstruction

In order to make the resulting sentences better reflect the image, the author proposes to project the image and the sentence into a hidden space and have them reconstruct each other.

Image Reconstruction

This step is to make the generated sentence more relevant to the image. In this paper, the author chooses to reconstruct the features of the image rather than the whole image. In the figure above, the discriminator can be regarded as a sentence encoder, with a full connection layer at the end of the discriminator projecting the final hidden state hnH_ {n}hn^d^ into the space shared by the image and the sentence

Where x ‘can be further regarded as the image feature reconstructed from the generated sentence. Then, an additional image reconstruction loss function is defined

As a result, reward for image reconstruction can be defined as

Sentence Reconstruction

The discriminator encodes a sentence and projects it into a public space, which can be viewed as a graphical representation associated with the given sentence. The cross entropy loss function of sentence reconstruction can be defined as

Where sts_{t}st^ represents the t th word in the sentence.

3.2.4 Integration

The training model considers three objectives. For generator, the author used Policy Gradient for training, which estimated the gradient of trainable parameters according to the joint reward. Joint rewards include the Adversarial Reward, Concept Reward, and Image Reconstruction Reward. In addition to the gradient estimated by Policy Gradient, sentence reconstruction losses also provide gradients for the generator through back propagation. Both types of gradients are used for update generators. Let theta theta theta be the trainable parameter in the generator, then the gradient with respect to theta theta theta is

Among them, γ\gammaγ is the attenuation factor, bTB_ {t}bt is the benchmark reward for using self-critic algorithm, and λc\lambda_{c}λc, λim\lambda_{im}λim, λsen\lambda_{sen}λsen are all hyperparameters controlling weight. For discriminator, adversarial and reconstruction losses are combined to update parameters via gradient descent:

Generators and discriminators are updated alternately during training.

3.3 Initialization

The author proposes a method to initialize the pipeline to pretrain the generator and discriminator.

For generator, the author generates a pseudo-caption for each image, and then initializes the model with an image-Caption pair. Specifically, it includes the following four steps:

  1. Start by building a Concept dictionary that includes all the target classes in the OpenImage dataset.
  2. In the second step, we use corpus to train a concept-to-sentence (con2sen) model. Given a sentence, I use a one-layer LSTM to encode concept words into sentence to get a feature vector. Then use another layer of LSTM decoder to decode the feature to get a full sentence;
  3. The third step, we use the existing concept detector to detect the concept included in each image. By using the detected Concept and concept-to-sentence model, we can generate a pseudo-caption for each image.
  4. In step 4, we use these pseudo-Caption -image pairs to train the generator.

With discriminator, the parameter is initialized by training an adversarial sentence generation model on the corpus.

4. Experiments

The author uses images from MSCOCO dataset as image sets (excluding subtitles) and builds a corpus by crawling Shutterstock image descriptions. The object detection model trained on OpenImages is used as a visual concept detector.

4.1 Shutterstock Image Description Corpus

Each Shutterstock image is accompanied by a description. The author uses the names of 80 object categories in the MSCOCO dataset as search keywords. The author has collected 2, 322 and 628 different image descriptions in total.

4.2 Experimental Settings Following

The LSTM’s hidden dimension and shared space dimension are set to 512. λc\lambda_{c}λc, λim\lambda_{im}λim, λsen\lambda_{sen}λsen is set to 10, 0.2 and 1. Gamma \gammaγ is set to 0.9. Adam Optimizer was used to train the model and the learning rate was set to 0.0001. In the initialization process, Adam was used to set the learning rate to 0.001 to minimize the loss of cross entropy. Use Beam Search when generating caption in the test phase and set beam size to 3.

Please give me a thumbs up if you find this post useful, and if there is anything you can exchange in the comments section