Make writing a habit together! This is the 7th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details

This paper proposes a dynamic memory generation adversarial network (DM-GAN) to generate high quality images. When the initial image generation is not good, the dynamic memory module can be introduced to refine the fuzzy image content, so that the image can be generated more accurately from the text description.

This paper was included in the 2019 CVPR (IEEE Conference on Computer Vision and Pattern Recognition).

Address: arxiv.org/abs/1904.01…

Code address: github.com/MinfengZhu/…

This blog is a close reading of the paper report, including some personal understanding, knowledge development and summary.

A summary of the original text

In this article, we focus on generating realistic images from text descriptions. Existing methods first generate an initial image with rough shapes and colors, and then refine the initial image into a high-resolution image. Most existing text image synthesis methods have two main problems. (1) These methods depend largely on the quality of the initial image. If the initial image is not initialized well, the following procedure makes it difficult to refine the image to a satisfactory quality. (2) Each word has a different importance when describing different image contents, but invariable text representation is used in the existing image refinement process. In this paper, we propose a dynamic memory generation adversarial network (DM-gan) to generate high quality images. When the initial image generation is not good, dynamic memory module is introduced to refine the fuzzy image. A memory writing primer is designed to select important text information according to the initial image content, so that our method can accurately generate images from the text description. We also use response gates to adaptively fuse the information and image features read from memory. We evaluated the DM-gan model on the Caltech UCSD Birds 200 dataset and Microsoft Common Objects in Context dataset. Experimental results show that our DM-gan model has good performance compared with the most advanced methods.

Two, key words

Text-to-Image Synthesis, Generative Adversarial Nets,, Computer Vision

3. Why is DM-gan proposed?

The multi-stage image generation methods prior to DM-GAN (StackGAN, StackGAN++, AttnGAN) had two problems.

1. The generated results are largely dependent on the quality of the initial composited images (i.e. the first stage images). If the initial image generation is poor, the image refinement process will not produce high-quality images.

2. Each word has different levels of information in the description of the picture content, while the previous method only uses the same word impact level. That is to say, each word has different effects and importance on the content of the picture. How to correlate words well with different contents of the image to be generated?

This paper proposes solutions to the above problems:

For the first problem, a memory mechanism is introduced to process the generated initial image. Add a key-value storage structure to GAN. The rough features of the initial image are sent to the key-value storage structure to query the features, and the initial image is modified with the queried results

For the second problem, a memorization primer is introduced to dynamically select words associated with the generated images. The image information is used to determine the importance of each word and refine it, which makes the generated image well dependent on the text description. Response gates are used to adaptively receive information from image and memory, and dynamically write and read memory components according to initial image and text information during each image refinement.

For example, AttnGAN treats all words equally and uses the same attention module to represent words. But the storage module proposed by DM-GAN is able to detect this difference in image generation because it dynamically selects important word information based on the initial image content.

4. Model structure

4.1. Model composition

The model consists of two phases: initial image generation and image refinement based on dynamic memory.

Initial image generation stage: Similar to the previous model, through the text first encoder converts input text describes some characteristic vectors (such as word sentence feature vector and feature vector), then the sentence features by CA module + adding random noise (manifold interpolation), generator parts (full connection layer + sampling + convolution) generated with rough shape, and a few details of the original image.

Dynamic memory generation stage: Adding more fine-grained visual content to the blurry initial image, To generate a photo realistic images xix_ {I} xi, xi = Gi (Ri – 1, W) x_ {I} = G_ {I} \ left (R_ {1} I -, W \ right) xi = Gi (Ri – 1, W), the Ri – 1 R_ {1} I – Ri – 1 is a phase of image features, The elaboration phase can be repeated many times.

The image refinement stage of dynamic memory consists of four parts: Memory Writing, Key Addressing, Value Reading, and Response (Memory Writing, Key Addressing, Value Reading, and Response), using Memory Writing to highlight important word information, A response gate is used to adaptively fuse the information and image features read from memory.

Memory Writing: Storing text information in key-value structured Memory for further retrieval Key Addressing and Value Reading: Reading features from Memory modules to improve and refine visual features of low quality images. Response: Control the fusion of image features and memory reading

4.2 dynamic memory mechanism

Memory networks use the idea of explicit storage and attention to reason more effectively about answers in memory. It first writes the information to external storage and then reads the information from memory slots based on associated probability.

Firstly, W is defined as the input word feature, x as the image, RiR_iRi as the image feature, T as the number of words, NwN_wNw as the dimension of word feature, N as the number of image pixels, and image pixel feature as the DIMENSION vector NrN_rNr. Dynamic memory optimizes images by fusing text and image information in a more efficient way through non-trivial transformations between keys and values. W = {w1, w2,… WT}, wi ∈ RNwRi = {r1, r2,… , rN)}, ri ∈ RNr \ begin = {array} {l} W \ left \ {w_ {1}, w_ {2}, \ ldots, w_ \}, {T} \ right, w_{i} \in \mathbb{R}^{N_{w}} \\ R_{i}=\left\{r_{1}, r_{2}, \ldots, r_{N}\right\}, R_ {I} \ in \ mathbb {R} ^ {N_ {R}} \ end W = {array} {w1, w2,… WT}, wi ∈ RNwRi = {r1, r2,… RN}, ri ∈ RNr

2, the Memory of Writing


m i = M ( w i ) . m i R N m m_{i}=M\left(w_{i}\right), m_{i} \in \mathbb{R}^{N_{m}}

Where, M (·) represents 1×1 convolution operation.

Memory writing is to encode prior knowledge. Memory writing empowers text features into n-dimensional memory feature space after a 1*1 convolution operation.

4.2.2, the Key to Addressing


Alpha. i . j = exp ( ϕ K ( m i ) T r j ) l = 1 T exp ( ϕ K ( m l ) T r j ) \alpha_{i, j}=\frac{\exp \left(\phi_{K}\left(m_{i}\right)^{T} r_{j}\right)}{\sum_{l=1}^{T} \exp \left(\phi_{K}\left(m_{l}\right)^{T} r_{j}\right)}

In advanced mathematics, exp (x) is an exponential function with the natural constant e as base ϕK: exe^xex ϕK()\phi_{K}()ϕK() is the 1*1 convolution kernel that maps key features to the corresponding dimensions. α I,j\alpha_{I,j}α I,j represent the similarity probability and similarity between the ith memory and the JTH image feature

Key addressing is the use of key memory to retrieve the associated memory and calculate the weight of each memory location

Holdings, the Value of Reading


o j = i = 1 T Alpha. i . j ϕ V ( m i ) o_{j}=\sum_{i=1}^{T} \alpha_{i, j} \phi_{V}\left(m_{i}\right)

ϕV()\phi_{V}()ϕV() represents the 1*1 convolution kernel with value features mapped to the corresponding dimension. Memory representation is defined as a weighted sum of value memories based on probability of similarity.

4.2.4, Response

After receiving the output, the current image is combined with the output representation to form a new image feature. The new image feature is:
r i new  = [ o i . r i ] r_{i}^{\text {new }}=\left[o_{i}, r_{i}\right]
get
R n e w R^{new}
After that, the new image features are amplified into high-resolution images by up-sampling and residual blocks, and finally refined images are obtained by 3*3 convolution.

4.3 Introduction to Memory Writing

First of all, image features and word features are transformed into the same dimension and combined to form a memory writing introduction:
g i w g_{i}^{w}

Giw (R, wi) = sigma (A ∗ ∗ 1 n wi + B ∑ I = 1 nri) g_ {I} ^ {w} \ left (R, W_ {I} \ right) = \ sigma \ left w_ (A * * {I} + B \ frac {1} {N} \ sum_ {I = 1} ^ r_ {N} {I} \ right) giw (R, wi) = sigma (A ∗ wi + B ∗ N1 ∑ nri) I = 1, Giwg_ {I}^{w}giw is the memory write entry, σ is the sigmoid function, A is the 1* NwN_wNw matrix, B is the 1* NrN_rNr matrix.

Then combine image and text features to write memory slots:
m i m_{i}

Mi = Mw (wi) ∗ giw + Mr Nri (1 N ∑ I = 1) ∗ (1 – giw) m_ {I} = m_ {w} \ left (w_ {I} \ right) * g_ {I} ^ {w} + m_ {r} \ left (\ frac {1} {N} \ sum_ {I = 1} ^ {N} R_ {I} \ \ left right) * (1 – g_ {I} ^ {w} \ right) mi = Mw (wi) ∗ giw + Mr Nri (N1 ∑ I = 1) ∗ (1 – giw) among them, the mim_ {I} mi is the memory slot (slot), MwM_{w}Mw and MrM_{r}Mr Represent the convolution operation of 1*1.

4.4 response door

Adaptive gating mechanism is used to dynamically control information flow and update image features
g i r = sigma ( W [ o i . r i ] + b ) r i new  = o i g i r + r i ( 1 g i r ) \begin{aligned} g_{i}^{r} &=\sigma\left(W\left[o_{i}, r_{i}\right]+b\right) \\ r_{i}^{\text {new }} &=o_{i} * g_{i}^{r}+r_{i} *\left(1-g_{i}^{r}\right) \end{aligned}
Among them,
g i r g_{i}^{r}
Is the response gate for information fusion, σ is the sigmoID function, W is the parameter matrix, b is the offset.

4.5. Loss function

The total loss function is: L = ∑ iLGi + lambda lca + 2 ldamsml lambda = 1 \ sum_ {I} L_ {G_ {I}} + \ lambda_ {1} L_ {A} C + \ lambda_ {2} L_ {} D A M S M L = ∑ iLGi + 1 lca lambda + 2 ldamsm lambda

It has three items, the first is the loss of confrontation (loss of each generator), the second is the loss of CA (conditional reflex enhancement), the third is the loss of DAMSM (deep attention multimodal similarity model), λ1 and λ2 are the weights.

4.5.1 Fight against loss

StackGAN++ is the loss function of a generator. LGi = – 12 [the Ex – pGilog ⁡ Di (x) + Ex ~ pGilog ⁡ Di (x, s)] L_ {G_ {I}} = – \ frac {1} {2} \ left [\ mathbb {E} _ {x \ sim p_ {G_ {I}}} \ log D_ {I} (x) + \ mathbb {E} _ {x \ sim p_ {G_ {I}}} \ log D_ {I} \ right] LGi = (x, s) – 21 [the Ex – pGilogDi (x) + Ex ~ pGilogDi (x, s)]

As before, the first is the unconditional loss to make the generated image as realistic as possible, and the second is the conditional loss to make the image match the input sentence.

D: LDi = – 12 [the Ex – pdata log ⁡ Di (x) + Ex ~ pGilog ⁡ (1 – Di (x)) ⏟ unconditional loss, the Ex – pdata The log ⁡ Di (x, s) + Ex ~ pGilog ⁡ (1 – Di (x, s))] ⏟ conditional loss, \ begin} {aligned L_ {D_ {I}} &=\underbrace{-\frac{1}{2}\left[\mathbb{E}_{x \sim p_{\text {data }}} \log D_{i}(x)+\mathbb{E}_{x \sim p_{G_{i}}} \log \left(1-D_{i}(x)\right)\right.}_{\text {unconditional loss }}, \\ & \underbrace{\left.\mathbb{E}_{x \sim p_{\text {data }}} \log D_{i}(x, s)+\mathbb{E}_{x \sim p_{G_{i}}} \log \left(1-D_{i}(x, s)\right)\right]}_{\text {conditional loss }}, \end{aligned}LDi=unconditional Loss −21[Ex ~ pdata logDi(x)+Ex ~ pGilog(1−Di(x)),conditional loss Ex ~ pdata LogDi (x, s) + Ex ~ pGilog (1 – Di (x, s)),

4.5.2 CA loss

LCA = DKL (N (mu (s), Σ (s)) N (0, I) ∥) L_ {A} C = D_} {K L (\ mathcal {N} (\ mu (s) and \ Sigma (s)) \ | \ mathcal {N} (0, I)) LCA = DKL (N (mu (s), Σ (s)) N (0, I) ∥) which including (s) and ∑ (s) is the sentence mean and diagonal covariance matrix.

4.5.3 DAMSM loss

DAMSM loss is used to measure the degree of match between image and text description. DAMSM loss allows the generated image to be better described based on text. Loss function visible: AttnGAN blog introduction.

Five, the experiment

Data set: COCO and CUB evaluation criteria: Inception Score (IS), Frechet Inception Distance (FID), R-Precision Experimental results show that the image quality generated by dM-GAN model is higher than other methods, dM-GAN learns better data distribution, and the image generated by DM-GAN is better adapted to the given text description, which further proves the effectiveness of the proposed dynamic memory.Experimental results:

6. Attention and dynamic memory

As shown above, the most relevant words chosen by AttnGAN and DM-GAN are visualized. Note that when the initial image is not well generated, the attention mechanism fails to accurately select the relevant words.

This method selects the most relevant words based on the dynamic storage module of global image features. As shown in Figure 6 (a), although a bird with an incorrect red breast is generated, the dynamic memory module selects the word, i.e., “white” to correct the image.

The DM-GAN selects and combines word information and image features in two steps (see Figure 6 (b)). The Memory writing step begins by roughly selecting the words associated with the image and writing them into memory. The Key Addressing step then goes further and reads more related words from memory.

Further reading

CogView: Mastering Text-to-image Generation via Transformers

Navigation Search: Text description, Image generation, blog post one-stop navigation search (Text to Image summary directory)