Reprinted with permission by Machine Heart

Official account: PaperWeekly

Author: Guo Xiaofeng (IQiyi)

In the process of learning GAN a few days ago, I found that most of the current GAN review articles are those by Ian Goodfellow or Wang Feiyue from the Institute of Automation in 2016. But in deep learning, GAN, where progress is measured in months, those two reviews feel a bit old.

Recently I found a new overview paper on GAN, more than 40 pages, introducing various aspects of GAN, so I studied and sorted out my notes as follows. Many of the contents of the article are summarized according to their own learning, there are improper welcome to point out.

In addition, this article references a lot of blog materials, has given reference links. If there is infringement, please delete private letter. The table of contents is as follows:

The basic introduction of GAN

Generative Adversarial Networks (GAN), as an excellent Generative model, have sparked many interesting applications of image generation. Compared with other generative models, GAN has two characteristics:

1. Do not rely on any prior assumptions. Many traditional methods assume that the data obey a certain distribution and then use maximum likelihood to estimate the distribution.

2. The way to generate a real-like sample is very simple. The way GAN generates real-like samples is propagated forward through the Generator, while the traditional method of sampling is very complex. Interested students may refer to Zhou Zhihua’s Book Machine Learning for the introduction of various sampling methods.

Next, we will introduce the above two points.

The basic concepts of GAN

GAN (Generative Adversarial Networks), as its name implies, is a Generative Adversarial network. More specifically, through the way of confrontation, to learn the generative model of data distribution.

The so-called antagonism refers to the antagonism between generative network and discriminant network. Generative network generates realistic samples as much as possible, while discriminant network tries to distinguish whether the samples are real samples or generated false samples. The schematic diagram is as follows:

The hidden variable Z (usually random noise subject to gaussian distribution) generates Xfake with the Generator, and the discriminator is responsible for determining whether the input data is the generated sample Xfake or the real sample Xreal. The objective function of optimization is as follows:

For discriminator D, this is a dichotomous problem, and V(D,G) is a common cross entropy loss in dichotomous problems. For generator G, in order to deceive D as much as possible, it is necessary to maximize the discriminant probability D(G(z)) of sample generation, that is, to minimize log(1-d (G(z))). Note: the log(D(x)) term has nothing to do with generator G, so it can be ignored.

In practice, the generator and discriminator alternate training, training D first, then G, and back and forth. It is worth noting that, for generators, it minimizesThat is, minimize the maximum value of V(D,G).

To ensure that V(D,G) is maximized, we usually train to iterate k times over the discriminator and then once over the generator (although in practice, k is usually 1). When the generator G is fixed, we can take the derivative of V(D,G) and find the optimal discriminator D*(x) :

By inserting the optimal discriminator into the above objective function, we can further find that under the optimal discriminator, the objective function of the generator is equivalent to optimizing the JS Divergence (JSD, Jenson Shannon Divergence) of Pdata(x), Pg(x).

It can be proved that when the capacity of G and D is sufficient, the model will converge and the two will reach Nash equilibrium. At this point, Pdata(x)= Pg(x), and the discriminator has a prediction probability of 1/2 for both Pdata(x) and Pg(x) samples, that is, the generated sample is indistinguishable from the real sample.

The objective function

We mentioned earlier that the objective function of GAN is to minimize the JS divergence of the two distributions. In fact, there are many ways to measure the distance between two distributions, and JS divergence is just one of them. If we define different distance measures, we can get different objective functions. Many improvements to GAN training stability, such as EBGAN and LSGAN, define different measures of distance between distributions.

f-divergence 

F-divergence uses the following formula to define the distance between the two distributions:

In the above formula, f is convex and f(1)=0. Different optimization objectives can be obtained by using different F functions (generators). Details are as follows:

It is worth noting that the divergence of this measure is not possess symmetry, namely Df (Pdata | | Pg) and Df (Pg | | Pdata) is not equal.

LSGAN 

So LSGAN, as I said, is one of the things thatIs a special case. Specifically, the Loss of LSGAN is as follows:

A =c=1,b=0 LSGAN has two major advantages:

  • Stability training: the problem of gradient saturation in traditional GAN training is solved

  • Improved generation quality: By penalizing generated samples far from the discriminator decision boundary

For the first point, stability training, take a look at this chart:

On the left side of the figure is the comparison diagram of input and output of traditional GAN when sigmoID cross entropy is used as Loss. On the right of the figure is the comparison diagram of input and output when LSGAN uses least square Loss. As can be seen from the figure on the left, when the input is relatively large, the gradient is 0, that is, the input of cross entropy loss is prone to gradient saturation. The least square loss on the right does not.

For the second point, improve the quality of generation. This is also explained in detail in the original text. Specifically: for some samples correctly classified by the discriminator, it does not contribute to the gradient. But is the sample that the discriminator classifies correctly necessarily the sample that is close to the real data distribution? Obviously not.

Consider the ideal situation, a well trained GAN, the real data distribution Pdata Pg coincide and generate the data distribution, discriminant decision plane through the real data point, so, on the other hand, we use sample points from the decision-making of the system is to measure the quality of the generated sample, sample the nearer the decision-making level, GAN training, the better.

In figure B, some points far from the decision surface are correctly classified, but they are not good generated samples. Traditional gans generally ignore this. For LSGAN, the distance between the decision surface and the sample point is calculated by using the least square loss, as shown in Figure C, which can “pull” the point far from the decision surface back, that is, the point far from the real data back.

Integral probality metric (IPM) 

IPM defines a family of evaluation functions, F, to measure the distance between any two distributions. In a compact space, P(x) is defined as the probability measure on x. Then IPM between two distribution Pdata and Pg can be defined as the following formula:

Like f-Divergence, each function f can define a series of different optimization goals. Typical examples include WGAN, Fisher GAN, etc. Here’s a brief introduction to WGAN.

WGAN  

WGAN proposed a new distance measurement method — EM (Earth-Mover Distance), also known as Wasserstein distance. An introduction to Wasserstein distance can be found in the vernacular Wassertein distance [1].

Wasserstein distance is defined as follows:

⊓(Pdata,Pg) represents a set of joint distributions in which the marginal distributions of any distribution γ are Pdata(x) and Pg(x).

Intuitively, the probability distribution function (PDF) can be understood as the quality of the random variable at each point, so W(Pdata,Pg) represents the minimum amount of work needed to move the probability distribution Pdata(x) to Pg(x).

WGAN can also be explained by the optimal transmission theory. The generator of WGAN is equivalent to solving the optimal transmission map, and the discriminator is equivalent to calculating Wasserstein distance, namely, the optimal total transmission cost [4]. The theoretical derivation and explanation of WGAN is complicated, but the code implementation is very simple. Specifically [3] :

  • The last layer of the discriminator removes the sigmoID

     

  • Loss for generators and discriminators does not take log

  • After each update of the discriminator’s parameters, truncate their absolute value to no more than a fixed constant c

For the third point above, in WGAN’s later work, WGAN-GP, gradient truncation is replaced by gradient punishment.

Contrast F-Divergence with IPM

There are two problems with F-Divergence: one is that it depends on the dimensions of the data spaceWith the addition of, f-Divergence would be very difficult to calculate. The other is that the support sets of the two distributions [3] are usually unaligned, which will result in divergence values tending to infinity.

IPM is not affected by data dimension and converges uniformly to the distance between Pdata} and Pg distributions. And even when the support sets of two distributions do not overlap, they do not diverge.

Auxiliary objective function

In many GAN applications, additional Loss is used for stabilization training or for other purposes. For example, in image translation, image repair, and superresolution, the generator will add the target image as supervisory information. EBGAN takes the discriminator of GAN as an energy function and adds reconstruction error into the discriminator. CGAN uses category label information as monitoring information.

Other common generative models

Autoregressive models: pixelRNN and pixelCNN

The autoregressive model explicitly models the probability distribution Pdata(x) of image data, and optimizes the model by maximum likelihood estimation. Details are as follows:

The above formula is easy to understand, given x1,x2… Under the condition of XI -1, the probability of all p(xi) multiplied is the distribution of image data. If you use RNN to model the above relationship, it is pixelRNN. If CNN is used, pixelCNN. The details are as follows [5] :

Obviously, no matter for pixelCNN or pixelRNN, the speed will be slow because the pixel values are generated one by one. WaveNet is a typical autoregressive model in speech field.

VAE 

PixelCNN/RNN defines a density function that is easy to handle, and we can directly optimize the likelihood of training data. For the variational autoencoder, we will define a density function that is not easy to handle, and model the density function by additional implicit variable Z. The VAE schematic is as follows [6] :

In VAE, the mean variance of real sample X is calculated by neural network (assuming that the hidden variable follows the normal distribution), and then the sampling variable Z is obtained through sampling and reconstructed. VAE and GAN learn the mapping of hidden variable Z to real data distribution. But unlike GAN:

1. GAN uses a discriminator to measure the distance between the distribution generated by the distribution conversion module (i.e. generator) and the real data distribution.

VAE is less intuitive. VAE realizes the distribution transformation map X=G(z) by constraining the implicit variable Z to follow the standard normal distribution and reconstructing the data.

Generative model comparison

1. Autoregressive model generates data through explicit modeling of probability distribution;

VAE and GAN: assume that the hidden variable Z is subject to a certain distribution, and learn a mapping X=G(z) to achieve the conversion of the hidden variable distribution Z to the real data distribution Pdata(X);

3. GAN uses a discriminator to measure the quality of the mapping X=G(z), while VAE uses the KL divergence of hidden variable Z and the standard normal distribution and reconstruction error to measure.

GAN common model structure

DCGAN 

DCGAN proposed to use CNN structure to stabilize GAN’s training, and used the following tricks:

  • Batch Normalization 

  • Up-sampling was carried out using Transpose convlution

  • Use Leaky ReLu as the activation function

These tricks above have a lot of help for the stability of GAN training, their own GAN network design can also be used.

hierarchy

GAN has many problems in high resolution image generation. Hierarchical GAN can improve image resolution step by step through hierarchical and phased generation. Typical models that use multi-pair gans are StackGAN and GoGAN. Using a single GAN, ProgressiveGAN was generated in stages. StackGAN and ProgressiveGAN are structured as follows:

Self-coding structure

In the classical GAN structure, the discriminant network is often used as a probabilistic model to distinguish real/generated samples. In the autoencoder structure, the discriminator (using AE as the discriminator) is usually treated as an Energy function. For the sample close to the data manifold space, the energy is small, and vice versa. Given this measure of distance, it is natural to use discriminators to guide generator learning.

As a discriminator, why AE can be used as an energy function to measure the distance between the generated sample and the data manifold space? First, look at the loss of AE:

Loss of AE is a reconstruction error. When AE is used as the discriminator, the reconstruction error will be small if the real sample is input. If the generated sample is input, its reconstruction error will be large. Because it is difficult for AE to learn the compressed representation of an image for the generated sample (that is, the generated sample is far from the data pop space). Therefore, VAE reconstruction error is reasonable as a measure of the distance between Pdata and Pg. Typical GAN with self-encoder structure include: BEGAN, EBGAN, MAGAN, etc.

GAN’s training disorder

Problems with the theory

There are two loss discriminators for classic GAN, which are:

When the first formula above is used as loss: when the discriminator reaches the optimum, it is equivalent to minimizing the JS divergence between the generated distribution and the real distribution. Since the random generated distribution is difficult to overlap with the real distribution in a non-negligible way and JS divergence has the mutation characteristics, the generator is faced with the problem of gradient disappearance.

When using the second formula above as loss: Under the optimal discriminator, it is equivalent to minimize the KL divergence directly between the generated distribution and the real distribution and maximize its JS divergence, which is contradictory and leads to gradient instability. Moreover, the asymmetry of KL divergence makes the generator prefer to lose diversity rather than accuracy, leading to the phenomenon of collapse mode [7].

Problems in practice

GAN has two problems in practice:

First, Ian Goodfellow, who proposed GAN, proved that GAN can achieve Nash equilibrium in his theory. However, in our practical implementation, we are optimizing in the parameter space, not the function space, which leads to the theoretical guarantee is not valid in practice.

Second, the optimization goal of GAN is a minmax problem, i.eIn other words, when optimizing generators, minimize. However, we do iterative optimization. To ensure the maximization of V(G,D), we need to iterate many times, which leads to a long training time.

If we just iterate over the discriminator once, then iterate over the generator once, iterate over and over again. In this way, the original minimax problem is easy to become a maxmin problem, but the two are different, namely:

If the variation is a minimax problem, then the iteration looks like this: Mr. Generator takes some samples, and then the discriminator gives the wrong discriminant and punishes the generator, which adjusts the generated probability distribution. However, this often results in generators becoming “lazy” and producing simple, repetitive samples that lack variety, also known as mode collapse.

Stabilization GAN training techniques

As mentioned above, GAN has three major problems in theory and practice, leading to the unstable training process and the problem of mode collapse. To improve the above situation, use the following stabilization training techniques:

Feature matching: The method is very simple, using the features of a layer of discriminator to replace the output in the original GAN Loss. Minimization: the distance between the features of the generated image through the discriminator and the features of the real image through the discriminator.

Tag smoothing: The tag in GAN training is either 0 or 1, which makes the confidence predicted by the discriminator tend to be higher. Using label smoothing can alleviate this problem. Specifically, label 1 is replaced with a random number between 0.8 and 1.0.

Spectral normalization: WGAN and Improve WGAN constrain the optimization process by imposing Lipschitz conditions, while spectral normalization imposes Lipschitz constraints on each layer of the discriminator, but spectral normalization is more efficient than Improve WGAN.

PatchGAN: Exactly speaking, PatchGAN is not used for stabilization training, but it is widely used in image translation. PatchGAN is equivalent to discriminating each small Patch of the image, which can make the generator generate sharper and clearer edges.

The specific approach is as follows: suppose that a 256×256 image is input to the discriminator, and a 4×4 confidence map is output. Each pixel value in the confidence map represents the confidence degree that the current patch is a real image, namely PatchGAN. The size of the patch in the current image is the size of the receptive field. Finally, the Loss of all patches is averaged as the final Loss.

Mode Collapse solutions

An improved method for objective function

In order to avoid the previously mentioned problem of mode hopping caused by maxmin optimization, UnrolledGAN uses the modified generator Loss to solve the problem. Specifically, when UnrolledGAN updates the generator k times, the reference Loss is not the Loss of a certain time, but the Loss of k iterations after the discriminator.

Note that the next k iterations of the discriminator do not update its own parameters and only calculate Loss for the update generator. This allows the generator to take into account the next k changes in the discriminator and avoid schema crashes caused by switching between modes. This must be distinguished from iterating a generator k times, then iterating a discriminator once [8].

DRAGAN introduced the no-regret algorithm in game theory and transformed its Loss to solve the mode collapse problem [9]. EBGAN mentioned above is added VAE reconstruction error to solve mode collapse.

Improvement method for network structure

Multi Agent Diverse GAN (Mad-GAN) adopts multiple generators and a discriminator to ensure the diversity of sample generation. The specific structure is as follows:

Compared with ordinary GAN, there are several more generators, and a regular item is added in the design of Loss. The regular term penalizes the consistency of samples generated by the three generators using the cosine distance.

MRGAN adds a discriminator to penalize mode collapse problems that generate samples. The specific structure is as follows:

Input sample X is encoded as a hidden variable E(x) through an Encoder, and then the hidden variable is reconstructed by the Generator. During training, there are three Loss.

DM and R (reconstruction errors) were used to guide the generation of real-like samples. DD discriminates the samples generated by E(x) and Z. Obviously, the samples generated by both are fake samples, so this discriminator is mainly used to judge whether the generated samples have diversity, that is, whether mode collapse occurs.

Mini-batch Discrimination 

Mini-batch discrimination establishes a mini-batch layer in the middle layer of the discriminator to calculate sample statistics based on L1 distance. By establishing such statistics, it determines how close a sample is to other samples ina batch. This information can be used by a discriminator to identify samples that lack diversity. For generators, you try to generate a variety of samples.

On the understanding of GAN hidden space

A hidden space is a compressed representation of data. In general, it is not practical for us to modify the image directly in the data space, because the image properties are located in a manifold in higher dimensional space. But in hidden space, since each hidden variable represents a specific property, this works.

In this section, we will explore how GAN deals with hidden Spaces and their properties, and how variational approaches are incorporated into GAN’s framework.

Implicit space decomposition

GAN’s input hidden variable Z is unstructured, and we do not know what attributes are controlled by each bit in the hidden variable. Therefore, some scholars proposed to decompose the implicit variable into a conditional variable C and a standard input implicit variable Z. It includes supervised methods and unsupervised methods.

Supervised method

Typical supervised methods are CGAN and ACGAN.

The CGAN takes random noise Z and category label C as inputs to the generator, and the discriminator takes the generated sample/real sample and category label as inputs. Learn the associations between tags and images.

ACGAN takes random noise Z and category label C as the input of the generator, and the discriminator inputs the generated sample/real sample, and regains the category label of the image. Learn the associations between tags and images. The two structures are as follows (CGAN on the left and ACGAN on the right) :

Unsupervised method

In contrast to supervised methods, unsupervised methods do not use any label information. Therefore, unsupervised methods need to decouple the hidden space to obtain meaningful feature representation.

InfoGAN pairs decompose input noise into an implicit variable Z and a conditional variable C (sampled from a uniform distribution during training), both of which are fed into the generator together. By maximizing the mutual information of C and G(z,c) I(c; G(z, C)) to achieve variable decoupling (I(c; Mutual information of G(z,c)) is how much information there is in C about G(z,c), if you maximize mutual information I(c; G(z,c)), maximizing the correlation between the generated result and the condition variable C).

The structure of the model is basically consistent with that of CGAN, except that Loss has one more maximum mutual information. The details are as follows [10] :

From the above analysis, we can see that InfoGAN only achieves information decoupling, and we cannot control the specific meaning of each value of condition variable C.

Then SS-INFogan appeared. Ss-infogan used semi-supervised learning to divide the condition variable C into two parts.. Css uses tags to learn like CGAN and Cus learns like InfoGAN.

The combination of GAN and VAE

Compared with VAE, GAN can generate clear images, but is prone to mode collapse problems. VAE does not have mode collapse problems because VAE encourages reconstruction of all samples.

A typical work combining the two is VAEGAN, which is similar in structure to MRGAN mentioned above, as follows:

Loss of the above model includes three parts, namely, reconstruction error of a layer feature of discriminator, Loss of VAE and Loss of GAN.

Summary of GAN Model

The previous two sections introduced a variety of GAN models, most of which were designed around two common problems with GAN: schema crashes and training crashes. The following table summarizes these models for review:

The application of GAN

Since GAN can generate real-like samples without explicitly modeling any data distribution in the process of sample generation, GAN has a wide range of applications in image, text, voice and many other fields. The following table summarizes the application of GAN in various aspects, and these algorithms will be introduced later.

image

Image translation

The so-called image translation refers to the conversion from one (source domain) image to another (target domain) image. It can be analogous to machine translation, where one language is converted into another. The translation process leaves the source domain image content unchanged, but the style or some other attribute becomes the target domain.

Paired two domain data  

A typical example of pairwise image translation is Pix2PIx, which uses pairwise data to train a conditional GAN. Loss includes Loss of GAN and per-pixel difference Loss. PAN uses per-pixel difference on the feature map as perceptual loss instead of per-pixel difference on the image to generate an image that is closer to the source domain in human eyes perception.

Unpaired two domain data  

A typical example of the problem of image translation of training data is CycleGAN. CycleGAN uses two pairs of gans to convert the source domain data to the target domain through one GAN network, and then uses the other GAN network to convert the target domain data back to the source domain. The converted data and the source domain data are exactly paired, forming supervisory information.

Super resolution

GAN and perceptual loss are used in SRGAN to generate detailed images. Perceptual loss focuses on the errors of the intermediate feature layer rather than the per-pixel errors of the output results. The lack of texture details in the generated high resolution image is avoided.

Target detection

Due to the application of GAN in super-resolution, GAN can generate high resolution images of small targets to improve the detection accuracy of small targets.

Image joint distributed learning

While most gans learn the distribution of data in a single domain, CoupledGAN proposes a partially weighted shared network that uses unsupervised methods to learn the joint distribution of images in multiple domains. The specific structure is as follows [11] :

As shown in the figure above, CoupledGAN uses two GAN networks. The first half of the generator is shared to encode the information shared at the high levels of the two domains, while the second half is not shared to encode the data of each domain. The first half of the discriminator is not shared, but the second half is used to extract high-level features and share the weight of both. For the trained network, input a random noise and output two images of different domains.

It is worth noting that what the above model learns is the joint distribution P(x,y). If two separate GAN are used to take training respectively, what we learn is the marginal distribution P(x) and P(y). In general, P(x,y)≠P(x) times P(y).

Video generated

Typically, a video consists of a relatively still background and a moving foreground. VideoGAN uses a two-stage generator: the 3D CNN generator generates a moving foreground and the 2D CNN generator generates a static background.

For Pose GAN, VAE and GAN are used to generate videos. First, VAE predicts the motion information of the next frame based on the posture of the current frame and the posture features of the past, and then 3D CNN generates subsequent video frames using the motion information.

Motion and Content GAN (MoCoGAN) proposes to separate the moving part and the Content part in the hidden space, and uses RNN to model the moving part.

Sequence generated

Compared with GAN’s application in the field of image, GAN’s application in the field of text and speech is much less. There are two main reasons:

1. GAN uses BP algorithm in optimization. For discrete data such as text and voice, GAN cannot directly jump to the target value, but can only approach it step by step according to the gradient.

2. For the problem of sequence generation, every time a word is generated, we need to judge whether the sequence is reasonable, but the discriminator in GAN cannot do it. Unless we set a discriminator for each step, which obviously doesn’t make sense.

To solve the above problems, Policy gredient Descent in reinforcement learning is introduced into the sequence generation problem in GAN.

Music is generated

Rnn-gan directly generates the entire audio sequence using LSTM as a generator and discriminator. However, as mentioned above, music includes lyrics and notes, and there are many problems with using GAN directly for such discrete data generation problems, especially the lack of local consistency in the generated data.

SeqGAN, by contrast, treats the output of the generator as an agent’s policy and the output of the discriminator as a reward, using policy gradient descent to train the model. ORGAN sets a specific objective function for a specific objective based on SeqGAN.

Language and Pronunciation

Vaw-gan (Variational autoencoding Wasserstein GAN) combines VAE and WGAN to implement a speech conversion system. The encoder encodes the content of the speech signal, and the decoder is used to reconstruct the timbre. Since VAE tends to cause the generated results to be too smooth, WGAN is used here to generate clearer speech signals.

Semi-supervised learning

Label acquisition of image data requires a lot of manual annotation, which is time-consuming and laborious.

Semi-supervised learning using discriminator

GAN based semi-supervised learning method [12] proposed a method using unlabeled data. The implementation method is basically the same as the original GAN, and the specific framework is as follows [13] :

Compared with the original GAN, the main difference is that the discriminator outputs a K+1 category information (the generated sample is the K+1 class). For the discriminator, Loss includes two parts, one is supervised learning Loss (only to judge the authenticity of the sample) and the other is unsupervised learning Loss (to judge the sample category). Generators just need to produce as realistic a sample as possible. After training, the discriminator can be used as a classification model to classify.

Intuitively, the generated sample is mainly for the auxiliary classifier to learn to distinguish where the real data space is.

Semi-supervised learning using auxiliary classifiers

There is a problem with the discriminator semi-supervised learning model mentioned above. Discriminators learn to distinguish positive and negative samples as well as to predict labels. The two goals are not consistent, easy to lead to both can not reach the optimal. An intuitive idea is to separate the prediction labels from the positive and negative samples. This is what triple-Gan does [14] :

(Xg, Yg) ~ pg (X, Y), (Xl, Yl) – p (X, Y), (Xc, Yc) to the PC (X, Y) data generated respectively, labeled data, data without a label. CE stands for cross entropy loss.

Domain adaptive

Domain adaptation is a concept in transfer learning. In simple terms, we define the source data field distribution as Ds(x,y) and the target data field distribution as DT(x,y). We have many labels for source domain data, but no labels for target domain data. We want to be able to learn a model from labeled data in the source domain and unlabeled data in the target domain and generalize it well in the target domain. The word “transfer” of transfer learning refers to the transfer of source domain data distribution to target domain data distribution.

When GAN is used for transfer learning, the core idea is to use generator to transform source domain data features into target domain data features, and discriminator to distinguish real data and generated data features as much as possible. Here are two examples of GAN for transfer learning: DANN and ARDA

For example, DANN on the left of the above figure, Is and It represent data in the source domain and data in the target domain respectively. Ys represents the label of data in the source domain. Fs and Ft are source domain features and target domain features. In DANN, generators are used to extract features and make it difficult for the discriminator to distinguish the extracted features from the source domain data features or the target domain data features.

There are many applications of CycleGAN based transfer learning for data augmentation in pedestrian re-recognition field. One difficulty of pedestrian re-recognition problem lies in the large difference of Angle between the character environment shot by different cameras, resulting in a large Domain gap.

Therefore, GAN can be considered to generate data under different cameras for data augmentation. [15] proposed a CycleGAN method for data augmentation. The specific model structure is as follows:

Training a CycleGAN for each pair of cameras makes it possible to convert data from one camera to data from the other, but the content (people) remains the same.

Other applications

GAN has a wide variety of variants and applications, including non-machine learning applications. Here are some examples.

Medical image segmentation

[16] A Segmente-Critic structure for medical image segmentation is proposed. Segmentor is similar to the generator in GAN for generating segmentation images, while critic maximizes the distance between generated segmentation images and ground truth. In addition, DI2IN uses GAN to segment 3D CT images, and SCAN uses GAN to segment X-ray images.

Image steganalysis

Steganography refers to hiding secret information in a non-secret container, such as a picture. Steganography analyzer is used to determine whether a container contains secret information. Some studies try to use GAN generators to generate pictures with steganographic information, and there are two discriminators, one for judging whether the picture is real and the other for judging whether the picture has secret information [17].

Continuous learning

Continuous learning aims at solving multiple tasks and accumulating new knowledge in the process of learning. A prominent problem in continuous learning is knowledge forgetting. [18] GAN generator was used as a scholars model. The generator was continuously trained with previous knowledge, and Solver provided answers to avoid the problem of “knowledge forgetting”.

discuss

In the first and second part we discuss GAN and its variants, and in the third part we discuss the application of GAN. The following table summarizes some of the more well-known GAN model structures and the additional constraints they impose.

The above are all about the micro level of GAN. Next, we will discuss GAN from a macro perspective.

The evaluation of GAN

There are a variety of GAN evaluation methods. The existing example-based (as the name implies, based on the sample level evaluation) methods extract features from generated samples and real samples, and then measure the distance in the feature space. The specific framework is as follows:

The symbolic comparison relation of this section is as follows:

Pg represents the generated data distribution, Pr represents the real data distribution, E represents the mathematical expectation, X represents the input sample, x~Pg represents the sampling of the generated sample, x~Pr represents the sampling of the real sample. Y denotes sample label, M denotes classification network, usually Inception network.

The common evaluation indicators are introduced one by one below.

I nception Score  

For a GAN well trained in ImageNet, when the sample generated by the GAN is thrown to the Inception network for testing, the discriminant probability obtained should have the following characteristics:

1. For pictures of the same category, the probability distribution of their output should tend to a pulse distribution. The accuracy of sample generation can be guaranteed.

2. For all categories, the output probability distribution should be uniform, so that no mode collapsing or other channels can be added to ensure the diversity of samples.

Therefore, the following indicators can be designed:

According to the previous analysis, if it is a well trained GAN, pM (y | x) tend to be pulse distribution, pM (y) tend to be evenly distributed. The KL divergence is going to be huge. Inception Score is naturally high. Actual experiments show that Inception Score tends to be consistent with human subjective judgment. The calculation of IS does not use real data, and the specific value depends on the choice of model M.

Features: It can measure the diversity and accuracy of generated samples to a certain extent, but it cannot test the fit. The same is true for Mode Score. Not recommended for use on data that differs significantly from the ImageNet dataset.

Mode Score  

As an improved version of Inception Score, Mode Score adds a probability distribution similarity measure for the prediction of generated and real samples. The specific formula is as follows:

Kernel MMD

The calculation formula is as follows:

For the calculation of Kernel MMD value, we first need to select a Kernel function K, which maps samples to Reproducing Kernel Hilbert Space (RKHS). Compared with Euclidean Space, RKHS has many advantages. The calculation of the inner product of functions is complete.

The following calculation formula can be obtained by expanding the above formula:

The smaller the MMD value is, the closer the two distributions are.

Features: It can measure the quality of images generated by the model to a certain extent, and the calculation cost is small. Recommended.

Wasserstein Distance 

Wasserstein distance is also commonly called bulldozer distance in optimal transmission problems. The introduction of this distance is discussed in detail in WGAN. The formula is as follows:

Wasserstein Distance can measure the similarity between two distributions. The smaller the distance, the more similar the distribution.

Features: If the feature space is selected properly, there will be certain effects. But O(n^3) is too high.

Frechet Inception short (FID)

FID distance Computes the real sample and generates the distance between samples in the feature space. The Inception network is first used to extract features, and then the Gaussian model is used to model the feature space. The distance is calculated according to the mean value and covariance of Gaussian model. The specific formula is as follows:

μ and C represent covariance and mean, respectively.

Features: Although only the first two moments of the feature space are calculated, they are robust and efficient.

1-Nearest Neighbor classifier 

Using the retention method, combined with 1-NN classifier (other lines) to calculate the real picture, the accuracy of the generated image. If the two are close, the accuracy is close to 50%, otherwise close to 0%. For GAN evaluation, the author uses the classification accuracy of positive samples and generated samples respectively to measure the authenticity and diversity of generated samples.

For real sample Xr, 1-NN classification, if the generated sample is more real. Then the real sample space R will be surrounded by the generated sample Xg. Then the accuracy of Xr will be very low.

For the generated sample Xg, if the generated sample diversity is insufficient during 1-NN classification. As the generated samples cluster in several modes, Xg can be easily distinguished from Xr, resulting in high accuracy.

Features: Ideal metric, and can be tested over fit.

Other evaluation methods

AIS and KDE methods can also be used to evaluate GAN, but these methods are not model Agnostic metrics. In other words, the calculation of these evaluation indicators cannot only use the generated samples and real samples to calculate.

conclusion

The MMD and 1-NN two-sample test were the most suitable evaluation indicators. The two indicators could be differentiated as follows: real samples and generated samples, mode collapsing. And the calculation is efficient.

Generally speaking, GAN learning is an unsupervised learning process, so it is difficult to find a relatively objective and quantifiable evaluation index. There are a lot of metrics that are high in number but not necessarily productive. In short, the evaluation of GAN is still an open question.

GAN and reinforcement learning

The goal of reinforcement learning is to select the best behavior A (action) for an agent given state S. In general, we can define a value function Q(s, A) to measure it, and for state S, the return of a for taking action is Q(s, A), and obviously, we want to maximize that return. For many complex problems, it is hard to define the value function Q(s,a), just as it is hard to define how good a picture GAN produces.

That’s probably what you’re reacting to. If the picture generated by GAN is good or not, I really can’t find a suitable indicator, so I can learn a discriminator to judge the distance between the generated picture and the real picture. The value function Q(s, A) in reinforcement learning is difficult to define, so it is better to use a neural network to learn it. Typical models are DDPG, TRPO and so on.

Advantages and Disadvantages of GAN

advantages

1. The advantages of GAN were introduced at the beginning. Here’s a summary:

2. GAN can generate data in parallel. Compared with PixelCNN and PixelRNN, GAN generation is very fast, because GAN uses Generator instead of sampling process;

3. GAN does not need to introduce lower bounds to approximate likelihood. VAE introduces variational lower bound to optimize likelihood due to optimization difficulties. However, VAE makes assumptions about prior and posterior distributions, which makes it difficult for VAE to approximate its variational lower bound.

In practice, the results generated by GAN are much clearer than VAE.

disadvantages

The disadvantages of GAN are also discussed in detail in the previous article. The main problems are as follows:

1. The training is unstable and easy to collapse. Scholars have proposed many solutions to this problem, such as WGAN, LSGAN, etc.

2. Mode crashes. Although there are many related studies, this problem is still not completely solved due to the high dimensional characteristics of image data.

Future research direction

The training crash and mode crash of GAN still need to be improved. Although Deep learning is very powerful, there are still many fields that cannot be conquered at present. It is expected that GAN will make some achievements on this basis.

reference

[1] https://zhuanlan.zhihu.com/p/57062205

[2]  https://blog.csdn.net/victoriaw/article/details/60755698 

[3] https://zhuanlan.zhihu.com/p/25071913 

[4] GAN and monge – Ampere equation theory

[5]  https://blog.csdn.net/poulang5786/article/details/80766498 

[6] https://spaces.ac.cn/archives/5253 

[7] https://www.jianshu.com/p/42c42e13d09b 

[8] https://medium.com/@jonathan_hui/gan-unrolled-gan-how-to-reduce-mode-collapse-af5f2f7b51cd 

[9] https://medium.com/@jonathan_hui/gan-dragan-5ba50eafcdf2 

[10] https://medium.com/@jonathan_hui/gan-cgan-infogan-using-labels-to-improve-gan-8ba4de5f9c3d 

[11]  https://blog.csdn.net/carrierlxksuper/article/details/60479883 

[12] Salimans, Tim, et al. “Improved techniques for training gans.” Advances in neural information processing systems. 2016. 

[13]  https://blog.csdn.net/qq_25737169/article/details/78532719 

[14] https://medium.com/@hitoshinakanishi/reading-note-triple-generative-adversarial-nets-fc3775e52b1e1 

[15] Zheng Z , Zheng L , Yang Y . Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in VitroC// 2017 IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, 2017. 

[16] Yuan Xue, Tao Xu, Han Zhang, Rodney Long, and Xiaolei Huang. Segan: Adversar-ial Network with multi-scale L_1 loss for Medical Image segmentation. ArXiv Preprint arXiv:1706.01805, 2017.

[17] Denis Volkhonskiy, Ivan Nazarov, Boris Borisenko, Steganographicgenerative adversarial networks. ArXiv Preprint arXiv: 173.05502, 2017.

[18] Shin, Hanul, et al. “Continual learning with deep generative replay.” Advances in Neural Information Processing Systems. 2017.

This article is reproduced by the heart of the machine. Please contact the original public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]