Recently, imperial College London, the University of Montreal and other research institutions jointly published a paper and sort out generative adversarial networks, the paper from the most basic GAN architecture and its variants to the training process and training techniques of a comprehensive overview of the concept of generative adversarial networks, problems and solutions. Heart of the Machine gives a brief introduction to the paper.

For the complete theoretical derivation and TensorFlow implementation of GAN (Goodfellow et al., 2014), please refer to GitHub project and article of Heart of Machine: Complete theoretical derivation and implementation of GAN. Below, we will introduce the review paper for each reader.


Address: arxiv.org/pdf/1710.07…

Generative adversarial networks (GAN) provide a way to learn deep representations without requiring large amounts of annotated training data. They update the two networks by backpropagation algorithm to perform competitive learning. GAN learning representations can be used in a variety of applications, including image synthesis, semantic image editing, style transfer, image super-resolution techniques, and classification. The purpose of this paper is to provide an overview of GAN for the signal processing community. In addition to introducing different approaches to training and building gans, we also discuss the theoretical and application challenges.


1. The introduction

Generators and discriminators typically consist of multi-layer networks with convolutional and/or full connection layers. Generators and discriminators must be differentiable, but not necessarily directly reversible (theoretically reversible). If the generative network is obtained by mapping some representation space, that is, mapping a hidden space to the data space (we need to focus on the image), it can be expressed in a more formal way as: G: G (z) – > R ^ | | x, including z ∈ R ^ | z | is a sample of the hidden space, x ∈ R ^ | | x is an image, dimension, | |, said.

The discriminant network D of the original GAN can be regarded as a function D: D(x) → (0, 1) that maps the image data to the discriminant probability (the graph comes from the real data distribution rather than the generator distribution). For a fixed generator G, the discriminator D may be trained to distinguish whether the image is from training data (true, probability close to 1) or from the generator (false, probability close to 0). If the discriminator is already the best, it becomes impossible to fool, and generator G needs to continue training to reduce the accuracy of the discriminator. If the generator distribution is good enough to match the real data distribution, the discriminator will be maximally confused and give a probability value of 0.5 for all inputs. In practice, discriminators may not be trained to an ideal state, and we’ll explore the training process in more depth in Section 4.

Figure 1. This figure shows the training flow of discriminator D and generator G in GAN respectively. They are usually implemented using neural networks, but in practice they can be implemented using any form of differentiable system that maps data from one space to another.


3. GAN architecture

Figure 2. During GAN training, the generator can generate a sample distribution P_g (x) to match the real data p_data (x). By properly parameterizing and training gans, these distributions become indistinguishable. The representation distribution of GAN can be constructed by learning parameters (weights) of generative network and discriminant network.


A. Fully connect GAN

The first GAN architecture uses fully connected neural networks on both generator and discriminator. This type of architecture is applied to relatively simple image databases, namely MNIST (handwritten Numbers), CIFAR-10 (Natural images), and the Toronto Face Dataset (TFD).


B. convolution GAN

Because CNN is very suitable for processing image data, it is a natural extension from full connection to convolutional neural network. Earlier experiments on CIFAR-10 showed that it would be more difficult to learn CNN training generators and discriminator networks with supervised learning of the same level and representational ability.

By using the form of multi-scale decomposition generation process, the Laplacian Pyramid adversation network (LAPGAN) [13] contributes a scheme to this problem: The truth image itself is decomposed into a Laplacian pyramid, and the conditional convolution GAN is trained to generate each layer given the upper layer.

In addition, Radford et al.[5] (for “Deep convolution GAN”) proposed a network architecture family called DCGAN, which allows training of a pair of deep convolution generator and discriminator networks. DCGAN uses strided convolution and fractionally strided convolution in training, and learns spatial subsampling and upsampling operators in training. These operators deal with changes in sampling rate and position, which are important requirements for mapping from image space to low-dimensional potential space and from image space to discriminator. Section IV-B covers DCGAN architecture and training in detail.

As an extension of 2D image synthesis, Wu et al. [14] demonstrated GAN synthesis of 3D data samples using volumetric convolution. Wu et al. [14] synthesis includes new objects, such as chairs, tables and cars; In addition, they showed a way to map from 2D to 3D images.


C. condition GAN

Mirza et al. extend the (2D) GAN framework into conditional setting by transforming generators and discriminators into class-conditional classes. The advantage of conditional GNN is that it can provide a better representation of multiple forms of data generation. The conditional GAN, which is parallel to InfoGAN[16], can decompose the noise source into an incompressible source and a “latent code”, and can discover the hidden factors of change by maximizing the interaction information between the latent code and the generator. This encodings can be used to discover the target class in completely unsupervised data, even if the encodings are not explicit. The representations learned by InfoGAN appear to be semantically specific, able to deal with complex entanglements in images (including postural changes, lighting, and the emotional content of facial images).


D. Gan inference model

The initial form of GAN cannot map a given input x to a vector in a hidden space (this is often referred to as an inference mechanism in the GAN literature). Several techniques have been proposed to reverse the generator of pre-trained GAN, such as Adversarially Learned Inference (ALI) and Bidirectional GANs (Bidirectional GANs), which can provide simple and effective extensions. Discriminators jointly test data space and hidden space by adding an inference network.

A generator in this form consists of two networks: the encoder (inference network) and the decoder. They are also trained to fool the discriminator. The discriminator receives a vector pair (x,z) (see Figure 4) and decides whether it contains a real image with its encoding, or a generated image sample with the implicit space input of the associated generator.

Ideally, the output (as a refactoring of the input) in a coding-decoding model should be similar to the input. In general, the fidelity of reconstructed data samples synthesized using ALI/BiGAN is very low. The fidelity of the samples can be improved by adding additional adversarial cost functions (for the data samples and their reconstruction).


E. Adversarial autoencoder (AAE)

Autoencoder is a network composed of encoder and decoder. It learns to map data to internal implicit representation and then map it out. That is, it learns to map images (or other) from data space to implicit space by encoding, and then from implicit space to data space by decoding. The two maps form a reconstruction operation, and the two maps are trained until the reconstructed image is as close to the original image as possible.

FIG. 3. On the left, Conditional GAN, proposed by Mirza et al., can be synthesized according to class-condition; The discriminator determines whether the image is true or false based on the condition class. On the right, InfoGAN, whose discriminator can also estimate category labels.

Figure 4. ALI/BiGAN structure, consisting of three networks. One of these is a discriminator, another network maps noise vectors from the hidden space to the image space (decoder, represented by G), and the last network (encoder, represented by E) maps the real image from the image space to the hidden space.


4. Training of GAN


A. the introduction

The training objective of GAN is to obtain the parameters to maximize the classification accuracy of the discriminator and the generator parameters to maximize the deception discriminator. The training process is summarized in Figure 5.

The cost of training is evaluated by a value function V(G,D), which contains the generator and discriminator parameters.

The training process can be expressed as follows:

During the training, the parameters of one model were updated, while the parameters of the other model were fixed. Goodfellow et al. show that there exists a unique optimal discriminator D∗ (x) = p_data(x) /(p_data(x)+p_g(x)) for a fixed generator. They also show that generator G is optimal when pg(x) = pdata(x), which is equivalent to the optimal discriminator giving a probability value of 0.5 for all sample x. In other words, generator G is optimal when the discriminator D is so confused by maximization that it cannot distinguish between true and false samples.

Ideally, the discriminator will be trained to be optimal for the current generator; The generator is then updated again. In practice, however, the discriminator may not be trained optimally, so a common strategy is that the discriminator is trained only iteratively and the generator and discriminator are updated synchronously. In addition, generators typically use an alternate, unsaturated training specification that uses max_G log D(G(z)) instead of min_G log(1 − D(G(z))).

Despite the existence of a theoretically unique solution, the training of GAN is challenging and often unstable for several reasons. One way to improve the effectiveness of GAN training is to track any empirical symptoms that may be encountered during training, including:


  • The two models (generator and discriminator) cannot converge [5].
  • Generators “crash”, generating similar samples from different inputs [25];
  • The loss of the discriminator quickly converges to zero, so that there is no gradient path strong enough to continue updating the generator.

Figure 5. Main cycles of GAN training. The new data sample X ‘can be obtained by passing random samples, and Z is the random noise extracted and used as the input of the generator network. The discriminator may need to be updated k times before the generator updates once.


B. Training skills

The first significant improvement in GAN training for image generation was the DCGAN architecture proposed by Radford et al. [5]. This research is a further exploration of the CNN architecture previously used in computer vision and leads to a set of guidelines for building and training generators and discriminators. In Section III-B, we mentioned the importance of convolution with step size and small step size coiling [27], which are important components of architectural design. This allows generators and discriminators to learn excellent up-sampling and down-sampling operations that may improve the quality of image composition. Specifically for training, researchers recommend the use of batch normalization in the two networks [28] to stabilize training in the deep model. Another recommendation is to minimize the number of fully connected layers used to improve the feasibility of deep model training. Finally, Radford et al. [5] suggested that using the Leaky ReLU activation function in the middle layer of the discriminator performed better than using the conventional ReLU function.

Salimans et al. [25] further proposed a heuristic method for stable GAN training. First, feature matching slightly changes the goal of the generator to increase the amount of information available. Specifically, the discriminator still needs training to distinguish between true and false samples, but the generator’s training goal is to match the expected intermediate activation values (features) of the false and true samples in the discriminator. Second, mini-batch discrimination adds additional input to the discriminator, which codes the distance between a given sample ina small batch and other samples. The goal is to prevent mode collapse because the discriminator can easily determine whether the generator produces the same output.

A third heuristic technique, which helps to converge to equilibrium, is to heuristic averaging, which penalizes network parameters if they deviate from the operating average of their previous values. The fourth technique is virtual Batch normalization, which reduces the dependence of samples in small batches on other samples. The method is to calculate normalized batch statistics using reference mini-batch samples determined at the beginning of training.

Finally, one-sided label smoothing smoothes the discriminator’s classification boundary by replacing the discriminator’s target of 1 to 0.9, preventing the discriminator from being overconfident and providing a poor gradient for the generator. Sønderby et al. [29] improved on this idea by adding noise to the sample before feeding it to the discriminator to challenge the discriminator. Sønderby et al. [29] believe that unilateral label smoothing tends to the best discriminator, and their technique, instance noise, makes the manifold of true and false samples closer and prevents the discriminator from easily finding the discriminator boundary that completely separates the true and false samples. In practice, this technique can be achieved by adding gaussian noise to the composite image and the real image to make the standard deviation decrease gradually over time. Later, Arjovsky et al. [26] formalized the process of adding noise to data samples to stabilize training.


C. Modification of cost function

1) Generalization of GAN cost function: Nowozin et al. [30] believed that GAN training could be generalized to minimize not only Jensen-Shannon divergence, but also the estimate of F divergence. These could be called F-gan.

2) Other cost functions to prevent gradient disappearance: Arjovsky et al. [32] proposed WGAN, a GAN network with other cost functions, which were derived from approximate Wasserstein distances.


7. Discuss


A. Open questions

1) Mode Collapse: As described in Chapter 4, a common problem with gans is generator collapsing, which only produces a small class of similar samples (partial Collapse) or, at worst, a simple sample (complete Collapse) [26], [48].

2) Training instability — saddle point: In GAN, the Hessian of the loss function becomes non-positive definite. Therefore, a second order nonpositive matrix can only find a saddle point and not a local minimum.


Conclusion B.

Gans are of interest not only because they can learn deep, highly non-linear mappings from mappings and inverse mappings between potential and data Spaces, but also because they can make use of large amounts of unlabeled image data, an operation similar to deep representation learning. In GAN training, there are many opportunities for theoretical and algorithmic development. With the help of the deep Web, there are plenty of opportunities for new applications.


Selected from arXiv

Heart of the machine compiles

Participants: Lu Xue, Liu Xiaokun, Jiang Siyuan


This article is compiled for machine heart, reprint please contact this public number for authorization.