Congratulations to Toutiao AI Lab paper “From Generative Adversarial Networks to More Automated Artificial Intelligence” published on the 9th column of The Communications of The Chinese Computer Society in 2017!
Hereby reproduced, share with you who love technology
“What I cannot create, I do not understand.” This is a famous saying of physicist Feynman. Put in the context of artificial intelligence, it means that in order for a machine to truly understand something, it has to learn how to create it. In recent years, deep learning research is in full bloom, especially in the field of computer vision, deep learning algorithms in face recognition and object classification and other applications have surpassed human resolution. So are machines really smart enough to understand images? To paraphrase Feynman, if a machine learning algorithm can produce realistic images, then it should really understand them, right? However, machine learning algorithms are far from ideal in the ability to generate images, which can only generate very fuzzy and lack of detail. Fortunately, Generative Adversarial Net (GAN) models have had a wonderful spring with the emergence of Generative Adversarial Net (GAN).
What is generative adversarial network
Generative adversarial networks were proposed by Ian Goodfellow in 2014. The first problem he wanted to solve was how to generate high-quality artificial data sets to compensate for the lack of real data. It had nothing to do with “confrontation.” The challenge was measuring the quality of the data he was generating. The simple way is to want the generated data to be as similar as possible to the real data, that is, to use L1 norm or L2 norm as the loss function; Or simply interpolate between two real samples to produce a sample. But goodfellow found that none of these methods worked so well that he came up with the idea of training a neural network to judge whether a sample was good or bad. He uses one generator to generate false samples, and another discriminator to distinguish true samples from false ones. This is the prototype of GAN. During training, the generator tries to trick the discriminator, and the discriminator tries to learn how to correctly distinguish between true and false samples, so that the two form an adversarial relationship — hence the generative adversarial network. Subsequently, various papers on GAN mushroomed.
Specifically, a GAN consists primarily of two independent neural networks: a Generator and a Discriminator (See Figure 1). The generator’s task is to sample a noise Z from a random uniform distribution and output the resultant data G(z); The discriminator takes a real data x or a synthetic data G(z) as input and outputs the probability that this sample is “true”.
Figure 1 generates the general structure of adversarial networks
The objective function of GAN is shown in Formula 1. D(x) is the probability that the discriminator thinks X is the real sample, and 1-d (G(z)) is the probability that the discriminator thinks the synthesized sample is false. Add the logarithms to get the form of Formula 1. When training GAN, the discriminator hopes to maximize the objective function, that is, to maximize the probability that the discriminator judges the real sample to be “true” and the synthetic sample to be “false”. In contrast, the generator wants the objective function to be minimized, that is, to reduce the probability that the discriminator is correct about the source of the data.
Formula 1 generates the objective function of adversarial network
While the “adversarial mechanism” neatly frees us from designing complex objective functions, there are some problems with GAN. In practice, because generator updates depend on the discriminator, if the discriminator learns poorly, the generator learns poorly. To mitigate this problem, in practice we often update the discriminator several times and then the generator one more time. Even so, the GAN training process is still very unstable, and the data generated is still not as diverse as the real sample, which is called “mode collapse.” Later, through a series of theoretical studies, researchers found that the main reason for training instability and mode shrinkage was that the distance measurement method between the real distribution and the simulated distribution in the earliest GAN was not appropriate, so they proposed a smoother measure to replace the original method, and then proposed methods such as WGAN-GP[1]. The mathematical derivation is more complex, interested readers can read the original paper.
What changes does GAN bring
It’s only been three years since GAN came up with it, but there’s already a ton of papers on it. From many papers in the field of computer vision, we can find that often adding a discriminator on the basis of the old method and covering the antagonism mechanism can also achieve better results than the original. However, there is still no unified opinion and complete explanation about why GAN can achieve better results. To imagined as example, a common explanation is that before, we used in image generation type model based on L1 norm and L2 norm of the loss function, too much attention to generate sample and real sample “pixel” one to one correspondence, and finally, after the mean error of every pixel, to become, leads to generate the image blurred. Even though the L1 norm as the loss function is closer to the sparse solution than L2 norm, theoretically sharper images can be obtained than those based on L2 norm, but the quality of the generated images is still not ideal.
What’s so special about GAN? Just as Goodfelo’s original inspiration was to design a neural network to determine how well a sample is generated, the biggest advantage of GAN comes from the introduction of a discriminator, which allows us to devise objective functions for tasks that are difficult to measure directly with mathematical formulas. For example, style transfer is a hot topic in the field of computer vision, such as converting a selfie into a cartoon style, or converting wind lighting into an image that fits an artist’s style. In this kind of application, it is difficult to use mathematical formula to directly measure whether the style of the generated image is suitable, and a large number of paired annotation data are needed if the loss function based on norm is used. But with GAN, we can let the discriminator learn to judge the style of the image. Since the discriminator learns the style of the image rather than the objects in the image, it only needs images that conform to the same style to be trained, not objects in the image. Similarly in natural language processing tasks, for example in chatbot, the loss function based on norm may make the robot produce some fixed answers to some questions, which is the performance of over-fitting. In fact, we know that there are many ways to express the same meaning. If we want to teach robots to express the same meaning in different ways, we can use GAN. Therefore, the greatest charm of GAN lies in the universality of this antagonistic mechanism, and it is believed that GAN will make breakthrough in many research fields with complex objective functions.
While confrontation brings us a lot of convenience, it also has significant drawbacks. GAN’s are not very good at many complex image-generating tasks, such as often producing distorted human bodies. In contrast, the method based on norm can guarantee the correct basic contour of the object even if the complex image is generated. This difference mainly comes from the fact that the norm objective function operates on each pixel, which corresponds the generated image to the pixels of the real image, which is equivalent to constraining the contour of the object. GAN, on the other hand, judges the whole image without constraining the contour of the object. Therefore, GAN is now used as the model of image generation, and a loss function of L1 or L2 norm will be added when training the generator. Experimental results show that this is indeed better than using only the adversarial objective function.
In addition, GAN is actually a framework that unifies generative and discriminative models. In general, we use GAN to generate real data with its generator, so we usually refer to it as a generative model. In fact, its discriminant model is also a discriminant model, which can embody its value in some specific tasks. IRGAN, the best paper of the recent SIGIR conference, uses GAN to combine generative and discriminant models in the field of information Retrieval. In the field of information retrieval, there are usually two schools, namely generative model and discriminant model. What the generative model does is, given a query, output the document that best matches it; Discriminant models, on the other hand, input a question and document pair and output the degree to which they match. As can be seen, GAN’s framework directly links the two: the output of the generated model is the input of the discriminant model, and the output of the discriminant model promotes the learning of the generated model.
GAN applications in computer vision
At present, the most basic use of GAN is to generate images that look like the real thing. The task of image generation is mainly divided into two kinds. The first kind is to generate a certain kind of image, and the second kind is to generate an image in accordance with the description of the user. At present, the first image generation task has achieved good results. For example, PPGN[2] model published in 2016 has achieved the top visual effect in the industry (see Figure 2), and the generated volcano image can achieve an overall effect that can distinguish the fake from the real. According to the description of the task of generating images, the current results are not satisfactory. The difficulty with this task is that instead of learning how to make each object and then combining them, the generator tries to make the entire image in one fell fell position, unlike a human drawing process. GAN does a good job of generating images of single objects from text, but much less well on complex images of multiple objects, sometimes even making it difficult to distinguish the content of the resulting image. As figure 3 shows, the image generated from the text makes sense, but the two images on the right side of the first line are confusing without reading the text. Therefore, in the field of text conversion to image, there is still a lot of research space.
[image-e82DDD-1525850516720-13]
FIG. 2 Volcanic image generated by PPGN model [2]
[image uploaded…(image-420081-1525850516720-12)]
FIG. 3 Image generated by PPGN model based on text [2]
Another popular application is image-to-image translation, of which image style transfer is just a small category. To be specific, image conversion can involve many things, such as converting a summer image to a winter image, filling in the outline of an object with colored details and textures, automatically blotting out a photo taken by a mobile phone to look like it was taken by a SLR camera, and so on. The CycleGAN[3] model published at the beginning of this year is able to achieve good performance on unlabeled data sets by using the combination of antagonism mechanism and norm loss function and dual learning-like dual model design. Figure 4 is an example of CycleGAN model image transformation, in which the mutual transformation between ordinary horse and zebra is impressive. In addition, image conversion also includes face transformation, which can turn unsmiling faces into smiling faces, put on or take off glasses, change hair styles, etc. It can be seen that GAN is expected to be combined with many mobile phone P image software to further enrich the functions and interests of software. Figure 5 is an example of IcGAN[4] model transforming human face from black hair to blonde hair, straight hair to afro hair, smile to grin and even change gender. We have to admire GAN’s magic power as if it were made by god.
[image uploaded…(image-c4E83B-1525850516720-9)]
Figure 4 Example of CycleGAN’s image conversion [3]
[image uploaded…(image-DD5976-1525850516720-8)]
Figure 5. Example of IcGAN face transformation [4]
GAN can also be used for image description, that is, attaching a detailed description to an image. This task also reflects the advantages of GAN: when we describe the content of an image, we can have different descriptions due to different focus points, while the previous model can only generate a fixed description for a picture. The RTT-Gan [5] model proposed in a recent paper realizes the ability of “visual-writing” above: Given a picture and a brief description of some objects in the picture, the model will generate a coherent chapter, and users can change the description sequence of objects in the image by controlling the objects described first, so as to experience different observation perspectives from different descriptions, as shown in FIG. 6.
[image-e96571-1525850516720-7]
FIG. 6 Example of rTT-gan viewing and writing [5]
GAN* Applications in natural language processing *
It is much more difficult to generate text than to generate images, so GAN is not ideal for natural language processing. For example, when the gradient returned by the discriminator is +1, we can update the pixels directly, and this update makes sense to reflect on the image. But in natural language, what does a word +1 mean? Perhaps we could rule that “iPhone6” +1= “iPhone7”, but for most words there is no such rule. Even if word vector is applied, the vector distribution is sparse in high dimensional space. If a word vector is added or subtracted, it may not correspond to any other word vector. Such update is not helpful for learning generator. But what about the graphic description we talked about earlier?
In order to train GAN to generate text, we have to mention the strategy gradient method in reinforcement learning. This is an approach based on Monte Carlo search: the model generates a dictionary-size polynomial probability distribution before generating the next word. Given a search width of D, d words are randomly selected from this polynomial distribution, and so on until the resulting length reaches the limit of N or the end of the sentence appears. We then put these sentences into the discriminator to get the corresponding score, and then update each generated polynomial distribution and the preceding neural network parameters backwards through the reverse gradient. The scheme of using strategy gradient in GAN first appeared in last year’s SeqGAN model, and the subsequent research on using GAN for natural language processing basically followed this scheme.
GAN’s papers on natural language processing mainly involve human-machine dialogue, machine translation, question answering and so on.
GAN is * a big step towards artificial intelligence *
Mankind’s pursuit of artificial intelligence has never stopped. From traditional machine learning to today’s deep learning, human beings have been able to create “intelligent agents” with self-decision-making ability in some tasks, such as the well-known Go master AlphaGo. In the early era of traditional machine learning, people need to carefully design how to extract useful features from data, design objective functions for specific tasks, and use some general optimization algorithms to build machine learning systems. With the rise of deep learning, people no longer rely on carefully designed features, and let neural networks learn useful features automatically. However, for specific tasks, we still need to design specific objective functions to constrain the learning direction of the model, otherwise the learned model may not do what we originally thought. However, since generative adversarial networks, we no longer need elaborate objective functions in many scenarios and are better off letting the discriminator learn on its own. As Figure 7 shows, from traditional machine learning to deep learning, we automate feature learning; The emergence of generative adversarial networks further automates the learning of objective functions. In addition, learn2Learn [6], an algorithm proposed by Google last year that uses neural networks to enable deep learning models to learn to optimize themselves, also automatized the last step of optimization. In theory, artificial intelligence algorithms can fully realize the automation from input to output. Its next development should be the algorithm that can automatically design neural network model according to different tasks. Let’s wait and see.
[image uploaded…(image-deAEBF-1525850516720-6)]
Figure 7 from traditional machine learning to automated artificial intelligence algorithms
References:
[1] Gulrajani I, Ahmed F, Arjovsky M, Improved training of Wasserstein gans[OL]. (2017). ArXiv Preprint arXiv:1704.00028.
[2] Nguyen A, Yosinski J, Bengio Y ,et al. Plug & play generative networks: Conditional iterative generation of images in latent Space [OL].(2016). ArXiv Preprint arXiv:1612.00005.
[3] Zhu J Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[OL]. (2017). arXiv preprint ArXiv: 1703.10593.
[4] Perarnau G, van de Weijer J, Raducanu B, Et al.[OL]. (2016). Invertible Conditional GANs for image editing. ArXiv preprint arXiv:1611.06355
[5] Liang X, Hu Z, Zhang H, “Recurrent Topic-Transition GAN for Visual Paragraph Generation[OL]. (2017). ArXiv Preprint arXiv:1703.07022.
[6] Andrychowicz M, Denil M, Gomez S, et al. Learning to learn by gradient descent by gradient descent[C]//Proceedings of the Advances in Neural Information Processing Systems. 2016: 3981-3989.
The author:
[image-f2beAA-1525850516720-5]
Yellow crane
PhD candidate, University of Illinois at Chicago, USA. An intern at an artificial intelligence lab. Research interests: Deep learning and data mining.
[image-1e2C7A-1525850516720-4]
Wang Changhu
Senior member of CCF, editorial Board of CCCF. Toutiao Artificial Intelligence Laboratory scientist, technical director. Research interests: Computer vision, multimedia analysis and machine learning.