This article was originally published by AI Frontier.
Send book | AI illustrator: how to generate animation with PyTorch generated against network based face?


Author | Chen Yun


Edit | Natalie

“2016 was the year of TensorFlow, which made headlines thanks to a big push from Google. In early 2017, PyTorch came out of the blue and attracted a lot of attention from researchers. PyTorch’s simple and elegant design, unified and easy-to-use interface, fast speed and unpredictable flexibility left many impressed.

This article is adapted from chapter 7 of PyTorch Introduction and Practice. It explains the most popular generative adversarial network (GAN), and guides readers to implement an animation avatar generator from scratch, which can generate animation avatars with various styles using GAN. Attention, at the end of the article there is a book benefit!”


Generative Adversarial Net (GAN) is a hot topic in deep learning. LeCun Yan, the father of convolutional networks and veteran of deep learning, once said that “GAN is the most interesting idea in the last 10 years in machine learning”. Especially in the past two years, there has been an explosion of GAN papers. Someone on GitHub has collected a variety of GAN variants, applications and research papers, among which there are as many as hundreds of titles. The author also made statistics on the trend of the number of GAN papers published over time, as shown in Figure 7-1, indicating the popularity of GAN.

Figure 7-1 Monthly accumulation of GAN papers


Introduction to the principle of GAN

In his classic paper Generative Adversarial Networks published in 2014, Ian Goodfellow, often referred to as the “father of GAN,” proposed Generative Adversarial Networks. And designed the first GAN experiment – handwritten number generation.

GAN was born out of a brilliant idea:

“What I cannot create, I do not understand.”


– Richard Feynman

Similarly, if deep learning can’t create pictures, it doesn’t really understand them either. Deep learning was already making inroads into all kinds of computer vision, making breakthroughs in almost every task. However, people have always questioned the black box model of neural network, so more and more people explore the features and combination of features learned by convolutional network from the perspective of visualization, while GAN demonstrates the powerful ability of neural network from the perspective of generative learning. GAN solves a well-known problem in unsupervised learning: given a set of samples, training a system to generate similar new samples.

Figure 7-2 shows the network structure of the adversarial network, which consists of the following two sub-networks.

  • Generator: Input a random noise to generate a picture.
  • Discriminator: Determines whether an input image is a real one or a fake one.

Figure 7-2 Generating an adversarial network structure

When training the discriminator, we need to use the fake image generated by the generator and the real image from the real world. When training generators, only noise is used to generate fake images. The discriminator is used to evaluate the quality of the generated fake image, prompting the generator to adjust parameters accordingly.

The goal of the generator is to generate as authentic a picture as possible so that the discriminator will think it is real. The goal of the discriminator is to distinguish the picture generated by the generator from the real world picture. It can be seen that the two have opposite goals and are opposed to each other in the training process, which is also the reason why it is called generative adversarial network.

The above description may be a bit abstract, but let’s use the examples of the painting collectors and fake dealers who collect Qi Baishi’s works (qi Baishi’s works are shown in Figure 7-3) to illustrate. Fake dealers are generators. They hope to imitate the master’s original works and forge fake paintings that look like the real thing, so as to deceive collectors and sell them at a high price. Collectors of painting and calligraphy want to separate fakes from originals, keep the originals alive and destroy the forgeries. Here counterfeiters and collectors trade in paintings, mainly qi Baishi’s shrimp. Qi Baishi painting shrimp can be said to be a masterpiece in the painting world, has always been sought after by the world.

Picture 7-3 Qi Baishi’s shrimp painting

In this case, the dealers and collectors of fake paintings were novices at first, and their concepts of authenticity and forgery were vague. The fake paintings produced by the fake dealer are almost all random graffiti, and the appraisal ability of the painting and calligraphy collector is very poor. Many fakes are regarded by him as genuine works, and many originals are regarded as fakes.

First of all, calligraphy and painting collectors collected a large number of fakes on the market and the original works of Master Qi Baishi. After careful study and comparison, they initially learned the structure of the shrimp in the painting and understood that the creature in the painting was curved in shape and had a pair of “claw feet” similar to pincers. All the fake paintings that did not meet this condition were filtered out. When collectors go to the market to identify this standard, fake paintings basically can’t fool collectors, fake painting dealers suffered heavy losses. But some of the fakes that dealers make themselves have a curved shape and a pair of pincers. So dealers began to modify their methods, adding curved shapes and a pair of “pincers” that resembled pincers. In addition to these features, other areas such as colors and lines are drawn randomly. The first forgery produced by a dealer is shown in Figure 7-4.

Figure 7-4 A first edition made by a fake dealer

When dealers put these paintings on the market, it was easy to fool the collector, because the picture of a curved creature with a pair of pincers in front of it met the criteria for authenticity, so the collector bought it as if it were genuine. With the passage of time, the collector bought more and more false, disastrous, so he closed the research of the difference between fake and real, after repeated comparison, he found a qi baishi painting reproductions of shrimp in addition to the shape of a bending, shrimp tentacles length, translucent color photos, and draw the details of the shrimp is very rich, with a white shrimp between each section are.

After the collectors have learned, they come back out of the mountains. However, the counterfeiting techniques of the fake dealers have not been improved, and the counterfeits produced by the collectors are easily detected. Dealers began experimenting with different shrimp paintings, mostly in vain, but among the many attempts, there were a few forgeries that fooled collectors. Dealers found that the fakes had long tentacles, translucent bodies and rich details of shrimp, as shown in Figure 7-5. Dealers began to make copies of them in large numbers and to sell them on the market, many of them successfully deceiving collectors.

Figure 7-5 A second copy made by a fake dealer

Once again, the collectors suffered heavy losses and were forced to close their doors to study the difference between qi Baishi’s original works and forgeries, learn the characteristics of Qi Baishi’s original works and improve their ability of identification. In this way, through the game between the collectors and the fake dealers, the collectors gradually improved their ability to distinguish the authentic and fake paintings from scratch, while the fake dealers also constantly improved their ability to copy qi Baishi’s authentic paintings. Collectors have a better appreciation of qi Baishi’s shrimp paintings by using the fakes provided by fake dealers as a comparison with the real ones. And the dealers of fake paintings are constantly trying to improve the level of imitation and the quality of the fake paintings. Even if the final ones are still fakes, they are very close to the real ones. Collectors and dealers of fake paintings are playing games against each other, and at the same time, they are constantly promoting each other’s learning and progress, so as to achieve the purpose of common improvement.

In this case, the fake dealer acts as a generator and the collector acts as a discriminator. Both generators and discriminators are poor at first, because both are randomly initialized. The training process is divided into two steps alternately. The first step is to train the discriminator (only modify the parameters of the discriminator and fix the generator). The goal is to distinguish the authentic works from the fake ones. The second step is to train the generator (only modifying the generator’s parameters, fixing the discriminator) so that the resulting fake painting can be recognized by the discriminator as authentic (by collectors). These two steps alternate so that both classifiers and discriminators reach a very high level. By the end of the training, the picture of the shrimp generated by the generator (as shown in Figure 7-6) was almost indistinguishable from the real qi Baishi.

Figure 7-6 Shrimp generated by generator

Let’s think about the design of network architecture. The goal of the discriminator is to determine whether the input image is authentic or fake, so it can be regarded as a binary network. Referring to the experiment of Dog vs. Cat in Chapter 6, we can design a simple convolutional network. The goal of the generator is to generate a color picture from the noise. Here, the widely used Deep Convolutional Generative Adversarial Networks (DCGAN) structure is adopted, that is, the full Convolutional network is adopted, and its structure is shown in Figure 7-7. The input of the network is a 100-dimensional noise, and the output is a 3×64×64 picture. The input here can be viewed as a 100×1×1 image, which is slowly enlarged by up-convolution to 4×4, 8×8, 16×16, 32×32, and 64×64. Upper convolution, or transpose convolution, is a special convolution operation, similar to the inverse operation of convolution operation. When the stride of the convolution is 2, the output will be downsampled to half the size compared with the input. When the stride of the convolution is 2, the output is sampled to twice the size of the input. This up-sampling method can be understood as the picture information is stored in 100 vectors. According to the information described by these 100 vectors, the neural network Outlines the basic information such as contour and tone in the first step of up-sampling, and gradually improves the details in the next step of up-sampling. The deeper the network, the more detail.

Figure 7-7 Generator network structure in DCGAN

In DCGAN, the structure of the discriminator is symmetric with that of the generator: the convolution of up-sampling is used in the generator, while the convolution of down-sampling is used in the discriminator. The generator outputs a 64×64×3 image according to the noise, while the discriminator outputs the fraction (probability) of positive and negative samples according to the 64×64×3 image input.


Use GAN to generate animation head

This section will use GAN to implement an example of generating animation character avatars. A blogger (estimated to be a fan of quadratic dimension) on a Japanese technical blog website learned from 200,000 cartoon avatars with DCGAN and was able to generate cartoon avatars automatically by using the program, as shown in Figure 7-8. The source program is implemented using the Chainer framework, and in this section we try to implement it using PyTorch.

Figure 7-8 Cartoon head generated by DCGAN

The original image is extracted from the website and OpenCV is used to capture the head image, which is more troublesome to deal with. Here, we use 50,000 images extracted and processed by Zhihu user Hezhiyuan. You can download all the image compression packages from the Baidu web disk link of readme. MD, a companion program of this book, and extract them into the specified folder. It should be noted that the resolution of the image here is 3×96×96, not 3×64×64 in the paper, so the network structure needs to be adjusted accordingly so that the size of the generated image is 96.

Let’s first look at the code structure of this experiment.

Next, look at how generators are defined in Model.py.

It can be seen that the generator is relatively simple to build. Nn.Sequential is used to combine operations such as up-convolution, activation, and pooling. Here attention should be paid to the use of up-convolution ConvTransposed2d. When kernel size is 4, stride is 2, and padding is 1, according to the formula H_out=(h_in-1)* Stride-2 *padding+kernel_size, the output size is exactly twice the input size. In the last layer, kernel size was 5, stride was 3, and padding was 1, in order to up-sample 32×32 to 96×96. This is the size of the picture in this example, which is different from the size of 64×64 in the paper. In the last layer, Tanh is used to normalize the pixel of the output image to -1~1. If you want to normalize to 0~1, you need to use Sigmoid. Next, let’s look at the network structure of the discriminator.

It can be seen that the network structure of discriminator and generator is almost symmetric, from the size of convolution kernel to the padding and stride Settings, they are almost the same. For example, the last convolution layer of a generator has a scale of (5,3,1), and the first convolution layer of a discriminator has a scale of (5,3,1). Also, it is important to note that the generator’s activation function uses ReLU, while the discriminator uses LeakyReLU. There is no essential difference between the two, and the choice here is more a rule of thumb. After each sample passes through the discriminator, output a number from 0 to 1, representing the probability that the sample is a real image. Before we start writing the training function, let’s look at the configuration parameters of the model.

These are just default parameters for the model, and you can override the defaults by passing them in from the command line using tools like Fire. Alternatively, we can use opt.attr directly, or take advantage of the auto-completion functionality provided by IDE/IPython, which is very convenient. The hyperparameter Settings here are mostly the default values of the DCGAN paper, and the authors have found through extensive experiments that these parameters can train a good model more quickly.

After we have downloaded the data, we need to put all the images in a folder and move that folder to the data directory (please make sure there are no other folders under Data). The purpose of this processing is to use torchVision’s built-in ImageFolder to read images without having to write the Dataset. The code for data reading and loading is as follows:

As you can see, using the ImageFolder in conjunction with the DataLoader is very convenient for loading images.

Before training, we need to define several variables: model, optimizer, noise, etc.


It is best to specify map_location when loading the pretraining model. Because if the application was running on the GPU, then the model will be saved as Torch. Cuda.Tensor, which will load data into video memory by default. The Tensor will load into memory by default by specifying map_location if you don’t have a GPU on the computer on which you’re running it, and then move it to video memory when you need it.

Let’s start training the network. The training steps are as follows.

(1) Training discriminator.

  • Fixed generator
  • For real images, the discriminator outputs a probability value as close to 1 as possible
  • For a fake image generated by the generator, the discriminator outputs 0 as much as possible

(2) Training generator.

  • Fixed discriminator
  • The generator generates the picture, making the discriminator print 1 as much as possible

(3) Return to the first step and cycle alternating training.

The following points need to be noted here.

  • When training the generator, there is no need to adjust the discriminator parameters. There is no need to adjust the parameters of the generator when training the discriminator.
  • When training the discriminator, detach operation is needed to compute graph truncation of the images generated by the generator to avoid back propagation of gradients into the generator. Since we do not need to train the generator to train the discriminator, we do not need the gradient of the generator.
  • When training the classifier, you need to propagate back twice, once to evaluate the real image as 1 and once to evaluate the fake image as 0. The data of the two can also be put into a batch for one forward propagation and one back propagation. But it was found that the best practice was to include only real or fake images in a batch.
  • For fake images, when we train the discriminator, we want it to output 0; When we train the generator, we want it to output 1. So you see a pair of seemingly contradictory codes: error_d_fake = criterion(fake_output, fake_labels) and error_g = criterion(fake_output, true_labels). The discriminator wants to recognize a fake image as fake_label, and the generator wants to recognize it as true_label. The discriminator and the generator promote each other.

Here’s some visual code. The noises used for each visualization are fixed fix_noises as it is easy to compare how the generator generated images are promoted step by step for the same input. In addition, since we normalized the input image (-1~1), we need to restore it to the original scale (0~1) during visualization.

In addition, a function is provided that loads pre-trained models and randomly generates images from noise.

See chapter7/AnimeGAN for the complete code. Follow the instructions in readme.md to configure the environment, prepare the data, and begin training with the following command:

If you use Visdom, open http://[your IP]:8097 at this point to see the generated image.

After the training, we can use the generation network to randomly generate animation heads, and enter the following command:


Analysis of experimental results

The experimental results are shown in Figure 7-9, respectively, cartoon avatars generated by neural network after the training of 1, 10, 20, 30, 40 and 200 epochs. Note that the noise of the generator input is the same each time, so we can compare how the quality of the generated image improves over time with the same input.

At the beginning, the generated image was relatively vague (1 epoch), but it could be seen that the image already had facial contours.

After training for another 10 epochs, the generated images have more details, including hair, color, etc., but the overall image is still very vague.

After training 20 epochs, details continued to be improved, including hair texture, eye details, etc., but there were still a lot of smears.

By the time of the 40th epoch training, obvious facial contours and details could be seen, but there were still smearing phenomena, and some details were not reasonable, for example, the eyes were big and small, and the facial contours were seriously distorted.

After the training of 200 epochs, the details of the pictures have been very perfect, with smoother lines and clearer Outlines. Although there are still some unreasonable points, many pictures can be recognized as authentic.

Figure 7-9 Animation heads generated by GAN

A similar project for generating animation head is “Generating high-definition animation head with DRGAN”, as shown in Figure 7-10. Unfortunately, the data used in the paper was not made public due to copyright issues. The paper’s main improvements include the use of higher-quality picture data and deeper, more complex models.

Figure 7-10 Animation head generated by DRGAN

The sample program explained in this chapter can also be applied to different image generation scenarios by changing the training images to other types of images, such as LSUN guest room photo set, MNIST handwritten data set or CIFAR10 data set. In fact, there is a lot of room for improvement. Here, the full convolutional network we used only had four layers and the model was relatively shallow. After ResNet’s paper was published, many researchers also tried to introduce Residual Block structure in GAN network structure and achieved good visual effect. Interested readers can try to change the single-layer convolution in the sample code to Residual Block and expect good results.

In recent years, a major breakthrough in GAN has been theoretical research. This paper analyzes why GAN is difficult to train from a theoretical perspective. The authors then proposed a better solution in another paper, Wasserstein GAN. However, the implementation of some technical details in Wasserstein GAN’s paper was too arbitrary. Therefore, some people subsequently proposed Improved Training of Wasserstein GANs to better train WGAN. The next two papers are implemented using PyTorch and TensorFlow, respectively, and the code can be searched on GitHub. I also tried to implement Wasserstein GAN in about 100 lines of code, for those who are interested.

With the gradual maturity of GAN research, people also try to use GAN in industrial practical problems, and in many relevant papers, The most impressive one is Unpaired image-to-image Translation using cycle-consistent Adversarial Networks, A new GAN structure called CycleGAN is proposed in this paper. CycleGAN uses GAN to achieve style transfer, color of black and white images, and mutual transformation of horses and zebras, etc., with very outstanding results. The authors of the paper have implemented all the code in PyTorch and made it open source on GitHub for interested readers to check out.

This chapter mainly introduces the basic principles of GAN, and leads readers to use GAN to generate animation head. There are many variations on gans, and there are many gans on GitHub that utilize PyTorch for those who are interested.

The authors introduce

Yun Chen is a Python programmer, Linux enthusiast and Source code contributor to PyTorch. His research interests include computer vision and machine learning. The first prize of “2017 Zhihu Kanshan Cup Machine Learning Challenge” and the eighth prize of “2017 Tianchi Medical AI Competition”. I am keen on promoting PyTorch and have rich experience in using it. I am active in PyTorch forums and zhihu related sections.

Welfare! Welfare! We are sending out 10 hard copy books called PyTorch Introduction and Practice for Deep Learning to AI fans! Leave a comment below with your reasons for wanting the book and we’ll invite you to join the book giveaway group, which will be selected at random by the raffle mini program, and each winner will receive a copy on Tuesday, February 6, at 10am. Attached jingdong purchase address, stamp “read the original”!

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.