GAN can also be massively compressed, and recent research by Hans Song’s team at MIT has many researchers excited.

Excerpted from arXiv by Muyang Li et al., Heart of Machine Compilation, Heart of Machine Editorial Board.



GAN generation is one of the most important development directions in machine learning. But such algorithms require so much computing power that most researchers have struggled to come up with new results. In recent years, this direction has tended to be monopolized by large institutions.


But recently, researchers from the Massachusetts Institute of Technology (MIT), Adobe, and Shanghai Jiao Tong University have come up with a general method for compressing GAN conditions. This new technology reduces the computing footprint of a widely used conditional GAN model such as Pix2PIx, CycleGAN and GauGAN to 1/9 to 1/21 while maintaining visual fidelity. The method is applicable to a variety of generator architectures, learning objectives, paired or unpaired Settings.




At present, this research paper has been included in CVPR 2020 conference, and the PyTorch version implementation of GAN compression framework has also been open source.


Project link: https://github.com/mit-han-lab/gan-compression


What is the specific performance of GAN after compression? In the Demo the researchers showed, using CycleGAN to add zebra stripes to a horse in a video required less than 1/16 of the computing power, tripled the number of frames, and improved performance:

Notably, the hardware platform used for the institute is Nvidia’s Edge AI computing chip, Jetson Xavier GPU. According to official stats, Jetson Xavier’s INT8 is 22+10TOPS and Snapdragon 865 is 15TOPS. Compressed gans now appear to be able to run on robots, drones and other small devices, but will soon be stuffed into mobile phones.





The thesis links: https://arxiv.org/pdf/2003.08936v1.pdf




General introduction


Production adversarial networks (GANS) specialize in synthesizing very realistic images. Conditional generative Adversarial network (cGAN), a variant of GAN, enables controlled image synthesis in many computer vision and imaging applications. However, most of these applications require the model to interact with humans, so a better user experience can be achieved on a device with low latency.


However, some recently developed CgAns are 1 to 2 orders of magnitude greater in computational strength than current recognition convolutional neural networks (CNN). For example, GanGAN consumes 281G Macs per image, while Mobilenet-V3 requires 0.44g of Macs, making it difficult for interactive deployment.


Moreover, as of now, most edge devices are limited by content and hardware limitations such as batteries, which also prevent GAN deployment on edge devices.


Therefore, based on these problems of GAN and cGAN in the field of image synthesis, Han Song and his team proposed GAN compression, which is a general compression method to reduce the reasoning time and computation cost of GAN. At the same time, the compressed generated model faces two basic difficulties: the GAN training is unstable, especially in the case of unpaired; Generators are different from CNN, making it difficult to use existing CNN designs. To solve this problem, the team passed knowledge from the original teacher generator intermediate representation layer to its corresponding student generator layer.


To reduce training costs, the team also separated model training from architecture search by training a once-for-all network that included all possible channels. This “once-for-all network” can generate many sub-networks through weight sharing, and the performance of each sub-network can be evaluated without training. This model can be applied to GAN models under various conditions, regardless of model architecture, learning algorithm, or supervised setting (paired or unpaired).


Through extensive experiments, the team has demonstrated that this approach can reduce the computing capacity of pix2pix, CycleGAN, or GauGAN, a widely used GAN model, to 1/9 to 1/21 without compromising the fidelity of the resulting image.


The specific methods


We all know that compressing conditional generative models for interactive applications is challenging for two reasons. First, the dynamic training of GAN is very unstable by nature. Second, it is difficult to directly use the existing CNN compression algorithm due to the huge architectural differences between recognition and generation models.


For these reasons, the researchers propose training programs specifically for efficient generation models and use neural architecture search (NAS) to further increase the compression ratio. The overall architecture of the GAN compression framework is shown in Figure 3 below, where they use the ResNet generator as an example. It is important to emphasize that the same framework can be applied to different generator architectures and learning goals.


Figure 3: Overall architecture diagram of GAN compression framework in this paper.



The objective function


1. Unify paired learning and unpaired learning


The wide range of training objectives makes it difficult to build a general compression framework. To address this problem, the researchers unified paired and unpaired learning in the model compression setup, regardless of how the teacher model was originally trained. Given the original teacher generator G ‘, the researcher converted the unpaired training setting into the paired setting. For unpaired Settings, the raw generator output can be treated as true and the compressed generator G can be trained using paired training targets.


Learning objectives are summarized as follows:




Based on these changes, you can now apply the same compression framework to different types of CgAns. Furthermore, learning using the pseudo pair described above makes training more stable and produces better results than the original unpaired training setup.


2. Learn from the teacher discriminator


Although this study focuses on the compression generator, discriminator D stores useful information about GAN. Therefore, the researchers utilized the same discriminator architecture, using pre-trained weights from the teacher discriminator to fine-tune the discriminator in conjunction with the compression generator.


In the experiment, the researchers observed that the pre-training discriminator could guide the training of the generator. Using a randomly initialized discriminator usually results in unstable training and degraded image quality. The goal of this GAN can be written as follows:




In the above formula, the researcher initializes the student discriminator D with weights from the teacher discriminator D ‘. They trained G and D using a standard minimax optimizer.


3. Intermediate characteristic distillation


Knowledge distillation is a common method used for CNN model compression. By matching the logit distribution of the output layer, the dark knowledge from the teacher model can be transferred to the student model to improve the performance of the student model. However, conditional GAN usually outputs a deterministic image rather than a probability distribution.


To solve the above problems, the researchers matched the intermediate representation of the teacher generator. The more channels the middle tier contains, the richer the information they can provide, and the more information the student model can capture beyond the output. The objectives of distillation are as follows:




Where, G_t(x) and G ‘_T (x) are the intermediate feature activation of the t selected layer in the student and teacher models, and T represents the number of layers.


4. Complete optimization goals


The end goal can be written as follows:




Among them, the superparameters λ_recon and λ_distill control the importance of each item.


Efficient generator design space


Choosing a well-designed student structure is crucial to the final distillation of knowledgeThe researchers found that reducing the number of channels in the teacher model alone did not make the student model more compact: when the computation was reduced by more than four times, the performance decreased significantly.


1. Convolution decomposition and layer sensitivity


Existing generators usually use traditional convolution to match CNN classification and segment design. Recently, some efficient CNN designs widely use the decomposition form of convolution (Depthwise + pointwise) to achieve better balance between performance and computation. The researchers found that the decomposed convolution can also be used in the generator design of cGAN.


2. Use the NAS to implement automatic channel tailoring


Existing generators use a manually designed (and almost uniform) number of channels at all levels, which creates redundancy and is far from optimal.
In order to further improve the compression efficiency, channel pruning is used to automatically select the channel width in the generator, thus reducing redundancy and secondary reduction of computation. This approach supports fine-grained selection of the number of channels. For each convolutional layer, the convolutional layer can be selected from multiples of 8 to balance MAC and hardware parallelism.


Decoupled training and structure search


Following the recent one-shot NAS approach, the researchers decouped model training from architecture search. First, train a once-for-all network that supports a different number of channels, where each subnetwork is equally trained. Figure 3 illustrates the overall framework. The researchers hypothesized that the original teacher generator had
For a given number of channels
, extract the first one from the weight tensor of “once-for-all”
Channel to obtain the weight network of subnetworks.


In each training step, the learning objective is used to randomly sample the sub-network with a certain number of channels, calculate the output and gradient, and update the extracted weight (Formula 4). Because the first few channels to be extracted update more frequently, they play a more critical role in ownership reweighting.


After the once-for-all network is trained, the researchers directly evaluate the performance of each sub-network on the validation set to find the best sub-network. The “once-for-all” network has been thoroughly trained in weight sharing and no fine-tuning is required. This result approximates the performance of the training model from scratch.


This way, with only one training and no further training, you can evaluate all configurations of all channels and find the best one based on the search results. Of course, you can tweak the selected architecture to further improve its performance.


The experimental results


The researchers performed experiments on the following three conditional GAN models, CycleGAN, Pix2Pix, and GauGAN, to verify the generalization of the proposed GAN compression framework. The four data sets used are Horse↔ Zebra, Edges→shoes, Cityscapes and Map↔aerial photo.


Table 1 below shows the quantization results of the CycleGAN, Pix2Pix, and GauGAN models compressed on the above four data sets.


Table 1: Quantitative evaluation of compression of the three conditional GAN models, where mAP measures (higher is better) are used on the Cityscapes dataset and FID measures are used on other datasets. The results show that the GAN compression method in this paper can compress the current SOTA conditional GAN on MACs by 7 to 21 times and 5 to 33 times on the model size under the condition of slight performance degradation. For CycleGAN model compression, the GAN compression method in this paper is far superior to the previous CycleGan-specific co-evolution method.



Tradeoffs between performance and computation


In addition to achieving higher compression rate, this method can also improve the performance of different model sizes. Figure 6 shows the tradeoff between performance and computation on different data sets in the Pix2PIx model.


Figure 6: Tradeoff curve of Pix2PIx on Cityscapes and Edges→Shoes data set. The pruning and distillation methods go beyond scratch training in large models, but perform poorly when the model is compressed sharply.



Results show


Figure 4 below shows the results obtained using this method. The input data, standard output, original model output and compressed model output are shown respectively. It can be seen from the figure that the method proposed by the researcher can still maintain the visual reliability of the output image even at a large compression rate.


Figure 4: Comparison of effect on Cityscapes, Edges→Shoes and Horse→Zebra data sets.



Hardware inference acceleration


For real scenario interactive applications, the importance of inference acceleration on hardware devices is much greater than the reduction of computing consumption. As shown in Table 2 below, in order to verify the effectiveness of the proposed method in practical application, the researcher tested the inference speed of the compression model on devices with different computing performance.


Table 2: Memory and latency drops measured on NVIDIA Jetson AGX Xavier, NVIDIA Jetson Nano, 1080Ti GPU, and Xeon CPU.



conclusion


In this paper, the general compression framework proposed by Han’s team can significantly reduce the computational cost and model size of generators in conditional GAN, and improve training stability and model efficiency through knowledge distillation and natural architecture search. Experiments show that the proposed GAN compression method can compress several conditional GAN models while maintaining visual quality. They said,
Future research work will focus on reducing model latency and building efficient frameworks for generating video models.