Xerrors. fun/cycle-gan-r…

Welcome to more articles: xerrors.fun


Pain points to solve the problem is the matching of images is bad, so try to find a mapping function G, X domain image can be mapped to Y domain, as there is no constraint mapping relationship, it is easy to appear problems on the training, so training for two mapping function, another mapping function F domain image map to X, The result is that F(G(X)) is approximately X.

The original paper first proposed a hypothesis:

We assume there is some underlying relationship between the domains — for example, That they are two different renderings of the same underlying scene — and seek to learn that relationship.

We assume that there is an underlying relationship between different domains (for example, they are two different representations of the same basic scene) and try to understand this relationship.

Therefore, the author wants to train a mapping relation G, which can map the X domain to the domain G(X) equivalent to the Y distribution. However, there is no guarantee that the transformation will match the input X and the output Y, and there is no way to optimize it. The problem here should be the same as with unpaired data; Meanwhile, in the process of training, “mode collapse” often occurs: all images will be mapped to one output.

In order to ensure that the output can be related to the input each time, which can be understood as preserving the original “potential relation”, the author introduces the principle of “cyclic consistency”, that is, to ensure that G(x) obtained after x is mapped to G can be mapped back through another mapping relation F. So F of G of x is equal to x, and G of F of y is equal to y.

1. Related work

This part is mainly to introduce predecessors’ work and their own understanding, here a little record;

The author believes that the key to GANs’ success lies in the concept of “antagonistic loss”, and proposes his own “cyclic consistency loss” based on this.

The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real photos.

Key to GANs’ success is the concept of “adversarial loss”, which forces the resulting image to be in principle indistinguishable from the real thing.

When summarizing previous work on image-to-image, it indicates that our network does not require specific training or specific matching relation compared with previous people, nor does it assume that the low-dimensional mapping space of input and output is the same. (DOESN’t that contradict his earlier assumption, I suppose?)

Unlike the above approaches, our formulation does not rely on any task-specific, predefined similarity function between the input and output, nor do we assume that the input and output have to lie in the same low-dimensional embedding space.

Unlike the above approach, our approach does not rely on any predefined similarity functions between inputs and outputs that are customized for a particular task, nor does it assume that inputs and outputs must be in the same low-dimensional embedded space.

When it comes to “circular consistency”, the author explains that the concept of using transitivity as an optimization method has a long history and has been applied in many fields, such as translation and 3D model matching. Meanwhile, some people have applied circular consistency loss in model training. They use transitivity to supervise THE training of CNN.

Of these, Zhou et al. and Godard et al. are most similar to our work, as they use a cycle consistency loss as a way of using transitivity to supervise CNN training.

Among them, Zhou and Godard et al. are most similar to our work, and they regard cyclic consistency loss as a way to use transitivity to monitor CNN training.

There was also a coincidence DualGAN:

Concurrent with our work, in these same proceedings, Yi et al. independently use a similar objective for unpaired image-to-image translation, inspired by dual learning in machine translation.

Simultaneously with our work, in the same paper, Yi et al. independently used a similar target for image-to-image unpaired translation, inspired by dual learning in machine translation.

At the same time, the author also compared Neural Style Transfer. Although the presentation results were similar, Cycle GAN focused on the mapping between two image sets and extracted more advanced features besides appearance. So the model can be easily applied to other tasks.

2. Mathematical theory

To understand the math here, it’s important to understand what each part of this diagram does;

The network contains two mapping functions and two discriminators. When training mapping relation G and discriminator DY, the objective function is shown as follows (the same for F and DX) :


L G A N ( G . D Y . X . Y ) = E y …… p data  ( y ) [ log D Y ( y ) ] + E x …… p data  ( x ) [ log ( 1 D Y ( G ( x ) ) ] \begin{aligned} \mathcal{L}_{\mathrm{GAN}}\left(G, D_{Y}, X, Y\right) &=\mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\log D_{Y}(y)\right] \\ &+\mathbb{E}_{x \sim p_{\text {data }}(x)}\left[\log \left(1-D_{Y}(G(x))\right]\right. \end{aligned}

Cycle consistency loss:


L cyc  ( G . F ) = E x …… p data  ( x ) [ F ( G ( x ) ) x 1 ] + E y …… p data  ( y ) [ G ( F ( y ) ) y 1 ] \begin{aligned} \mathcal{L}_{\text {cyc }}(G, F) &=\mathbb{E}_{x \sim p_{\text {data }}(x)}\left[\|F(G(x))-x\|_{1}\right] \\ &+\mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\|G(F(y))-y\|_{1}\right] \end{aligned}

Taken together, the objective function is obtained as follows, where λ\lambdaλ controls the relative importance of the two objectives. :


L ( G . F . D X . D Y ) = L G A N ( G . D Y . X . Y ) + L G A N ( F . D X . Y . X ) + Lambda. L c y c ( G . F ) \begin{aligned} \mathcal{L}\left(G, F, D_{X}, D_{Y}\right) &=\mathcal{L}_{\mathrm{GAN}}\left(G, D_{Y}, X, Y\right) \\ &+\mathcal{L}_{\mathrm{GAN}}\left(F, D_{X}, Y, X\right) \\ &+\lambda \mathcal{L}_{\mathrm{cyc}}(G, F) \end{aligned}

The whole algorithm is to solve the following formula:


G . F = arg min G . F max D x . D Y L ( G . F . D X . D Y ) G^{*}, F^{*}=\arg \min _{G, F} \max _{D_{x}, D_{Y}} \mathcal{L}\left(G, F, D_{X}, D_{Y}\right)

In addition, there is another loss function identity loss for style transformation. The author finds that introducing additional loss to stimulate generator mapping can well retain color components between input and output:


L identity  ( G . F ) = E y …… p data  ( y ) [ G ( y ) y 1 ] + E x …… p data  ( x ) [ F ( x ) x 1 ] \begin{aligned} & \mathcal{L}_{\text {identity }}(G, F)=\mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\|G(y)-y\|_{1}\right]+\mathbb{E}_{x \sim p_{\text {data }}(x)}\left[\|F(x)-x\|_{1}\right] \end{aligned}

3. Network structure

In terms of network architecture, I did not understand the paper well. However, the following picture is more clear, which shows a one-way training process.

The generator is completed in three parts:

The first part of “encoding – converting – decoding” can be understood as feature extraction (encoding) to extract the abstract features of the original image. The second part is transformation, converting features from image domain A to image domain B, and then converting them into images on domain B through restoration (decoding). Therefore, the generator model is usually represented by two tips opposite each other;

The generator

class Generator(nn.Module) :
    def __init__(self, input_nc, output_nc, n_residual_blocks=9) :
        super(Generator, self).__init__()

        # Initial convolution block (256x256x3 -> 256x256x64)
        # # https://zhuanlan.zhihu.com/p/66989411, a nn. ReflectionPad2d
        model = [   nn.ReflectionPad2d(3),
                    nn.Conv2d(input_nc, 64.7),
                    nn.InstanceNorm2d(64),
                    nn.ReLU(inplace=True)]# Downsampling, Encoding (256x256x64 -> 128x128x128 -> 64x64x256)
        ## Two convolution layers are used for feature extraction
        in_features = 64
        out_features = in_features*2
        for _ in range(2):
            model += [  nn.Conv2d(in_features, out_features, 3, stride=2, padding=1),
                        nn.InstanceNorm2d(out_features),
                        nn.ReLU(inplace=True) ]
            in_features = out_features
            out_features = in_features*2

        # Residual blocks, Transformation (64x64x256 -> 64x64x256)
        Add 6 (default) residual modules for style conversion
        for _ in range(n_residual_blocks):
            model += [ResidualBlock(in_features)]

        # Upsampling, Decoding (64x64x256 -> 128x128x128 -> 256x256x64)
        ## Restore image features to image domain B, using two-layer inverse convolution operation
        out_features = in_features//2
        for _ in range(2):
            model += [  nn.ConvTranspose2d(in_features, out_features, 3, stride=2, padding=1, output_padding=1),
                        nn.InstanceNorm2d(out_features),
                        nn.ReLU(inplace=True) ]
            in_features = out_features
            out_features = in_features//2

        # Output layer (2556x256x64 -> 256x256x)
        model += [  nn.ReflectionPad2d(3),
                    nn.Conv2d(64, output_nc, 7),
                    nn.Tanh() ]

        self.model = nn.Sequential(*model)

    def forward(self, x) :
        return self.model(x)
Copy the code

Residual module

class ResidualBlock(nn.Module) :
    def __init__(self, in_features) :
        super(ResidualBlock, self).__init__()

        conv_block = [  nn.ReflectionPad2d(1),
                        nn.Conv2d(in_features, in_features, 3),
                        nn.InstanceNorm2d(in_features),
                        nn.ReLU(inplace=True),
                        nn.ReflectionPad2d(1),
                        nn.Conv2d(in_features, in_features, 3),
                        nn.InstanceNorm2d(in_features)  ]

        self.conv_block = nn.Sequential(*conv_block)

    def forward(self, x) :
        return x + self.conv_block(x)
Copy the code

Judging device

It’s a simple classifier

Discrimination:

class Discriminator(nn.Module) :
    def __init__(self, input_nc) :
        super(Discriminator, self).__init__()

        # A bunch of convolutions one after another
        model = [   nn.Conv2d(input_nc, 64.4, stride=2, padding=1),
                    nn.LeakyReLU(0.2, inplace=True) ]

        model += [  nn.Conv2d(64.128.4, stride=2, padding=1),
                    nn.InstanceNorm2d(128), 
                    nn.LeakyReLU(0.2, inplace=True) ]

        model += [  nn.Conv2d(128.256.4, stride=2, padding=1),
                    nn.InstanceNorm2d(256), 
                    nn.LeakyReLU(0.2, inplace=True) ]

        model += [  nn.Conv2d(256.512.4, padding=1),
                    nn.InstanceNorm2d(512), 
                    nn.LeakyReLU(0.2, inplace=True)]# FCN classification layer
        model += [nn.Conv2d(512.1.4, padding=1)]

        self.model = nn.Sequential(*model)

    def forward(self, x) :
        x =  self.model(x)
        # Average pooling and flatten
        return F.avg_pool2d(x, x.size()[2:]).view(x.size()[0] -1)
Copy the code

4. Training process

When judging the generator, three loss functions are used: GAN Loss, Identity Loss and Cycle Loss. A picture is worth a thousand words. I did a great job!

It can be seen from the figure that the input for generator training is two pictures Real A and Real B taken from domain A and domain B. The whole generator training is divided into three parts: the orange area is used to calculate Identity loss, and the yellow area is used to calculate Cycle loss. The rest of the purple is used to calculate GAN loss; For the 6 losses generated by the two generators, add them up as the loss of the generator, and then carry out back propagation. The following code can also explain this.

Training method of discriminator:

Code can refer to the bosses Pytorch version of the implementation is easy to read: Pytorch CycleGAN/blob/master/train

The resources

  1. Tensorflow implementation of CycleGAN – GitHub
  2. A Clean and readable Pytorch Implementation of CycleGAN – GitHub
  3. Take you through CycleGAN and easily implement qubits with TensorFlow