“This is the 19th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

preface

Although we improved the traditional neural style transfer in the improved style transfer, we still could only use a fixed number of styles obtained by training. So we need to learn another neural network model that allows real-time arbitrary style transfer to get more creative choices.

Adaptive instance normalization

Adaptive Instance Normalization (AdaIN) is a form of instance normalization, meaning that its mean and standard deviation are calculated on each image and each channel (H, W). In CIN, the γγγ and βββ coefficients are trainable variables that learn the mean and variance required for different styles. In AdaIN, γγγ γ and βββ are replaced by standard deviation and mean of style characteristics:


A d a I N ( x . y ) = sigma ( y ) x mu ( x ) sigma ( x ) + mu ( y ) AdaIN(x,y)=\sigma(y)\frac {x-\mu (x)}{\sigma(x)} + \mu(y)

AdaIN can still be understood as a form of normalizing condition instances, where conditions are style characteristics rather than style tags. For training and reasoning, we use VGG to extract the style layer output and use its statistics as a style condition, which avoids having to define a fixed set of styles in advance. Use TensorFlow to create a custom AdaIN layer:

class AdaIN(layers.Layer) :
    def __init__(self, epsilon=1e-5) :
        super(AdaIN, self).__init__()
        self.epsilon = epsilon
    
    def call(self, inputs) :
        x = inputs[0] # content
        y = inputs[1] # style
        mean_x, var_x = tf.nn.moments(x, axes=(1.2), keepdims=True)
        mean_y, var_y = tf.nn.moments(y, axes=(1.2), keepdims=True)
        std_x = tf.sqrt(var_x + self.epsilon)
        std_y = tf.sqrt(var_y + self.epsilon)
        output = std_y * (x - mean_x) / std_x + mean_y
        return output
Copy the code

Tips: As you can see, this is a direct implementation of AdaIN's equation. Where tF.nn. moments is used to calculate the mean and variance of the feature graph, where axis 1 and 2 point to H and W of the feature graph. Keepdims = True is also set to keep the result four-dimensional, with a shape of (N, 1, 1, C) instead of the default (N, C). The former allows TensorFlow to use the broadcast mechanism with input tensors of shape (N, H, W, C).

Next, we integrate AdaIN into the style migration.

Style transfer network

The following figure shows the architecture and training flow of the style transfer network:

A Style Transfer network (STN) is an encoder/decoder network in which encoders encode content and style features using fixed VGG. AdaIN then encodes the style features into the statistics of the content features, and the decoder then adopts these new features to generate stylized images.

Encoder structure and implementation

Encoders are built using VGG:

    def build_encoder(self, name='encoder') :
        self.encoder_layers = [
            'block1_conv1'.'block2_conv1'.'block3_conv1'.'block4_conv1']
        vgg = tf.keras.applications.VGG19(include_top=False,weights='imagenet')

        layer_outputs = [vgg.get_layer(x).output for x in self.encoder_layers]

        return Model(vgg.input, layer_outputs, name=name)
Copy the code

Tips: This is similar to neural style migration, using the last style layer "block4_conv1" as the content layer. Therefore, we do not need to define a separate content layer.

Next, we need to make a small but significant improvement to the convolutional layer to improve the appearance of the generated image.

Reduce block artifacts by reflection padding

When we apply the padding to the input tensor at the convolution layer, we populate the constant zeros around the tensor. However, a sudden drop in the value at the boundary produces high-frequency components and leads to block artifacts in the generated image. One way to reduce these high-frequency components is to add total variation Loss as a regularizer in network training:

  1. First, by moving the image one pixel to calculate the high-frequency component,
  2. Then subtract the original image to create a matrix.

The total variational loss is the sum of the L1L_1L1 norm. Therefore, training will attempt to minimize this loss function to reduce the high-frequency components.

Another option is to replace the constant zeros in the padding with reflected values. For example, if we fill the array [10, 8, 9] with zeros, we get [0, 10, 8, 9, 0], and we can see that the values of 0 and its neighbors change very suddenly.

If we use reflection fill, the fill array will be [8, 10, 8, 9, 8], which will provide a smoother transition to the boundary. However, Keras Conv2D does not support reflection filling, so we need to create a custom Conv2D using TensorFlow. The following code snippet shows adding reflection fill to the input tensor before convolution:

class Conv2D(layers.Layer) :
    def __init__(self, in_channels, out_channels, kernel=3, use_relu=True) :
        super(Conv2D, self).__init__()
        self.kernel = kernel
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.use_relu = use_relu
    
    def build(self, input_shape) :
        self.w = self.add_weight(shape=[
		            self.kernel,
		            self.kernel,
		            self.in_channels,
		            self.out_channels],
		            initializer='glorot_normal',
		            trainable=True, name='bias')
        
        self.b = self.add_weight(shape=(
		            self.out_channels,),
		            initializer='zeros',
		            trainable=True,
		            name='bias')

    @tf.function
    def call(self, inputs) :
        padded = tf.pad(inputs, [[0.0], [1.1], [1.1], [0.0]], mode='REFLECT')
        # perform conv2d using low level API
        output = tf.nn.conv2d(padded, self.w, strides=1, padding='VALID') + self.b
        if self.use_relu:
            output = tf.nn.relu(output)
        return output
Copy the code

Decoder structure and implementation

Although the encoder uses the network layer of the four VGG layers (block1_conv1 to block4_conv1), AdaIN uses only the last layer of the encoder, block4_conv1. Therefore, the input tensor of the decoder is the same as the activation layer output of Block4_conv1. The decoder consists of convolution and upsampling layers, as shown in the following code:

    def build_decoder(self) :
        block = tf.keras.Sequential([
            Conv2D(512.256.3),
            UpSampling2D((2.2)),
            Conv2D(256.256.3),
            Conv2D(256.256.3),
            Conv2D(256.256.3),
            Conv2D(256.128.3),
            UpSampling2D((2.2)),
            Conv2D(128.128.3),
            Conv2D(128.64.3),
            UpSampling2D((2.2)),
            Conv2D(64.64.3),
            Conv2D(64.3.3,use_relu=False)
        ], name='decoder')
        return block
Copy the code

Tips: The previous code used a custom Conv2D with reflection padding. ReLU activation functions are used in all layers except the output layer which does not have any nonlinear activation functions.

Now we have AdaIN, encoder and decoder. Next, you can continue the image preprocessing process.

VGG pretreatment

Just like the neural style transfer we constructed before, we need to preprocess the image by converting the color channel to BGR and subtracting the color mean, the code is as follows:

    def preprocess(self, image) :
        # RGB to BGR
        image = tf.reverse(image, axis=[-1])
        return tf.keras.applications.vgg19.preprocess_input(image)
Copy the code

We can do the reverse in post processing by adding the color mean and reversing the color channel. However, this is something the decoder might learn, since the color mean is equivalent to a bias in the output layer. So we’ll let the training process do the post-processing, and all we need to do is crop the pixels to the [0,255] range:

    def postprocess(self, image) :
        return tf.clip_by_value(image, 0..255.)
Copy the code

Now that we have all the artifacts ready, all that remains is to put them together to create the STN and the training process.

Implement style migration network

Constructing an STN is as simple as connecting the encoder, AdaIN, and decoder, as shown in the previous architecture diagram. The STN is also the model we will use to perform our reasoning. The code to do this is as follows:

        """ Style Transfer Network """
        content_image = self.preprocess(content_image_input)
        style_image = self.preprocess(style_image_input)

        self.content_target = self.encoder(content_image)
        self.style_target = self.encoder(style_image)

        adain_output = AdaIN()([self.content_target[-1], self.style_target[-1]])
        
        self.stylized_image = self.postprocess(self.decoder(adain_output))

        self.stn = Model([content_image_input, style_image_input], self.stylized_image)
Copy the code

The content and style images are preprocessed and fed into the encoder. The last feature layer, block4_conv1, enters AdaIN(). The stylized features then go into the decoder to generate an RGB stylized image.

Real-time arbitrary style transfer model training

Like neural style transfer, content loss and style loss are calculated based on activation of fixed VGG extraction. Content loss is also L2L_2L2 norm, but now the content features of the generated stylized image are compared to AdaIN’s output, rather than features in the content image, as shown in the following code, which makes convergence faster:

            content_loss = tf.reduce_sum((output_features[-1]-adain_output)**2)
Copy the code

For style loss, the commonly used Gram matrix is replaced with the L2L_2L2 norm of mean and variance activation. This produces a result similar to the Gram matrix, with the following style loss function equation:


L s = i = 1 L mu ( ϕ i ( s t y l i z e d ) ) mu ( ϕ i ( s t y l e ) ) 2 + sigma ( ϕ i ( s t y l i z e d ) ) sigma ( ϕ i ( s t y l e ) ) 2 \mathcal L_s = \sum_{i=1}^L||\mu( \phi_i(stylized) )-\mu(\phi_i(style))||_2+||\sigma( \phi_i(stylized) )-\sigma(\phi_i(style))||_2

Here, ϕ I \phi_iϕ I represents the layer in VGG-19 that counts the stylistic loss.

We use tF.nn.moments in the AdaIN layer to calculate the statistics and L2L_2L2 norm between features from stylized and stylized images, and we average the losses of the content layer, as shown below:

    def calc_style_loss(self, y_true, y_pred) :
        n_features = len(y_true)
        epsilon = 1e-5
        loss = []

        for i in range(n_features):
            mean_true, var_true = tf.nn.moments(y_true[i], axes=(1.2), keepdims=True)
            mean_pred, var_pred = tf.nn.moments(y_pred[i], axes=(1.2), keepdims=True)
            std_true = tf.sqrt(var_true + epsilon)
            std_pred = tf.sqrt(var_pred + epsilon)
            mean_loss = tf.reduce_sum(tf.square(mean_true-mean_pred))
            std_loss = tf.reduce_sum(tf.square(std_true-std_pred))
            loss.append(mean_loss + std_loss)
        
        return tf.reduce_mean(loss)
Copy the code

The final step is to write the training steps:

    @tf.function
    def train_step(self, train_data) :
        with tf.GradientTape() as tape:
            adain_output, output_features, style_target = self.training_model(train_data)

            content_loss = tf.reduce_sum((output_features[-1]-adain_output)**2)
            style_loss = self.style_weight * self.calc_style_loss(style_target, output_features)
            loss = content_loss + style_loss

            gradients = tape.gradient(loss, self.decoder.trainable_variables)

            self.optimizer.apply_gradients(zip(gradients, self.decoder.trainable_variables))
        
        return content_loss, style_loss

	def train(self, train_generator, test_generator, steps, interval=500, style_weight=1e4) :
        self.style_weight = style_weight

        for i in range(steps):
            train_data = next(train_generator)
            content_loss, style_loss = self.train_step(train_data)

            if i % interval == 0:
                ckpt_save_path = self.ckpt_manager.save()
                print ('Saving checkpoint for step {} at {}'.format(i, ckpt_save_path))
                print(f"Content_loss {content_loss:4.f}, style_loss {style_loss:4.f}")
                val_data = next(test_generator)
                self.stylized_images = self.stn(val_data)
                self.plot_images(val_data[0], val_data[1], self.stylized_images)
Copy the code

Tips: We fixed the content weight to 1 and adjusted the style weight, which in our example was set to 1e4. In the network architecture diagram shown above, it looks as if there are three networks to train, but two of them are VGG for freezing parameters, so the only network that can be trained is the decoder.

More effect display

In the training example, use faces as content images and Cyclegan/Monet2photo as style images. Although Monet’s paintings belong to an artistic style, from the point of view of style transfer, each style image is a unique style. The Monet2photo dataset contains 1193 styles of images, which means we will be using 1193 different styles to train the network! The image below shows an example of an image generated by our network:

The figure above shows the style transfer of the style images not used in network training (i.e., the test style data set) during reasoning. Each style transfer is performed by a single forward calculation, which is much faster than the iterative optimization of the original neural style transfer algorithm.

Series connection

TensorFlow2 implements neural style migration

Improved neural style transfer