“This is the fourth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
preface
Neuro style transfer has generated a lot of interest in the industry since it was proposed. Some sites allow users to upload photos for style transfer, and some even use it for merchandising (DIY digital custom oil painting photos, etc.).
Neurostyle transfer
The image can be divided into content and style. The content describes the composition of the image, such as flowers and trees in the image. Style refers to the details of the image, such as the texture of the lake and the color of the trees. Photographs of the same building at different times of the day with different tones and brightness can be seen as having the same content but different styles.
In the paper published by Gatys et al., CNN was used to transfer the artistic style of one image to another:
Unlike most deep learning models, which require a large amount of training data, neural style transfer requires only two images — content images and style images. You can use trained CNN, such as VGG, to migrate styles from style images to content images.
As shown above, (A) is A content image, (B) — (D) shows A stylized and stylized content image, and the results are amazing! Some even use the algorithm to create and sell art. There are websites and applications that can upload photos for style transfer without understanding the underlying principles, but as technologists, we certainly want to implement this model ourselves.
VGG was used to extract features
Classifier CNN can be divided into two parts: the first part is called feature extractor, which is mainly composed of convolution layer; The latter part consists of several full connection layers and outputs probability scores of classes, called classifier head. CNN pre-trained for classification tasks on ImageNet can also be used for other tasks, which is called Transfer learning. We can transfer or reuse some learned knowledge to new networks or applications. In CNN, the two steps of image reconstruction are as follows:
- The image is calculated forward by CNN to extract features.
- The randomly initialized input is used and trained to reconstruct the feature that best matches the reference feature in Step 1.
In normal network training, the input image is fixed and a backpropagation gradient is used to update the network weights. In neural style migration, all network layers are frozen, and we use gradients to modify the input. In the original paper VGG19 was used, Keras has a pre-training model that can be used. VGG’s feature extractor consists of five blocks, each with a downsample at the end. Each block has 2 ~ 4 convolution layers, and the whole VGG19 has 16 convolution layers and 3 full connection layers.
In the following sections, we will implement content refactoring and extend it to perform style migration. Here is the code to extract the output layer of Block4_conv2 using pre-trained VGG:
Since we only need to extract features, we use include_top = False to freeze network parameters when instantiating the VGG model
vgg = tf.keras.applications.VGG19(include_top=False, weights='imagenet')
content_layers = ['block4_conv2']
content_outputs = [vgg.get_layer(x).output for x in content_layers]
model = Model(vgg.input, content_outputs)
Copy the code
The pre-trained Keras CNN model was divided into two parts. The bottom is made up of a convolution layer, usually called a feature extractor, while the top is a classifier head made up of a fully connected layer. Because we only want to extract features and don’t care about classifiers, include_TOP = False will be set when instantiating the VGG model.
The image to load
First you need to load the content image and style image:
def scale_image(image) :
MAX_DIM = 512
scale = np.max(image.shape)/MAX_DIM
print(image.shape)
new_shape = tf.cast(image.shape[:2]/scale, tf.int32)
image = tf.image.resize(image, new_shape)
return image
content_image = scale_image(np.asarray(Image.open('7.jpg')))
style_image = scale_image(np.asarray(Image.open('starry-night.jpg')))
Copy the code
VGG pretreatment
Keras pre-training model expects the BGR range of input images to be [0, 255]. Therefore, the first step is to reverse the color channel to convert RGB to BGR. VGG to different color channel, using different average. You can use the tf keras. Applications. Vgg19. Preprocess_input preprocessing, () in preprocess_input inside (), B, respectively, Subtract 103.939, 116.779, and 123.68 from the pixel values of G and R channels.
The following is the forward calculation code, which preprocesses the image before it is evaluated forward, and then feeds it into the model to return the content characteristics. We then extract the content characteristics and use them as our goals:
def extract_features(image) :Image = tf. Keras. Applications. Vgg19. preprocess_input(image *255.)
content_ref = model(image)
return content_ref
content_image = tf.reverse(content_image, axis=[-1])
content_ref = extract_features(content_image)
Copy the code
In the code, since the image is normalized to [0., 1.], we need to restore it to [0., 255.] by multiplying it by 255. Then create a randomly initialized input that will also become stylized image:
image = tf.Variable(tf.random.normal( shape=content_image.shape))
Copy the code
Next, we will use back propagation to reconstruct the image from the content features.
Reconstruction of content
In the training step, images are fed into frozen VGG to extract content features, and then L2L_2L2 loss is used to measure the target content features, which is used to calculate L2 loss of each feature layer:
def calc_loss(y_true, y_pred) :
loss = [tf.reduce_sum((x-y)**2) for x, y in zip(y_pred, y_true)]
return tf.reduce_mean(loss)
Copy the code
Calculate GradientTape using tF.gradienttape (). In normal neural network training, gradient updating is applied to the trainable variable, namely the weight of neural network. However, in neural style transfer, gradient is applied to the image. Then, the image value is clipped between [0., 1.], as shown below:
for i in range(1,steps+1) :with tf.GradientTape() as tape:
content_features = self.extract_features(image)
loss = calc_loss(content_features, content_ref)
grad = tape.gradient(loss, image)
optimizer.apply_gradients([(grad, image)])
image.assign(tf.clip_by_value(image, 0..1.))
Copy the code
Block1_1 was used to reconstruct the image, and after 2000 training steps, the reconstructed content image was obtained:
Block4_1 was used to reconstruct the image, and after 2000 training steps, the reconstructed content image was obtained:
You can see that with layer block4_1, you start to lose details, such as the shape of the leaves. When we use block5_1, we see that almost all the details disappear and are filled with some random noise:
If we look closely, the structure and edges of the leaves are still preserved and in their proper place. Now that we have extracted the content, after extracting the content characteristics, the next step is to extract the style characteristics.
Reconstruct style with Gram matrix
As you can see in the content reconstruction, the feature map (especially the first few layers) contains both style and content. So how do we extract style features from images? The method is to use the Gram matrix, which computes the correlation between different filter responses. Assume that the activation shape of convolution layer 1 is (H, W, C), where H and W are spatial dimensions, and C is the number of channels, equal to the number of filters, each filter detects different image features.
When there are some common features, such as colors and edges, they are considered to have the same texture. For example, if we input an image of a meadow into the convolution layer, a filter that detects vertical lines and green will produce a greater response in its feature graph. Therefore, we can use the correlation between feature graphs to represent textures in the image.
To create a Gram matrix by activation of shape (H, W, C), we first reshape it into C vectors. Each vector is a one-dimensional feature graph of size H×W. Dot product is performed on C vectors to obtain a symmetric C×C Gram matrix. The detailed steps for calculating Gram matrix in TensorFlow are as follows:
- use
tf.squeeze()
The batch size(1, H, W, C)
Modified to(H, W, C)
; - Transpose the tensor to take the shape from
(H, W, C)
convert(C, H, W)
; - Flatten the last two dimensions
(C, H x W)
; - Perform the dot product of features to create shapes as
(C, C)
的Gram
Matrix; - By dividing the matrix by the number of elements in each flattened feature graph
(H * W)
Let’s normalize it.
The code for calculating Gram matrix is as follows:
def gram_matrix(x) :
x = tf.transpose(tf.squeeze(x), (2.0.1));
x = tf.keras.backend.batch_flatten(x)
num_points = x.shape[-1]
gram = tf.linalg.matmul(x, tf.transpose(x))/num_points
return gram
Copy the code
You can use this function to get the Gram matrix for each VGG layer of the specified style layer. We then apply L2L_2L2 loss to the Gram matrix from the target image and the reference image. The loss function is the same as content reconstruction. The code to create the Gram matrix list is as follows:
def extract_features(image) :
image = tf.keras.applications.vgg19.preprocess_input(image *255.)
styles = self.model(image)
styles = [self.gram_matrix(s) for s in styles]
return styles
Copy the code
The following images are reconstructed from the style characteristics of different VGG layers:
In the style image reconstructed from Block1_1, the content information disappears completely and only high-frequency texture details are displayed. The higher layer, block3_1, shows some curly shapes:
These shapes capture the higher levels of style in the input image. The loss function of Gram matrix is the sum of square errors rather than mean square errors. Therefore, layers with a higher hierarchical style have a higher inherent weight. This allows more advanced style representations, such as strokes, to be transmitted. If mean square error is used, low-level style features (such as textures) will stand out more visually and may look like high-frequency noise.
Achieve neural style conversion
We can now combine the code in the content and style refactoring to perform neural style transitions.
We start by creating a model that extracts two feature blocks, one for content and one for style. Content reconstruction uses the Block5_conv1 layer, the five layers from block1_conv1 to block5_conv1 are used to capture styles from different hierarchies, as shown below:
vgg = tf.keras.applications.VGG19(include_top=False, weights='imagenet')
default_content_layers = ['block5_conv1']
default_style_layers = ['block1_conv1'.'block2_conv1'.'block3_conv1'.'block4_conv1'.'block5_conv1']
content_layers = content_layers if content_layers else default_content_layers
style_layers = style_layers if style_layers else default_style_layers
self.content_outputs = [vgg.get_layer(x).output for x in content_layers]
self.style_outputs = [vgg.get_layer(x).output for x in style_layers]
self.model = Model(vgg.input, [self.content_outputs, self.style_outputs])
Copy the code
Before the training cycle begins, we extract content and stylistic features from the respective images to be used as targets. While we can use randomly initialized input for content and style reconstruction, it is faster to start training from content images:
content_ref, _ = self.extract_features(content_image)
_, style_ref = self.extract_features(style_image)
Copy the code
We then calculate and add content and style losses:
def train_step(self, image, content_ref, style_ref) :
with tf.GradientTape() as tape:
content_features, style_features = self.extract_features(image)
content_loss = self.content_weight * self.calc_loss(content_ref, content_features)
style_loss = self.style_weight*self.calc_loss( style_ref, style_features)
loss = content_loss + style_loss
grad = tape.gradient(loss, image)
self.optimizer.apply_gradients([(grad, image)])
image.assign(tf.clip_by_value(image, 0..1.))
return content_loss, style_loss
Copy the code
Results show
Here are four stylized images generated with different weights and content layers:
You can change the weights and layers to create the desired styles.
Of course, this model also has the disadvantage that it takes several minutes to generate an image, so real-time migration cannot be achieved. Relevant improved models will be discussed later.