Autoregressive model and PixelCNN construction

“This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Autoregressive models

Deep neural network generation algorithms are mainly divided into three categories:

Generative Adversarial Network (GAN)
Variational Autoencoder (VAE)
Autoregressive models

VAE is described in VAE in detail and realization. For details on GAN, see Deep Convolutional Generative Adversarial Networks (DCGAN) Principles and Implementation. This article will introduce the little known autoregression models. Although autoregression is not common in image generation, it is still an active area of research. DeepMind’s WaveNet uses autoregression to generate realistic audio. In this paper, the autoregressive model and PixelCNN model will be introduced.

Introduction to the

Autoregressive “Auto” meant self, while machine learning terminology regress meant predicting new values. Taken together, autoregression means that we use the model to predict new data points based on the model’s past data points.

Let the probability distribution of the image be P (x)p(x)p(x) is the joint probability distribution of pixels p(x1,x2… Xn) p (x_1, x_2,… X_n) p (x1, x2,… Xn), difficult to model due to high dimensions. Here, we assume that the value of a pixel depends only on the value of the pixel before it. In other words, the current pixel only by its previous pixels as the condition, the p (xi) = p (xi ∣ xi – 1) p (xi – 1) p (x_i) = p (x_i | x_ {1} I -) p (x_ {1} I -) p (xi) = p (xi ∣ xi – 1) p (xi – 1), We can approximate the joint probability as the product of conditional probability:

P (x) = p(x_n, x_{n-1},… , x_2, x_1)

p(x) = p(x_n | x_{n-1})… p(x_3 | x_2) p(x_2 | x_1) p(x_1)

To take a concrete example, imagine that there is a red apple near the center of the image and that the apple is surrounded by green leaves. In this case, imagine that there are only two possible colors: red and green. X1x_1x1 is the upper left pixel, so p(x1)p(x_1)p(x1) is the probability that the upper left pixel is green or red. If x1x_1x1 is green, the pixel on its right, P (x2)p(x_2)p(x2), may also be green because it may have more leaves. But, although less likely, it could also be red.

Keep going, and we’ll end up with red pixels. Starting with that pixel, the next few pixels are likely to be red as well, which is much easier than having to consider all of them at once.

PixelRNN

PixelRNN was created by DeepMind in 2016. As the name RNN (Recurrent Neural Network) suggests, the model uses an RNN called long – and short-term memory (LSTM) to learn the distribution of images. It reads one row of the image at a time in one step of the LSTM, processes it using a one-dimensional convolution layer, and then feeds the activation information to subsequent layers to predict the pixels of the row.

Because LSTM runs slowly, it takes a long time to train and generate samples. Therefore, we won’t study it too much, but instead turn our attention to a variant proposed in the same paper, PixelCNN.

TensorFlow 2 was used to construct PixelCNN model

PixelCNN consists only of convolution layers, making it much faster than PixelRNN. Here, we will train a simple PixelCNN model for using MNIST data sets.

Input and label

MNIST consists of 28 x 28 x 1 grayscale digits handwritten digits. It has only one channel:

In this experiment, the problem is simplified by converting the image to binary data: 0 represents black, 1 represents white:

def binarize(image, label) :
    image = tf.cast(image, tf.float32)
    image = tf.math.round(image/255.)
    return image, tf.cast(image, tf.int32)
Copy the code

This function requires two inputs — an image and a label. The first two lines of the function convert the image to binary Float32 format, which is 0.0 or 1.0. Also, we convert the binary image to an integer and return it to follow the convention of using integers as labels. The data returned, which will be used as input and label for network training, are both 28 x 28 x 1 binary MNIST images, they differ only in data type.

mask

Unlike PixelRNN, which reads line by line, PixelCNN slides the convolution kernel from left to right and top to bottom in the image. When convolution is performed to predict the current pixel, the traditional convolution kernel can see the current input pixel and the surrounding pixels, including the pixel information after the current pixel, which is contrary to the conditional probability hypothesis in the introduction section.

To avoid this, we need to ensure that CNN does not see the input pixel xix_ixi when predicting the output pixel xix_ixi.

This is achieved by using mask convolution, where the mask is applied to the convolution kernel weights prior to performing the convolution. The figure below shows the mask of a 7 x 7 convolution kernel with a weight of 0 starting from the center. This prevents CNN from seeing the pixel it is predicting (the center of the convolution kernel) and all subsequent pixels. This is called an A-mask and applies only to the input layer.

Since the center pixel is obscured in the first layer, we no longer need to hide the center element in the later layers. In fact, we need to set the center of the convolution kernel to 1 to enable it to read the features of the previous layer, which is called a B-mask.

Implement a custom layer

Now we will create a custom layer for the mask convolution. We can create a custom Layer in tensorFlow2.x using subclasses inherited from the base class tf.keras.layers.layer so that we will be able to use it just like any other Keras Layer. Here is the basic structure of the custom layer class:

class MaskedConv2D(tf.keras.layers.Layer) :
    def __init__(self) :.def build(self, input_shape) :.def call(self, inputs) :.return output
Copy the code

Build () takes the shape of the input tensor as an argument, and we will use this information to ensure that we create variables with the correct shape. This function is run only once when building layers. We can create masks by declaring untrainable variables or constants so that TensorFlow knows that it does not need a gradient to propagate back:

	def build(self, input_shape) :
	        self.w = self.add_weight(shape=[self.kernel,
	                                        self.kernel,
	                                        input_shape[-1],
	                                        self.filters],
	                                initializer='glorot_normal',
	                                trainable=True)
	        self.b = self.add_weight(shape=(self.filters,),
	                                initializer='zeros',
	                                trainable=True)
	        mask = np.ones(self.kernel**2, dtype=np.float32)
	        center = len(mask)//2
	        mask[center+1:] = 0
	        if self.mask_type == 'A':
	            mask[center] = 0
	        mask = mask.reshape((self.kernel, self.kernel, 1.1))
	        self.mask = tf.constant(mask, dtype='float32')
Copy the code

Call () is used to perform forward passing. In the mask convolution layer, before performing convolution using the low-level TF.nn API, we multiplied the weight by the mask and set the value of the lower half to zero:

	def call(self, inputs) :
	        masked_w = tf.math.multiply(self.w, self.mask)
	        output=tf.nn.conv2d(inputs, masked_w, 1."SAME") + self.b
	        return output
Copy the code

The network architecture

PixelCNN architecture is very simple. After the first 7 x 7 conv2D layer using A type A mask, there are several layers of residual blocks with A Type B mask:

The following figure illustrates the residual block architecture used in PixelCNN:

Cross entropy loss

The cross entropy loss, also known as the logarithmic loss, measures the performance of the model, where the probability of the output is between 0 and 1. The following is the equation of binary cross entropy loss, where there are only two classes, the label Y can be 0 or 1, and p(x)p(x)p(x) is the prediction of the model:

BCE = -\frac1N\sum_{i=1}^N(y_ilogp(x)+(1-y_i)log(1-p(x)))

In PixelCNN, individual image pixels are used as labels. In binarized MNIST, we have to predict whether the output pixel is 0 or 1, which makes it a classification problem using cross entropy as a loss function.

Finally, to compile and train the neural network, we use binary cross entropy for both loss and measurement, and use RMSprop as the optimizer. There are many different optimizers available, and their main difference is the way they adjust learning rates based on past statistics. There is no single optimal optimizer that can be used in all situations, so it is advisable to try different optimizers.

Compile and train pixelCNN models:

pixelcnn = SimplePixelCnn()
pixelcnn.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.001),
    metrics=[ tf.keras.metrics.BinaryCrossentropy()])
pixelcnn.fit(ds_train, epochs = 10, validation_data=ds_test)
Copy the code

Next, we will generate a new image based on the previous model.

Sample generated image

After training, we can use the model to generate new images through the following steps:

Creates an empty tensor with the same shape as the input image and uses1Filled in. It is fed into the network and obtained $p(x1)$ Is the probability of the first pixel.
Sample from p(x1)p(x_1)p(x1) and assign the sample values to pixels x1x_1x1 in the input tensor.
Feed the input to the network again, and perform Step 2 for the next pixel.
Repeat steps 2 and 3 until xNx_NxN is generated.

A major disadvantage of the autoregressive model is that it is slow to generate because it needs to be generated pixel by pixel and cannot be parallelized. The following images were generated by our PixelCNN model after 100 training cycles. They don’t look quite like the right numbers yet, but we can now generate new images out of thin air. Better numbers can be generated by training longer models and doing some hyperparameter tweaking.