“This is the fifth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Use autoencoder to learn latent variables

Autoencoders were first introduced in the 1980s by Geoffrey Hinton et al., because there is a lot of redundancy in the high-dimensional input space, which can be compressed into some low-dimensional variables. Techniques used in traditional machine learning techniques to reduce input dimensions include Principal Component Analysis (PCA). However, in image generation, we also want to restore the low-dimensional space to the high-dimensional space. You can think of it as image compression, where the raw image is compressed into a file format like JPEG, which is small and easy to store and transfer. The computer can then restore the JPEG to its original pixels. In other words, the original pixels are compressed to low-dimensional JPEG format and restored to high-dimensional original pixels for display.

Autoencoders are unsupervised machine learning techniques that train models without training labels. However, since we do need to use the image itself as the label, some call it self-supervised machine learning (auto is Latin for self). The basic building blocks of the autoencoder are the encoder and decoder. The encoder is responsible for reducing the higher-dimensional input to some lower-dimensional latent (hidden) variables. A decoder is a module that converts hidden variables back into higher dimensional space. Encoder-decoder architectures are also used in other machine learning tasks, such as semantic segmentation, where a neural network first learns about an image representation and then generates pixel-level labels. The following figure shows the general architecture of an autoencoder:

The input and output are images of the same dimensions, and Z is a latent vector of lower dimensions. The encoder compresses the input to Z and the decoder processes the reverse to produce the output image.

The encoder

The encoder is composed of multiple neural network layers, and we will construct the encoder using MNIST data set, which accepts the input size of 28x28X1. We need to set the dimension of the latent variable, using a one-dimensional vector. The size of the latent variable should be smaller than the input size. It is a hyperparameter, first tried with 10, which has a compression ratio of 28 * 28/10 = 78.4.

This network extension enables the model to learn important knowledge and discard minor features layer by layer to arrive at the top 10 most important features. It looks very similar to the CNN classification, where the size of the feature graph decreases from top to bottom.

The convolution layer is used to construct the encoder. The previous CNN (such as VGG) uses maximum pooling for feature graph sub-sampling, but newer networks tend to achieve this goal by using convolution with step 2 in the convolution layer. We will follow the convention and name the potential variable z:

def Encoder(z_dim) :
    inputs  = layers.Input(shape=[28.28.1])
    x = inputs    
    x = Conv2D(filters=8, kernel_size=(3.3), strides=2, padding='same', activation='relu')(x)
    x = Conv2D(filters=8, kernel_size=(3.3), strides=1, padding='same', activation='relu')(x)
    x = Conv2D(filters=8, kernel_size=(3.3), strides=2, padding='same', activation='relu')(x)
    x = Conv2D(filters=8, kernel_size=(3.3), strides=1, padding='same', activation='relu')(x)
    x = Flatten()(x)
    out = Dense(z_dim, activation='relu')(x)
    return Model(inputs=inputs, outputs=out, name='encoder')
Copy the code

In a typical CNN architecture, the number of filters increases while the size of the feature graph decreases. However, the goal is to reduce the feature size, so the number of filters remains the same, which is sufficient for simple data such as MNIST. Finally, we flatten the output of the last convolution layer and feed it to the dense layer to output the latent variables.

decoder

The decoder works essentially the opposite of an encoder, converting low-dimensional latent variables into higher-dimensional outputs to approximate the input image. Here, the convolution layer is used in the decoder to sample the feature graph from 7×7 to 28×28:

def Decoder(z_dim) :
    inputs = layers.Input(shape=[z_dim])
    x = inputs    
    x = Dense(7*7*64, activation='relu')(x)
    x = Reshape((7.7.64))(x)
    x = Conv2D(filters=64, kernel_size=(3.3), strides=1, padding='same', activation='relu')(x)
    x = UpSampling2D((2.2))(x)
    x = Conv2D(filters=32, kernel_size=(3.3), strides=1, padding='same', activation='relu')(x)
    x = UpSampling2D((2.2))(x)    
    x = Conv2D(filters=32, kernel_size=(3.3), strides=2, padding='same', activation='relu')(x)
    out = Conv2(filters=1, kernel_size=(3.3), strides=1, padding='same', activation='sigmoid')(x)
    return Model(inputs=inputs, outputs=out, name='decoder')
Copy the code

Unlike encoders, decoders are not designed to reduce size, so we should use more filters to give them more generative power.

UpSampling2D interpolates pixels to improve resolution. This is an affine transformation (linear multiplication and addition) and therefore can propagate back, but it uses fixed weights and is therefore untrainable. Another popular upsampling method is the use of the transpose convolutional layer, which is trainable, but which can create checkerboard-like artifacts in the generated image.

Therefore, recent image generation models usually do not use transpose convolution:

Build autoencoder

Place encoder and decoder together to create autoencoder. First, we instantiate the encoder and decoder respectively. We then feed the encoder’s output to the decoder’s input and instantiate a Model using the encoder’s input and the decoder’s output:

z_dim = 10
encoder = Encoder(z_dim)
decoder = Decoder(z_dim)
model_input = encoder.input
model_output = decoder(encoder.output)
autoencoder = Model(model_input, model_output)
Copy the code

For training, L2 loss is used, which is achieved by comparing each pixel between the output and the expected result by means of mean square error (MSE). In this example, some callback functions have been added that will be called after training each epoch:

  1. ModelCheckpoint(monitor=’val_loss’) is used to save the model if the current validation loss is lower than the previous EPOCH.
  2. If verification loss does not improve within 10 epochals, the EarlyStopping(monitor=’val_loss’, patience = 10) training can be stopped earlier.

The resulting image is as follows:

Generate images from latent variables

So, what’s an autoencoder for? One of the applications of autoencoders is image denoising, that is, adding some noise to the input image and training the model to produce a clear image.

If you are interested in generating an image using an autoencoder, you can ignore the encoder and use only the decoder to sample from latent variables to generate the image. Our first challenge was to determine how to sample from potential variables.

For illustration, train another autoencoder with z_DIM = 2 so that we can explore the potential space in two dimensions:

The graph is generated by passing 1,000 samples to a trained encoder and plotting two potential variables on a scatter plot. The color bar on the right indicates the strength of the digital label. We can observe from the picture:

The categories of potential variables are not evenly distributed. Clusters that are completely separate from the other categories can be seen in the upper left and upper right. However, the categories at the center of the graph tend to be more densely packed and overlap each other.

In the figure below, these images are generated by spacing 1.0 in the range of latent variables [-5, +5] :

We can see that the numbers 0 and 1 are well represented in the sample distribution, well plotted. But the middle numbers are fuzzy, and even some numbers are missing from the sample.

We can also add controls to the.ipynb code that allow sliding to change latent variables to generate images interactively.