Fully convolutional network (FCN) adopts convolutional neural network to realize the transformation from image pixels to pixel categories [36]. Different from the convolutional neural network introduced before, the full convolutional network transforms the height and width of the feature graph of the middle layer back to the size of the input image through transposed convolution layer, so that the prediction results correspond to the spatial dimension (height and width) of the input image in a one-to-one way: Given a position on a spatial dimension, the output of the channel dimension is the category prediction of the corresponding pixel at that position.

Let’s import the package or module we need for the experiment, and then explain what the transpose convolution layer is.

In [1]: %matplotlib inline
        import d2lzh as d2l
        from mxnet import gluon, image, init, nd
        from mxnet.gluon import data as gdata, loss as gloss, model_zoo, nn
        import numpy as np
        import sys
Copy the code

9.10.1 Transpose convolution layer

As the name implies, the transpose convolution layer gets its name from the transpose operation of the matrix. In fact, convolution can be done by matrix multiplication. In the example below, we define the input X with height and width of 4, and the convolution kernel K with height and width of 3. Print the output of the two dimensional convolution operation and the convolution kernel. As you can see, the output is 2 in height and 2 in width.

In [2]: X = nd.arange(1, 17).reshape((1, 1, 4, 4))
        K = nd.arange(1, 10).reshape((1, 1, 3, 3)) 
        conv = nn.Conv2D(channels=1, kernel_size=3) 
        conv.initialize(init.Constant(K))
        conv(X), K

Out[2]: (
         [[[[348. 393.]
            [528. 573.]]]]
         <NDArray 1x1x2x2 @cpu(0)>, 
         [[[[1. 2. 3.]
            [4. 5. 6.]
            [7. 8. 9.]]]]
         <NDArray 1x1x3x3 @cpu(0)>)
Copy the code

Next, we rewrite the convolution kernel K into a sparse matrix W containing a large number of zero elements, namely the weight matrix. The shape of the weight matrix is (4, 16), where the non-zero elements come from the elements in the convolution kernel K. Join the input X row by row to get a vector of length 16. And then multiply W with the vectorized X to get a vector of length 4. And when we transform it, we get the same result as the convolution operation. So, in this example, we’ve implemented convolution using matrix multiplication.

In [3]: W, k = nd.zeros((4, 16)), nd.zeros(11)
        k[:3], k[4:7], k[8:] = K[0, 0, 0, :], K[0, 0, 1, :], K[0, 0, 2, :]
        W[0, 0:11], W[1, 1:12], W[2, 4:15], W[3, 5:16] = k, k, k, k
        nd.dot(W, X.reshape(16)).reshape((1, 1, 2, 2)), W

Out[3]: (
         [[[[348. 393.]
            [528. 573.]]]]
         <NDArray 1x1x2x2 @cpu(0)>,
         [[1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0. 0. 0. 0. 0.]
          [0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0. 0. 0. 0.]
          [0. 0. 0. 0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0.]
          [0. 0. 0. 0. 0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9.]]
         <NDArray 4x16 @cpu(0)>)
Copy the code

Now let’s describe convolution in terms of matrix multiplication. Let the input vector be x and the weight matrix be W, and the forward computation function of the convolution can be realized as multiplying the input of the function by the weight matrix and producing the vector

. We know that back propagation requires the chain rule. Due to the

, the implementation of the backpropagation function of convolution can be regarded as the weight matrix multiplied by the transpose of the function input

. The transpose convolution layer just swaps the forward calculation function and the back propagation function of the convolution layer: these two functions can be regarded as multiplying the input vector of the function separately

And W.

It is not hard to imagine that the transpose convolution layer can be used to swap the shapes of the input and output of the convolution layer. Let’s continue to describe convolution in terms of matrix multiplication. Let the weight matrix be of shape

For an input vector of length 16, convolve forward to compute an output vector of length 4. If the length of the input vector is 4, the shape of the transpose weight matrix is

, then the transpose convolution layer will output a vector of length 16. In model design, the transpose convolution layer is often used to transform a smaller feature graph into a larger feature graph. In full convolutional networks, when the input is a feature graph with small height and width, the transposed convolution layer can be used to enlarge the height and width to the size of the input image.

Let’s look at an example. Construct a convolution layer conv and let the shape of input X be (1,3,64,64). The number of channels in the convolution output Y is increased to 10, but the height and width are reduced by half, respectively.

In [4]: conv = nn.Conv2D(10, kernel_size=4, padding=1, strides=2) 
        conv.initialize()

        X = nd.random.uniform(shape=(1, 3, 64, 64)) 
        Y = conv(X)
        Y.shape

Out[4]: (1, 10, 32, 32)
Copy the code

Next, we construct the transposed convolution layer conv_trans by creating a Conv2DTranspose instance. Here we assume that the convolution kernel of conv_trans has the same shape, fill and step as that of CONV, and that the number of output channels is 3. When the input is the output Y of conv of the convolution layer, the height and width of the output of the transposed convolution layer are the same as those of the input of the convolution layer: the transposed convolution layer magnify the height and width of the feature graph by 2 times respectively.

In [5]: conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) 
        conv_trans.initialize()
        conv_trans(Y).shape

Out[5]: (1, 3, 64, 64)
Copy the code

In some literatures, transposed convolution is also called fractional step -strided convolution [12].

9.10.2 Construction model

Here we present the basic design of the full convolutional network model. As shown in Figure 9-11, the convolutional neural network firstly extracts image features and then passes through

The convolutional layer transforms the number of channels into the number of categories, and finally transforms the height and width of the feature image into the size of the input image by transposing the convolutional layer. The model output is the same as the height and width of the input image and corresponds one to one in spatial position: the final output channel contains the category prediction of pixels at that spatial position.

Figure 9-11 Full convolutional network

Next, we use a ResNET-18 model based on ImageNet data set pre-training to extract image features, and the network instance is denoted as Pretrained_NET. It can be seen that the last two layers of the model member variable Features are the global maximum pooling layer GlobalAvgPool2D and the sample flattening layer Flatten respectively, while the output module contains the full connection layer for output. These layers are not required for fully convolutional networks.

In [6]: pretrained_net = model_zoo.vision.resnet18_v2(pretrained=True) pretrained_net.features[-4:], pretrained_net.output Out[6]: (HybridSequential( (0): BatchNorm(Axis =1, EPS = 1E-05, Momentum =0.9, Fix_gamma =False, ➥ use_global_stats=False, in_channels=512) (1): Activation(RELu) (2): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ➥ ceil_mode=True) (3): Flatten ), Dense(512 -> 1000, linear))Copy the code

Next we create the full convolutional network instance NET. It copies all the layers in the pretrained_net instance’s member variables features except the last two layers and the pre-trained model parameters.

In [7]: net = nn.HybridSequential()
        for layer in pretrained_net.features[:-2]: 
            net.add(layer)
Copy the code

Given an input with a height of 320 and a width of 480, net’s forward calculation reduces the height and width of the input to 1/32, or 10 and 15.

In [8]: X = nd.random.uniform(shape=(1, 3, 320, 480)) 
        net(X).shape

Out[8]: (1, 512, 10, 15)
Copy the code

And then we go through

The convolution layer transforms the number of output channels into the number of categories of Pascal VOC2012 dataset 21. Finally, we need to magnify the height and width of the feature image by 32 times to convert it back to the height and width of the input image. Recall the calculation of the output shape of the convolution layer described in Section 5.2. Due to the

 且 

, we construct a transpose convolution layer with a step of 32, and set the height and width of the convolution kernel as 64, and the fill as 16. It is not difficult to find that if the step is S, the fill is S /2 (assuming that S /2 is an integer), and the height and width of the convolution kernel are 2S, the transposed convolution kernel will amplify the height and width of the input by S times respectively.

In [9]: num_classes = 21
        net.add(nn.Conv2D(num_classes, kernel_size=1), 
                nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16,
                                   strides=32))
Copy the code

9.10.3 Initialize the transpose convolution layer

We already know that the transpose convolution layer can magnify the feature image. In image processing, we sometimes need to enlarge the image, namely upsample. There are many methods of up-sampling, and bilinear interpolation is commonly used. Simply put, in order to get the output image in coordinates

First map the coordinates to the coordinates of the input image

For example, mapping according to the ratio of the size of the input to that of the output. After the mapping

and

Usually a real number. Then, find the and coordinates on the input image

The last 4 pixels. Finally, output the image in coordinates

The pixels on the input image are based on the 4 pixels and the

To calculate the relative distance of. Upsampling of bilinear interpolation can be achieved through the transpose convolution layer of the convolution kernel constructed by the following bilinear_kernel function. Due to space limitation, we only give the implementation of bilinear_kernel function without discussing the principle of algorithm.

In [10]: def bilinear_kernel(in_channels, out_channels, kernel_size): factor = (kernel_size + 1) // 2 if kernel_size % 2 == 1: center = factor - 1 else: Center = factor -0.5 og = np.ogrid[:kernel_size, :kernel_size] f ilt = (1 - abs(og[0] - center) / factor) * \ (1 - abs(og[1] - center) / factor) weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size), dtype='f loat32') weight[range(in_channels), range(out_channels), :, :] = f ilt return nd.array(weight)Copy the code

Let’s experiment with the upsampling of bilinear interpolation with the transpose convolution layer. Construct a transpose convolution layer that magnify the height and width of the input by 2 times, and initialize its convolution kernel with bilinear_kernel function.

In [11]: conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) 
         conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))
Copy the code

Read the image X and call the upsampling result Y. To print the image, we need to adjust the position of the channel dimension.

In [12]: img = image.imread('.. /img/catdog.jpg') X = img.astype('f loat32').transpose((2, 0, 1)).expand_dims(axis=0) / 255 Y = conv_trans(X) out_img = Y[0].transpose((1, 2, 0))Copy the code

It can be seen that the transpose convolution layer magnify the height and width of the image by 2 times respectively. It is worth noting that the enlarged image of bilinear interpolation looks exactly the same as the original printed image in Section 9.3, except for the different coordinate scales.

In [13]: d2l.set_f igsize()
         print('input image shape:', img.shape) 
         d2l.plt.imshow(img.asnumpy()); 
         print('output image shape:', out_img.shape) 
         d2l.plt.imshow(out_img.asnumpy());

input image shape: (561, 728, 3)
output image shape: (1122, 1456, 3)
Copy the code

In full convolutional networks, we initialize the transposed convolution layer as an upsample of bilinear interpolation. For the 1 × 1 convolution layer, we use Xavier random initialization.

In [14]: net[-1].initialize(init.Constant(bilinear_kernel(num_classes, num_classes,
                                                        64)))
         net[-2].initialize(init=init.Xavier())
Copy the code

9.10.4 Reading a Data Set

We read the data set using the method described in Section 9.9. The shape of the randomly cropped output image is specified here to be 320 × 480: both height and width are divisible by 32.

In [15]: crop_size, batch_size, colormap2label = (320, 480), 32, nd.zeros(256**3) for i, cm in enumerate(d2l.VOC_COLORMAP): colormap2label[(cm[0] * 256 + cm[1]) * 256 + cm[2]] = i voc_dir = d2l.download_voc_pascal(data_dir='.. /data') num_workers = 0 if sys.platform.startswith('win32') else 4 train_iter = gdata.DataLoader( d2l.VOCSegDataset(True, crop_size, voc_dir, colormap2label), batch_size, shuff le=True, last_batch='discard', num_workers=num_workers) test_iter = gdata.DataLoader( d2l.VOCSegDataset(False, crop_size, voc_dir, colormap2label), batch_size, last_batch='discard', num_workers=num_workers) read 1114 examples read 1078 examplesCopy the code

9.10.5 Training model

Now you can start training the model. The loss function and accuracy calculation here are not fundamentally different from those in image classification. Because we use the channels of the transpose convolution layer to predict the categories of pixels, the axis=1 (channel dimension) option is specified in SoftmaxCrossEntropyLoss. In addition, the model calculates accuracy based on whether the predicted category for each pixel is correct.

In [16]: ctx = d2l.try_all_gpus() loss = gloss.SoftmaxCrossEntropyLoss(axis=1) net.collect_params().reset_ctx(ctx) trainer = Gluon. Trainer(net.collect_params(), 'SGD ', {'learning_rate': 0.1, 'wd': 1e-3}) d2l.train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs=5) training on [gpu(0), gpu(1), gpu(2), Gpu (3)] epoch 1, Loss 1.3306, Train ACC 0.726, Test ACC 0.811, Time 17.5 SEC Epoch 2, Loss 0.6524, Train ACC 0.811, Test ACC 0.820, time 16.6 SEC epoch 3, Loss 0.5364, Train ACC 0.838, Test ACC 0.812, Time 16.3 SEC epoch 4, Loss 0.4650, Train ACC 0.856, Test ACC 0.842, Time 16.5 SEC epoch 5, Loss 0.4017, Train ACC 0.872, Test ACC 0.851, Time of 16.3 SECCopy the code

9.10.6 Predict pixel categories

In prediction, we need to standardize the input image in each channel and convert it into the four-dimensional input format required by the convolutional neural network.

In [17]: def predict(img):
             X = test_iter._dataset.normalize_image(img)
             X = X.transpose((2, 0, 1)).expand_dims(axis=0)
             pred = nd.argmax(net(X.as_in_context(ctx[0])), axis=1)
             return pred.reshape((pred.shape[1], pred.shape[2])) 
Copy the code

To visualize the predicted categories for each pixel, we map the predicted categories back to their annotated colors in the dataset.

In [18]: def label2image(pred):
             colormap = nd.array(d2l.VOC_COLORMAP, ctx=ctx[0], dtype='uint8')
             X = pred.astype('int32')
             return colormap[X, :]
Copy the code

The images in the test dataset varied in size and shape. Since the model uses a transpose convolution layer with a step of 32, when the height or width of the input image cannot be divisible by 32, the height or width of the output of the transpose convolution layer will deviate from the size of the input image. To solve this problem, we can intercept several rectangular areas with the height and width as integer multiples of 32 in the image, and calculate the pixels in these areas separately. The union of these regions needs to completely cover the input image. When a pixel is covered by multiple regions, its average value of the output of the convolutional layer in different regions can be used as the input of softMax operation to predict the category.

For simplicity, we just read a few large test images and intercept a region of shape 320 × 480 starting at the top left corner of the image: only that region is used for prediction. For the input image, we print the captured area first, then the prediction results, and finally the labeled category (see also color illustration 20).

In [19]: test_images, test_labels = d2l.read_voc_images(is_train=False) 
         n, imgs = 4, []
         for i in range(n):
             crop_rect = (0, 0, 480, 320)
             X = image.f ixed_crop(test_images[i], *crop_rect) 
             pred = label2image(predict(X))
             imgs += [X, pred, image.f ixed_crop(test_labels[i], *crop_rect)] 
         d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n);
Copy the code

summary

You can do convolution by matrix multiplication.

First, convolutional neural network is used to extract image features, and then the number of channels is transformed into the number of categories through the 1 × 1 convolutional layer. Finally, the height and width of the feature graph are transformed into the size of the input image through the transpose convolutional layer, so as to output the category of each pixel.

In full convolutional networks, the transposed convolutional layer can be initialized as an upsample of bilinear interpolation.

This article is excerpted from Hands-on Deep Learning

This book aims to deliver interactive learning experiences about deep learning. The book not only explains the principles of deep learning algorithms, but also demonstrates their implementation and operation. Unlike traditional books, each section of the book is a Downloadable and running Jupyter notepad that combines text, formulas, images, code and running results. In addition, readers can access and participate in discussions of the contents of the book. The book is divided into three parts: The first part introduces the background of deep learning, provides preparatory knowledge, and includes the basic concepts and technologies of deep learning; The second part describes the important components of deep learning computing. It also explains convolutional neural networks and recurrent neural networks, which have made deep learning successful in many fields in recent years. The third part evaluates the optimization algorithm, examines the important factors that affect the computing performance of deep learning, and lists the important applications of deep learning in computer vision and natural language processing respectively. This book covers both the methods and practices of deep learning for college students, technologists, and researchers. This book requires an understanding of basic Python programming or the basics of linear algebra, differentiation, and probability described in the appendices.