CNN Network Architecture Evolution: From LeNet to DenseNet
Convolutional neural network is a popular network framework in the field of deep learning, especially in computer vision. CNN started from LeNet in the 1990s, and was silent for 10 years in the early 21st century. Until 2012, AlexNet began to regain its second spring. From ZF Net to VGG, GoogLeNet to ResNet and recently DenseNet, the network became deeper and deeper, with more and more complex architecture. The methods to solve the gradient disappearance in back propagation are becoming more and more ingenious. During the New Year holiday, let’s summarize a wave of classic CNN architectures and appreciate the beauty of wisdom collision between various gods in the development process of CNN.
The graph above shows the top-5 error rate of ILSVRC over the years, and we will introduce them in the chronological order of the emergence of these classic networks.
This paper will talk about the following classical convolutional neural networks:
- LeNet
- AlexNet
- ZF
- VGG
- GoogLeNet
- ResNet
- DenseNet
The first: LeNet
Highlights: Defines the basic components of CNN and is the originator of CNN.
LeNet was proposed by LeCun, the father of convolutional neural network, in 1998 to solve the visual task of handwritten number recognition. Since then, the most basic architecture of CNN has been decided: convolution layer, pooling layer and full connection layer. LeNet used in various deep learning frameworks today is a simplified and improved Lenet-5 (-5 means it has five layers), which is slightly different from the original LeNet. For example, the activation function is changed to ReLu, which is commonly used today.
Lenet-5 is different from the existing conv->pool->ReLU routine. Lenet-5 uses conv1->pool->conv2->pool2 to connect the whole connection layer. However, the mode of conv->pool-> pool2 remains unchanged.
The above picture shows an in-depth analysis of the classic Lenet-5:
- The first input image is a single-channel image of 28*28 size, which is represented by the matrix [1,28,28].
- The size of the convolution kernel used by the first convolution layer conv1 is 5*5, the sliding step is 1, and the number of convolution kernels is 20. Then the image size becomes 24, 28-5+1=24, and the output matrix is [20,24] after passing through this layer.
- The pool core size of the first pooling layer is 2*2 and the step size is 2, which is the Max pooling without overlap. After pooling, the image size is halved to 12×12 and the output matrix is [20,12,12].
- The convolution kernel size of the second convolution layer conv2 is 5*5, step size is 1, and the number of convolution kernels is 50. After convolution, the image size becomes 8, because 12-5+1=8, and the output matrix is [50,8,8].
- The second pooling layer pool2 has a core size of 2*2 and a step size of 2, which is the Max pooling without overlap. After pooling, the image size is halved to 4×4 and the output matrix is [50,4,4].
- Pool2 is followed by the full connection layer FC1, the number of neurons is 500, and then the RELU activation function.
- Then fc2 is connected, the number of neurons is 10, and the 10-dimensional feature vector is obtained, which is used for the classification training of 10 numbers, and sent to SoftMaxt for classification, and the probability output of classification results is obtained.
LeNet’s Keras implementation:
def LeNet() :
model = Sequential()
model.add(Conv2D(32, (5.5),strides=(1.1),input_shape=(28.28.1),padding='valid',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(2.2)))
model.add(Conv2D(64, (5.5),strides=(1.1),padding='valid',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(2.2)))
model.add(Flatten())
model.add(Dense(100,activation='relu'))
model.add(Dense(10,activation='softmax'))
return model
Copy the code
Return of the King: AlexNet
AlexNet won the ImageNet competition in 2012 with an absolute advantage of 10.9 percentage points over the second place. Since then, deep learning and convolutional neural network have gained fame. Researches on deep learning have sprung up like bamboo shoots after rain, and the emergence of AlexNet can be described as the return of the king of convolutional neural network.
Sparkle:
- Deeper network
- Data augmented
- ReLU
- dropout
- LRN
Take AlexNet architecture in the figure above as an example. The first five layers of this network are convolution layer and the last three layers are full connection layer. The final output of SoftMax is 1000 classes, and the first two layers are taken for detailed explanation.
- AlexNet contains 5 convolutional layers and 3 fully connected layers, which are much more than LeNet. However, the overall flow of convolutional neural network has not changed, but a lot of depth has been added.
- AlexNet aims at the classification problem of 1000 classes, and the input image is 256×256 three-channel color image. In order to enhance the generalization ability of the model and avoid over-fitting, the author uses the idea of random clipping to randomly clipping the original 256×256 image, and gets the image with the size of 3×224×224. Input to network training.
- Since multi-GPU training is used, it can be seen that there are two exactly the same branches after the first convolution layer to accelerate the training.
- For a branch analysis: the convolution kernel size of the first convolution layer conv1 is 11×11, the sliding step size is 4, and the number of convolution kernels is 48. The output matrix obtained after convolution is [48,55,55]. The number 55 here is an incomprehensible number, and the author does not explain it, if we use normal calculations (224-11)/4+1! Is equal to 55, so we’re doing the padding and then we’re doing the convolution, so we’re padiing the image to 227 times 227, and then we’re doing the convolution of 227 minus 11 over 4 plus 1 is equal to 55. These pixel layers were processed by RelU1 unit to generate activated pixel layers with two groups of 48×55×55 pixel layer data. After normalization, the scale of normalization operation is 5*5. The size of pixel layer formed after the first convolutional layer operation is 48×27×27.
- The input matrix is [48,55,55]. Then the pooling layer is used for Max pooling operation. The scale of pooling operation is 3*3 and the step size of operation is 2, then the size of pooled image is (55-3)/2+1=27. So the output matrix is [48,27,27]. The following layers are not repeated.
AlexNet uses training techniques:
- Data augmentation techniques to increase model generalization capabilities.
- ReLU was used instead of Sigmoid to speed up the convergence of SGD
- Dropout: The principle of Dropout is similar to the mid-integration algorithm of shallow learning algorithms, which inactivates neurons at the full connection layer (the model introduces Dropout at the first two full connection layers) with a probability (say 0.5) that the inactivated neurons no longer participate in forward and backward propagation. About half of the neurons stopped working. To test, multiply the output of all neurons by 0.5. Dropout reference effectively alleviates model overfitting.
- The Local Responce Normalization: And the basic idea of local response normalization is, if this is a piece of the network, let’s say it’s 13 by 13 by 256, what LRN is going to do is pick a location, let’s say a location like this, and from that location across the channel, it’s going to get 256 numbers, and it’s going to normalize. The motivation for local response normalization is that for each location in this 13-by-13 image, we probably don’t need too many highly activated neurons. But since then, many researchers have found that LRNS don’t matter very much because they’re not important, and we don’t use LRNS to train networks today.
Keras implementation of AlexNet:
def AlexNet() :
model = Sequential()
model.add(Conv2D(96, (11.11),strides=(4.4),input_shape=(227.227.3),padding='valid',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(3.3),strides=(2.2)))
model.add(Conv2D(256, (5.5),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(3.3),strides=(2.2)))
model.add(Conv2D(384, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(384, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(256, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(3.3),strides=(2.2)))
model.add(Flatten())
model.add(Dense(4096,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000,activation='softmax'))
return model
Copy the code
Steady progress: ZF-NET
ZFNet is the 2013ImageNet classification task champion, its network structure did not improve, just tune parameters, performance compared with Alex improved a lot. Zf-net only changes the convolution kernel of AlexNet layer 1 from 11 to 7, the step size from 4 to 2, and the convolution layer 3, 4 and 5 is changed to 384,384,256. This year ImageNet is still a quiet session, its champion ZF-NET’s reputation is not the other session of the classic network architecture loud.
Keras implementation of ZF-NET:
def ZF_Net() :
model = Sequential()
model.add(Conv2D(96, (7.7),strides=(2.2),input_shape=(224.224.3),padding='valid',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(3.3),strides=(2.2)))
model.add(Conv2D(256, (5.5),strides=(2.2),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(3.3),strides=(2.2)))
model.add(Conv2D(384, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(384, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(256, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(3.3),strides=(2.2)))
model.add(Flatten())
model.add(Dense(4096,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000,activation='softmax'))
return model
Copy the code
Go deeper: VGG-Nets
Vgg-nets is proposed by The Visual Geometry Group (VGG) of the University of Oxford, and is the basic network in the first prize of ImageNet positioning task and second prize of classification task in 2014. VGG can be regarded as the deepened version of AlexNet. Both are conv layer + FC layer, which seems to be a very deep network at that time, because the number of layers is up to ten layers. We know from the title of the thesis (Very Deep Convolutional Networks for Large-scale Visual Recognition) that VGG is not really a Very Deep network in today’s view.
The above table describes the network structure and birth process of VGG-NET. In order to solve the initialization (weight initialization) and other problems, VGG adopts a pre-training method, which is common in classical neural networks. It is to train a part of the small network first, and then ensure the stability of this part of the network, and then gradually deepen on this basis. Table 1 shows this process from left to right, and the effect is optimal when the network is in phase D, so the network in phase D is vGG-16! The resulting network in phase E is VGG-19! Vgg-16 indicates that the total number of conv+ FC layers is 16, excluding Max Pool layers.
The following figure shows the network structure of VGG-16.
It can be seen from the figure above that the structure of VGG-16 is very clean and much deeper than AlexNet, which contains multiple conv-> CONv ->max_pool structures. The convolution layer of VGG is the same convolution, that is, the size of the output image after convolution is consistent with that of the input. Its downsampling is completely implemented by Max pooling.
The VGG network is followed by 3 fully connected layers, and the number of filters (the number of output channels after convolution) starts from 64, then increases exponentially after no pooling, 128 and 512. The attention contribution of VGG is to use small-size filter and regular convolution – pooling operation.
Sparkle:
- The convolution layer uses smaller filter sizes and intervals
Compared with AlexNet, it can be seen that the size of the convolution kernel of VGG-Nets is still very small. For example, the size of the convolution kernel used in the convolution layer of AlexNet’s first layer is 11*11, which is a very large convolution kernel. In contrast, vGG-NETS uses small convolution kernels of 1×1 and 3×3, which can replace the large filter size.
Advantages of 3×3 convolution kernel:
- Multiple 3×3 volume bases have more nonlinearity than a large filter volume base, which makes the decision function more deterministic
- Multiple 3×3 convolutional layers have fewer parameters than a large filter. Assuming that the input and output feature graphs of the volume base are of the same size as C, the number of parameters of the three 3×3 convolutional layers is 3× (3×3×C×C) =27CC. A 7×7 convolution layer parameter is 49CC; Therefore, three 3×3 filters can be regarded as the decomposition of a 7×7filter (there is nonlinear decomposition in the middle layer).
Advantages of 1*1 convolution kernel:
- The function is to linearly deform the input without affecting the dimension of input and output, and then perform nonlinear processing through Relu to increase the nonlinear expression ability of the network.
Vgg-16 Keras implementation:
def VGG_16() :
model = Sequential()
model.add(Conv2D(64, (3.3),strides=(1.1),input_shape=(224.224.3),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(64, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(2.2)))
model.add(Conv2D(128, (3.2),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(128, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(2.2)))
model.add(Conv2D(256, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(256, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(256, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(2.2)))
model.add(Conv2D(512, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(512, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(512, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(2.2)))
model.add(Conv2D(512, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(512, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(Conv2D(512, (3.3),strides=(1.1),padding='same',activation='relu',kernel_initializer='uniform'))
model.add(MaxPooling2D(pool_size=(2.2)))
model.add(Flatten())
model.add(Dense(4096,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000,activation='softmax'))
return model
Copy the code
Big wave driver: GoogLeNet
GoogLeNet in 2014 ImageNet classification task beat VGG-Nets won the champion, its strength is certainly very deep, GoogLeNet with AlexNet,VGG-Nets this simple rely on deepening the network structure and improve network performance of the idea is not the same, it has another path, While deepening the Network (layer 22), innovation is also made in the Network structure. Inception structure is introduced to replace the traditional operation of simple convolution + activation (this idea was first proposed by Network in Network). GoogLeNet further pushes the research of convolutional neural networks to a new height.
Sparkle:
- Introducing Inception structure
- Auxiliary LOSS unit in the middle layer
- The following full connection layer is replaced by simple global average pooling
The above structure is Inception. The convolution stride in the structure is all 1, and zero padding is used to keep the size of the characteristic response graph consistent. Finally, each convolutional layer is immediately followed by a ReLU layer. Before the output, there is a layer called concatenate, which literally means “juxtaposes”. That is, four groups of feature response graphs of different types but of the same size are stacked side by side to form a new feature response graph. There are two main things in the Inception structure: 1. Feature extraction of the input characteristic response map is carried out in a total of four ways through 3×3 pooling and convolution kernels with different scales of 1×1, 3×3 and 5×5. 2. To reduce the amount of computation. At the same time, information is transmitted through fewer connections to achieve more sparse characteristics, and 1×1 convolution kernel is adopted to achieve dimensionality reduction.
I want to talk a little bit more about what the 1 by 1 convolution kernel does, how it actually reduces dimension. Now the calculation is as follows: Figure 1 below is the convolution of the 3×3 convolution kernel, and Figure 2 is the convolution process of the 1×1 convolution kernel. For single-channel inputs, it is true that a 1×1 convolution does not reduce dimension, but for multi-channel inputs, it is different. Suppose you have 256 feature inputs and 256 feature outputs, and also assume that the Inception layer performs only a 3×3 convolution. This means a total of 256×256×3×3 convolution (589000 product summation (MAC) operations). This might be beyond our computing budget, say, to run the layer for 0.5 milliseconds on a Google server. Instead, we decided to reduce the number of features requiring convolution to, say, 64 (256/4). In this case, we first perform 1×1 convolution of 256 to 64, then 64 times on all Inception branches, followed by a 1×1 convolution of 64 to 256.
- 256 * 64 * 1 * 1 = 16000
- 64 * 64 * 3 * 3 = 36000
- 64 * 256 * 1 * 1 = 16000
The number of calculations is now around 70,000 (16000+36000+16000), almost a tenfold reduction from around 600,000. This reduces dimension through small convolution kernels.
Now let’s think about another question: why do we have to use a 1×1 convolution kernel, but isn’t 3×3 ok? Considering the matrix input of [50,200,200], we can use 20 1×1 convolution kernels for convolution to obtain the output [20,200,200]. Some people ask, I can get [20,200,200] matrix output with 20 3×3 convolution kernels, why use 1×1 convolution kernels? We calculate the convolution parameters to know that for the total number of 1×1 parameters: 20×200×200× (1×1), for the total number of 3×3 parameters: 20×200×200× (3×3), it can be seen that the total number of parameters using 1×1 is only one ninth of the total number of 3×3! So we’re using a 1 by 1 convolution kernel.
There are three LOSS units in the GoogLeNet network structure, which is designed to help the network convergence. The LOSS unit for auxiliary calculation is added in the middle layer to make the features of the lower layer have good distinguishing ability when calculating the LOSS, so that the network can be better trained. In the paper, the calculation of these two auxiliary LOSS units is multiplied by 0.3 and then added to the final LOSS as the final LOSS function to train the network.
Another shining point worth mentioning in GoogLeNet is that the following full connection layer is replaced by simple global average pooling, which will change fewer parameters in the end. In AlexNet, the last 3 layers of the full connection layer parameters account for almost 90% of the total parameters, using a large network in the width and depth allows GoogleNet to remove the full connection layer, but does not affect the accuracy of the results, in ImageNet 93.3% accuracy, and faster than VGG.
Keras implementation of GoogLeNet:
def Conv2d_BN(x, nb_filter,kernel_size, padding='same',strides=(1.1),name=None) :
if name is not None:
bn_name = name + '_bn'
conv_name = name + '_conv'
else:
bn_name = None
conv_name = None
x = Conv2D(nb_filter,kernel_size,padding=padding,strides=strides,activation='relu',name=conv_name)(x)
x = BatchNormalization(axis=3,name=bn_name)(x)
return x
def Inception(x,nb_filter) :
branch1x1 = Conv2d_BN(x,nb_filter,(1.1), padding='same',strides=(1.1),name=None)
branch3x3 = Conv2d_BN(x,nb_filter,(1.1), padding='same',strides=(1.1),name=None)
branch3x3 = Conv2d_BN(branch3x3,nb_filter,(3.3), padding='same',strides=(1.1),name=None)
branch5x5 = Conv2d_BN(x,nb_filter,(1.1), padding='same',strides=(1.1),name=None)
branch5x5 = Conv2d_BN(branch5x5,nb_filter,(5.5), padding='same',strides=(1.1),name=None)
branchpool = MaxPooling2D(pool_size=(3.3),strides=(1.1),padding='same')(x)
branchpool = Conv2d_BN(branchpool,nb_filter,(1.1),padding='same',strides=(1.1),name=None)
x = concatenate([branch1x1,branch3x3,branch5x5,branchpool],axis=3)
return x
def GoogLeNet() :
inpt = Input(shape=(224.224.3))
#padding = 'same', fill with (step -1) /2, also use ZeroPadding2D((3,3))
x = Conv2d_BN(inpt,64, (7.7),strides=(2.2),padding='same')
x = MaxPooling2D(pool_size=(3.3),strides=(2.2),padding='same')(x)
x = Conv2d_BN(x,192, (3.3),strides=(1.1),padding='same')
x = MaxPooling2D(pool_size=(3.3),strides=(2.2),padding='same')(x)
x = Inception(x,64)# 256
x = Inception(x,120)# 480
x = MaxPooling2D(pool_size=(3.3),strides=(2.2),padding='same')(x)
x = Inception(x,128)# 512
x = Inception(x,128)
x = Inception(x,128)
x = Inception(x,132)# 528
x = Inception(x,208)# 832
x = MaxPooling2D(pool_size=(3.3),strides=(2.2),padding='same')(x)
x = Inception(x,208)
x = Inception(x,256)# 1024
x = AveragePooling2D(pool_size=(7.7),strides=(7.7),padding='same')(x)
x = Dropout(0.4)(x)
x = Dense(1000,activation='relu')(x)
x = Dense(1000,activation='softmax')(x)
model = Model(inpt,x,name='inception')
return model
Copy the code
Landmark innovation: ResNet
ResNet, launched by Kai Ming He, swept ISLVRC and COCO to win the title in 2015. ResNet has made a big innovation in network structure, instead of simply stacking layers. ResNet’s new idea in convolutional neural network is definitely a milestone event in the development of deep learning.
Sparkle:
- The layers are very deep, more than a hundred
- Residual element is introduced to solve the degradation problem
As can be seen from the above, with the increase of network depth, the accuracy of the network should increase synchronously. Of course, attention should be paid to the problem of over-fitting. However, one problem with the increase of network depth is that these increased layers are the signal of parameter update, because the gradient propagates from back to front. After the increase of network depth, the gradient of the earlier layers will be very small. This means that these layers basically stop learning, and that’s the gradient extinction problem. Deep web training in the second problem is, when the network deeper means that the parameter space is bigger, optimization problem becomes more difficult, so instead of simply to increase the depth of network appeared higher training error, though deep network convergence, but the network began to degenerate, the increasing network layer has led to greater error, such as below, The performance of a 56-layer network is not as good as that of a 20-layer network, not because of overfitting (the training set training error is still high), but because of the annoying degradation problem. ResNet designs a residual module that allows us to train deeper networks.
Here is a detailed analysis of the residual unit to understand the essence of ResNet.
As can be seen from the figure below, the data goes through two routes, one is a conventional route and the other is a shortcut, which directly realizes the direct connection of unit mapping, which is somewhat similar to the “short circuit” in the circuit. And it turns out, experimentally, that this shortcut structure does a pretty good job of dealing with degradation. We take the relation between input and output of a module in the network as y=H(x), then H(x) directly solved by gradient method will encounter the degradation problem mentioned above. If such shortcut structure is used, the optimization target of the variable parameter part is no longer H(x). If F(x) is used to represent the part to be optimized, So H of x is equal to F of x plus x, so F of x is equal to H of x minus x. Since y=x corresponds to the observed value in the assumption of the unit mapping, F(x) corresponds to the residual, thus called the residual network. Why do this, because the author thinks that learning residual F(X) is easier than learning H(X) directly! Imagine now that we only need to learn the difference between the input and the output, and the absolute quantity becomes the relative quantity (H (x) -x is how much the output changes with respect to the input), optimization is much easier.
Considering that the dimension of X may not match the dimension of F(x), dimension matching is required. In this paper, two methods are adopted to solve this problem (in fact, there are three methods, but the third method is found to cause a sharp decline in performance through experiments, so it is not adopted):
- Zero_padding: Completes the dimension by padding the identity layer with zeros. This method does not add additional parameters
- Projection: uses a 1×1 convolution kernel at the identity layer to add dimensions. This method adds extra parameters
The following figure shows two types of residual modules. The left figure is a conventional residual module consisting of two 3×3 convolution kernels. However, with the further deepening of the network, this residual structure is not very effective in practice. To solve this problem, the “bottleneck residual block” in the figure on the right can achieve better effect, which is successively stacked by 1×1, 3×3 and 1×1 convolution layers. The convolution of 1×1 can take off or raise dimension. Thus, the convolution of 3×3 can be carried out on the input of relatively lower dimensions, so as to improve the computational efficiency.
Resnet-50 Keras implementation:
def Conv2d_BN(x, nb_filter,kernel_size, strides=(1.1), padding='same',name=None) :
if name is not None:
bn_name = name + '_bn'
conv_name = name + '_conv'
else:
bn_name = None
conv_name = None
x = Conv2D(nb_filter,kernel_size,padding=padding,strides=strides,activation='relu',name=conv_name)(x)
x = BatchNormalization(axis=3,name=bn_name)(x)
return x
def Conv_Block(inpt,nb_filter,kernel_size,strides=(1.1), with_conv_shortcut=False) :
x = Conv2d_BN(inpt,nb_filter=nb_filter[0],kernel_size=(1.1),strides=strides,padding='same')
x = Conv2d_BN(x, nb_filter=nb_filter[1], kernel_size=(3.3), padding='same')
x = Conv2d_BN(x, nb_filter=nb_filter[2], kernel_size=(1.1), padding='same')
if with_conv_shortcut:
shortcut = Conv2d_BN(inpt,nb_filter=nb_filter[2],strides=strides,kernel_size=kernel_size)
x = add([x,shortcut])
return x
else:
x = add([x,inpt])
return x
def ResNet50() :
inpt = Input(shape=(224.224.3))
x = ZeroPadding2D((3.3))(inpt)
x = Conv2d_BN(x,nb_filter=64,kernel_size=(7.7),strides=(2.2),padding='valid')
x = MaxPooling2D(pool_size=(3.3),strides=(2.2),padding='same')(x)
x = Conv_Block(x,nb_filter=[64.64.256],kernel_size=(3.3),strides=(1.1),with_conv_shortcut=True)
x = Conv_Block(x,nb_filter=[64.64.256],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[64.64.256],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[128.128.512],kernel_size=(3.3),strides=(2.2),with_conv_shortcut=True)
x = Conv_Block(x,nb_filter=[128.128.512],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[128.128.512],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[128.128.512],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[256.256.1024],kernel_size=(3.3),strides=(2.2),with_conv_shortcut=True)
x = Conv_Block(x,nb_filter=[256.256.1024],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[256.256.1024],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[256.256.1024],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[256.256.1024],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[256.256.1024],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[512.512.2048],kernel_size=(3.3),strides=(2.2),with_conv_shortcut=True)
x = Conv_Block(x,nb_filter=[512.512.2048],kernel_size=(3.3))
x = Conv_Block(x,nb_filter=[512.512.2048],kernel_size=(3.3))
x = AveragePooling2D(pool_size=(7.7))(x)
x = Flatten()(x)
x = Dense(1000,activation='softmax')(x)
model = Model(inputs=inpt,outputs=x)
return model
Copy the code
Build on the past: DenseNet
Since Resnet was put forward, Resnet variant networks emerge in an endless stream, each with its own characteristics, network performance has also been improved. The last Network introduced in this paper is DenseNet, the best paper of CVPR 2017. The DenseNet (Dense Convolutional Network) proposed in the paper is mainly compared with ResNet and Inception networks, with some reference in ideas, but a brand new structure. The network structure is not complex, but it is very effective, and it outperforms ResNet in CIFAR index. It can be said that DenseNet has absorbed the best part of ResNet, and has done more innovative work on this, which has further improved network performance.
Sparkle:
- Dense connection: alleviates gradient disappearance, strengthens feature propagation, encourages feature reuse, and greatly reduces the number of parameters
DenseNet is a convolutional neural network with dense connections. In this network, there is a direct connection between any two layers, that is to say, the input of each layer of the network is the union of the output of all previous layers, and the feature graph learned by this layer will be directly transmitted to all subsequent layers as input. The map below shows a dense block of DenseNet. The structure of the dense block is similar to that of the BottleNeck in ResNet. Bn-relu-conv (1×1) -Bn-relu-conv (3×3), and a DenseNet consists of multiple such blocks. The layers between each DenseBlock are called transition layers, and are composed of BN−>Conv(1 x 1)−>averagePooling(2 x 2)
Don’t dense connections create redundancy? Don’t! At first glance, the word dense connection greatly increases the number of parameters and computations in the network. But in fact, DenseNet is more efficient than other networks because of the reduction of computation at each layer of the network and the reuse of features. DenseNet makes the input of L layer directly affect all subsequent layers, and its output is: XL =Hl([X0,X1,… XL −1]), where [X0,X1,… XL −1] merges the previous feature map in the dimension of channels. And since each layer contains the output information of all previous layers, it only needs a few feature graphs, which is why the number of DneseNet parameters is greatly reduced compared with other models. This dense connection means that each layer is directly connected to input and loss, so gradient disappearance can be alleviated and deeper network is not a problem
To be clear, dense connectivity is only in a dense block. There is no dense connectivity between different dense blocks, as shown in the figure below.
There is no free lunch, and the Internet is no exception. Getting better convergence rates at the same depth naturally comes at a premium. One price is its horrendous memory footprint.
Keras implementation of Densenet-121:
def DenseNet121(nb_dense_block=4, growth_rate=32, nb_filter=64, reduction=0.0, dropout_rate=0.0, weight_decay=1e-4, classes=1000, weights_path=None) :
'''Instantiate the DenseNet 121 architecture, # Arguments nb_dense_block: number of dense blocks to add to end growth_rate: number of filters to add per dense block nb_filter: initial number of filters reduction: reduction factor of transition blocks. dropout_rate: dropout rate weight_decay: weight decay factor classes: optional number of classes to classify images weights_path: path to pre-trained weights # Returns A Keras model instance. '''
eps = 1.1 e-5
# compute compression factor
compression = 1.0 - reduction
# Handle Dimension Ordering for different backends
global concat_axis
if K.image_dim_ordering() == 'tf':
concat_axis = 3
img_input = Input(shape=(224.224.3), name='data')
else:
concat_axis = 1
img_input = Input(shape=(3.224.224), name='data')
# From architecture for ImageNet (Table 1 in the paper)
nb_filter = 64
nb_layers = [6.12.24.16] # For DenseNet-121
# Initial convolution
x = ZeroPadding2D((3.3), name='conv1_zeropadding')(img_input)
x = Convolution2D(nb_filter, 7.7, subsample=(2.2), name='conv1', bias=False)(x)
x = BatchNormalization(epsilon=eps, axis=concat_axis, name='conv1_bn')(x)
x = Scale(axis=concat_axis, name='conv1_scale')(x)
x = Activation('relu', name='relu1')(x)
x = ZeroPadding2D((1.1), name='pool1_zeropadding')(x)
x = MaxPooling2D((3.3), strides=(2.2), name='pool1')(x)
# Add dense blocks
for block_idx in range(nb_dense_block - 1):
stage = block_idx+2
x, nb_filter = dense_block(x, stage, nb_layers[block_idx], nb_filter, growth_rate, dropout_rate=dropout_rate, weight_decay=weight_decay)
# Add transition_block
x = transition_block(x, stage, nb_filter, compression=compression, dropout_rate=dropout_rate, weight_decay=weight_decay)
nb_filter = int(nb_filter * compression)
final_stage = stage + 1
x, nb_filter = dense_block(x, final_stage, nb_layers[-1], nb_filter, growth_rate, dropout_rate=dropout_rate, weight_decay=weight_decay)
x = BatchNormalization(epsilon=eps, axis=concat_axis, name='conv'+str(final_stage)+'_blk_bn')(x)
x = Scale(axis=concat_axis, name='conv'+str(final_stage)+'_blk_scale')(x)
x = Activation('relu', name='relu'+str(final_stage)+'_blk')(x)
x = GlobalAveragePooling2D(name='pool'+str(final_stage))(x)
x = Dense(classes, name='fc6')(x)
x = Activation('softmax', name='prob')(x)
model = Model(img_input, x, name='densenet')
if weights_path is not None:
model.load_weights(weights_path)
return model
def conv_block(x, stage, branch, nb_filter, dropout_rate=None, weight_decay=1e-4) :
'''Apply BatchNorm, Relu, bottleneck 1x1 Conv2D, 3x3 Conv2D, and option dropout # Arguments x: input tensor stage: index for dense block branch: layer index within each dense block nb_filter: number of filters dropout_rate: dropout rate weight_decay: weight decay factor '''
eps = 1.1 e-5
conv_name_base = 'conv' + str(stage) + '_' + str(branch)
relu_name_base = 'relu' + str(stage) + '_' + str(branch)
# 1x1 Convolution (Bottleneck layer)
inter_channel = nb_filter * 4
x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_x1_bn')(x)
x = Scale(axis=concat_axis, name=conv_name_base+'_x1_scale')(x)
x = Activation('relu', name=relu_name_base+'_x1')(x)
x = Convolution2D(inter_channel, 1.1, name=conv_name_base+'_x1', bias=False)(x)
if dropout_rate:
x = Dropout(dropout_rate)(x)
# 3x3 Convolution
x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_x2_bn')(x)
x = Scale(axis=concat_axis, name=conv_name_base+'_x2_scale')(x)
x = Activation('relu', name=relu_name_base+'_x2')(x)
x = ZeroPadding2D((1.1), name=conv_name_base+'_x2_zeropadding')(x)
x = Convolution2D(nb_filter, 3.3, name=conv_name_base+'_x2', bias=False)(x)
if dropout_rate:
x = Dropout(dropout_rate)(x)
return x
def transition_block(x, stage, nb_filter, compression=1.0, dropout_rate=None, weight_decay=1E-4) :
''' Apply BatchNorm, 1x1 Convolution, averagePooling, optional compression, dropout # Arguments x: input tensor stage: index for dense block nb_filter: number of filters compression: calculated as 1 - reduction. Reduces the number of feature maps in the transition block. dropout_rate: dropout rate weight_decay: weight decay factor '''
eps = 1.1 e-5
conv_name_base = 'conv' + str(stage) + '_blk'
relu_name_base = 'relu' + str(stage) + '_blk'
pool_name_base = 'pool' + str(stage)
x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_bn')(x)
x = Scale(axis=concat_axis, name=conv_name_base+'_scale')(x)
x = Activation('relu', name=relu_name_base)(x)
x = Convolution2D(int(nb_filter * compression), 1.1, name=conv_name_base, bias=False)(x)
if dropout_rate:
x = Dropout(dropout_rate)(x)
x = AveragePooling2D((2.2), strides=(2.2), name=pool_name_base)(x)
return x
def dense_block(x, stage, nb_layers, nb_filter, growth_rate, dropout_rate=None, weight_decay=1e-4, grow_nb_filters=True) :
''' Build a dense_block where the output of each conv_block is fed to subsequent ones # Arguments x: input tensor stage: index for dense block nb_layers: the number of layers of conv_block to append to the model. nb_filter: number of filters growth_rate: growth rate dropout_rate: dropout rate weight_decay: weight decay factor grow_nb_filters: flag to decide to allow number of filters to grow '''
eps = 1.1 e-5
concat_feat = x
for i in range(nb_layers):
branch = i+1
x = conv_block(concat_feat, stage, branch, growth_rate, dropout_rate, weight_decay)
concat_feat = merge([concat_feat, x], mode='concat', concat_axis=concat_axis, name='concat_'+str(stage)+'_'+str(branch))
if grow_nb_filters:
nb_filter += growth_rate
return concat_feat, nb_filter
Copy the code