This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”
This paper mainly includes Batch Normalization, basic convolutional neural networks, and loss function-related interview experience.
A, Batch Normalization
Batch Normalization ** is a data pre-processing technique that follows the (0, 1) 0 mean and 1 variance distribution of the input of each layer of the network. If BN is not used, the data distribution of each input will be inconsistent and the training accuracy of the network will be affected. Forward formula:
Propagate the code forward
def batchnorm_forward(x, gamma, beta, eps) :
N, D = x.shape
# In order to facilitate the derivation of backward propagation, it is carried out step by step
# Step1: Calculate the mean
mu = 1./N * np.sum(x, axis = 0)
# step2: average reduction
xmu = x - mu
#step3: calculate variance
sq = xmu ** 2
var = 1./N * np.sum(sq, axis = 0)
#step4: compute the denominator of x^
sqrtvar = np.sqrt(var + eps)
ivar = 1./sqrtvar
#step5: normalization->x^
xhat = xmu * ivar
#step6: scale and shift
gammax = gamma * xhat
out = gammax + beta
# Store intermediate variables
cache = (xhat,gamma,xmu,ivar,sqrtvar,var,eps)
return out, cache
Copy the code
Propagate the code backwards
def batchnorm_backward(dout, cache) :
Unzip the intermediate variable
xhat,gamma,xmu,ivar,sqrtvar,var,eps = cache
N,D = dout.shape
#step6
dbeta = np.sum(dout, axis=0)
dgammax = dout
dgamma = np.sum(dgammax*xhat, axis=0)
dxhat = dgammax * gamma
#step5
divar = np.sum(dxhat*xmu, axis=0)
dxmu1 = dxhat * ivar Note that this is a branch of XMU
#step4
dsqrtvar = -1. /(sqrtvar**2) * divar
dvar = 0.5 * 1. /np.sqrt(var+eps) * dsqrtvar
#step3
dsq = 1. /N * np.ones((N,D)) * dvar
dxmu2 = 2 * xmu * dsq Note that this is the second branch of XMU
#step2Dx1 is equal to dxmu1 plus dxmu2, and notice that this is a branch of x#step1
dmu = -1 * np.sum(dxmu1+dxmu2, axis=0)
dx2 = 1./N * np.ones((N,D)) * dmu note that this is the second branch of x#step0 done!
dx = dx1 + dx2
return dx, dgamma, dbeta
Copy the code
Batch Norm refers to Batch normalization, which aims to solve the training difficulties caused by irregular distribution of each Batch of data during training. It can also normalize Batch data and solve the problem of gradient disappearance during gradient back transmission.
Batchnorm is also a regularization method that can replace other regularization methods such as dropout, but by such regularization, it can also dissolve much of the difference information between data.
2. Several parameters of batchnorm, which parameters can be learned?
In the fourth step, two parameters γ and β are added, which are called scaling parameters and shift parameters respectively. By choosing different γ and β parameters, the hidden element can have different distribution. Gamma and beta here can be learned from your model, can be updated with gradient descent, Adam, etc.
3. There has been a Batch of Normalization
During the neural network training, as the number of network layers deepens, the overall distribution of the input values of the activation function gradually approaches the upper and lower limits of the value range of the activation function, resulting in the disappearance of the gradient of the neural network at the lower level during the back propagation. However, BatchNormalization pulls the increasingly biased distribution back to the standardized distribution by means of normalization. As a result, the input value of the activation function falls in the area that is more sensitive to the input of the activation function, which increases the gradient, speeds up the learning convergence and avoids the problem of gradient disappearance.
① Not only the training speed is greatly improved, the convergence process is greatly accelerated; (2) It can also increase the classification effect. One explanation is that this is a regularized expression similar to that of Dropout to prevent overfitting, so the effect can be achieved without Dropout. ③ In addition, the adjustment process is much simpler, the initialization requirements are not so high, and you can use a large learning rate.
4. How to implement BN layer
1. Calculate the sample mean. 2. Calculate sample variance. 3. Sample data standardization processing. 4. Pan and zoom. Two parameters γ and β are introduced. To train the parameters gamma and beta. By introducing the learnable reconstruction parameters γ and β, our network can learn and recover the feature distribution to be learned by the original network.
**5.BN is usually used in which part of the network? ** convolved first and then BN
The Batch normalization Batch refers to the printed data, which is separated into small batches and stochastic gradient descent. In addition, during the forward propagation of each batch of data, there is an exploratory process for each layer
6. Why is BN reconstructed
Regaining the characteristics learned at one level of the original. Therefore, we introduce the learnable reconstruction parameters γ and β so that our network can learn and recover the feature distribution to be learned by the original network.
7.BN layer back propagation, how to find the derivative
Back propagation:
Back propagation needs to calculate three gradients, respectively
define
Is the residual passed from the previous layer.
Looking at the scaling and shifting and normalization formulas, we can see the chain calculation from xi to YI:
The three plus signs correspond to the three chain calculations.
8. Batchnorm difference between training and testing
Training phase: first calculate the mean and variance (for each training batch, calculate the mean variance of the batch), then normalize, then scale and shift.
Test phase: input only one picture at a time, how to calculate the mean and variance of the batch? Then, there are the following two lines in the code, which can realize the calculation of mean and var in the training, and can be directly used in the test, without calculating the mean and variance.
9. BN or activate first, what’s the difference (activate first)
At present, in practice, we tend to put BN after ReLU. There are also reviews that show that BN works better behind ReLU.
2. Basic convolutional neural network
1. The classic model of CNN
LeNet,AlexNet,VGG,GoogLeNet,ResNet,DenseNet
2. Understanding CNN
CNN = Input Layer + {[CONV Layer * A +ReLU Layer] * B + Pooling Layer} * C + FC Layer * d.
3. What is the difference between CNN and traditional fully connected neural network?
In the fully connected neural network, the nodes between each two adjacent layers are connected by edges, so the nodes in the fully connected layer of each layer are organized into a column, which is convenient to display the connection structure. However, for convolutional neural networks, only part of nodes are connected between two adjacent layers. In order to display the dimension of neurons at each layer, nodes of each convolutional layer are generally organized into a THREE-DIMENSIONAL matrix. The only difference between fully connected neural network and convolutional neural network is the connection mode of two adjacent layers of neural network.
4. Talk about CNN, each layer and its role
Convolution layer: it is used for feature extraction
Pooling layer: compress the input feature graph, on the one hand, make the feature graph smaller and simplify the network computing complexity; On the one hand, feature compression is carried out to extract the main features.
Activation function: it is used to add nonlinear factors, because linear models are not expressive enough.
Fully connected layer (FC) plays the role of “classifier” in the whole convolutional neural network. The full connection layer maps the learned “distributed feature representation” to the sample marker space.
5. Why does neural network use convolution layer? – Share parameter, local connection;
What are the prerequisites for using the convolution layer? – Consistent data distribution
6. Compared with the previous convolutional neural network model, what is the biggest improvement of RESNET? What problem has been solved
Skip the residual block and the bottleneck layer. Resnet itself is a result of fitting residuals, which makes network learning tasks simpler and can effectively solve the problem of gradient dispersion.
Why does Resnet solve gradient disappearance? How? Can you deduce it?
Because every convolution (including the corresponding activation operation) will waste some information, such as the randomness of convolution kernel parameters (blindness), the inhibition of activation function and so on. In this case, Shortcut in ResNet is equivalent to taking the information processed in the past and processing it in the present.
7. What are the improvements made in resNet version 2, what is the most powerful variant of ResNet, what is its structure, and how it works?
Resnetv2:1. Compared with the original network structure, F in the first activated network is the identity transformation, which makes the model optimization easier. 2. The network with activated input first is used to reduce network overfitting.
The most powerful variant of Resnet is Resnext.
ResNeXt is based on Resnet and Inception ‘Split + Transfrom + Concat’. Its structure is simple, easy to understand and powerful enough. Inception network uses a split-transform-merge idea where inputs are shred into different lower dimensions, a feature map is done, and the results are fused together. However, the model is not generalizing well and there are too many things to design for different tasks.)
ResNeXt introduced the concept of cardinatity as an additional measure of model complexity. Cardinatity refers to the number of identical branches that a block has.
Compared to ResNet, the same number of parameters gives better results: a 101-layer ResNeXt network has about the same accuracy as a 200-layer ResNet network, but only half as many calculations.
ResNet features the introduction of jump connections, which effectively solves the problem of gradient disappearance when the network is too deep, making it feasible to design deeper networks.
8. Briefly describe the network, difference and improvement of InceptionV1 to V4
The core of Inceptionv1 is to replace some large convolution layers of Googlenet with small convolution layers of 11, 33, 5*****5, which can greatly reduce the number of weight parameters.
His paper is also called batch_normal because of the addition of batch_normal during input. With this addition, training will converge faster, learning will be more efficient, and the use of dropout will be reduced.
Inception V3 transforms some convolution of 77 in Googlenet into a two-layer series of 17 and 71, as well as 33 into 13 and 31, which speeds up the computation and also increases the nonlinearity of the network and reduces the probability of overfitting. In addition, the network input was changed from 224 to 299.
Inception V4 is actually an addition of the original inception method to resnet, from a node it is possible to skip some nodes and connect directly to subsequent nodes, and the residual is followed by one. The other thing is that V4 changed an 11 and 33 step to 33 and 11.
The paper says resNet was introduced not to improve depth, and thus accuracy, but to improve speed.
9. Why is DenseNet more expressive than ResNet?
DenseNet increases the depth and widens the network width of each DenseBlock to increase the capability of network feature recognition. Moreover, because the horizontal structure of DenseBlock is similar to that of Inception block, the number of parameters to be calculated is greatly reduced.
Loss function
1. Talk about Smooth L1 Loss and explain the advantages of using it
Advantages of Smooth L1: ① It converges faster than the L1 loss function. (2) Compared with L2 loss function, it is not sensitive to outliers and outliers, and the gradient change is relatively smaller, so it is not easy to run and fly during training.
2. Difference between L1_loss and L2_loss
Average Absolute Error (L1 Loss): Mean Absolute Error (MAE) refers to the average distance between the model predicted value F (x) and the true value Y, and its formula is as follows:
MSE (L2 Loss): Mean Square Error (MSE) is the average Square of the difference between the model predicted value F (x) and the actual sample value Y, and its formula is as follows
3. Why cross entropy is used instead of square loss for classification problems? What is cross entropy
1. With the square error loss function, the gradient of parameters will increase when the error increases, but when the error is large, the gradient of parameters will decrease again. 2. The cross entropy loss is a function, and the larger the error is, the larger the gradient of the parameter will be, which can converge quickly.
Why cross entropy loss function is more commonly used than mean square error loss function in classification?
Cross entropy loss function of input weighted gradient expression is proportional to the error of the predicted values and the real value and contains no activation function gradient, and the mean square error of the loss function of input weighted gradient expression contains, as a result of the commonly used sigmoid/tanh activation function gradient saturated zone, and makes the MSE of weighted gradient will be small, The slow adjustment of parameter W leads to slow training, while the cross entropy loss function does not have this problem. Its parameter W will be adjusted according to the error, resulting in faster training and better effect.
4. How to design the loss function for a picture with multiple categories and multi-label classification
How to solve multi-label classification from the point of view of loss function
The classification problem name output layer uses the loss function corresponding to the activation function
Binary sigmoid function Binary crossentropy loss function (binary_crossentropy)
Multi-category Softmax function Multi-category crossentropy loss function (Categorical_crossentropy)
Multi-label classification SigmoID function Dichotomous crossentropy Loss function (binary_Crossentropy)
(The relationship between multi-label problem and dichotomy problem has been discussed above. The method is to calculate the loss of each label of a sample (sigmoid function is used in the output layer) and then take the average value. Turn a multi-label problem into a dichotomy problem on each label.
5. Loss function of LR? What’s its derivative? And what’s the derivative of this with regularization?
Logistic regression (Logistic regression) is a commonly used machine learning method in the industry.
Although Logistic regression has “regression” in its name, it is actually a classification method, which is mainly used for two classification problems. Logistic function (or Sigmoid function) is used, the value range of independent variable is (-inf, INF), and the value range of independent variable is (0,1). The function form is as follows:
The loss function of LR is the cross entropy loss function.
reference
LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25: 1097-1105.
Simonyan K, Convolutional networks for Large-scale Image recognition[J]. ArXiv Preprint arXiv:1409.1556, 2014.
Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:1-9.
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.
He K, Zhang X, Ren S, et al. Identity mappings in deep residual networks[C].European conference on computer vision. Springer, Cham, 2016: 630-645.
Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1492-1500.
Refer to the link
zhuanlan.zhihu.com/p/26138673
Blog.csdn.net/elaine_bao/…
www.cnblogs.com/guoyaohua/p…
Blog.csdn.net/bl128ve900/…
Blog.csdn.net/bll1992/art…
Blog.csdn.net/kittyzc/art…
Blog.csdn.net/qq122716072…
www.cnblogs.com/wangguchang…
zhuanlan.zhihu.com/p/97324416
Blog.csdn.net/zouxy09/art…