At the end of the article Easter egg: July online dry goods group’s latest upgrade of “AI interview 100 famous enterprises” free to send!
16. What data sets are unsuitable for deep learning?
1. When the data set is too small and the data samples are insufficient, deep learning has no obvious advantages over other machine learning algorithms.
2. There is no local correlation in data sets. Currently, the fields in which deep learning performs well are mainly image/speech/natural language processing, etc. A common feature in these fields is local correlation. Pixels in images form objects, phonemes in speech signals form words, and words in text data form sentences. Once the combination of these characteristic elements is disrupted, the meaning of the expression is also changed. For data sets without such local correlation, deep learning algorithms are not suitable for processing. For example, the parameters used to predict a person’s health — age, occupation, income, family status and so on — can be scrambled without affecting the outcome.
17. How can generalized linear models be used in deep learning?
A Statistical View of Deep Learning (I): Recursive GLMs
From a statistical point of view, deep learning can be regarded as a recursive generalized linear model.
Compared with the classical linear model (y=wx+b), the core of the generalized linear model lies in the introduction of the connection function G (.). Y =gā1(wx+b).
The activation function of neurons is the link function of the recursive generalized linear model in deep learning. Logistic function of Logistic regression (a kind of generalized linear model) is the Sigmoid function in neuron activation function. Many similar methods have different names in statistics and neural network, which is easy to cause confusion for beginners (mainly me here).
18. How to mitigate gradient extinction and gradient expansion (fine tuning, gradient truncation, improved activation function, etc.)
In short, the causes of gradient disappeared and gradient explosion, respectively is: the gradient: according to the chain rule, if each layer neurons in a layer of the output of the partial derivatives are multiplied by the weight is less than 1, so even if the result is 0.99, after enough multi-layer transmission, error partial derivatives of the input layer tends to zero
The ReLU activation function can be used to effectively address gradient loss, or Batch Normalization can be used to address this issue. Why does Batch Normalization have a good effect in deep learning? See also: Why does Batch Normalization work well in deep learning?
Gradient expansion: According to the chain rule, if the partial derivative of each neuron with respect to the output of the previous layer is greater than 1 times the weight, after enough layers of propagation, the partial derivative of the error with respect to the input layer will approach infinity
This can be addressed through activation functions, or through Batch Normalization.
Describe the development history of neural networks
In 1949, Hebb proposed the Hebbian learning theory, a neuropsychological learning paradigm
In 1952, IBM’s Arthur Samuel wrote a chess program
In 1957, Rosenblatt’s perceptron algorithm was the second machine learning model with a neuroscience background.
Three years later, Widrow made ML history by inventing the Delta learning rule, which was immediately applied to perceptron training
The heat of the perceptron was extinguished in 1969 by Minskey’s basin of cold water. He proposed the famous XOR problem and demonstrated the weakness of perceptrons in linear unfractionable data similar to the XOR problem.
20. Common methods of deep learning
At present, DNN, CNN and RNN are widely used in the application field.
DNN is a traditional fully connected network that can be used for AD click-through estimates, recommendations, etc. It uses embedding to encode many discrete features into neural network, which can greatly improve the results.
CNN is mainly used in the field of Computer Vision. The emergence of CNN mainly solves the problem that DNN has too many parameters in the field of image. Meanwhile, CNN has developed a series of special aspects such as convolutional, pooling, Batch Normalization, Inception, ResNet and DeepNet, which has made great progress in many areas such as classification, object detection, face recognition, and image segmentation. At the same time, CNN is not only widely used in images, but also has made great progress in natural language processing. At present, there are language models based on CNN that can achieve better effects than LSTM. ResNet in CNN is also one of the two basic algorithms in the latest AlphaZero.
GAN is a training method applied in model generation, and now there are many applications in CV, such as image translation, image superclarity, image repair and so on.
Please briefly describe the development history of neural networks.
The sigmoid will saturate, causing the gradient to disappear. Hence ReLU.
The negative half axis of ReLU is the dead zone, causing the gradient to change to 0. And we have LeakyReLU, PReLU.
Emphasis was placed on the stability of gradients and weight distributions, from which came ELU and, more recently, SELU.
It’s too deep for the gradient to go down, so you have highway.
Even the parameters of highway were abandoned, and the residual was directly changed, hence ResNet.
What is the true meaning of activation function in neural network? What are the necessary attributes an activation function needs to have? What other attributes are good but unnecessary?
-
Nonlinear: that is, the derivative is not a constant. This condition is the basis of the multilayer neural network to ensure that the multilayer network does not degenerate into a single-layer linear network. That’s what the activation function is all about.
-
Differentiability almost everywhere: Differentiability guarantees computability of gradients in optimization. Traditional activation functions such as sigmoID are everywhere differentiable. For piecewise linear functions such as ReLU, only differentiable almost everywhere (i.e., only non-differentiable at a finite number of points). For the SGD algorithm, as it is almost impossible to converge to the position where the gradient is close to zero, the finite non-differentiable points will not have a great influence on the optimization results [1].
-
Calculation is simple: there are many nonlinear functions. At the extreme, a multi-layer neural Network can also be used as a nonlinear function, similar to the practice of Network In Network[2], which regards it as a convolution operation. However, the calculation times of activation function in front of neural network are proportional to the number of neurons, so a simple nonlinear function is naturally more suitable for activation function. This is one of the reasons ReLU and the like are more popular than other activation functions that use operations such as Exp.
23. The neural network of gradient descent method is easy to converge to local optimum. Why is it widely used?
It is probably an illusion that deep neural networks “converge easily to local optimum”, but the reality is that we may never find a “local optimum”, let alone a global optimum.
Many people have a view that “local optimization is the main difficulty of neural network optimization”. This comes from the intuition of the one-dimensional optimization problem. In the case of univariate, the most intuitive difficulty of optimization problem is that there are many local extreme values, such as
24. Briefly talk about several models commonly used by CNN
25, Why do a lot of face Paper finally add a Local Connected Conv?
Take FaceBook DeepFace:
DeepFace first performed two full convolution + one pooling to extract low-level edge/texture features. Three local-conV layers are then connected. The reason for using Local-ConV here is that faces have different features in different regions (the distribution position of eyes/nose/mouth is relatively fixed). When there is no global Local feature distribution, local-ConV is more suitable for feature extraction.
What is a gradient explosion?
Error gradient is the direction and quantity calculated during the neural network training, which is used to update the weight of the network in the correct direction and the appropriate quantity.
In deep networks or recurrent neural networks, error gradients can accumulate during updates and become very large gradients, which can then lead to large updates of network weights and thus make the network unstable. In extreme cases, the weights become so large that they overflow, resulting in NaN values.
Exponential growth resulting from repeated multiplication of gradients (values greater than 1.0) between network layers produces a gradient explosion.
27. What are the problems caused by gradient explosions?
In deep multilayer perceptron networks, gradient explosions can cause network instability, with the best result of not being able to learn from the training data and the worst result of NaN weight values that cannot be updated again.
The gradient explosion makes the learning process unstable. — Deep Learning, 2016.
In the recurrent neural network, gradient explosion will make the network unstable and unable to learn from the training data. The best result is that the network cannot learn from the long input sequence data.
28. How do I determine if there is a gradient explosion?
Gradient explosion during training will be accompanied by some subtle signals, such as:
The model could not get updates (such as low losses) from the training data.
The instability of the model leads to significant changes in losses during the updating process.
During training, the model loss becomes NaN.
If you find these problems, then you need to look carefully for gradient explosion problems.
Here are some slightly more obvious signs that can help determine if there is a gradient explosion problem.
The model gradient increases rapidly during training.
Model weights become NaN values during training.
During the training, the error gradient value of each node and layer continued to exceed 1.0.
29. How to fix gradient explosion?
There are many ways to solve the gradient explosion problem, and this section lists some of the best experimental methods.
- Redesign the network model
In deep neural networks, gradient explosion can be solved by redesigning the network with fewer layers. Using smaller batch sizes also has benefits for network training. In the recurrent neural network, updating on fewer previous time steps (truncated Backpropagation through time) during training can alleviate the gradient explosion problem.
30. What is the input and output of LSTM neural network?
The first thing to make clear is that the units that neural networks deal with are all vectors
Now, why do you see the training data as matrices and tensors
Regular feedForward input and output: matrix
Input matrix shapes :(n_samples, dim_input)
Output matrix shape :(n_samples, dim_output)
Note: For real testing/training, the inputs and outputs of the network are just vectors. The dimension of N_samples is added in order to realize the training of multiple samples at one time and work out the average gradient to update the weight, which is called mini-Batch gradient descent. If n_samples is equal to 1, then this update mode is called Stochastic Gradient Descent (SGD).
The input and output of Feedforward are essentially a single vector.
Regular Recurrent (RNN/LSTM/GRU) inputs and outputs: the tensor
Input tensor shapes :(time_steps, n_samples, dim_input)
Output tensor shapes :(time_steps, n_samples, dim_output)