1. What are the applications of deep learning
- Image: image recognition, object recognition, image beautification, image repair, target detection.
- Natural language processing: machine creation, personalized recommendation, text classification, translation, automatic error correction, emotion analysis.
- Numerical prediction, quantitative trading
2. What are neural networks
Let’s take the case of house price prediction, which takes the area of the house as the input of the neural network (we call it 𝑥), passes through a node (a small circle), and finally outputs the price (we call it 𝑦). In fact, this little circle is a single neuron, just like a human brain neuron. If this is a single neuron network, no matter how large or small it is, it is formed by stacking these individual neurons together. If you think of these neurons as individual Lego blocks, you build a larger neural network by building blocks.
Neural networks have little to do with the brain. This is an oversimplified comparison, comparing the logical unit of a neural network to the biological neuron on the right. So far, even neuroscientists have had a hard time explaining what a single neuron can do.
2.1 What is a perceptron
It starts with logistic regression. We all know that the objective function of logistic regression is as follows:
Let’s use a network, which is called a perceptron:
If you add a hidden layer to this perceptron, you get the neural network structure that we’re talking about.
2.2 Structure of neural network
The general structure of neural network is composed of input layer, hidden layer (neuron) and output layer. The hidden layer can be 1 or multiple layers, which are connected to each other, as shown in the following figure.
Generally speaking, the number of layers of a neural network is calculated in this way, excluding the input layer. From the hidden layer to the output layer, a total of several layers means that it is a neural network with several layers. For example, the figure above is a neural network with three layers.
Explain the meaning of the hidden layer: * * * * in a neural network, when you use a supervised learning which, when training, the training set containing the input 𝑥 𝑦 also contains the target output, so the term is the meaning of the hidden layer in the training set, we don’t know the accurate value of these intermediate nodes, this means you can’t see them in training focus should have values.
- The engineering effect of multi-hidden layer neural network is much better than that of single hidden layer neural network.
- If the number of hidden layers or neurons is increased, the “capacity” of neural network will be larger and the spatial expression power will be stronger.
- Too many hidden layers and neuron nodes will bring overfitting problems.
- Do not try to slow down overfitting by reducing the number of neural network parameters. Use regularization or dropout.
2.3 Why does neural network have nonlinear segmentation capability
If I were to categorize the following figure, circles are one type and red crosses are another, and if I were to use linear segmentation, I would not be able to separate them in any way.
At this point, the neural network (two-layer neural network) is introduced, which contains a hidden layer. In the hidden layer, the graphs of P1 and P2 are obtained respectively. The parts above the P1 line are red crosses, and the parts below the P2 line are red crosses, and two neurons have two linear lines. All you have to do to go from the hidden layer to the output layer is to combine these two lines, and you get h(x), which means that the space above P1 intersects the space below P2, and the rest of the space is classified as a circle. This makes it impossible to do linear segmentation into nonlinear segmentation.
If the hidden layer is more complex, the division of sample point distribution of complex plane can be perfectly realized (similar to matting), as shown in the figure below:
3. Calculation process of neural network
3.1 Calculation Process
As shown in the figure below. In the second step, you calculate 𝑧 (resulting in 𝑎) with sigmoid function as the activation function. A neural network simply repeats the calculation many times.
The computation of one of these neurons is shown below:
Vectorization, if you’re running a neural network program, doing this with a for loop seems really inefficient. So what we’re going to do is vectorize these four equations. The process of vectorization is to vertically pile up one layer of neuron parameters in the neural network. For example, 𝑤 in the hidden layer is vertically piled up into a matrix (4,3), which is represented by the symbol 𝑊[1]. Another way of looking at this is that we have four logistic regression units, and each logistic regression unit has a corresponding parameter, the vector 𝑤, and if you stack these four vectors together, you get this 4 by 3 matrix.
The above formula represents the vectorization calculation of a sample, so the multiple local vectorization calculation is actually to increase the number of columns on the basis of the above formula, and each column is equivalent to a sample.
3.2 Random initialization of model parameters
In neural networks, random initialization of model parameters is usually required. Here’s why.
Assume that the output layer retains only one output unit o1 (removing O2 and O3 and the arrows pointing to them) and that the hidden layer uses the same activation function. If the parameters of each hidden cell are initialized to the same value, then in forward propagation each hidden cell will calculate the same value based on the same input and pass it to the output layer. In back propagation, the parameter gradients of each hidden cell are equal. Therefore, the values of these parameters remain the same after iteration using gradient-based optimization algorithms. The same goes for subsequent iterations.
In this case, no matter how many hidden units there are, the hidden layer essentially only has 1 hidden unit in play. Therefore, as we did in the previous experiment, we usually randomly initialize the model parameters of the neural network, especially the weight parameters.
There are two initialization methods:
The normal distribution random initialization method is adopted.
Xavier initialization method: Assuming that the input number of a fully connected layer is A and the output number is B, Xavier random initialization will make each element of the weight parameter of the layer randomly sampled in uniform distribution:
After initialization, the variance of the output of each layer is not affected by the number of inputs of the layer, and the variance of the gradient of each layer is not affected by the number of outputs of the layer.
3.3 Activation function
3.3.1 What are the activation functions
After a linear transformation is connected in the hidden layer, a nonlinear transformation (such as SigmoID) is followed, which is called the transfer function or activation function. The above examples all use the Sigmoid activation function of logistic regression. If you don’t know where the activation function is, look at the figure below.
The sigmoid function
Tanh, hyperbolic tangent
In fact, the tanh function is the result of the downward translation and scaling of sigmoid. It goes through the point (0,0) and has a range between +1 and -1. There is one exception: in the dichotomy problem, for the output layer, since 𝑦 has a value of 0 or 1, you want the value of 𝑦^ to be between 0 and 1, not between -1 and +1. So use the sigmoID activation function.
The common disadvantage of both the Sigmoid function and the TANh function is that when 𝑧 is very large or very small, the gradient of the derivative or the slope of the function will become very small, and eventually close to zero, resulting in the decrease of the gradient descent speed.
ReLu(modified linear element) function
As long as 𝑧 is positive, the derivative is equal to 1, and as long as 𝑧 is negative, the derivative is equal to 0.
Here are some rules of thumb for selecting activation functions: if the output is 0 or 1 (dichotomous problem), then the output layer selects the sigmoid function, and then all other units select Relu.
Softmax activation function
- Calculation before nonlinear transformation:
- After nonlinear transformation, temporary variables:
The normalized
These are the probabilities of the categories, theseThe sum of probabilities is 1
Previously, our activation functions took a single line of numeric input, such as Sigmoid and ReLu activation functions, which input one real number and output one real number. The Softmax activation function is special in that, because you need to normalize all possible outputs, you need to input a vector and then output a vector.
The hardmax function looks at the elements of 𝑧 and places 1 at the largest element in 𝑧 and 0 at the others. The mapping from 𝑧 to these probabilities by Softmax is more modest.
Softmax regression extended logistic regression to more than two categories.
- Calculation before nonlinear transformation:
3.3.2 rainfall distribution on 10-12 pros and cons
- If the interval of 𝑧 varies greatly, the derivative of the activation function or the slope of the activation function will be much greater than 0. In the program implementation, it is an if-else statement, and the sigmoid function requires four floating-point operations. In practice, Neural networks using ReLu activation functions generally learn faster than those using SigmoID or TANH activation functions.
- The derivative gradients of sigmoid and TANH function are close to 0 in the positive and negative saturation zones, which will cause gradient dispersion, while Relu and Leaky Relu functions greater than 0 are constant and will not produce gradient dispersion. (It should also be noted that when Relu enters the negative half, the gradient is 0, the neurons will not train at this time, producing the so-called sparselessness, and Leaky Relu will not have this problem.) 𝑧 The gradient at Relu is half 0, however, there are enough hidden layers so that the z value is greater than 0, So the learning process can still be fast for most of the training data.
3.3.3 Why use the activation function
If you use a linear activation function or you don’t use an activation function, then no matter how many layers of your neural network are always doing linear functions, it’s better to just get rid of all the hidden layers. In our brief case, it turns out that if you use linear activation functions in the hidden layer and sigmoid functions in the output layer, then the model has no hidden layer at all. The standard Logistic regression was the same.
Linear hidden layers don’t help at all here, because the combination of these two linear functions is itself a linear function, so unless you introduce nonlinearity, you can’t compute more interesting functions, no matter how many layers you have.
3.3.4 Why ReLu is better than TANH and Sigmoid function in artificial neural network?
- Sigmoid and other functions are used to calculate the activation function (exponential operation), which requires a large amount of calculation. When backward propagation is used to calculate the error gradient, the derivation involves division and exponential operation, which requires a relatively large amount of calculation. However, Relu activation function is used to save a lot of calculation in the whole process.
- For the deep network, when the sigmoID function is propagated back, it is easy to disappear the gradient (when the sigmoID is close to the saturation region, the transformation is too slow and the derivative tends to 0, which will cause information loss). This phenomenon is called saturation, so the training of the deep network cannot be completed. ReLU, on the other hand, doesn’t tend to saturate, doesn’t have a very small gradient.
- Relu will make the output of some neurons be zero, which causes the sparsity of the network and reduces the interdependence of parameters, alleviating the occurrence of over-fitting problems (and some people’s biological interpretation of Balabala). Of course, there are also some improvements to RELU, such as PRELU and Random Relu, etc. In different data sets, there will be some improvements in training speed or accuracy. You can refer to relevant papers for details.
3.3.5 What are the properties of the activation function?
- Nonlinear: When the activation function is linear, a two-layer neural network can approximate almost all functions. But if the activation function is the identity activation function, i.e
, does not meet this property, and if MLP uses the identity activation function, then in fact the whole network is equivalent to the single-layer neural network;
- Differentiability: this property is reflected when the optimization method is based on gradients;
- Monotone: when the activation function is monotone, the single-layer network can be guaranteed to be convex.
: When the activation function meets this property, if the initialization of parameters is a random small value, then the training of neural network will be very efficient; If this property is not met, then the initial value needs to be set in detail;
- Range of output values: When the output value of the activation function is limited, the optimization method based on gradient will be more stable, because the representation of features is more significantly affected by the finite weights; When the output of the activation function is infinite, the training of the model will be more efficient. However, in this case, a smaller Learning Rate is generally required.
3.4 Forward Transmission
Forward propagation refers to computing and storing intermediate variables (including outputs) of the model in sequence from input layer to output layer of the neural network.
Logistic regression steps: So back when we talked about logistic regression, we had this forward propagation step where we computed 𝑧, then 𝑎, and then the loss function 𝐿. Forward propagation is similar to computation.
And then calculate
And finally getloss function.
3.5 Back Propagation (BP)
Back propagation refers to the method of calculating the gradient of neural network parameters. In general, back propagation calculates and stores intermediate variables and parameter gradients of the objective function related to each layer of neural network along the sequence from output layer to input layer according to the chain rule in calculus.
The forward propagation passes through all the hidden layers to the output layer, resulting in an output resultAnd then according to this
Into theloss funcation, the SGD algorithm is used for optimization solution, in which BP is used to update the parameter values in each network layer for each gradient descent, which is the meaning of BP backsending error.
- Forward propagation loss, BP return error.
- Correct the weight of each layer according to the error signal. Take the derivative of each w, and update each W.
- Chain dependent loss function:
3.6 Stochastic Gradient Descent (SGD)
3.6.1 Mini-batch Gradient descent
You can divide the training set into smaller subsets called mini-batch. Assuming that there are only 1000 samples in each subset, take 𝑥 (1) to 𝑥 (1000) and call them the first subset, also called mini-Batch. Then you take the next 1000 samples, from 𝑥 (1001) to 𝑥 (2000), then another 1000 samples, and so on.
Running mini-Batch gradient descent on the training set, you run for t=1… 5000, because we have 5000 groups with 1000 samples each, all you have to do in the for loop is basically perform a gradient descent on 𝑋 {𝑡} and 𝑌 {𝑡}.
- Batch_size =1, which is SGD.
- If batch_size=n, mini-batch is used
- Batch_size = M, which is batch
Where 1<n<m, m represents the size of the entire training set.
The advantages and disadvantages:
- Batch: The relative noise is lower, the amplitude is also higher, you can continue to find the minimum value.
- SGD: You towards the global minimum value is close to most of the time, sometimes you’re far away from the minimum value, just because the sample to you in the right direction, so the stochastic gradient descent method is has a lot of noise, on average, it will eventually close to the minimum, but sometimes the wrong direction, because of the stochastic gradient descent method will never convergence, but would have been near the minimum fluctuation. It is inefficient to process only one training sample at a time.
- Mini-batch: in practice, it is best to choose the mid-sized mini-batch to obtain a large number of vectorization, high efficiency and fast convergence.
First of all, if the training set is small, the batch gradient descent method is directly used, where less means less than 2000 samples. The typical mini-batch size ranges from 64 to 512, and given the way your computer’s memory is set up and used, the code will run faster if the mini-batch size is 2 to the 𝑛.
3.6.2 What is the effect of Batch_Size adjustment on training effect?
- Batch_Size is too small, and the model performance is extremely poor (error surge).
- As Batch_Size increases, the same amount of data can be processed faster.
- As Batch_Size increases, more and more epochs are needed to achieve the same accuracy.
- Due to the contradiction of the above two factors, Batch_Size increases to a certain point and reaches the optimal time.
- Since the final convergence accuracy will fall into different local extreme values, Batch_Size will increase to some time to reach the optimal final convergence accuracy.
4. Why is a neural network an end-to-end network?
End-to-end learning is an approach to problem solving, and it corresponds to multi-step problem solving, that is, a problem is divided into multiple steps to solve step by step, and end-to-end is the input data directly get the results of the output end.
Instead of preprocessing and feature extraction, just throw in the raw data and get the final result.
Feature extraction is contained in neural network, so neural network is an end-to-end network.
By reducing the manual pretreatment and follow-up processing, the model can be changed from the original input to the final output as much as possible, giving the model more space for automatic adjustment according to the data, and increasing the overall fit of the model.
- It can require a lot of data. To learn this 𝑥 to 𝑦 mapping directly, you may need a lot of (𝑥, 𝑦) data.
- It excludes hand-designed components that might be useful.
5. Comparison of deep learning frameworks
There are Caffe, PyTorch, MXNet, CNTK, Theano, TensorFlow, Keras, Fastai, etc.
platform | advantages | disadvantages |
TensorFlow | 1. It has complete functions and can build a richer network. 2. Support multiple programming languages. 3. Have a powerful computing cluster. 4. Google support 5. High community activity. 6. Supports multiple Gpus. 7.TensorBoard supports graphical visualization. |
1. It is difficult to get started with programming. 2. The computation graph is pure Python, so it is slow 3. Diagram constructs are static, meaning diagrams must be “compiled” before they can be run |
Keras | 1.Keras is TensorFlow’s advanced integration APi 2.Keras is a compact API. Can quickly help you create applications. 3. Code is more readable and concise. 4.Keras is in a highly integrated framework. 5. Active community. |
1.Keras framework environment configuration is a bit more complex than other underlying frameworks. 2. Although it is easier to create models, it may not be as good as TensorFlow when dealing with complex network structures. 3. Performance is insufficient. |
Pytorch | 1. It can change the architecture in the process. 2. The process of training neural networks is simple and clear. 3. You can write for loops using standard Python syntax. 4. Lots of pre-training models |
1. Not comprehensive enough for TensorFlow, but will make up for it in the future. 2.PyTorch does not deploy well on mobile. |
MXNet | 1. Support multiple languages. 2. Documents are complete. 3. Supports multiple Gpus. 4. Code that is clear and easy to maintain. 5. Choose between imperative and symbolic programming styles. |
1. Not widely used. 2. The community is not active enough. 3. It’s harder to learn. |
At present, most companies use TensorFlow in terms of recruitment. After all, TensorFlow is very strong in terms of community, performance, and deployment, so the sample code written later is completed using TensorFlow.
6. Softmax classifier
6.1 What is SoftMax
In the image classification scenario, the output of Softmax classifier can be a discrete value of an image category. Different from linear regression, The output unit of Softmax has changed from one to multiple.
Softmax regression and linear regression do linear superposition of input features and weights. One major difference from linear regression is that the number of output values of **softmax regression is equal to the number of categories in the tag. ** The figure below depicts softmax regression with a neural network, also a single layer neural network, due to each outputAll calculations depend on all inputs
The output layer of Softmax regression is also a fully connected layer.
6.2 Softmax calculation
An easy way to do this is to print the valueAssume that the prediction category is the confidence of I, and take the category corresponding to the output with the largest value as the prediction output. For example, if
0.1, respectively; 10; 0.1, as a result of
Maximum, so the prediction category is 2.
However, there are two problems with using output directly from the output layer:
- Since the range of output values of the output layer is uncertain, it is difficult for us to intuitively judge these values.
- Since real labels are discrete values, the error between these discrete values and the output values in an uncertain range is difficult to measure.
Softmax solved the above two problems. It transforms the output value into a probability distribution with positive values and a sum of 1 by using the following formula:
6.3 Cross entropy loss function
We already know that softmax transforms the output into a legitimate category prediction distribution. In fact, real tags can also be expressed as category distribution:
So for sample I, we construct vectors, making it the first
One element is one, and the rest is zero. So our training goal can be to predict the probability distribution
As close to true tag probability as possible
In order to predict the classification result correctly, we do not need to predict that the probability is exactly equal to the label probability, and the squared loss is too strict. One way to ameliorate this problem is to use a measurement function better suited to measuring the difference between the two probability distributions. Where, cross entropy is a common measure:
There is a lower markIs a vector
The non-0 is 1 element. In other words,Cross entropy is only concerned with the probability of prediction for the correct class, because as long as its value is large enough, it can ensure that the classification result is correct.That is, minimizing the cross entropy loss function is equivalent to maximizing the joint prediction probability of all label classes in the training data set.
7. Neural network implementation
TensorFlow example: Linear regression
Machine Learning
Author: @ mantchs
Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]