directory
- [1] NN complexity
- [2] Exponential decay of learning rate
- [3] Activation function
-
- Characteristics of a good activation function
- Common activation functions
- Advice for beginners
- [4] Loss function
- [5] Alleviate overfitting — regularization
- [6] Parameter optimizer
-
- 【 1 】 SGD
- [2] SGDM(added first-order momentum on the basis of SGD)
- [3] Adagrade(second order momentum added based on SGD)
- [4] RMSProp(add second order momentum on SGD basis)
- [5] Adam(combining SGDM first-order momentum and RMSProp two-node momentum simultaneously)
- Optimizer comparison summary
[1] NN complexity
Space complexity:
Layers: Hidden layers + output layers
Total parameters: total W + total B
Time complexity:
Times plus operations
For example: Total parameter =3×4+4(layer 1) +4×2+2(layer 2)=26
Times plus =3×4+4×2=20
[2] Exponential decay of learning rate
Known: learning rate is too small, slow convergence rate. Excessive learning rate leads to non-convergence. First, a larger learning rate can be used to quickly obtain a better solution, and then gradually reduce the learning rate to make the model stable in the later training period, such as exponential decay of learning rate.
Exponential decay learning rate = initial learning rate * learning rate decay rate ^(current number of rounds/how many rounds decay once)
epoch = 40
LR_BASE = 0.2 # Initial learning rate
LR_DECAY = 0.99 # Learning rate decay rate
LR_STEP = 1 # Feed how many rounds of BATCH_SIZE, update a learning rate
for epoch in range(epoch): In this example, the data of the dataset has only 1 W, constant is assigned to the initialization value 5, and the loop is 100 iterations.
lr = LR_BASE * LR_DECAY ** (epoch / LR_STEP)
Copy the code
[3] Activation function
For a linear function, even if multiple neurons are connected end to end to form a deep neural network, it is still a linear combination and the expression power of the model is insufficient.
The MP model has one more activation function than the simplified model, whose addition strengthens the expressive power of the model, making the deep network no longer a linear combination of input X, and improving the expressive power with the increase of layers.
Characteristics of a good activation function
Nonlinear: nonlinear activation function, the multilayer neural network can approach all functions differentiability: most of the optimizer monotonicity with gradient descent to update parameters: when the activation function is drab, can guarantee the single-layer network loss function is convex function approximation identity: f (x) is approximately equal to x, when the parameters are initialized to small random values, neural network is more stable
The range of the output value of the activation function is as follows: 1. When the output of the activation function is finite, the weight will have a more significant impact on the feature, and the optimization method based on gradient is more stable. 2
Common activation functions
1. Sigmoid function: function formula, function image and derivative image:
Sigmoid (x) features: 1. It is easy to cause the gradient to disappear (when the input value is large, the gradient is approximately equal to 0). 2
At the beginning of the neural network, sigmoID function was widely used as the activation function, but in recent years, there are few networks using sigmoID function. When the deep neural network updates its parameters, it needs to take the chain derivative from the output layer to the input layer layer by layer, while the output range of the derivative of the Sigmoid function is (0,0.25). The chain derivative requires the continuous multiplication of multiple derivatives, which ultimately results in the output being 0, resulting in the gradient vanishing, and the parameters cannot be updated.
Tanh (x) Features: 1, easy to cause gradient disappearance 2, output 0 mean value 3, complex power operation, long training time function formula, function image and derivative image:
3. Relu function: Function formula, function image and derivative image:
Relu (x) Advantages: 1. It solves the problem of gradient disappearance in the positive interval; 2. It only needs to judge whether the input is greater than 0, and the calculation speed is fast; 3. 1. Output non-zero mean, slow thin convergence 2. Dead ReIU problem, some neurons may never be activated, resulting in the corresponding parameters can never be updated
Into the input features of activation function is negative, the output is 0, the activation function of reverse transmission gradient is zero, after the negative features of relu function too much lead to neuronal death, we can improve the random initialization, avoid too much incoming negative characteristics, can also be vector by setting smaller, reducing the great changes in the parameter distribution, avoid excessive training of negative traits
Leaky Relu function: Function formula, function image and derivative image:
Theoretically, Leaky Relu has all the advantages of Relu, plus no deadrelu problems, but in practice, it is not proven that Leaky Relu is always better than Relu
Advice for beginners
1. Relu activation function is the first choice; 2. Learning rate is set to a small value; 3. Input features are standardized so that they meet the normal distribution with 0 as the mean value and 1 as the standard deviation; 4
[4] Loss function
NN Optimization objective: Minimum loss commonly used three loss functions: 1, MSE (mean square error)2, custom 3, CE (cross entropy) 1, MSE (mean square error)
loss_mse =tf.reduce_mean(tf.square(y_-y))
Copy the code
Forecast yogurt code:
import tensorflow as tf
import numpy as np
SEED = 23455
rdm = np.random.RandomState(seed=SEED) # generate random numbers between [0,1)
x = rdm.rand(32.2)
y_ = [[x1 + x2 + (rdm.rand() / 10.0 - 0.05)] for (x1, x2) in x] # generate noise [0,1)/10=[0,0.1); [0,0.1)-0.05=[-0.05,0.05)
x = tf.cast(x, dtype=tf.float32)
w1 = tf.Variable(tf.random.normal([2.1], stddev=1, seed=1))
epoch = 15000
lr = 0.002
for epoch in range(epoch):
with tf.GradientTape() as tape:
y = tf.matmul(x, w1)
loss_mse = tf.reduce_mean(tf.square(y_ - y))
grads = tape.gradient(loss_mse, w1)
w1.assign_sub(lr * grads)
if epoch % 500= =0:
print("After %d training steps,w1 is " % (epoch))
print(w1.numpy(), "\n")
print("Final w1 is: ", w1.numpy())
Copy the code
The resulting coefficient is indeed approximately equal to one
2. Custom loss function
Take sales forecasts for example:
By default, the loss function of mean square error assumes that the sales forecast is more or less, resulting in the same loss, but in fact it is not.
Too much: the cost of damage
Less forecast: lost profits
If profit is not equal to cost, loss generated by MSE cannot maximize profit!
Loss can be defined as a piecewise function
loss_zdy=tf,reduce_sum(tf.where(tf.greater(y,y_),COST(y-y_),PROFIT(y_-y)))
For example: forecast yogurt sales, yogurt COST (1 yuan), yogurt PROFIT (99 yuan).
The loss profit is 99 yuan less than the forecast, and the loss cost is 1 yuan more than the forecast.
If there is less prediction, there is more loss, and hopefully more prediction function will be generated.
Prediction yoghurt code :(modify loss function)
import tensorflow as tf
import numpy as np
SEED = 23455
COST = 1
PROFIT = 99
rdm = np.random.RandomState(SEED)
x = rdm.rand(32.2)
y_ = [[x1 + x2 + (rdm.rand() / 10.0 - 0.05)] for (x1, x2) in x] # generate noise [0,1)/10=[0,0.1); [0,0.1)-0.05=[-0.05,0.05)
x = tf.cast(x, dtype=tf.float32)
w1 = tf.Variable(tf.random.normal([2.1], stddev=1, seed=1))
epoch = 10000
lr = 0.002
for epoch in range(epoch):
with tf.GradientTape() as tape:
y = tf.matmul(x, w1)
loss = tf.reduce_sum(tf.where(tf.greater(y, y_), (y - y_) * COST, (y_ - y) * PROFIT))
grads = tape.gradient(loss, w1)
w1.assign_sub(lr * grads)
if epoch % 500= =0:
print("After %d training steps,w1 is " % (epoch))
print(w1.numpy(), "\n")
print("Final w1 is: ", w1.numpy())
# Custom loss function
# Yogurt cost 1 yuan, yogurt profit 99 yuan
# The cost is very low, the profit is very high, people want to predict more, the generated model coefficient is greater than 1, more forecast
Copy the code
The generated coefficients are indeed all greater than 1. 3, CE (crossentropy)
import tensorflow as tf
loss_ce1 = tf.losses.categorical_crossentropy([1.0], [0.6.0.4])
loss_ce2 = tf.losses.categorical_crossentropy([1.0], [0.8.0.2])
print("loss_ce1:", loss_ce1)
print("loss_ce2:", loss_ce2)
# Cross entropy loss function
Copy the code
Generally speaking, the output is made to fit the probability distribution by softmax function first, and then the cross entropy loss function of Y and y_ is calculated. TensorFlow provides the function tf.nn.softMAX_cross_entropy_with_logits (y_, Y) for calculation at the same time.
# Softmax combined with cross entropy loss function
import tensorflow as tf
import numpy as np
y_ = np.array([[1.0.0], [0.1.0], [0.0.1], [1.0.0], [0.1.0]])
y = np.array([[12.3.2], [3.10.1], [1.2.5], [4.6.5.1.2], [3.6.1]])
y_pro = tf.nn.softmax(y)
loss_ce1 = tf.losses.categorical_crossentropy(y_,y_pro)
loss_ce2 = tf.nn.softmax_cross_entropy_with_logits(y_, y)
print('Step by step result :\n', loss_ce1)
print('Combined results :\n', loss_ce2)
Copy the code
[5] Alleviate overfitting — regularization
Check out the following two videos to see why regularization can solve overfitting:
A fitting
How does regularization work
Regularization does not abandon features, but reduces the magnitude of characteristic variables, thus simplifying the hypothesis model. Since there are too many variables, we don’t know how relevant each variable is to the result in advance, that is to say, we don’t know which parameters to shrink. We choose to shrink all parameters, that is, to add penalty terms to all parameters (except theta0 variable).
The regularization parameter λ is used to control the choice between two different objectives:
1. Better fitting of training data
2. Keep parameters as small as possible
Too large a λ can also cause underfitting.
Underfitting solutions: 1. Add input feature items; 2. Increase network parameters; 3. Reduce regularization parameters
Example: Using neural networks to distinguish blue dots from red dots
Ideas: 3. Send coordinates in the grid to the trained neural network. 4. The network outputs a predictive value for each coordinate. So this is the distinction line that outputs the effects of irregularization and regularization respectively, and looks at the effects
The left is before regularization and the right is after regularization.
One hidden layer, four neurons | One hidden layer, four neurons |
---|---|
One hidden layer, 22 neurons | One hidden layer, 22 neurons |
One hidden layer, 40 neurons | One hidden layer, 40 neurons |
It is obvious that the regularized curve is flatter when L2 is added. Overfitting is effectively alleviated.
[6] Parameter optimizer
【 1 】 SGD
Formula:
Code:
W1 = w1-LR * w1_grad b = b-lr * b_grad
w1.assign_sub(lr * grads[0]) The w1 parameter is self-updating
b1.assign_sub(lr * grads[1]) Parameter b is self-updating
Copy the code
[2] SGDM(added first-order momentum on the basis of SGD)
Formula:
Mt represents the exponential moving average value of gradient direction at each time and represents the average value in the past period of time. Beta is a hyperparameter approaching 1, usually equal to 0.9
code:
m_w, m_b = 0.0
beta = 0.9
# sgd-momentun
m_w = beta * m_w + (1 - beta) * grads[0]
m_b = beta * m_b + (1 - beta) * grads[1]
w1.assign_sub(lr * m_w)
b1.assign_sub(lr * m_b)
Copy the code
[3] Adagrade(second order momentum added based on SGD)
Formula:
code:
v_w, v_b = 0.0
# adagrad
v_w += tf.square(grads[0])
v_b += tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
Copy the code
[4] RMSProp(add second order momentum on SGD basis)
Formula:
v_w, v_b = 0.0
beta = 0.9
# rmsprop
v_w = beta * v_w + (1 - beta) * tf.square(grads[0])
v_b = beta * v_b + (1 - beta) * tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
Copy the code
[5] Adam(combining SGDM first-order momentum and RMSProp two-node momentum simultaneously)
Formula:
code:
m_w, m_b = 0.0
v_w, v_b = 0.0
beta1, beta2 = 0.9.0.999
delta_w, delta_b = 0.0
global_step = 0
# adam
m_w = beta1 * m_w + (1 - beta1) * grads[0]
m_b = beta1 * m_b + (1 - beta1) * grads[1]
v_w = beta2 * v_w + (1 - beta2) * tf.square(grads[0])
v_b = beta2 * v_b + (1 - beta2) * tf.square(grads[1])
m_w_correction = m_w / (1 - tf.pow(beta1, int(global_step)))
m_b_correction = m_b / (1 - tf.pow(beta1, int(global_step)))
v_w_correction = v_w / (1 - tf.pow(beta2, int(global_step)))
v_b_correction = v_b / (1 - tf.pow(beta2, int(global_step)))
w1.assign_sub(lr * m_w_correction / tf.sqrt(v_w_correction))
b1.assign_sub(lr * m_b_correction / tf.sqrt(v_b_correction))
Copy the code
Optimizer comparison summary
Optimizer comparison (LR =0.1 epoch=500 Batch =32)Copy the code