This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together
One or two questions
Assume 3-layer neural network, input node v0, first-layer node v1,v2,v3 second-layer node V4,v5 third-layer node v6. Where vi=f(ai), I =4,5, 6f is the activation function.
Forward propagation:
1. Check whether the all-zero initialization is ok
In general, no. When the all-zero parameter is initialized, the value of all nodes except the input node is 0. According to the above formula, the value of all nodes except the gradient of the first layer is related to the input value is 0.
LR and other layer networks can be initialized all zero, and the network gradient depends on the input value. Only all-zero initialization of one layer does not affect training, but involves two or more layers. The gradient from the involved layer to the input layer is 0, and parameters cannot be updated.
2. Check whether you can initialize all the same parameters
Can’t. If initialized to the same parameter, the output of all nodes of the hidden layer is the same and the gradient is the same. Equivalent to the input going through a node.
2. Parameter initialization method
1. Pre-training initialization
Pretraining + finetuning loads trained model parameters for model training of downstream tasks.
2. Random initialization
2.1 the random initialization
random initialization: np.random.randn(m,n)
Randomly generate m by N dimensional vectors that conform to normal distribution
Disadvantages: The gradient disappears with chance. As the network layer deepens, the output gets closer and closer to 0 due to the chain derivative rule
2.2 Xavier initialization
tf.Variable(np.random.randn(node_in,node_out))/np.sqrt(node_in)
M stands for the input and output dimensions node_in,node_out
Ensure that the input and output variances are consistent
2.3 He initialization
tf.Variable(np.random.randn(node_in,node_out))/np.sqrt(node_in/2)
Applies to RELU activation functions, only half region
3. Fixed initialization
For example, the bias is usually initialized with 0, and the BIAS of LSTM forgetting gate is usually 1 or 2 to increase the gradient on the timing sequence. For ReLU neuron, the bias is set to 0.01 to make it easier to activate in the initial training.
Refer to the link
www.leiphone.com/category/ai…