This is the eighth day of my November challenge

In addition to the cross entropy method, there is another way to improve the neural network by adding softmax neuron layer

Softmax neuron layer

Softmax neuron layer defines a new output layer for the neural network, and its position in the neural network is shown in the figure below:

For input zjL=∑kwjkLakL−1+bjLz^L_j=\ sum_KW ^L_{jk}a^{L-1} _K + B ^L_jzjL=∑kwjkLakL−1+bjL, the softMaxt method applies the Softmax function on zjLz^L_jzjL, That is, for the JJJ neuron, its activation value is:


a j L = e z j L k e z k L a^L_j=\frac{e^{z^L_j}}{\sum_ke^{z^L_k}}

∑k\sum_k∑ K means summation on all neurons. According to the formula, softmax function will perform two operations on neuron input values:

  • The input median zjLz^L_jzjL is mapped to a space of [0,+∞][0,+\infty][0,+∞] by the exponential function exe^xex
  • On the non-negative interval, the proportion of each neuron mapping value in the sum of all values is calculated

It can be inferred from the above formula that the sum of the activation values output by all neurons should be 1, namely:


j a j L = j e z j L k e z k L = 1 \sum_j a^L_j=\frac{\sum_j e^{z^L_j}}{\sum_ke^{z^L_k}} = 1

In other words, the output of the SoftMax layer can be viewed as a probability distribution. In many problems, the output activation value can be directly interpreted as the probability that the input sample belongs to a certain category, which is a convenient way to solve the problem.

Softmaxt function


S i = e i j e j S_i=\frac{e^i}{\sum_je^j}

The principle of Softmax to improve learning rate

To explain this, first define a log-likelidood cost function (which is a good partner of SoftMax layer), XXX represents the training input of neural network, YYy represents the expected target output, and the log-likelidood cost function of XXX is:


C = ln a y L C=-\ln a^L_y

Calculate the partial derivatives of the cost function with respect to parameters WWW and BBB (calculation process) :


partial C partial b j L = a j L y j \frac{\partial C}{\partial b^L_j}={a^L_j-y_j}

partial C partial w j k L = a j L 1 ( a j L y j ) \frac{\partial C}{\partial w^L_{jk}}=a^{L-1}_j(a^L_j-y_j)

This method also eliminates the influence of σ ‘(z) \sigma'(z)σ ‘(z) and avoids the problem of decreased learning rate. Therefore, the small effect of “Softmax + log-likelihood cost function” can be equivalent to: “Quadratic cost function + cross entropy loss”

Why softmax

Add a normal number CCC to the Softmax function and the formula becomes:


a j L = e c z j L k e c z k L a^L_j=\frac{e^{cz^L_j}}{\sum_ke^{cz^L_k}}

From the meaning of the formula explained previously, the sum of all output activation values after increasing CCC is still 1. When c=1c=1c=1, the softmax function is obtained. When c→+∞c\rightarrow +\inftyc→+∞, ajL→1a^L_j\rightarrow 1ajL→1. “Softsoftsoft” means to smooth out the probability distribution curve so that even small numbers have certain values.

Exe ^xex function image can be seen that it grows very fast, can map large numbers to a large space, small numbers to a relatively small space, effectively achieve the purpose of screening the maximum probability.

Back propagation of Softmax and log likelihood

In the back propagation algorithm, error δ\deltaδ is calculated, and in the Softmax + log-likelihood back propagation algorithm, its calculation formula is as follows:


Delta t. j L = a j L y j \delta^L_j=a^L_j-y_j

Partial C partial bjL\frac{\partial C}{\partial B ^L_j} partial bjL∂C is δjL\delta^L_jδjL, the proof has been written in the previous notes.