This article is a note on the activation function section of CS231n.

The Activation Function preserves and maps the input characteristics.

Sigmoid

The Sigmoid nonlinear function maps the input to. Its mathematical formula is:.

Historically, the sigmoid function was very common, however, it has fallen out of favor and is rarely used in fact, because it has two main disadvantages:

1. Function saturation makes gradient disappear

The sigmoid neurons are near saturation at 0 or 1, and in these regions, the gradient is almost zero. Therefore, when propagating back, this local gradient will be multiplied by the gradient of the entire cost function with respect to the output of the unit, and the result will also be close to 0.

As a result, there is almost no signal passing through the neurons to the weights and then to the data, so the gradient makes no contribution to updating the model.

In addition, special attention must be paid to the initialization of the weight matrix to prevent saturation. For example, if the initialization weight is too high, most neurons will become saturated and the network will hardly learn.

2. The sigmoid function is not symmetric about the center of the origin

This feature will cause the input of the following network layer to be not zero-centered, thus affecting the operation of gradient descent.

Because if the input is all positive (for example, every element in), then the gradient of about will either be all positive or all negative (depending on the whole expression) during the backpropagation, which will result in a Z-drop when the gradient descent weight is updated.

Of course, if the training is done by batch, each batch may get different signals, and the gradient of the whole batch can be added up to alleviate this problem. Therefore, this problem is a minor annoyance compared to the neuron saturation problem above, and is not as serious.

tanh

The tanh function also suffers from saturation problems, but its output is zero-centered, so in practice tanh is more popular than Sigmoid.

The tanh function is actually an enlarged Sigmoid function, and the mathematical relation is:

ReLU

ReLU has become very popular in recent years. Its mathematical formula is:.

When it is two-dimensional, the effect of ReLU is shown in the figure below:

Advantages of ReLU:

  1. Compared with Sigmoid and TANH functions, ReLU greatly accelerates the convergence of SGD (Alex KrizhevskySix times as many. Some think this is due to its linear, unsaturated formula.
  2. In contrast to Sigmoid/TANH, ReLU only needs a threshold value to get the activation value without having to do a lot of complicated (exponential) operations.

The downside of ReLU is that it is vulnerable to training and can “die”.

For example, a very large gradient passes through a ReLU neuron, and after the parameters are updated, the neuron no longer activates any data. If this happens, then all gradients through this neuron will henceforth be zero.

That is, this ReLU unit will die irreversibly during training, resulting in the loss of data diversification. In practice, if the learning rate is set too high, you may find that 40% of the neurons in the network will die (these neurons will not be activated throughout the training set).

Reasonable setting of learning rate will reduce the probability of this situation.

Leaky ReLU

Leaky ReLU is an attempt to solve the problem of “ReLU dying.”

In ReLU, when x<0, the function value is 0. Leaky ReLU, on the other hand, gives a small negative gradient, such as 0.01.

Some researchers have reported that this activation function works well, but its effect is not very stable.

In the paper Delving Deep into Rectifiers published by Kaiming He et al in 2015, a new method, PReLU, is introduced, which trains the slope on the negative interval as a parameter in each neuron. However, the effect of this activation function on different tasks is not particularly clear.

Maxout

Maxout is a generalization of ReLU and Leaky ReLU, and its function formula is (in two dimensions) :. ReLU and Leaky ReLU are both special cases of this formula (e.g. ReLU is when).

In this way, Maxout neurons have all the advantages of ReLU units (linearity and unsaturated) and none of its disadvantages (dead ReLU units). However, compared to ReLU, it doubled the number of parameters per neuron, resulting in a surge in the number of overall parameters.

How to choose the activation function?

In general, activation functions are rarely strung together in a network.

If you are using ReLU, be careful to set the learning rate and be careful not to have too many “dead” neurons on your network. If this is a problem, try Leaky ReLU, PReLU or Maxout.

It’s best to skip Sigmoid and try Tanh, though you can expect it to be less effective than ReLU and Maxout.

The resources

  1. http://cs231n.github.io/neural-networks-1/
  2. https://zhuanlan.zhihu.com/p/21462488
  3. http://blog.csdn.net/cyh_24/article/details/50593400