Activation function is an important part of neural network model. Sukanya Bag, the author of this paper, explained in detail the advantages and disadvantages of ten activation functions from the mathematical principle of activation function.

From Medium by Sukanya Bag, Heart of a Machine compiled and edited by Boat and Mayonnaise.

Activation functions are functions added to artificial neural networks to help networks learn complex patterns in data. Similar to the neuron-based model in the human brain, the activation function ultimately determines what to fire into the next neuron

In artificial neural networks, the activation function of a node defines the output of that node under a given input or set of inputs. A standard computer chip circuit can be thought of as a digital circuit activation function that gets on (1) or off (0) outputs based on the input. Therefore, the activation function is a mathematical equation to determine the output of neural network. This paper summarizes ten common activation functions in deep learning and their advantages and disadvantages.

First, let’s take a look at how artificial neurons work, which is roughly as follows:

The mathematical visualization process of the above process is shown in the figure below:

1. Sigmoid activation function

The Sigmoid function’s image looks like an S-shaped curve.

The function expression is as follows:

When is it appropriate to use the Sigmoid activation function?

  • The Sigmoid function has an output range of 0 to 1. Since the output value is limited from 0 to 1, it normalizes the output of each neuron.

  • A model for taking the predicted probability as the output. Since the probability ranges from 0 to 1, the Sigmoid function is very suitable;

  • Gradient smoothing to avoid “jumping” output values;

  • The function is differentiable. This means that you can find the slope of the Sigmoid curve at any two points;

  • Clear prediction, that is, very close to 1 or 0.

What are the disadvantages of the Sigmoid activation function?

  • Gradient tends to disappear;

  • Function output is not centered on 0, which reduces the efficiency of weight update;

  • The Sigmoid function performs exponential operations and the computer runs slowly.

2. Tanh/hyperbolic tangent activation function

The image of tanh activation function is also s-shaped, and the expression is as follows:

Tanh is a hyperbolic tangent function. The curves of the tanh and Sigmoid functions are relatively similar. But it has some advantages over the Sigmoid function.

  • First, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference between the two is the output interval. Tanh has an output interval of 1, and the whole function is centered on 0, which is better than sigmoid.

  • In the TANH diagram, negative inputs will be strongly mapped to negative and zero inputs will be mapped to near zero.

Note: In general binary classification problems, the TANh function is used for the hidden layer, while the sigmoid function is used for the output layer, but this is not fixed and needs to be adjusted for specific problems.

3. ReLU activation function

The image of ReLU activation function is shown in the figure above, and the function expression is as follows:

ReLU function is a popular activation function in deep learning. Compared with Sigmoid function and TANH function, ReLU function has the following advantages:

  • When the input is positive, there is no gradient saturation problem.

  • Calculations are much faster. ReLU functions have only linear relationships, so it can be computed faster than Sigmoid and TANH.

Of course, there are drawbacks:

  1. Dead ReLU problem. When the input is negative, ReLU completely fails, which is not a problem during forward propagation. Some areas are sensitive and some are not. However, in the process of back propagation, if negative values are input, the gradient will be completely zero. Sigmoid function and TANh function also have the same problem.

  2. We find that the output of the ReLU function is 0 or positive, which means that the ReLU function is not a zero-centered function.

4. Leaky ReLU

It is an activation function specifically designed to solve the Dead ReLU problem:

ReLU vs Leaky ReLU

Why is Leaky ReLU better than ReLU?

  1. Leaky ReLU adjusts the negative zero gradients problem by giving a very small linear component of X to the negative input (0.01x);

  2. Leak helps expand the range of ReLU functions, usually with a value of about 0.01;

  3. Leaky ReLU’s function range is (minus infinity to infinity).

Note: Theoretically, Leaky ReLU has all the advantages of ReLU and Dead ReLU won’t have any problems, but in practice, it has not been completely proven that Leaky ReLU is always better than ReLU.

5. ELU

ELU vs Leaky ReLU vs ReLU

The proposal of ELU also solves the problem of ReLU. ELU has a negative value compared to ReLU, which brings the average value of activation close to zero. Mean activations close to zero make learning faster because they bring the gradient closer to the natural gradient.

Obviously, ELU has all the advantages of ReLU and:

  • There is no Dead ReLU problem, and the average output is close to 0, centered on 0.

  • By reducing the effect of bias offset, ELU makes the normal gradient closer to the unit natural gradient, thus accelerating the mean learning to zero.

  • ELU saturates to negative values with small inputs, reducing forward-propagating variation and information.

One small problem is that it is more computationally intensive. Similar to Leaky ReLU, although better in theory than ReLU, there is currently insufficient evidence in practice that ELU is always better than ReLU.

6. Parametric ReLU

PReLU is also an improved version of ReLU:

Take a look at PReLU’s formula: the parameter α is usually a number between 0 and 1, and is usually relatively small.

  • If a_i= 0, then f becomes ReLU

  • If a_I > 0, F becomes Leaky ReLU

  • If a_i is a learnable parameter, f becomes PReLU

PReLU has the following advantages:

  1. In the negative range, the slope of PReLU is smaller, which also avoids the Dead ReLU problem.

  2. Compared with ELU, PReLU is linear operation in negative range. Even though the slope is small, it doesn’t go to 0.

7. Softmax

Softmax is the activation function for multi-class classification problems where class membership is required for more than two class tags. For any real vector of length K, Softmax can compress it into a real vector of length K, value in the range (0,1), and the sum of the elements in the vector is 1.

Softmax differs from normal Max functions: The Max function outputs only the maximum value, but Softmax ensures that smaller values have a low probability and are not discarded directly. We can think of it as a probabilistic or “soft” version of the Argmax function.

The denominator of the Softmax function combines all the factors of the original output value, which means that the various probabilities obtained by the Softmax function are related to each other.

The main disadvantages of Softmax activation functions are:

  1. Not differentiable at zero;

  2. The gradient of negative input is zero, which means that for activation of this region, the weights are not updated during backpropagation, resulting in dead neurons that never activate.

8. Swish

Y = x * sigmoid (x)

Swish’s design was inspired by the use of GATING’s SigmoID function in LSTM and high-speed networks. We use the same gating value to simplify the gating mechanism, which is called self-gating.

The advantage of self-gating is that it only requires simple scalar inputs, whereas normal gating requires multiple scalar inputs. This makes it easy for self-gated activation functions such as Swish to replace activation functions (such as ReLU) that take a single scalar as input without changing the hidden capacity or the number of parameters.

The main advantages of the Swish activation function are as follows:

  • “Unbounded” helps prevent the gradient from approaching zero and leading to saturation during slow training; (At the same time, boundedness is also advantageous because bounded activation functions can have strong regularization and large negative input problems can be solved);

  • Derivative constant > 0;

  • Smoothness plays an important role in optimization and generalization.

9. Maxout

In the Maxout layer, the activation function is the maximum value of the input, so a multilayer perceptron with only 2 Maxout nodes can fit any convex function.

A single Maxout node can be interpreted as a piecewise linear approximation (PWL) of a real-valued function in which the line segment between any two points on the graph of the function is located above the graph (convex function).

Maxout can also be implemented for d-dimensional vectors (V) :

Suppose that two convex functions h_1(x) and h_2(x), approximated by two Maxout nodes, g(x) are continuous PWL functions.

Thus, a Maxout layer consisting of two Maxout nodes can well approximate any continuous function.

10. Softplus

F (x) = ln (1 + exp x)

The derivative of Softplus is

F ‘(x)=exp(x)/(1 +exp⁡ x)= 1/ (1 +exp(−x)), also known as the Logistic/sigmoid function.

The Softplus function is similar to the ReLU function, but is relatively smooth and, like ReLU, unilateral inhibition. It has a wide range of acceptances :(0, + inf).

The original link: sukanyabag.medium.com/activation-…