Explain the activation function in detail

“This is the fifth day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

1. Figure of activation function

Whenever I copy code, I often find something called the activation function. Look at the following code:

model = Sequential([
    Flatten(input_shape=(28.28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')])Copy the code

Activation =’relu’ is the activation function.

So the question is, why? What can he do?

2. Why do we need to activate functions?

2.1 Weighted sum of neurons

We know that artificial intelligence’s neural networks mimic human neurons.

Look, it’s got all these antennae, and each one of these antennae can pick up different information, and they all have their own judgments about that information, and they all come together, and they all go down to the brain, and you make a decision: Run!

Graph of TD A1 / dendritic X1 - stimulate W1 - > B [Y] axon A2 / dendritic X2 - stimulate W2 - > B A3 [dendritic X3] - stimulate W3 -- -- - > B comprehensive judgment -- - > to make a decision

Looking at a single neuron, its inputs and outputs look like this: Y = X1×W1+X2×W2+X3×W3.

In a neural network, it’s just a multidimensional matrix weighting.

In fact, the way this works is the weighted sum.

The weighted sum, which is a linear model, has the limitation that it is a linear model no matter how many layers are added.

Linear models do not solve nonlinear problems.

2.2 Who can explain what linearity is

How many articles have said that activating functions is to de-linearize. So a “linear”, so many small white orca, never look at artificial intelligence.

So what does linearity mean?

Linearity means no turning, stupid guy, no counting. Y is equal to kx plus b, and as the variable gets bigger, it gets bigger. How big is big, how small is small, plus or minus infinity.

But, in practice, that’s not what we want.

A few examples:

Digital recognition scene, no matter how many pictures input, we want the output is 10 categories: 0~9.
Film evaluation sentiment analysis, no matter how many input sentence evaluation, we want the output is the probability of praise.
Stock forecasts, in both bull and bear markets, do not go up or down to an infinite value, but approach a certain value.

In real life, there are very few linear things, no matter how much money a person has, there is also an amount. Say hair countless root, also have a number actually.

So, in order for neural networks to be applied to life, they have to be de-linearized. Therefore, it is necessary to set another activation function on the layer of neural network.

Try out the following examples at TensorFlow.

A line can easily separate the two samples.

If you have a sample like this down here, you can’t do it with a straight line.

If y=function(kx+b) is wrapped on top of the weighted sum, it becomes a function within a function, so it is not a linear model and can be easily distinguished.

Graph of TD A1 / dendritic X1 - stimulate W1 - > B [Y] axon A2 / dendritic X2 - stimulate W2 - > B A3 [dendritic X3] - stimulate W3 - > - > C B B - activation function nonlinear effect []

The activation function maps the input to the output according to a specific rule.

The following Galton plate, which adds an activation function to free fall, is a good illustration of what it does.

If you change the activation function, the inputs are the same, but the distribution of the outputs is different.

So, different activation functions have different effects, so let’s take a look at some of them.

3. Common activation functions

3.1 Sigmoid s-type functions

Sigmoid function, also called Logistic function, is also a common function in biology, known as s-shaped growth curve.

I remember the biology teacher said that when the environment is bad, the growth of biological population is slow. As the environment gets better and better, the biological population increases rapidly. When it reaches a certain number, the population tends to stabilize under the pressure of competition.

Because it does not grow indefinitely and maps all inputs to the interval between 0 and 1, Sigmoid function is often used as the activation function of neural network in information science and can be used to do dichotomy.

For example: do text emotion analysis “NLP foundation – Chinese Semantic Understanding: Doubanfilm review emotion analysis”, the last layer used sigmoID as the activation function.

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length= max_length),
    GlobalAveragePooling1D(),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')])Copy the code

When a piece of text is entered, a value between 0 and 1 is output. 0 is bad, 1 is good.

"Good, good acting ", 0.91799414 ===> Positive" It would be strange if it were good ", 0.19483969 ===> Negative "One-star subtitle ", 0.0028086603 ===> Negative" Good acting, good acting, Bad ", 0.17192301 ===> Bad "Acting good, acting good, acting good, acting good, very badCopy the code

3.2 Tanh hyperbolic tangent function

Tanh is a hyperbolic tangent function, and here’s the formula and the graph.

It has a similar shape to sigmoid, except that the output range is -1 to 1.

It solves the zero-centric problem better than sigmoID.

3.3 ReLU linear rectifier function

ReLU is called a Rectified Linear Unit, also known as a corrected Linear Unit.

Compared with Sigmoid and TANH, linear rectifier functions have many advantages:

More bionic: perhaps only 1 to 4 percent of the brain’s neurons are active at any one time. The relu function can be used to control the activity of neurons.
The calculation is simpler: there are no exponents of other functions. At the same time, the cost of computation falls because the activity of neurons is dispersed.

If you’re building a neural network and you don’t know which activation function to use, use ReLU.

3.4 Softmax normalized exponential function

Softmax is called normalized exponential function, it is a multi-classification generalization on Sigmoid dichotomy, the purpose is to present the results of multi-classification in the form of probability.

Softmax has been a controversial issue in the industry: does it count as an activation function?

From my point of view, it’s an activation function.

Softmax mainly solves the problem of multi-category classification, solving the problem that there is only one correct answer. It outputs the possibility of multiple categories in the form of probability. The output is mutually exclusive, and the probability of all outputs is close to 1.

For handwritten digit recognition, the final layer is softmax: Dense(10, activation=’softmax’), which outputs the probability of 10 categories from 0 to 9.

[[2.3691061e-11 1.0000000e+00 5.5736946e-14 1.7459076e-10 1.8988343e-13 8.0071365e-31 1.2010425e-14 0.0000000e+00 6.0655720 1.8470960 e-27 e-20]]Copy the code

Then a call to np.argmax(p) shows that the index of the maximum probability in the probability list is 1.

So that’s the activation function.