Participate in the 11th day of The November Gwen Challenge, see the details of the event: 2021 last Gwen Challenge

Why do we need to activate functions?

Use the following neural network to analyze:

For the figure above: we know where X4×1H5×1O7×1X^{4 \times 1} \quad H^{5 \times 1} \quad O^{7 \times 1}X4×1H5×1O7×1

Calculation between each layer:


H = W 1 x + b 1 The matrix is [ H ] 5 x 1 = [ w 1 ] 5 x 4 [ x ] 4 x 1 + [ b 1 ] 5 x 1 O = W 2 H + b 2 The matrix is [ O ] 3 x 1 = [ w 2 ] 3 x 5 [ H ] 5 x 1 + [ b 2 ] 3 x 1 \ begin {array} {ll} H = W_ {1} x + b_ is {{1} & its matrix [H] _ \ {5 times 1} = \ left [W_ {1} \ right] _ \ {5 times 4} [x] _ {4 \ times 1} + \ left [b_ {1} \ right] _ \ {5 times 1}} \ \ O = W_ {2} H + b_ is {{2} & its matrix [O] _ {3} \ times 1 = \ left [W_ {2} \ right] _ {3 \ times 5} [H] _ {5 \times 1}+\left[b_{2}\right]_{3 \times 1}} \end{array}

Combine the above two formulas:


O = w 2 H + b 2 = w 2 ( w 1 X + b 1 ) + b 2 = w 2 w 1 X + w 2 b 1 + b 2 \begin{aligned} O &=w_{2} H+b_{2} \\ &=w_{2}\left(w_{1} X+b_{1}\right)+b_{2} \\ &=w_{2} w_{1} X+w_{2} b_{1}+b_{2} \\ \end{aligned}

Let’s look at the matrix operation:


[ w 2 ] 3 x 5 [ w 1 ] 5 x 4 = [ w ] 3 x 4 [w_{2}]_{3 \times 5} \cdot [w_{1}]_{5 \times 4} =[w]_{3 \times 4}


[ w 2 ] 25 [ b 1 ] 5 x 1 = [ b 3 ] 3 x 1 [w_{2}]_{25} \cdot [b_{1}]_{5 \times 1} =[b_{3}]_{3 \times 1}


[ b 1 ] 3 x 1 + [ b 2 ] 3 x 1 = [ b ] 3 x 1 [b_{1^{\prime}}]_{3 \times 1}+[b_{2}]_{3 \times 1} =[b]_{3 \times 1}

And when I put that matrix back into the formula, it becomes O=wX+bO=wX+bO=wX+b. In this case, there is no meaning for the existence of this multi-layer neural network. Since it can be combined, is it not sweet to write a layer directly? Sweet! But multi-layer models do something that a single layer can’t do, so how do you keep multi-layer models so that they can’t be simply merged? So that’s the activation function.

1. ReLU function (Rectified Linear Unit)

Formula: ReLU ⁡ (x) = Max ⁡ (x, 0) \ operatorname {ReLU} (x) = \ Max (x, 0) ReLU (x) = Max (x, 0)


sigma ( x ) = { x  if  x > 0 0  otherwise  \sigma(x)= \begin{cases}x & \text { if } x>0 \\ 0 & \text { otherwise }\end{cases}

The ReLU function preserves only positive elements and discards all negative elements by setting the corresponding activation value to 0.

Functions processed by ReLU look like this:

The Sigmoid function

And then is my first exposure to study activation function, then listen to the Wu En class, he has a detailed talk about the benefits of using sigmoid, I remember the notes, we can see here: Logistic Regression | Logistic Regression – the nuggets (juejin. Cn)

Formula: sigmoid ⁡ (x) = 11 + exp ⁡ (-) x \ operatorname {sigmoid} (x) = \ frac {1} {1 \ + exp (-) x} sigmoid (x) = 1 + exp (1 – x)


sigma ( x ) = { 1  if  x > 0 0  otherwise  \sigma(x)= \begin{cases}1 & \text { if } x>0 \\ 0 & \text { otherwise }\end{cases}

For the input of a domain in R\mathbb{R}R, the sigmoid function transforms the input to the output on the interval (0, 1). For this reason, sigmoID is often called the squashing function: it compresses arbitrary input in the range (-INF, INF) to some value in the range (0, 1).

Tanh function

Formula: Tanh ⁡ (x) = 1 – exp ⁡ (2 x) – 1 + exp ⁡ (2 x) – \ operatorname tanh (x) = attach \ frac {1 – \ exp (2 x)} {1 \ + exp (2 x)} tanh (x) = 1 + exp (1-2 x) – exp (2 x) –

Similar to the Sigmoid function, the tanh(hyperbolic tangent) function compresses its input into the interval (-1, 1).