Participate in the 15th day of the November Gwen Challenge. For details, see: the last Gwen Challenge 2021

Originally written about weight decay, I added and regularized it myself. Because I started to see Ng Enda, Li Mu teacher spoke for a long time after I found? Boon? It’s the same thing that Ng talked about in the regularization section, so I added regularization myself.

But the L2L_2L2 norm used here is just one of the regularizations.

Weight attenuation (often called L2L2L2 regularization) is one of the most widely used regularization techniques when training parameterized machine learning models.

By adding the L2L_2L2 norm of its weight to the loss function, the original training objective minimizes the predicted loss on the training label and adjusts it to minimize the sum of the predicted loss and the penalty term.

To make it look better after taking the derivative, we also add 1/2 to the regular term.


L ( w . b ) + Lambda. 2 w 2 L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2

Among them


L ( w . b ) = 1 n i = 1 n 1 2 ( w x ( i ) + b y ( i ) ) 2 . L(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b – y^{(i)}\right)^2.

The weight update process for Mini-Batch is as follows:


w please w eta B i B partial w l ( i ) ( w . b ) = w eta B i B x ( i ) ( w x ( i ) + b y ( i ) ) \begin{aligned} \mathbf{w} \leftarrow \mathbf{w} – \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} – \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b – y^{(i)}\right)\end{aligned}

To derive:

∥ L2L_2L2 norm for ∥ w 2 \ | w \ | _2 ∥ 2 shorthand for ∥ ∥ w w ∥ \ | w \ | ∥ w ∥. ∥ w ∥ 2 \ | w \ | ^ 2 ∥ w ∥ 2 is its square.


w = w 2 = w 1 2 + w 2 2 + + w n 2 = i = 1 n w i 2 w 2 = i = 1 n w i 2 \begin{aligned} &\because\|w\|=\|w\|_{2}=\sqrt{w_{1}^{2}+w_{2}^{2}+\cdots+w_{n}^{2}}=\sqrt{\sum_{i=1}^{n} w_{i}^{2}} \\ &\therefore\|w\|^{2}=\sum_{i=1}^{n} w_{i}^{2} \end{aligned}

∥ of ∥ w 2 \ | the \ | w ^ {2} ∥ w ∥ 2 derivation:


partial w 2 partial w = 2 j = 1 n w i \frac{\partial\|w\|^{2}}{\partial w}=2 \sum_{j=1}^{n} w_{i}

Add weight updates to mini-Batch:


w eta B ( i B x ( i ) ( w x ( i ) + b y ( i ) ) + Lambda. 2 2 i B w ( i ) ) = w eta B ( i B x ( i ) ( w x ( i ) + b y ( i ) ) + Lambda. i B w ( i ) ) = w eta Lambda. w eta B i B x ( i ) ( w x ( i ) + b y ( i ) ) \begin{aligned} &w-\frac{\eta}{|B|}\left(\sum_{i \in B} x^{(i)}\left(w^{\top} x^{(i)}+b-y^{(i)}\right)+\frac{\lambda}{2} \cdot 2 \sum_{i \in B} w^{(i)}\right)\\ &=w-\frac{\eta}{|B|}\left(\sum_{i \in B} x^{(i)}\left(w ^{\top} x^{(i)}+b-y^{(i)}\right)+\lambda \sum_{i \in B} w^{(i)}\right)\\ &=w-\eta \lambda w-\frac{\eta}{|B|} \sum_{i \in B} x^{(i)}\left(w ^{\top} x^{(i)}+b-y^{(i)}\right) \end{aligned}

That is:


w please ( 1 eta Lambda. ) w eta B i B x ( i ) ( w x ( i ) + b y ( i ) ) . \begin{aligned} \mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} – \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b – y^{(i)}\right). \end{aligned}

You can read more about hands-on Deep Learning here: Hands-on Deep Learning – LolitaAnn’s Column – Nuggets (juejin. Cn)

Notes are still being updated …………