CS229 Lecture notes

The original author translation
Andrew Ng,Kian Katanforoosh CycleUser
A link to the
Making the address
Zhihu column
CS229, Stanford University
Netease open class video with Chinese subtitles

Additional handout on Backpropagation

1 Forward Propagation

Recall that given an input characteristic x, we defined $a^{[0]}=x$. $l=1,2,3,… ,N$, where N is the number of layers in the network, then

  1. $z^{[l]}=W^{[l]}a^{[l-1]}+b^{[l]}$
  2. $a^{[l]}=g^{[l]}(z^{[l]})$

In the notes, the nonlinear characteristic $g^{[l]}$pairs are assumed to be the same for all layers except the NTH layer. This is because we might want to do regression at the output layer [hence $g(x)=x$], or binary classification. $g(x)=sigmoid(x)$] or multiclass classification (g(x)=softmax(x)). So separate $g^{[N]}$from $g$phases, and then assume that $g$is used for all but the NTH layer.

In the end, to the network $a ^ {[N]} $output, simple to write for $\ hat y $, can measure the loss function $J (W, b) = L (a ^ {} [N], y) = L (\ hat y, y) $. For example, the following squared loss function can be used for regression of real values:

? L(\hat y,y)=\frac{1}{2} (\hat y-y)^2 ?

For binary classification using Logistic regression, we can use the following loss function:

? L(\hat y,y) =-[y\log \hat y+(1-y)\log(1-\hat y)] ?

Or you can use negative log-likelihood. Finally, for the softmax regression with class k, the cross entropy loss function is used:

? L(\hat y,y)=-\sum^k_{j=1}1{y=j}\log\hat y_j ?

This is a simple extension of the negative log likelihood function to multiple scenarios. Note that $\hat y$is a k-dimensional vector. If $y$is used to represent the k-dimensional vector, where the position of the first $L $is 1 and all other places are 0, that is, the label is $L $, then the cross entropy loss function can also be expressed as follows:

? L(\hat y,y)=- \sum^k_{j=1}y_j\log \hat y_j ?

2 Backpropagation

Then we can define more notations for back propagation (the notes in this part are based on Scribe: Ziang Xie from Supervised Learning at Stanford University, which is about multi-layer neural network algorithms).

First definition:

? \delta ^{[l]} =\nabla _{z^{[l]}}L(\hat y,y) ?

You can then define a three-step process for calculating the gradient for each $W^{[l]},b^{[l]}$as shown below:

  1. For the output layer $N$:

    ? \delta ^{[N]} =\nabla _{z^{[N]}}L(\hat y,y) ?

    Sometimes we may have to directly calculate the $\ nabla _ {z ^ {[N]}} L (\ hat y, y) $(such as $g ^ {[N]} $is the largest flexible function (softmax function)), and other times (such as $g ^ {[N]} $is s-shaped function (sigmoid Function) may apply the chain rule:

    ? \nabla _{z^{[N]}}L(\hat y,y)= \nabla _{\hat y}L(\hat y,y) \circ (g^{[N]})'(z^{[N]}) ?

    Note that $(g ^ {} [N]) ‘(z ^ {} [N]) $is about $z ^ {[N]} $according to the elements of the derivative of (elementwise derivative).

  2. For $l = N – 1, N – 1,… , 1 $is:

    ? \delta^{[l]} = ( {W^{[l+1]}}^T \delta^{[l+1]} ) \circ g’ ( z^{[l]} ) ?

  3. Finally, we can calculate the gradient (gradies) of the $L $layer:

    ? \begin{aligned} \nabla_{W^{[l]}}J(W,b)&= \delta^{[l]}{a^{[l-1]} }^T \ \nabla_{b^{[l]}}J(W,b)&= \delta^{[l]} \ \end{aligned} ?

The small circle minus sign $\circ$in the preceding paragraph represents the elementwise product. Note that the above procedure corresponds to a single training sample.

You can test steps 1 and 3 by using the above algorithm in logistic regression ($N=1,g^{[1]}$is sigmoid function $\sigma$). Remember the $\ sigma ‘(z) = \ sigma (z) \ the circ (1 – \ sigma (z)) $and $\ sigma (z ^ {[1]}) is $$a ^ ${[1]}. Note for logistic regression, if $x $$R is a real field ^ {1} n \ times $of the column (the column vector), then $W ^ {[1]} \ in R ^ {1 \ times n} $, so have the $\ nabla_ ^ {W} {[1]} J \ in (W, b) R^{1\times n}$. Code samples such as cs229.stanford.edu/notes/backp… Shown below.

(Translator’s note: I posted the code from the above link below for convenience.)

#http://cs229.stanford.edu/notes/backprop.py import numpy as np from copy import copy # Example backpropagation code for  binary classification with 2-layer # neural network (single hidden layer) sigmoid = lambda x: 1 / (1 + np.exp(-x)) def fprop(x, y, params): # Follows procedure given in notes W1, b1, W2, b2 = [params[key] for key in ('W1', 'b1', 'W2', 'b2')] z1 = np.dot(W1, x) + b1 a1 = sigmoid(z1) z2 = np.dot(W2, a1) + b2 a2 = sigmoid(z2) loss = -(y * np.log(a2) + (1-y) * np.log(1-a2)) ret = {'x': x, 'y': y, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2, 'loss': loss} for key in params: ret[key] = params[key] return ret def bprop(fprop_cache): # Follows procedure given in notes x, y, z1, a1, z2, a2, loss = [fprop_cache[key] for key in ('x', 'y', 'z1', 'a1', 'z2', 'a2', 'loss')] dz2 = (a2 - y) dW2 = np.dot(dz2, a1.T) db2 = dz2 dz1 = np.dot(fprop_cache['W2'].T, dz2) * sigmoid(z1) * (1-sigmoid(z1)) dW1 = np.dot(dz1, x.T) db1 = dz1 return {'b1': db1, 'W1': dW1, 'b2': db2, 'W2': dW2} # Gradient checking if __name__ == '__main__': # Initialize random parameters and inputs W1 = np.random.rand(2,2) b1 = np.random.rand(2, 1) W2 = np.random.rand(1, 1) 2) b2 = np.random.rand(1, 1) params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2} x = np.random.rand(2, 1) y = np.random.randint(0, 2) # Returns 0/1 fprop_cache = fprop(x, y, params) bprop_cache = bprop(fprop_cache) # Numerical gradient checking # Note how slow this is! Thus we want to use the backpropagation algorithm instead. eps = 1e-6 ng_cache = {} # For every single parameter (W, b) for key in params: param = params[key] # This will be our numerical gradient ng = np.zeros(param.shape) for j in range(ng.shape[0]): for k in xrange(ng.shape[1]): # For every element of parameter matrix, compute gradient of loss wrt # that element numerically using finite differences add_eps = np.copy(param) min_eps = np.copy(param) add_eps[j, k] += eps min_eps[j, k] -= eps add_params = copy(params) min_params = copy(params) add_params[key] = add_eps min_params[key] = min_eps ng[j, k] = (fprop(x, y, add_params)['loss'] - fprop(x, y, min_params)['loss']) / (2 * eps) ng_cache[key] = ng # Compare numerical gradients to those computed using backpropagation algorithm for key in params: print key # These should be the same print(bprop_cache[key]) print(ng_cache[key])Copy the code