CS229 Lecture notes

The original author	translation
Andrew Ng,Kian Katanforoosh	CycleUser

A link to the
Making the address
Zhihu column
CS229, Stanford University
Netease open class video with Chinese subtitles

Additional handout on Backpropagation

1 Forward Propagation

Recall that given an input characteristic x, we defined $a^{[0]}=x$. $l=1,2,3,… ,N$, where N is the number of layers in the network, then

$z^{[l]}=W^{[l]}a^{[l-1]}+b^{[l]}$
$a^{[l]}=g^{[l]}(z^{[l]})$

In the notes, the nonlinear characteristic $g^{[l]}$pairs are assumed to be the same for all layers except the NTH layer. This is because we might want to do regression at the output layer [hence $g(x)=x$], or binary classification. $g(x)=sigmoid(x)$] or multiclass classification (g(x)=softmax(x)). So separate $g^{[N]}$from $g$phases, and then assume that $g$is used for all but the NTH layer.

In the end, to the network $a ^ {[N]} $output, simple to write for $\ hat y $, can measure the loss function $J (W, b) = L (a ^ {} [N], y) = L (\ hat y, y) $. For example, the following squared loss function can be used for regression of real values:

? L(\hat y,y)=\frac{1}{2} (\hat y-y)^2 ?

For binary classification using Logistic regression, we can use the following loss function:

? L(\hat y,y) =-[y\log \hat y+(1-y)\log(1-\hat y)] ?

Or you can use negative log-likelihood. Finally, for the softmax regression with class k, the cross entropy loss function is used:

? L(\hat y,y)=-\sum^k_{j=1}1{y=j}\log\hat y_j ?

This is a simple extension of the negative log likelihood function to multiple scenarios. Note that $\hat y$is a k-dimensional vector. If $y$is used to represent the k-dimensional vector, where the position of the first $L $is 1 and all other places are 0, that is, the label is $L $, then the cross entropy loss function can also be expressed as follows:

? L(\hat y,y)=- \sum^k_{j=1}y_j\log \hat y_j ?

2 Backpropagation

Then we can define more notations for back propagation (the notes in this part are based on Scribe: Ziang Xie from Supervised Learning at Stanford University, which is about multi-layer neural network algorithms).

First definition:

? \delta ^{[l]} =\nabla _{z^{[l]}}L(\hat y,y) ?

You can then define a three-step process for calculating the gradient for each $W^{[l]},b^{[l]}$as shown below:

For the output layer $N$:

? \delta ^{[N]} =\nabla _{z^{[N]}}L(\hat y,y) ?

Sometimes we may have to directly calculate the $\ nabla _ {z ^ {[N]}} L (\ hat y, y) $(such as $g ^ {[N]} $is the largest flexible function (softmax function)), and other times (such as $g ^ {[N]} $is s-shaped function (sigmoid Function) may apply the chain rule:

? \nabla _{z^{[N]}}L(\hat y,y)= \nabla _{\hat y}L(\hat y,y) \circ (g^{[N]})'(z^{[N]}) ?

Note that $(g ^ {} [N]) ‘(z ^ {} [N]) $is about $z ^ {[N]} $according to the elements of the derivative of (elementwise derivative).
For $l = N – 1, N – 1,… , 1 $is:

? \delta^{[l]} = ( {W^{[l+1]}}^T \delta^{[l+1]} ) \circ g’ ( z^{[l]} ) ?
Finally, we can calculate the gradient (gradies) of the $L $layer:

? \begin{aligned} \nabla_{W^{[l]}}J(W,b)&= \delta^{[l]}{a^{[l-1]} }^T \ \nabla_{b^{[l]}}J(W,b)&= \delta^{[l]} \ \end{aligned} ?

The small circle minus sign $\circ$in the preceding paragraph represents the elementwise product. Note that the above procedure corresponds to a single training sample.

You can test steps 1 and 3 by using the above algorithm in logistic regression ($N=1,g^{[1]}$is sigmoid function $\sigma$). Remember the $\ sigma ‘(z) = \ sigma (z) \ the circ (1 – \ sigma (z)) $and $\ sigma (z ^ {[1]}) is $$a ^ ${[1]}. Note for logistic regression, if $x $$R is a real field ^ {1} n \ times $of the column (the column vector), then $W ^ {[1]} \ in R ^ {1 \ times n} $, so have the $\ nabla_ ^ {W} {[1]} J \ in (W, b) R^{1\times n}$. Code samples such as cs229.stanford.edu/notes/backp… Shown below.

(Translator’s note: I posted the code from the above link below for convenience.)

#http://cs229.stanford.edu/notes/backprop.py import numpy as np from copy import copy # Example backpropagation code for  binary classification with 2-layer # neural network (single hidden layer) sigmoid = lambda x: 1 / (1 + np.exp(-x)) def fprop(x, y, params): # Follows procedure given in notes W1, b1, W2, b2 = [params[key] for key in ('W1', 'b1', 'W2', 'b2')] z1 = np.dot(W1, x) + b1 a1 = sigmoid(z1) z2 = np.dot(W2, a1) + b2 a2 = sigmoid(z2) loss = -(y * np.log(a2) + (1-y) * np.log(1-a2)) ret = {'x': x, 'y': y, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2, 'loss': loss} for key in params: ret[key] = params[key] return ret def bprop(fprop_cache): # Follows procedure given in notes x, y, z1, a1, z2, a2, loss = [fprop_cache[key] for key in ('x', 'y', 'z1', 'a1', 'z2', 'a2', 'loss')] dz2 = (a2 - y) dW2 = np.dot(dz2, a1.T) db2 = dz2 dz1 = np.dot(fprop_cache['W2'].T, dz2) * sigmoid(z1) * (1-sigmoid(z1)) dW1 = np.dot(dz1, x.T) db1 = dz1 return {'b1': db1, 'W1': dW1, 'b2': db2, 'W2': dW2} # Gradient checking if __name__ == '__main__': # Initialize random parameters and inputs W1 = np.random.rand(2,2) b1 = np.random.rand(2, 1) W2 = np.random.rand(1, 1) 2) b2 = np.random.rand(1, 1) params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2} x = np.random.rand(2, 1) y = np.random.randint(0, 2) # Returns 0/1 fprop_cache = fprop(x, y, params) bprop_cache = bprop(fprop_cache) # Numerical gradient checking # Note how slow this is! Thus we want to use the backpropagation algorithm instead. eps = 1e-6 ng_cache = {} # For every single parameter (W, b) for key in params: param = params[key] # This will be our numerical gradient ng = np.zeros(param.shape) for j in range(ng.shape[0]): for k in xrange(ng.shape[1]): # For every element of parameter matrix, compute gradient of loss wrt # that element numerically using finite differences add_eps = np.copy(param) min_eps = np.copy(param) add_eps[j, k] += eps min_eps[j, k] -= eps add_params = copy(params) min_params = copy(params) add_params[key] = add_eps min_params[key] = min_eps ng[j, k] = (fprop(x, y, add_params)['loss'] - fprop(x, y, min_params)['loss']) / (2 * eps) ng_cache[key] = ng # Compare numerical gradients to those computed using backpropagation algorithm for key in params: print key # These should be the same print(bprop_cache[key]) print(ng_cache[key])Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

CS229 Machine Learning Chinese handout: Back Propagation

Additional handout on Backpropagation

1 Forward Propagation

2 Backpropagation

CS229 Machine Learning Chinese handout: Back Propagation

Additional handout on Backpropagation

1 Forward Propagation

2 Backpropagation

Related Posts

Developing ARTIFICIAL Intelligence: Looking ahead or Digging Your Own Grave?

Open source golden decade, discuss the trend and practice of AI open source technology

[Knowledge Graph] 1.3- How to do knowledge graph?