This is the 27th day of my participation in the August More Text Challenge

Notes of Andrew Ng’s Machine Learning —— (9) Neural Networks: Learning

Cost Functon & Backpropagation

Cost Function

Firstly, define some needed variables:

Notation Represent

L L
total number of layers in the network

s l s_l
number of units (not counting bias unit) in layer
l l

K K
number of output units/classes

Recall that in neural networks, we may have many output nodes.

The “Denote H” θ (x)kH_\Theta(x)_kH θ (x) K as being a hypothesis that results in the KthK^{th}Kth output.

Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:


J ( Theta. ) = 1 m i = 1 m [ y ( i )   log ( h Theta. ( x ( i ) ) ) + ( 1 y ( i ) )   log ( 1 h Theta. ( x ( i ) ) ) ] + Lambda. 2 m j = 1 n Theta. j 2 J(\theta) = – \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 – y^{(i)})\ \log (1 – h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2

For neural networks, it is going to be slightly more complicated:


J ( Θ ) = 1 m i = 1 m k = 1 K [ y k ( i ) l o g ( ( h Θ ( x ( i ) ) ) k ) + ( 1 y k ( i ) ) l o g ( 1 ( h Θ ( x ( i ) ) ) k ) ] + Lambda. 2 m l = 1 L 1 i = 1 s l j = 1 s l + 1 ( Θ j . i ( l ) ) 2 J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}\Big[ y_k^{(i)}log\Big(\big(h_\Theta(x^{(i)})\big)_k\Big)+ (1-y_k^{(i)})log\Big(1-\big(h_\Theta(x^{(i)})\big)_k\Big) \Big]+ \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}\Big(\Theta_{j,i}^{(l)}\Big)^2

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.

In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.

Note:

  • the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
  • The triple sum simply adds up the squares of all the individual θ s in the entire network.
  • the i in the triple sum does not refer to training example i

Backpropagation Algorithm

Backpropagation is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:


min Θ J ( Θ ) \mathop{\textrm{min}}_\Theta J(\Theta)

That is, we want to minimize our cost function
J J
using an optimal set of parameters in theta. In this section we’ll look at the equations we use to compute the partial derivative of
J ( Θ ) J(\Theta)
:


partial partial Θ i . j ( l ) J ( Θ ) \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta)

To do so, we use the following algorithm:


Backpropagation Algorithm

Given training set (x(1),y(1)),… ,(x(m),y(m)){(x^{(1)},y^{(1)}),… ,(x^{(m)},y^{(m)})}(x(1),y(1)),… ,(x(m),y(m))

  • Set △ I,j(l):=0\Delta_{I,j}^{(l)}:=0 \Delta_{I,j}^{(l)}:=0 △ I,j(l):=0 for all l, I,jl, I,jl, I,j, (hence you end up having a matrix full of zeros)

(👇 Begin For loop: 👇)

For training example
t = 1  to  m t=1\textrm{ to } m
:

  1. Set
    a ( 1 ) : = x ( t ) a^{(1)}:=x^{(t)}
  2. Preform Forward Propagation to compute a(l)a^{(l)}a(l) for l=2,… Ll = 2, 3,… Ll = 2, 3,… ,L

  1. The Using of y (t) y ^ {} (t) y (t), compute the delta (L) = a (L) – y (t) \ delta ^ {} (L) = a ^ {} (L) – y ^ {(t)} the delta (L) = a (L) – y (t)

Where
L L
is our total number of layers and
a ( L ) a^{(L)}
is the vector of outputs of the activation units for the last layer.

So, our error values for the last layer are simply the differences of our actual results in the last layer and the correct outputs in
y y
.

To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:

  1. Compute the delta (L – 1), the delta (L – 2),… Delta, delta (2) \ ^ {(L – 1)}, \ delta ^ {} (L – 2),… ^ delta, \ {} (2) the delta (L – 1), the delta (L – 2),… , the delta (2) using The delta (l) = ((Θ (l) T delta (l + 1)). ∗ (l). A ∗ (1 – a (l)) \ delta ^ {} (l) = \ big ((\ Theta ^ {} (l)) ^ ^ delta T \ {(l + 1)} \ big). * a ^ {} (l). * (1 – a ^ {} (l)) delta (l) = ((Θ ( T delta l) (l + 1)). ∗ (l). A ∗ (1 – (l) a)

*.∗” above are the element-wise multiplications in Octave/MATLAB.

The delta values of layer
l l
are calculated by multiplying the delta values in the next layer with the theta matrix of layer
l l
.

We then element-wise multiply that with a function called g’g ‘g ‘, or g-prime, which is the derivative of the activation function ggg evaluated with the input values given by z(l)z^{(l)}z(l).

The g-prime derivative terms can also be written out as: G ‘(l) (z) = a (l) ∗ (1 – (l) a) g (z ^ {} (l)) = a ^ {} (l). * (1 – a ^ {} (l)) g’ (l) (z) = a (l) ∗ (1 – (l) a)

  1. Δ I, j (l) : = Δ I, jl + aj (l) the delta (l + 1) \ Delta_ {I, j} ^ {} (l) : \ Delta_ = {I, j} ^ {l} + a_j ^ {} (l) \ delta ^ {} (l + 1) Δ I, j (l) : = Δ I, jl + aj (l) the delta (l + 1) or with Vectorization, Δ : (l) = Δ (l) + Delta (l + 1) (a) (l) T \ Delta ^ {} (l) : = ^ Delta \ + \ {} (l) Delta ^ {} (l + 1) (a ^ {} (l)) ^ T Δ : (l) = Δ (l) + Delta (l + 1) (a) (l) T

(👆 End For 👆)

Hence we update our new δ \Delta δ matrix.

  • Di, j (l) : = 1 m (Δ I, j (l) + lambda Θ I, j (l)) D_ {I, j} ^ {} (l) : = \ frac {1} {m} \ big (\ Delta_ {I, j} ^ {} (l) + \ lambda \ Theta_ {I, j} ^ {} (l) \ big) Di, j (l) : = m1 (Δ I, j (l) + lambda Θ I, j (l)), if j j \ indicates 0 not = 0 j  = 0;

  • Di, j (l) : = 1 m Δ I, j (l) D_ {I, j} ^ {} (l) : = \ frac {1} {m} \ Delta_ {I, j} ^ {} (l) Di, j (l) : = m1 Δ I, j (l) if j = 0 j = 0 j = 0.

The capital-delta matrix
D D
is used as an accumulator to add up our values as we go along and eventually compute our partial derivate.

Thus we get Partial partial Θ I, j, j (Θ) = Di (l), j (l) \ frac {\ partial} {\ partial \ Theta_ {I, j} ^ {(l)}} j = D_ (\ Theta) {I, j} ^ {} (l) partial Θ I, j (l) partial j (Θ) = Di, j (l).


Backpropagation Intuition

Recall that the cost function for a neural network is:


J ( Θ ) = 1 m i = 1 m k = 1 K [ y k ( t ) log ( h Θ ( x ( t ) ) ) k + ( 1 y k ( t ) ) log ( 1 h Θ ( x ( t ) ) ) k ] + Lambda. m l = 1 L 1 i = 1 s l j = 1 s l + 1 ( Θ j . i ( l ) ) 2 J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}\Big[y_k^{(t)}\log\big(h_\Theta(x^{(t)})\big)_k+(1-y_k^{(t)})\log\big( 1-h_\Theta(x^{(t)})\big)_k\Big]+\frac{\lambda}{m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}\Big(\Theta_{j,i}^{(l )}\Big)^2

If we consider simple non-multiclass classification (
K = 1 K=1
) and disregard regularization, the cost is copmputed with:


c o s t ( t ) = y ( t ) log ( h Θ ( x ( t ) ) ) + ( 1 y ( t ) ) log ( 1 h Θ ( x ( t ) ) ) cost(t) = y^{(t)}\log\big(h_\Theta(x^{(t)})\big)+(1-y^{(t)})\log\big(1-h_\Theta(x^{(t)})\big)

Intuitively,
δ j ( l ) \delta_j^{(l)}
is the “error” for
a j ( l ) a_j^{(l)}
(unit
j j
in layer
l l
). More formally, the delta values are actually the derivative of the cost function:


Delta t. j ( l ) = partial partial z j ( l ) c o s t ( t ) \delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(t)

Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we Could calculate some delta j (l) \ delta_j ^ {} (l) the delta j (l) :

In the image above, to calculate δ2(2)\delta_2^{(2)}δ2(2), We multiply the weights θ 12(2)\Theta_{12}^{(2)} θ 12(2) and θ 22(2)\Theta_{22}^{(2)} θ 22(2) by their respective δ\deltaδ values found to the right of each edge. So we get Delta 2 (2) = Θ 12 (2) 1 (3) + x delta Θ 22 (2) x delta 2 (3) \ delta_2 ^ {} (2) = \ Theta_ {12} ^ {(2)} \ times \ delta_1 ^ {} (3) + \ Theta_ {and} ^ {(2)} \ times \ delta_2 ^ {(3) } the delta 2 (2) = Θ 12 (2) 1 (3) + x delta Θ 22 (2) x delta 2 (3) To calculate every single possible delta j (l) \ delta_j ^ {} (l) the delta j (l), We could start from the right of our diagram. We can think of our edges as our θ Ij \Theta_{ij} θ ij. Going from right to Left, to calculate the value of δj(l)\delta_j^{(l)}δj(l), You can just take the over all sum of each weight times the δ\deltaδ it is coming from. Another example whenever the delta 2 (3) = Θ 12 (3) 1 (4) x delta \ delta_2 ^ {} (3) = \ Theta_ {12} ^ {(3)} \ times \ delta_1 ^ {} (4) the delta 2 (3) = Θ 12 (3) x delta 1 (4).

Backpropagation in Practice

Unrolling Parameters

With neural networks, we are working with sets of matrices:


Θ ( 1 ) . Θ ( 2 ) . Θ ( 3 ) . D ( 1 ) . D ( 2 ) . D ( 3 ) . \begin{array}{lll} \Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}, \dots \newline D^{(1)}, D^{(2)}, D^{(3)}, \dots \end{array}

In order to use optimizing function such as fminunc(), we will want to unroll all the elements and put them into one long vector:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:) ];
deltaVector = [ D1(:); D2(:); D3(:) ];
Copy the code

If the dimensions of Theta1 is 10×11, Theta2 is 10×11 and Theta3 is 1×11, then we can get back our original matrices from the “unrolled” versions as follows:

Theta1 = reshape(thetaVector(1:110),10.11)
Theta2 = reshape(thetaVector(111:220),10.11)
Theta3 = reshape(thetaVector(221:231),1.11)
Copy the code

To summarize:

Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:


partial partial Θ J ( Θ ) material J ( Θ + ϵ ) J ( Θ ϵ ) 2 ϵ \frac{\partial}{\partial\Theta}J(\Theta) \approx \frac{J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2\epsilon}

A small value for ϵ\epsilonϵ such as ϵ=10−4\epsilon=10^{-4}ϵ=10−4, The math works out properly. If the value for ϵ\ Epsilon ϵ, we can end up with numerical problems.

With multiple theta matrices, We can approximate the derivative with respect to θ J \Theta_j θ J (where we get JJJ from 1 to N, for each one) as follows:


partial partial Θ j J ( Θ ) material J ( Θ 1 . . . . . Θ j + ϵ . . . . . Θ n ) J ( Θ 1 . . . . . Θ j ϵ . . . . . Θ n ) 2 ϵ \frac{\partial}{\partial\Theta_j}J(\Theta) \approx \frac{J(\Theta_1,… ,\Theta_j+\epsilon,… ,\Theta_n)-J(\Theta_1,… ,\Theta_j-\epsilon,… ,\Theta_n)}{2\epsilon}

Hence, we are only adding or subtracting epsilon to the θ J \Theta_j θ J matrix.

In octave we can do it as follows:

epsilon = 1e-4;
for i = 1 : n
	thetaPlus = theta;
	thetaPlus(i) += epsilon;
	thetaMinus = theta;
	thetaMinus(i) += epsilon;
	gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2*epsilon);
end
Copy the code

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.

Once we have verified once that your backpropagation algorithm is correct, we don’t need to compute gradApprox again. The code to compute gradApprox can be very slow. So we guess it’s better to turn off or disable the codes for gradient checking.

Randoom Initialization

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, All nodes will update to the same value. Instead we can get initialize our weights for our Theta Theta matrices using the following method:

Hence, We Initialize each θ Ij (L)\Theta_{ij}^{(L)} θ Ij (L) to a random value between [−ϵ,ϵ][-\epsilon,\epsilon][−ϵ,ϵ]. Using the Above formula guarantees that we get the desired bound. The same procedure applies to all the θ \Theta Theta ‘s.

Below is some working code implement.

# If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10.11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10.11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1.11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Copy the code

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

Putting It Together

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

  • Number of input units = dimension of features
    x ( i ) x^{(i)}
  • Number of output units = number of classes
  • Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
  • Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network

  1. Randomly initialize the weights
  2. H θ (x^{(I)}) H θ (x(I)) for any x(I)x^{(I)}x(I)
  3. Implement the cost function
  4. Implement backpropagation to compute partial derivatives
  5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
  6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:

for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2. ,LCopy the code

The following image gives us an intuition of what is happening as we are implementing our neural network:

Ideally, We want H θ (x(I))≈y(I)h_\Theta(x^{(I)})\approx y^{(I)} H θ (x(I))≈y(I). This will minimize our cost function. However, Keep in mind that J(θ)J(\Theta)J(θ) is not convex and thus we can end up in a local minimum instead.