The paper contains 14,118 words and is expected to last 30 minutes or longer

Want to learn how to use Numpy to implement optimization algorithms in TensorFlow or PyTorch, and how to create beautiful animations using Matplotlib?

This article discusses how to implement different variations of the gDA optimization technique and how to visualize the operation of the update rules for these variations using Matplotlib.

The content and structure of this article is based on One-Fourth Labs.

Gradient descent is one of the most commonly used techniques for optimizing neural networks. The gradient descent algorithm updates the parameters by moving in the opposite direction to the gradient of the objective function relative to the network parameters.

Use Numpy in Python

Photo credit: Unsplash, Christopher Gower

The coding section will cover the following topics.

• Sigmoid neuron class

• Overall setting — what is the data, model, task

• Drawing function — 3D and contour drawing

• Individual algorithms and how they are executed

Before you can start implementing gradient descent, you first need to enter the required libraries. The Axes3D input from mpl_toolkits. Mplot3d provides some basic 3D drawing tools (scatters, surfaces, lines, grids). It’s not the fastest or most full-featured 3D library, but it comes with Matplotlib. Also enter colors and colormap(cm) from Matplotlib. We wanted to animate the diagram to illustrate how each optimization algorithm works, so we typed animation and RC to make the diagram look nice. To display HTML, line up in the Jupyter Notebook. Finally, enter numpy for computation purposes, which is a heavy task.

from mpl_toolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt

from matplotlib import cm

import matplotlib.colors

from matplotlib import animation, rc

from IPython.display import HTML

import numpy as np

Implement Sigmoid neurons

To implement the gDA optimization technique, take sigmoid neurons (logical functions) as an example to see how different variants of gDA learn parameters “W” and “B”.

Sigmoid neuronal review

The Sigmoid neuron is similar to the Perceptron neuron in that for each input XI, it has a weight WI associated with the input. Weights indicate the importance of inputs in the decision making process. The output from sigmoID differs from the perceptron model in that the output is not 0 or 1, but a real value between 0 and 1 that can be interpreted as a probability. The most common sigmoid function is the logical function, which has the characteristics of an “S” curve.

Sigmoid neuron labeling (logical function)

Learning algorithm

The goal of the learning algorithm is to determine the best possible values for parameters (w and b) so as to minimize the overall loss (squared error loss) of the model as much as possible.

Random initialization of w and B. Then, all observations in the data are iterated over. The sigmoID function is used to find the corresponding prediction result for each observation value and calculate the mean square error loss. Based on the loss values, the weights will be updated so that the overall loss of the model under the new parameters will be less than the current loss of the model.

Sigmoid neuron class

Before we start analyzing the different variants of the gradient descent algorithm, we will build the model in a class named SN.

class SN:

#constructor

def __init__(self, w_init, b_init, algo):

self.w = w_init

self.b = b_init

self.w_h = []

self.b_h = []

self.e_h = []

self.algo = algo

#logistic function

def sigmoid(self, x, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

return 1. / (1. + np.exp(-(w*x + b)))

#loss function

def error(self, X, Y, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

err = 0

for x, y in zip(X, Y):

Err += 0.5 * (self.sigmoid(x, w, b) -y) ** 2

return err

def grad_w(self, x, y, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

y_pred = self.sigmoid(x, w, b)

return (y_pred – y) * y_pred * (1 – y_pred) * x

def grad_b(self, x, y, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

y_pred = self.sigmoid(x, w, b)

return (y_pred – y) * y_pred * (1 – y_pred)

def fit(self, X, Y,

Epochs =100, ETA =0.01, gamma=0.9, mini_batCH_size =100, EPS = 1E-8,

Beta = 0.9 beta1 = 0.9, beta 2 = 0.9

) :

self.w_h = []

self.b_h = []

self.e_h = []

self.X = X

self.Y = Y

if self.algo == ‘GD’:

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

self.w -= eta * dw / X.shape[0]

self.b -= eta * db / X.shape[0]

self.append_log()

elif self.algo == ‘MiniBatch’:

for i in range(epochs):

dw, db = 0, 0

points_seen = 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

points_seen += 1

if points_seen % mini_batch_size == 0:

self.w -= eta * dw / mini_batch_size

self.b -= eta * db / mini_batch_size

self.append_log()

dw, db = 0, 0

elif self.algo == ‘Momentum’:

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

v_w = gamma * v_w + eta * dw

v_b = gamma * v_b + eta * db

self.w = self.w – v_w

self.b = self.b – v_b

self.append_log()

elif self.algo == ‘NAG’:

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

v_w = gamma * v_w

v_b = gamma * v_b

for x, y in zip(X, Y):

dw += self.grad_w(x, y, self.w – v_w, self.b – v_b)

db += self.grad_b(x, y, self.w – v_w, self.b – v_b)

v_w = v_w + eta * dw

v_b = v_b + eta * db

self.w = self.w – v_w

self.b = self.b – v_b

self.append_log()

#logging

def append_log(self):

self.w_h.append(self.w)

self.b_h.append(self.b)

self.e_h.append(self.error(self.X, self.Y))

#constructor

def __init__(self, w_init, b_init, algo):

self.w = w_init

self.b = b_init

self.w_h = []

self.b_h = []

self.e_h = []

self.algo = algo

The init__ function (constructor) helps to initialize the sigmoid neuron’s arguments to w weights and B biases. This function takes three arguments:

• w_init,b_init, these take the initial values of the parameters “w” and “b”, rather than randomly setting the parameters to a specific value. This allows you to understand how the algorithm performs by visualizing different starting points. Some algorithms fall into local minima for certain parameters.

• Algo indicates which variant of the gradient descent algorithm to use to find the best parameters.

In this function, we initialize the parameters and define three new array variables with the suffix ‘_h’ to indicate that they are historical variables to track how the values of weight (w_h), bias (b_h), and error (e_h) change as sigmoid neurons learn the parameters.

def sigmoid(self, x, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

return 1. / (1. + np.exp(-(w*x + b)))

There is a sigmoid function that takes an input X – mandatory argument and evaluates the input logical function and its arguments. The function also takes two additional optional arguments.

• W & b, with “w” and “b” as arguments, which helps to calculate the value of the sigmoid function based on specific argument values. If these parameters are not passed, it calculates the logical function using the learned parameter values.

def error(self, X, Y, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

err = 0

for x, y in zip(X, Y):

Err += 0.5 * (self.sigmoid(x, w, b) -y) ** 2

return err

Below, there is the error function, which inputs X and Y as mandatory and optional arguments, like the sigmoid function. In this function, each data point is iterated and the sigmoID function is used to calculate the cumulative mean square error between the actual and predicted eigenvalues. As you can see in the Sigmoid function, it supports calculating errors at specified parameter values.

def grad_w(self, x, y, w=None, b=None):

.

def grad_b(self, x, y, w=None, b=None):

.

Next, you define two functions grad_w and grad_B. Input “x” and “y” as mandatory parameters helps to calculate the gradient of sigmoID with respect to the parameters “W” and “b” inputs, respectively. There are also two optional parameters that compute the gradient at the value of the specified parameter.

Def fit(self, X, Y, epochs=100, eta=0.01, gamma=0.9, mini_batch_size=100, eps=1e-8,beta=0.9, beta1=0.9, beta2=0.9):

self.w_h = []

.

Next, define the “fit” method, which accepts input “X”, “Y”, and a series of other parameters. These parameters are interpreted whenever they are used in a particular variant of the gradient descent algorithm. The function first initializes the history variable and sets the local input variable to store the input parameter data.

Then, there are a bunch of different “if-else” statements for each algorithm supported by the function. According to the algorithm chosen, gradient descent will be implemented in the FIT method. These implementations are explained in detail later in this article.

def append_log(self):

self.w_h.append(self.w)

self.b_h.append(self.b)

self.e_h.append(self.error(self.X, self.Y))

Finally, there is the theappend_log function, which stores parameter values and loss function values for each gradient descent at each time.

The drawing set

This section defines configuration parameters that simulate gradient descent update rules using a simple two-dimensional toy data set. Functions are also defined to create and animate 3-dimensional and 2-dimensional diagrams to visualize update rules in action. This setup helps to run different experiments for drawing/animation update rules with different starting points, different hyperparameter Settings, and different gradient descent variables.

#Data

X = np.asarray([3.5, 0.35, 3.2, -2.0, 1.5, -0.5])

Y = np.asarray([0.5, 0.50, 0.5, 0.5, 0.1, 0.3])

#Algo and parameter values

algo = ‘GD’

W_init = 2.1

B_init = 4.0

#parameter min and max values- to plot update rule

w_min = -7

w_max = 5

b_min = -7

b_max = 5

#learning algorithum options

epochs = 200

mini_batch_size = 6

Gamma = 0.9

eta = 5

#animation number of frames

animation_frames = 20

#plotting options

plot_2d = True

plot_3d = False

First, we take a simple two-dimensional toy data set with two inputs and two outputs. In line 5, define a string variable, algo, that accepts the type of algorithm to execute. Initialize the arguments ‘w’ and ‘b’ in lines 6-7 to indicate where the algorithm starts.

Starting at lines 9-12, set the limit of the parameter, the range within which the Signoid neuron searches for the best parameter within the specified range. These carefully selected numbers illustrate the operation of the gradient descent update rule. Next, you set the values of the hyperparameters, making certain variables specific to certain algorithms. I’ll talk about that when we talk about algorithm implementation. Finally, starting with lines 19-22, identify the variables needed to animate or draw the update rules.

sn = SN(w_init, b_init, algo)

sn.fit(X, Y, epochs=epochs, eta=eta, gamma=gamma, mini_batch_size=mini_batch_size)

plt.plot(sn.e_h, ‘r’)

plt.plot(sn.w_h, ‘b’)

plt.plot(sn.b_h, ‘g’)

plt.legend((‘error’, ‘weight’, ‘bias’))

plt.title(“Variation of Parameters and loss function”)

plt.xlabel(“Epoch”)

plt.show()

After setting the configuration parameters, initialize the SN class and invoke the FIT method using the configuration parameters. In addition, three historical variables are plotted to visualize the variation of parameters and loss function values from time to time.

3D and 2D drawing Settings

if plot_3d:

W = np.linspace(w_min, w_max, 256)

b = np.linspace(b_min, b_max, 256)

WW, BB = np.meshgrid(W, b)

Z = sn.error(X, Y, WW, BB)

fig = plt.figure(dpi=100)

ax = fig.gca(projection=’3d’)

Surf = ax.plot_surface(WW, BB, Z, rstride=3, cstride=3, alpha=0.5, cmap=cm.coolwarm, lineWidth =0, antialiased=False)

Cset = ax.contourf(WW, BB, Z, 25, zdir=’ Z ‘, offset=-1, alpha=0.6, cmap=cm.coolwarm)

ax.set_xlabel(‘w’)

ax.set_xlim(w_min – 1, w_max + 1)

ax.set_ylabel(‘b’)

ax.set_ylim(b_min – 1, b_max + 1)

ax.set_zlabel(‘error’)

ax.set_zlim(-1, np.max(Z))

ax.view_init (elev=25, azim=-75) # azim = -20

ax.dist=12

title = ax.set_title(‘Epoch 0’)

First to create a 3D drawing, create 256 equally spaced values between the minimum and maximum values of “w” and “b” to create the grid, as shown in lines 2-5. Use the grid to calculate the error (line 5) SN of these values by calling the error function in the Sigmoid class. Create a shaft handle in line 8 to create a 3D drawing.

To create a 3D drawing, use the ax.plot_surface function to create a surface map of weights and errors by setting the rstride and cstride, specifying the frequency of sampling points and data. Next, using the ax.contourf function, an error profile relative to the weight and deviation is drawn on the top of the surface by specifying the error value in the “Z” direction (lines 9-10). In lines 11-16, label each axis and set the axis limits for all three dimensions. You are making a 3d drawing, so you need to define the viewpoint. In lines 17-18, a viewpoint is set for the drawing with a height of 25 degrees on the “Z” axis and a distance of 12 units.

def plot_animate_3d(i):

i = int(i*(epochs/animation_frames))

line1.set_data(sn.w_h[:i+1], sn.b_h[:i+1])

line1.set_3d_properties(sn.e_h[:i+1])

line2.set_data(sn.w_h[:i+1], sn.b_h[:i+1])

line2.set_3d_properties(np.zeros(i+1) – 1)

title.set_text(‘Epoch: {: d}, Error: {:.4f}’.format(i, sn.e_h[i]))

return line1, line2, title

if plot_3d:

#animation plots of gradient descent

i = 0

line1, = ax.plot(sn.w_h[:i+1], sn.b_h[:i+1], sn.e_h[:i+1], color=’black’,marker=’.’)

line2, = ax.plot(sn.w_h[:i+1], sn.b_h[:i+1], np.zeros(i+1) – 1, color=’red’, marker=’.’)

anim = animation.FuncAnimation(fig, func=plot_animate_3d, frames=animation_frames)

rc(‘animation’, html=’jshtml’)

anim

Based on a static 3D drawing, you want to visualize the dynamic operation of the algorithm, which is captured by historical variables for parameters and error functions at each period of the algorithm. To create an animation of the gradient descent algorithm, use the animation.funcanimation function that passes the custom function plot_animate_3D as one of the arguments and specifies the number of frames required to create the animation. The plot_animate_3d function plot_animate_3d updates parameter values and error values for the corresponding values of “w” and “b”. In the same function on line 7, set the text to show the error value for that particular period. Finally, to display the animation online, the rc function is called to render the HTML content in the JUPyter notebook.

Similar to 3d drawing, you can create a function to draw a 2d contour map.

if plot_2d:

W = np.linspace(w_min, w_max, 256)

b = np.linspace(b_min, b_max, 256)

WW, BB = np.meshgrid(W, b)

Z = sn.error(X, Y, WW, BB)

fig = plt.figure(dpi=100)

ax = plt.subplot(111)

ax.set_xlabel(‘w’)

ax.set_xlim(w_min – 1, w_max + 1)

ax.set_ylabel(‘b’)

ax.set_ylim(b_min – 1, b_max + 1)

title = ax.set_title(‘Epoch 0’)

Cset = plt.contourf(WW, BB, Z, 25, alpha=0.8, cmap=cm.bwr)

plt.savefig(“temp.jpg”,dpi = 2000)

plt.show()

def plot_animate_2d(i):

i = int(i*(epochs/animation_frames))

line.set_data(sn.w_h[:i+1], sn.b_h[:i+1])

title.set_text(‘Epoch: {: d}, Error: {:.4f}’.format(i, sn.e_h[i]))

return line, title

if plot_2d:

i = 0

line, = ax.plot(sn.w_h[:i+1], sn.b_h[:i+1], color=’black’,marker=’.’)

anim = animation.FuncAnimation(fig, func=plot_animate_2d, frames=animation_frames)

rc(‘animation’, html=’jshtml’)

anim

Implementation of algorithm

This section will implement different variations of the gradient descent algorithm and generate 3d and 2D animation diagrams.

Vanilla gradient descent

The gradient descent algorithm updates the parameters by moving in the opposite direction to the gradient of the objective function relative to the network parameters.

Parameter updating rules are given by the following formula:

Gradient descent update rule:

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

self.w -= eta * dw / X.shape[0]

self.b -= eta * db / X.shape[0]

self.append_log()

In batch gradient descent, all training data points are iterated and the cumulative gradient sum of parameters “W” and “b” is calculated. Then the parameter values are updated according to the cumulative gradient value and learning rate.

To perform the gradient descent algorithm, change the configuration Settings as shown below.

X = np. Asarray ([0.5, 2.5])

Y = np. Asarray ([0.2, 0.9])

algo = ‘GD’

w_init = -2

b_init = -2

w_min = -7

w_max = 5

b_min = -7

b_max = 5

epochs = 1000

eta = 1

animation_frames = 20

plot_2d = True

plot_3d = True

In the configuration Settings, the variable algo was set to ‘GD’, indicating that the Vanilla gradient descent algorithm was performed in the Sigmoid neuron to find the optimal parameter value. After the configuration parameters are set, the SN class “FIT” method is continued to train sigmoid neurons on small data.

The history of gradient descent

The figure above shows how the historical values of errors, weights, and biases vary from stage to stage while the algorithm learns the best parameters. The key point to note in the figure is that the error hovers around 0.5 in the initial stage, but after 200 stages it is almost zero.

If you want to draw 3d or 2d animation, you can set the Boolean variables plot_2d and plot_3D. Show the appearance of the 3d error surface under the corresponding values “W” and “b”. The goal of the learning algorithm is to move to the dark blue region with minimal error/loss.

To visualize dynamically executing algorithms, the function plot_animate_3D can be used to generate animations. When the animation is played, you can see the phase number and the corresponding error value.

If you want to slow down the animation, you can do so by clicking the minus sign in the video control, as shown in the animation above. Similarly, you can animate a two-dimensional contour map to see how the algorithm moves toward the global minimum.

Gradient descent based on Momentum

In Momentum GD, moves with the cumulative average of the previous gradient and the current gradient’s exponential decay.

Momentum GD has the following code:

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

v_w = gamma * v_w + eta * dw

v_b = gamma * v_b + eta * db

self.w = self.w – v_w

self.b = self.b – v_b

self.append_log()

Based on Momentum GD, historical variables are covered in order to track previous gradient values. The variable gamma indicates how much Momentum needs to be applied to the algorithm. The variables v_W and v_B are used to calculate the motion based on the history and the current gradient. At the end of each phase, the append_log function is called to store a history of parameters and lost function values.

To execute Momentum GD for the Sigmoid neuron, modify the configuration Settings as follows:

X = np. Asarray ([0.5, 2.5])

Y = np. Asarray ([0.2, 0.9])

algo = ‘Momentum’

w_init = -2

b_init = -2

w_min = -7

w_max = 5

b_min = -7

b_max = 5

epochs = 1000

mini_batch_size = 6

Gamma = 0.9

eta = 1

animation_frames = 20

plot_2d = True

plot_3d = True

The variable algo is set to “Momentum” to indicate that you want to use Momentum GD to find the best parameter for sigmoid neurons; Another important change is the gamma variable, which controls the amount of Momentum needed in a learning algorithm. Gamma varies from 0 to 1. After the configuration parameters are set, the SN class “FIT” method continues to train the SigmoID neurons on the TOY data.

Momentum GD changes

As can be seen from the figure, the accumulated historical Momentum GD fluctuates within and outside the minimum value, and the values of weight and deviation terms also show some fluctuations.

Nesterov accelerated gradient descent

During Nesterov accelerated gradient descent, it is desirable to know if the minimum is close before taking another step based on the current gradient to avoid overshoot.

Momentum GD has the following code:

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

v_w = gamma * v_w

v_b = gamma * v_b

for x, y in zip(X, Y):

dw += self.grad_w(x, y, self.w – v_w, self.b – v_b)

db += self.grad_b(x, y, self.w – v_w, self.b – v_b)

v_w = v_w + eta * dw

v_b = v_b + eta * db

self.w = self.w – v_w

self.b = self.b – v_b

self.append_log()

The main change to the NAG GD code is the calculation of v_W and v_B. In Momentum GD, these variables are computed in one step, but in NAG, in two steps.

v_w = gamma * v_w

v_b = gamma * v_b

for x, y in zip(X, Y):

dw += self.grad_w(x, y, self.w – v_w, self.b – v_b)

db += self.grad_b(x, y, self.w – v_w, self.b – v_b)

v_w = v_w + eta * dw

v_b = v_b + eta * db

In part 1, before iterating through the data, multiply Gamma by the history variable, and then compute the gradient using the subtracted history values from self.w and self.b. Just set the algo variable to “NAG”. You can generate a 3D or 2d animation to see how NAG GD differs from Momentum GD in achieving the global minimum.

Mini-batch and Stochastic gradient descent

Instead of looking at all the data points at once, divide the entire data into multiple subsets. For each subset of the data, the derivative of each point in the subset is calculated and the parameters are updated. There is no calculation of the derivative of the entire data for the loss function, but it is approximated to fewer points or smaller mini-batch size. This method of Batch gradient calculation is called “mini-batch gradient descent”.

The code for Mini-Batch GD is as follows:

for i in range(epochs):

dw, db = 0, 0

points_seen = 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

points_seen += 1

if points_seen % mini_batch_size == 0:

self.w -= eta * dw / mini_batch_size

self.b -= eta * db / mini_batch_size

self.append_log()

dw, db = 0, 0

In Mini Batch, the entire data is iterated over and the variable points_seen is used to track the number of points seen. If you see points that are multiples of mini-Batch size, you are updating the parameters of the Sigmoid neuron. In the special case, when the mini-batch size is equal to 1, it will become a random gradient descent. To execute mini-Batch GD, simply set the algorithm variable to “MiniBatch”. 3D or 2D animations can be generated to see how Mini-Batch GD differs from Momentum GD in achieving the global minimum.

AdaGrad gradient descent

The main hidden motivation of AdaGrad is the adaptive learning rate for different features of the data set, that is, different learning rates are used for different features instead of using the same learning rate for all features of the data set.

The code for Adagrad looks like this:

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

v_w += dw**2

v_b += db**2

self.w -= (eta / np.sqrt(v_w) + eps) * dw

self.b -= (eta / np.sqrt(v_b) + eps) * db

self.append_log()

In Adagrad, you keep the sum of squares of gradients, and then update the parameter by dividing the learning rate by the square root of the historical value. This is not static learning, but intensive and sparse dynamic learning. The mechanics for generating graphics/animations are the same as above. The idea here is to use different toy datasets and different hyperparameter configurations.

RMSProp gradient descent

In RMSProp, unlike the sum of gradients in AdaGrad, the gradient history is calculated from the average of exponential decay, which helps prevent rapid growth of intensive denominators.

The code for RMSProp looks like this:

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

v_w = beta * v_w + (1 – beta) * dw**2

v_b = beta * v_b + (1 – beta) * db**2

self.w -= (eta / np.sqrt(v_w) + eps) * dw

self.b -= (eta / np.sqrt(v_b) + eps) * db

self.append_log()

The only change in the AdaGrad code is the way the variables v_W and V_B are updated. In AdaGrad, v_W and V_B, from the first stage, always increase by the square of the gradient for each parameter, but in RMSProp, v_W and V_B are gradient-weighted sums of hyperparametric exponential decay using “gamma”. To execute RMSProp GD, simply set the algo variable to “RMSProp”. 3D or 2D animations can be generated to see how RMSProp GD differs from AdaGrad GD in achieving the global minimum.

Adam gradient descent

Adam has two histories, similar to those used in “M ₜ” and Momentum GD and similar to those used in “V ₜ” and RMSProp.

In operation, Adam performs deviation correction. It uses the following equation for “mₜ” and “vₜ” :

Deviation correction

Deviation correction ensures that there is no weird behavior at the beginning of training. Adam’s point is that it combines the strengths of Momentum GD (moving faster in mild areas) with RMSProp GD (adjusting learning rates).

The code for Adam GD looks like this:

v_w, v_b = 0, 0

m_w, m_b = 0, 0

num_updates = 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw = self.grad_w(x, y)

db = self.grad_b(x, y)

num_updates += 1

m_w = beta1 * m_w + (1-beta1) * dw

m_b = beta1 * m_b + (1-beta1) * db

v_w = beta2 * v_w + (1-beta2) * dw**2

v_b = beta2 * v_b + (1-beta2) * db**2

m_w_c = m_w / (1 – np.power(beta1, num_updates))

m_b_c = m_b / (1 – np.power(beta1, num_updates))

v_w_c = v_w / (1 – np.power(beta2, num_updates))

v_b_c = v_b / (1 – np.power(beta2, num_updates))

self.w -= (eta / np.sqrt(v_w_c) + eps) * m_w_c

self.b -= (eta / np.sqrt(v_b_c) + eps) * m_b_c

self.append_log()

In the Adam optimizer, calculate M_w&m_b to track momentum history, and calculate v_w&v_b to attenuate the denominator and prevent it from growing quickly, as in RMSProp.

After that, bias correction is implemented for historical variables based on Momentum and RMSProp. Once correction values for parameters “w” and “b” have been calculated, these values are used to update parameter values.

To perform Adam gradient descent, change the configuration Settings as shown below.

X = np.asarray([3.5, 0.35, 3.2, -2.0, 1.5, -0.5])

Y = np.asarray([0.5, 0.50, 0.5, 0.5, 0.1, 0.3])

algo = ‘Adam’

w_init = -6

B_init = 4.0

w_min = -7

w_max = 5

b_min = -7

b_max = 5

epochs = 200

Gamma = 0.9

Eta = 0.5

eps = 1e-8

animation_frames = 20

plot_2d = True

plot_3d = False

The variable algo is set to “Adam”, indicating that Adam GD is used to find the best parameter for sigmoid neurons; Another important change is the gamma variable, which is used to control the momentum required by the learning algorithm. Gamma varies from 0 to 1. After the configuration parameters are set, the SN class “FIT” method continues to train the SigmoID neurons on the TOY data.

Parameter changes in Adam GD

Create a 2D contour animation that shows how Adam GD learns the path to a global minimum.

Adam GD animation

Unlike the RMSProp case, there is not much volatility. Especially after the first few stages, move more definitely toward the minimum.

This concludes our discussion of how to use Numpy to implement optimization techniques.

Practice learning

This article presents a different case, using a toy dataset with a static initialization point. But you can use different initialization points, and for each initialization point, use different algorithms to see what adjustments need to be made in the hyperparameters. All of the code discussed in this article is in the GitHub repository. Feel free to categorize or download. Best of all, you can run code directly from Google Colab without having to worry about installing packages.

https://github.com/Niranjankumar-c/GradientDescent_Implementation?source=post_page—–809e7ab3bab4——————— –

In summary, this paper describes how to implement different variations of the gradient algorithm by taking simple Sigmoid neurons. You also learned how to create beautiful 3D or 2D animations for each variant, which shows how the learning algorithm finds the best parameters.

Leave a comment like follow

We share the dry goods of AI learning and development. Welcome to pay attention to the “core reading technology” of AI vertical we-media on the whole platform.



(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)