First, a brief introduction to artificial neural networks, or ANN.

Many machine learning algorithms are inspired by nature, but the biggest inspiration comes from our brains, how we think, learn and make decisions.

Interestingly, when we touch something hot, neurons in our bodies send signals to our brains, which then trigger a response that tells us to retreat from the hot area. We can train based on experience, and based on our experience, we start to make better decisions.

Using the same analogy, when we send an input to the neural network (touch the hot material), then based on learning (prior experience) we produce an output (exit from the hot region). In the future, when we get a similar signal (contact with a hot surface), we can predict the output (exit from the hot zone).

Suppose we input information such as temperature, wind speed, visibility, humidity and so on to predict future weather conditions – rainy, cloudy or sunny.

This can be expressed as follows.

Let’s represent it in terms of a neural network and understand the components of a neural network.

The neural network receives the input, transforms the input signal by changing the state using the activation function, and then generates the output.

The output will vary depending on the input received, the intensity (if the signal is represented by weights), and the activation applied to the input parameters and weights.

Neural networks are very similar to the neurons in our nervous system.

X1, X2,… Xn is the input signal from neuron to dendrite, and the state changes at the end of axon of neuron, producing output Y1, Y2… Yn.

In the case of weather forecasting, for example, temperature, wind speed, visibility and humidity are input parameters, which the neuron then processes by applying weights to the inputs using an activation function to produce the output, which here is predicted to be the type of sunny, rainy or cloudy day.

So what are the components of a neural network

The neural network will have

  • Input layer, bias unit.
  • One or more hidden layers, each of which will have a bias unit
  • Output layer
  • Weights associated with each connection
  • An activation function that converts a node’s input signal to an output signal

The input, hidden, and output layers are often referred to as the fully connected layer

What are these weights, what are the activation functions, what are these equations?

So let’s simplify this

Weights are how neural networks learn, and we adjust the weights to determine the strength of the signal.

Weights help us get different outputs.

For example, to predict sunny days, the temperature may range from pleasant to hot, with very good visibility on sunny days, so the weight of temperature and visibility will be higher.

The humidity will not be too high, otherwise it will rain that day, so the weight of humidity may be lower, or it may be negative.

Wind speed may be independent of sunny days, and its intensity may be either zero or very small.

We randomly initialize the weight (w) multiplied by the input (x) and add the deviation term (b), so for the hidden layer z is computed and the activation function (ɸ) is applied.

We call this forward propagation. The equation can be expressed as follows, where $l$is the number of layers, and for the input layer $l=1$.

Speaking of activation functions, let’s see what they do

The activation function helps us decide whether we need to activate the neuron, and if we need to activate the neuron what is the intensity of the signal.

Activation function is the mechanism by which neurons process and transmit information through neural networks.

Let’s use sample data for weather prediction to understand neural networks

To better understand, we will simplify, we only need two inputs: temperature and visibility with two hidden nodes, no bias. For output, we still want to classify the weather as sunny or not

The temperature is Fahrenheit and the visibility is miles.

Let’s look at a temperature of 50 degrees Fahrenheit and visibility of 0.01 miles.

Step 1: We randomly initialize the weight to a value close to but not equal to 0.

Step 2: Next, we get our single data point with the input nodes of temperature and visibility and input it into the neural network.

Step 3: Propagate forward from left to right, multiply the weights by the input values, and then use ReLU as the activation function. ReLU is currently the most commonly used activation function in fully connected networks.

Step 4: Now we predict the output and compare the predicted output to the actual output value. Since this is a classification problem, we use the cross entropy function

The cross entropy is a non-negative cost function between 0 and 1 and in our case, the actual output is not sunny, so the y value is 0. If ŷ is 1, let’s plug the values into the cost function and see what we get

Similarly, when the actual output is the same as the predicted output, we get the cost c=0.

We can see that for the cross entropy function, when the predicted output matches the actual output, the cost is zero; When the predicted output does not match the actual output, the cost is infinite.

Step 5: Propagate back from right to left and adjust the weights. The weights are adjusted according to how responsible they are for errors, and the learning rate determines how much we update the weights.

Back propagation, learning rate, we’ll explain everything in simple terms.

Back propagation

Think of back propagation as the feedback mechanism we sometimes get from parents, mentors, peers, and feedback that helps us become a better person.

Back propagation is a fast learning algorithm that tells us how the cost function changes when we change weights and biases, changing the behavior of the neural network.

The detailed mathematics of backpropagation will not be delved into here. In back propagation, we calculate the partial derivatives of cost with respect to weight and cost with respect to deviation of each training instance, and then find the average value of partial derivatives of all training samples.

For our individual data points, we determine how much each weight and deviation affects the error, and based on how much these weights affect the error, we adjust the ownership values simultaneously.

For batch gradient descent (GD) algorithm, all training data are updated once. For using the stochastic gradient descent (SGD) algorithm, the weights are updated once per batch training example.

For different weights, we repeat steps 1 through 5 using GD or SGD.

As weights are adjusted, certain nodes are turned on or off based on the activation function.

In our weather example, the correlation between temperature and predict cloudy, because summer temperatures above 70 degrees, the winter is still a cloudy, or winter temperature may be 30 degrees or lower, but is still a cloudy, in this case, the activation function can be decided to close down is responsible for the temperature of the hidden nodes, only open the visibility node, To predict that the output is not sunny, as shown in the figure below

Epoch refers to the complete data set used for one learning, one forward propagation and one back propagation.

We can repeat that means propagating forward and backward under multiple epochs, until we converge to a global minimum.

What is learning rate?

The learning rate controls the extent to which we should adjust the weights according to the loss gradient.

The lower the value, the slower the learning and the slower the convergence to the global minimum.

Higher learning rate does not make gradient descent converge

The learning rate is randomly initialized.

How do I determine the number of hidden layers and the number of nodes per hidden layer?

As the number of hidden layers increases and the number of hidden layer neurons or nodes increases, the capacity of neural network also increases. Neurons can cooperate to express different functions, which often leads to over-fitting, and we must be careful about over-fitting.

For the optimal number of hidden layers in neural networks, the following table is proposed by Jeff Heaton

For the optimal number of neurons in the hidden layer, we can use the following method

  • Average number of neurons in input layer and output layer.
  • Between the size of the input layer and the size of the output layer.
  • 2/3 the size of the input layer, plus the size of the output layer.
  • Less than twice the size of the input layer.