Author | Vivek Patel compile | Flin source | towardsdatascience
Don’t reinvent the wheel unless you can learn something.
Powerful libraries already exist, such as TensorFlow, PyTorch, Keras, and so on. I’ll cover the basics of creating a multilayer perceptron (MLP) neural network in Python.
Perceptron is the basic component of neural network. The perceptron input function is a linear combination of weights, biases, and input data. In_j = weight input + bias.(in_j = weight input + bias) On each perceptron, we can specify an activation function g.
An activation function is a mathematical method to ensure that a perceptron is “fired” or activated only after a certain level of input is reached. Common nonlinear activation functions are S-type, Softmax, rectilinear unit (ReLU) or simply tanH.
There are many options for the activation function, but in this article we will only cover Sigmoid and SoftMax.
Figure 1: Perceptron
For supervised learning, we later forward the input data through a series of hidden layers to the output layer. This is called forward propagation. In the output layer, we can print the prediction y*. Through our forecast y *, we can calculate the error – y | | y * and makes error by back propagation neural network. This is called back propagation. The weight and bias of each perceptron in the hidden layer will be updated by a stochastic gradient descent (SGD) process.
Figure 2: Basic structure of neural network
Now that we’ve covered the basics, let’s implement a neural network. The goal of our neural network is to classify handwritten numbers in the MNIST database. I will use the NumPy library for basic matrix calculations.
In our problem, MNIST data is represented by an 8-bit color channel in the [748,1] matrix. Essentially, we have a matrix of numbers [748,1] that begins with [0,1,…. 255], where 0 represents white and 255 represents black.
The results of
The MNIST handwritten numerals database contains 60,000 handwritten examples for training purposes and 10,000 samples for testing purposes. After training 30 Epochs on 60,000 examples, I ran the trained neural network on the test dataset and achieved 93.2% accuracy. It can even be further optimized by tweaking the hyperparameters.
How does it work?
This paper is divided into five parts. These parts are:
(1) activation function (2) weight initialization (3) deviation initialization (4) training algorithm (5) prediction
1. Activate the function
Sigmoid is the activation function defined by the equation 1 / (1+ exp (-x)) that will be used in the hidden layer perceptron.
Softmax is an activation function that is usually used in the output layer when we want to classify inputs. In our example, we want to split a number into one of 10 buckets [0,1,2,… 9]. It calculates the probability of each entry in the matrix; The probability is going to add up to 1. The entry with the highest probability will correspond to its prediction, i.e. 0,1,… , 9. The Softmax is defined as exp (x)/sum (exp (x)).
Figure 3: Implementation of the activation function
2. Initialize the weights
For each of our hidden layers, we will need to initialize the weight matrix. There are a couple of different ways to do this, so this is 4.
-
Zero initializer – Initializer ownership weight = 0.
-
Random initialization – Use random numbers to initialize weights, not completely random. We usually use random numbers in the standard normal distribution (mean 0 and variance 1).
-
Xavier Initialization – Initializes weights using random numbers in a normal distribution with set variances. We will set the variance based on the size of the previous layer.
As mentioned above, the edge into the perceptron is multiplied by the weight matrix. The key point is that the size of the matrix depends on the size of the current layer and the layer before it. Specifically, the weight matrix is of size [currentLayerSize, previousLayerSize].
As mentioned above, the edge into the perceptron is multiplied by the weight matrix. The key point is that the size of the matrix depends on the size of the current layer and the layer before it. Specifically, the weight matrix is of size [currentLayerSize, previousLayerSize].
Suppose we have a hidden layer with 100 nodes. The size of our input layer is [748,1], and the size of our desired output layer is [10,1]. The weight matrix between the input layer and the first hidden layer is [100,748]. The size of each weight matrix between the hidden layers is [100,100]. Finally, the size of the final weight matrix between the hidden layer and the output layer is [10,100].
For educational purposes, we’ll stick with a single hidden layer; In the final model, we will use layers.
Figure 4: Weight initialization implementation
3. Deviation initialization
Like weight initialization, the size of the bias matrix depends on the layer size, especially the current layer size. One method of bias initialization is to set the bias to zero.
For our implementation, we will need to provide a bias for each hidden layer and output layer. The size of the bias matrix is [100,1], based on 100 nodes per hidden layer, while the size of the output layer is [10,1].
Figure 5: Bias initialization implementation
4. Train the algorithm
As mentioned earlier, training is based on the concept of stochastic gradient descent (SGD). In SGD, we only consider one training point at a time.
In our example, we will use SoftMax activation in the output layer. The “cross entropy loss” formula will be used to calculate the loss. For SGD, we will need to use Softmax to calculate the derivative of the cross entropy loss. In other words, the derivative is reduced to y star minus y, which is the predicted y star minus the expected value of y.
Figure 6: Cross entropy loss and its derivatives for Softmax activation
We also need to write the derivative of the s-type activation function. In Figure 7, I define an S-type function and its derivatives
Figure 7: Sigmoid function (top) and its derivatives (bottom)
Typically, the neural network will allow the user to specify several “hyperparameters.” In our implementation, we will focus on allowing users to specify epoch, batch size, learning rate and momentum. There are other optimization techniques!
-
Learning rate (LR) : Learning rate is a parameter by which the user can specify the speed at which the network will allow us to learn and update its parameters. Choosing a good study rate is an art. If LR is too high, we may never converge to good acceptable training errors. If LR is too low, we can waste a lot of computing time.
-
Epoch: Epoch is an iteration of the entire training set. To make sure we don’t overfit the data from the earlier sample, we randomly rank the data after each period.
-
Batch size: With each iteration of Epoc2h, we will train data in batches. For each training point in the batch, we will collect the gradient and update the weight/deviation after the batch is completed.
-
Momentum: This is a parameter that we will accelerate learning by collecting the moving average of the past gradient and allowing motion in that direction. In most cases, this will result in faster convergence. Typical values range from 0.5 to 0.9.
Below, I write some generic pseudocode to simulate an overview of the backpropagation learning algorithm. Tasks such as calculating output and grouping training data into batches have been annotated for ease of reading.
Now, we will show the implementation of the pseudocode.
5. Make predictions
Right now, we are missing only one key aspect of this implementation. Predictive algorithms. We have done most of the work in writing the backpropagation algorithm. We just need to use the same forward propagation code to make the prediction. The softmax activation function in the output layer calculates the probability of each entry in the matrix of size [10,1].
Our goal is to classify numbers from 0 to 9. Therefore, the index of the AJ2 matrix will correspond to the prediction. The index with the highest probability will be selected by Np.argmax () and will serve as our prediction.
conclusion
That’s right! We’re done. We have written an implementation of neural networks in Python.
But how do we choose the best parameters? We can use general knowledge of algorithms to select meaningful hyperparameters. We need to select hyperparameters that generalize but do not overfit the data. We can adjust momentum, learning rate, number of periods, batch size and number of hidden nodes to achieve our goals. Taking a step forward, we can write more algorithms to do this for us!
Genetic algorithm is an AI algorithm that can be used to select the best parameters. The idea of genetic algorithms is to create a set of children with different parameters and make them generate test errors related to the parameters. We can reproduce and mutate neural networks with the best hyperparameters to find parameters with better performance. After spending a lot of time, we will be able to learn a lot about hyperparametric cases and find new optimal hyperparametric values.
Is there anything else we can do to reduce test errors? Yes, we can scale the input data. Like many algorithms, more can have a big impact on the results of an algorithm. In our example, the numbers range from [0 to 255]. This bias can be reduced if we scale the numbers so that they range from [0 to 1].
Thanks for reading!
The original link: towardsdatascience.com/implementin…
Welcome to panchuangai blog: panchuang.net/
Sklearn123.com/
Welcome to docs.panchuang.net/