In this chapter, we will discuss ANN, a powerful nonlinear model for both supervised and unsupervised tasks that uses a different strategy to overcome the limitations of perceptrons. If the perceptron is analogous to a neuron, then an ANN, or a neural network, should be analogous to a brain. Just as a human brain is made up of billions of neurons and trillions of synapses, an ANN is a directed graph of artificial neurons. The edges of the graph represent weights, which are parameters that the model needs to learn.

This chapter will provide an overview of the structure and training of small feedforward artificial neural networks. The SciKit-Learn library implements neural networks for classification, regression, and feature extraction. However, these implementations are only suitable for small networks. Training a neural network requires a lot of computing power. In practice, most neural networks are trained using graphics processing units containing thousands of parallel processing cores. The SciKit-Learn library does not support gpus and has no plans to do so in the near future. GPU acceleration is not yet mature but is rapidly evolving. Providing GPU support in the SciKit-Learn library will add many dependencies that conflict with the goal of the SciKit-Learn project to “easily install on all platforms”. In addition, other machine learning algorithms rarely need to use GPU acceleration to the same extent as neural networks. Training neural networks is best done using specialized libraries such as Caffe, TensorFlow, and Keras rather than general-purpose machine learning libraries such as Scikit-learn.

Although we will not use the SciKit-Learn library to train a deep convolutional neural network (CNN) for target recognition or a recursive network for speech recognition, understanding the principles of the small networks to be trained is an important prerequisite for these tasks.

12.1 Non-linear decision boundary

Recall from Chapter 10 that while some Boolean functions such as AND, OR, AND NAND can be approximated by perceptrons, XOR, a linearly indivisible function, cannot, as shown in Figure 12.1.

 

Figure 12.1

Let’s review the XOR function in more detail to build an intuition about ANN’s capabilities. Unlike the AND function (the output is equal to 1 when both inputs are equal to 1) AND the OR function (the output is equal to 1 when at least one of the inputs is equal to 1), the XOR function has an output equal to 1 only when one of the inputs is equal to 1. When both conditions are true, we can treat the output of the XOR function as 1. The first condition is that at least one of the output items is equal to 1. This condition is the same as the test for the OR function. The second condition is that the input terms cannot all be equal to 1, and this condition is the same as the test for NAND. We can obtain the processing output of the XOR function by processing the input items with the OR AND NAND functions at the same time, AND then using the AND function to verify that the output of both functions is equal to 1. That is, the OR, NAND, AND functions can be combined to produce the same output as the XOR function.

Table 12.1 is A table of the true values of input A AND input B for XOR, OR, AND, AND NAND functions. From this table, we can verify that the output of input A AND input B processed by the OR function AND the output of the NAND function followed by the AND function is the same as the output directly processed by the XOR function, as shown in Table 12.2.

 

Feedforward artificial neural network and feedback artificial neural network

ANN can be described by three key components. The first key component is the architecture, or topology, of the model, which describes the types of neurons and the connections between them. The second key component is the activation function used by artificial neurons. The third key component is the learning algorithm to find the optimal weight value.

There are two main types of ANNs. Feedforward neural networks are the most common type and are defined by directed acyclic graphs. In feedforward neural networks, information is transmitted to the output layer in only one direction. In contrast, feedback or recursive neural networks contain loops. A feedback loop can represent an internal state of the network that causes the behavior of the network to change over time based on its own inputs. Feedforward neural networks are often used to learn a function that maps input to output. For example, a feedforward neural network can be used to identify objects in a photo, or to predict the likelihood of losing subscribers to a SaaS product. The temporal behavior of feedback neural networks makes them suitable for processing input sequences. Feedback neural networks have been used to translate documents and automatically transcribe speeches between two languages. Since feedback neural networks are not implemented in the SciKit-Learn library, we will confine our discussion to feedforward neural networks.

12.3 Multilayer Perceptron

The multilayer perceptron is a simple ANN. However, its name is a misnomer. A multilayer perceptron is not a multilayer structure with a single perceptron in each layer, but a multilayer structure with artificial neurons simulating the perceptron. A multilayer perceptron consists of three or more layers of artificial neurons that form a directed, acyclic graph. In general, each layer is fully connected to the following layer, and the output or activation of each artificial neuron in one layer is the input of each artificial neuron in the next layer. Features are entered through the input layer. Simple neurons in the input layer are connected to at least one hidden layer. Hidden layers represent potential variables that cannot be observed in the training data. Hidden neurons in hidden layers are often referred to as hidden units. The last hidden layer is connected to an output layer whose activation is the predicted value of the response variable. Figure 12.2 depicts a multi-layer perceptron structure with three layers. Neurons marked with +1 are constant deviation neurons and do not appear in most architectural diagrams. The neural network has two input neurons, three hidden neurons and two output neurons.

 

Figure 12.2

 

The input layer is not included in the layer count of a neural network, but the count of the MLPclassifier.nlayers property contains the input layer.

Recall from Chapter 10 that a perceptron consists of one or more binary outputs, one binary output, and a Heyveside step activation function. A small change in the weight of a perceptron has no effect on its output and will either cause it to change from 1 to 0 or from 0 to 1. This property makes it difficult to understand the performance variation of the neural network when we change its weight. Because of this, we will use a different type of neuron to create the MLP. An S-curve neuron contains one or more real-valued inputs and one real-valued output, and uses an S-curve activation function. As shown in Figure 12.3, an S-curve activation function is a smooth version of the step function that approximates a step function within the extreme range, but can output any value between 0 and 1, which allows us to understand how changes in the input term affect the output term.

 

Figure 12.3

12.4 Training multi-layer perceptron

In this section, we will discuss how to train a multilayer perceptron. Recalling chapter 5, we can use gradient descent to minimize a real-valued function C with many variables. Let’s say C is a function of two variables, v1 and v2. To understand how to minimize C by changing the variable, we need a small change in the variable to produce a small change in the output. We express a change in the value of v1 as δ v1, a change in the value of v2 as δ v2, and a change in the value of C as δ C. The relationship between δ C and variable change is shown in Formula 12.1:

 

(Formula 12.1)

 

Represents the partial of C with respect to v1. For convenience, we express δ v1 and δ v2 as a vector, as shown in Formula 12.2

 

(Formula 12.2)

We will also express the partial derivative of C with respect to each variable as the gradient vector C of C, as shown in Formula 12.3:

 

(Formula 12.3)

We can rewrite the formula for δ C as formula 12.4:

 

(Formula 12.4)

In each iteration, δ C should be negative to reduce the value of the cost function. To ensure that δ C is negative, we set δ v to formula 12.5:

 

(Formula 12.5)

In Formula 12.5, η is a hyperparameter called the learning rate. We replace δ v to clarify why δ C is negative, as shown in formula 12.6:

 

(Formula 12.6)

 

Omega squared is always greater than 0, so we multiply it by the learning rate and take the inverse of the product. In each iteration, we will compute the gradient C of C and update the variable to take a step in the direction of the fastest descent. In order to train multilayer perceptrons, we omit an important detail: how to understand how changes in implicit unit weights affect cost functions? More specifically, how to calculate the partial derivative of the cost function with respect to the weight of the connection hidden layer?

12.4.1 Reverse Propagation

We have seen that gradient descent minimizes the function iteratively by computing the gradient of a function and using the gradient to update the parameters of the function. To minimize the cost function of a multilayer perceptron, we need to calculate its gradient. Recall that multilayer perceptrons contain layers of cells that represent potential variables. We can’t calculate their errors using a cost function. The training data indicates the desired output of the entire network, but does not describe how implicit units should affect the output. Since we can’t calculate the error of the hidden elements, we can’t calculate their gradient, or update their weights. A simple solution to this problem is to randomly modify the gradient of the implicit element. If a random change in a gradient reduces the value of the cost function, the weight is updated to evaluate the other change. Even for ordinary networks, this method consumes a lot of computing power. In this section, we describe a more efficient solution, using a back-propagation algorithm to compute the gradient of a neural network’s cost function for each of its weights. The back propagation method allows us to understand how each weight affects the error and how to update the weight to minimize the cost function.

The name of the algorithm is a portmanteau of reverse and propagation, which refers to the direction of the error through the network layer when calculating the gradient. Backpropagation is often used in conjunction with an optimization algorithm, such as gradient descent, to train feedforward neural networks. Theoretically, it can be used to train feedforward networks with any number of hidden cells and any number of layers.

Like GRADIENT descent, back propagation is an iterative algorithm, and each iteration consists of two phases. The first stage is forward propagation or forward transmission. In the forward pass phase, inputs travel forward through the neuron layer of the network until they reach the output layer. The loss function can then be used to calculate the error of the prediction. The second stage is backward propagation. The error propagates from the cost function to the input so that each neuron’s contribution to the error can be estimated. The process is based on the chain rule, which can be used to compute the derivative of a combination of two or more functions. We have shown earlier that neural networks can approximate complex nonlinear functions by combining linear functions. These errors can then be used to calculate the gradients required by the gradient descent method to update the weights. After the gradient is updated, the features can be propagated forward through the network again to start the next iteration.

 

The chain rule can be used to compute the derivative of a combination of two or more functions. Let’s say z depends on y, and y depends on x. The derivative of z with respect to x can be expressed as

To propagate forward through a network, we calculate the activation terms of neurons in one layer and use the activation terms as inputs to the neurons connected to it in the next layer. To accomplish these tasks, we first need to calculate the pre-activation terms of each neuron in the network layer. Recall that the preactivation term of a neuron is a linear combination of its input terms and weights. Next, we calculate its activation term by applying its activation function to its pre-activation term. The activation items of this layer become the input items of the next layer in the network.

In order to propagate back through the network, we first calculate the partial derivative of the cost function with respect to each activation term in the last hidden layer. Then, we calculate the partial derivative of the activation term of the last hidden layer with respect to its preactivation term. Next, the partial derivative of the last hidden layer’s preactivation term with respect to its weight is calculated, and the process is repeated until the input layer is reached. Through this process, we approximate the contribution of each neuron to the error and calculate the gradient value necessary to update the weight and minimize the cost function. More specifically, for each cell in each layer, we must compute two partial derivatives. The first is the partial derivative of the error with respect to the element activation term. This derivative is not used to update the weight of the cell; instead, it is used to update the weight of the cell in the previous layer connected to the cell. Second, we calculate the derivative of the error with respect to the weight of the element in order to update the weight value and minimize the cost function. Let’s look at an example. We will train a neural network consisting of two input units, a hidden layer consisting of two hidden units, and an output unit. Its architecture diagram is shown in Figure 12.4.

Let’s assume that the initial values for the weights are shown in Table 12.3.

 

 

Figure 12.4

The eigenvectors are [0.8,0.3], and the true value of the response variable is 0.5. Let’s calculate the value passed forward for the first time, starting with the implicit unit H1. The pre-activation item of H1 is firstly calculated, and then the logical S-shaped curve function is applied to the pre-activation item to calculate the activation item, as shown in Formula 12.7:

 

(Formula 12.7)

We can use the same procedure to calculate the h2 activation term, and the result is 0.615. Then the activation items of implicit units H1 and H2 are taken as the input items of the output layer, and the activation items of O1 are similarly calculated, and the calculated result is 0.813. Now we can calculate the error of the network prediction. For this network, we will use the squared error cost function, as shown in 12.8:

 

(Formula 12.8)

In formula 12.8, n is the number of output units,

 

Is the activation term of output neuron OI, and Yi is the true value of the response variable. Our network has only one output unit, so n is equal to 1. The predicted value of the network is 0.813, and the true value of the response variable is 0.5, so the error is 0.313. Now we can update the weight w5. First calculate

 

Or change W5 to see how it affects the error. And by the chain rule,

 

Is equal to formula 12.9:

 

(Formula 12.9)

That is, we can approximate the degree of correlation between the change in error and W5 by answering the following questions.

  • oHow much does a change in the activation term of 1 affect the error?
  • o1 The change of preactivation item can change the activation itemo1 How much impact?
  • The weightw5 changes can be made to pre-activated itemso1 How much impact?

And then we’re going to subtract from w5

 

Times our learning rate to update the weight. The first question is answered by approximating the degree of connection between the error change and the activation term O1. The partial derivative of the cost function with respect to the activation term of the output unit is shown in Formula 12.10:

 

(Formula 12.10)

Then we answer the second question by approximating the degree of connection between the change of activation term of O1 and its preactivation term. The partial derivative of the logical function is shown in Formula 12.11:

 

(Formula 12.11)

In formula 12.11, f(x) is a logical function, corresponding to the formula 1/(1+e−x).

 

(Formula 12.12)

Finally, we approximate how much the change in preactivation o1 is related to W5. The pre-activation term is a linear combination of weights and input terms, as shown in Formula 12.13:

 

(Formula 12.13)

The deviation terms b2 and theta

 

The derivatives of phi are all 0. Both of these terms are constant for W5, the change in W5

 

No impact. Now that we have answered the three questions, we can calculate the partial derivative of the error with respect to W5, as shown in Formula 12.14:

 

(Formula 12.14)

We can now do this by subtracting the learning rate from w5

 

To update the value of W5. Then we can follow the same process to update the remaining weights. After the first backward pass, we can use the updated weight value to propagate forward over the network again.

12.4.2 Train a multilayer perceptron to approximate XOR function

Let’s use the SciKit-learn library to train a network to approximate XOR functions. We pass the activation=’logistic’ keyword variable to the MLPClassifier constructor to specify that the neuron should use the logical S-curve activation function. The Hidden_layer_sizes parameter takes an integer tuple to indicate the number of hidden cells in each hidden layer. We will use the same network architecture as in the previous section to train a network consisting of a hidden layer with two hidden units and an output layer with one output unit, as shown in code 12.1.

Code 12.1

# In[1]: from sklearn.model_selection import train_test_split from sklearn.neural_network import MLPClassifier y = [0, 1, 1, 0] X = [[0, 0], [0, 1], [1, 0], [1, 1]] clf = MLPClassifier(solver='lbfgs', activation='logistic', hidden_layer_sizes=(2,), random_state=20) clf.fit(X, y) predictions = clf.predict(X) print('Accuracy: %s' % clf.score(X, y)) for i, p in enumerate(predictions): print('True: %s, Predicted: %s' % (y[i], p)) # Out[1]: Accuracy: 1.0 True: 0, Predicted: 0 True: 1, Predicted: 1 True: 1, Predicted: 0Copy the code

After several iterations, the network converges. Let’s look at the weights we have learned and do a forward pass to the feature vector [1,1], as shown in code 12.2.

Code 12.2

# In[2]: print('Weights connecting the input layer and the hidden layer: \n%s' % clf.coefs_[0]) print('Hidden layer bias weights: \n%s' % clf.intercepts_[0]) print('Weights connecting the hidden layer and the output layer: \n%s' % clf.coefs_[1]) print('Output layer bias weight: \n%s' % clf.intercepts_[1]) # Out[2]: Weights connecting the input layer and the hidden layer: [[6.11803955 6.35656369] [5.79147859 6.14551916]] Hidden layer bias weights: [-9.38637909-2.77751771] Weights connecting the hidden layer and the output layer: [[-14.95481734] [14.53080968]] Output layer bias weight: [-7.2284531]Copy the code

To propagate forward, we need to compute the following formula, as shown in Formula 12.15.

 

 

 

 

 

 

 

 

 

 

 

(Formula 12.15)

The probability that the response variable was positive was 0.001, and the network prediction was 11=0.

12.4.3 Train a multilayer perceptron to classify handwritten numbers

In the previous chapter, we used a SVM to classify handwritten numbers in the MNIST dataset. In this section, we will use an ANN to classify these images, as shown in code 12.3.

Code 12.3

# In[1]: from sklearn.datasets import load_digits from sklearn.model_selection import cross_val_score from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.neural_network.multilayer_perceptron import MLPClassifier if __name__ == '__main__': digits = load_digits() X = digits.data y = digits.target pipeline = Pipeline([ ('ss', StandardScaler()), ('mlp', MLPClassifier (hidden_layer_sizes = (150, 100), alpha = 0.1, max_iter = 300, Print (pipeline, X, Y, n_jobs=-1) print(pipeline, X, y, n_jobs=-1)) # print [1]: [0.94850498 0.94991653 0.90771812]Copy the code

First we load the MNIST dataset using the load_digits convenience function, which will generate additional processes during cross-validation, which requires the code to start executing from within a main guard block. Scaling features is very important for ANN, and it will ensure faster convergence of some learning algorithms. Next, we create a Pipeline to scale the data before fitting an MLPClassifier class. The network consists of an output layer, a hidden layer containing 150 cells, and a second hidden layer containing 100 cells and an output layer. We also increased the regularization hyperparameter alpha and increased the maximum number of iterations from the default of 200 to 300. Finally, we print out the accuracy of the triple cross validation. The mean accuracy is similar to that of support vector classifier. Adding more hidden cells or layers, and fine-tuning hyperparameters with grid searches can further improve accuracy.

12.5 summary

In this chapter, we introduce the ANN model, which can represent complex functions for classification and regression by combining artificial neurons. In particular, we discuss the graph of directed acyclic artificial neural network called feedforward neural network. Multilayer perceptron is a feedforward neural network in which each layer is fully connected to the next layer. An MLP containing a hidden layer and a finite number of hidden cells is a general-purpose function approximation. It can represent any continuous function, although it does not necessarily automatically learn to approximate the weight value. We describe how the hidden layers of a network represent potential variables and how the weights of the network can be learned using a backward propagation algorithm. Finally, we use a multilayer perceptron implementation of the SciKit-Learn class library to approximate XOR functions and classify handwritten numbers.

This article is excerpted from SciKit-Learn Machine Learning (2nd edition)

 

In 14 chapters, sciKit-Learn introduces a series of machine learning models and techniques for using sciKit-Learn in detail. Book from the basic theory of machine learning, covers the simple linear regression, K – nearest neighbour algorithm, feature extraction, multiple linear regression, logistic regression, naive bayesian classification, decision tree, nonlinear regression, random forests, awareness, support vector machine, artificial neural network, K – means algorithm, principal component analysis and other important topics.

The book is for engineers in the field of machine learning and data scientists who want to learn more about SciKit-Learn. By reading this book, readers will improve their ability to construct and evaluate machine learning models and solve machine learning problems efficiently.