This article is excerpted from Deep Learning Principles and PyTorch in Action

We will start from the actual problem of predicting the number of shared bikes in a place, and lead readers into the house of neural networks. We will use PyTorch to build a shared bike predictor, and master basic concepts such as neurons, neural networks, activation functions, machine learning, and data preprocessing. In addition, the “black box” of neural networks will be uncovered to see how it works and which neurons play a key role, giving readers a deeper understanding of how neural networks work.

3.1 Troubles of shared bikes

Since about 2016, many shared bikes have appeared around us. Colorful and diverse shared bikes exploded across the city’s streets.

While sharing bikes brings convenience to people, there is also a troublesome problem: the distribution of bikes is very uneven. For example, in the morning rush hour, a large number of bikes are often gathered at some subway entrances, while in the evening rush hour, it is difficult to find a bike, which brings inconvenience to people who need to use shared bikes.

So how to solve the problem of uneven distribution of shared bikes? Currently, bike-sharing companies hire workers to carry the bikes and transport them to areas where they are needed. But the question is how many bikes should be transported? When will it be shipped? To where? This requires knowing exactly how many bikes are distributed in different locations across the city, and arrangements need to be made in advance, as there is a delay in workers delivering bikes. This is a very serious challenge for bike-sharing companies.

In order to solve this problem more scientifically and effectively, we need to construct a predictor of the number of bikes, which is used to predict the number of bikes at a certain time and in a certain parking area, for the reference of bike-sharing companies, so as to achieve a reasonable placement of bikes.

One cannot make bricks without straw. To build such a bike predictor, a certain amount of shared bike data is needed. In order to avoid business disputes, in order to make this book also development and explain more convenient, this example will use a bike sharing foreign public data sets (Capital Bikeshare) to finish our task, data set download links: www.capitalbikeshare.com/ system – data.

Once the data set is downloaded, it can be opened directly using regular table processing software or a text editor, as shown in Figure 3.1.

The data is from January 1, 2011 to December 31, 2012, each row represents a data record, a total of 17, 379. A piece of data records the day of the week, whether it’s a holiday, weather and wind speed in a particular location within an hour, and the amount of bike use in the area (CNT variables), which is the quantity we care most about.

We can intercept data over a period of time and plot the change relationship of CNT over time. Figure 3.2 shows the data from January 1 to January 10, 2011. The abscissa is the time, and the ordinate is the number of bikes. The number of bicycles fluctuates with time and shows a certain regularity. It is not hard to see that the peak number of bicycles on weekdays is much higher than that on weekends.

The question is, can we use historical data to predict how the number of bikes in the region is going to evolve over time? In this chapter, we will learn how to design a neural network model to predict the number of bikes. Instead of offering a perfect solution all at once, we take a step-by-step approach and try different solutions. Combined with this problem, we will mainly explain what artificial neuron is, what neural network is, how to build a neural network according to needs, what is over-fitting, how to solve the over-fitting problem, and so on. In addition, we will learn how to dissect a neural network to understand how it works and how it corresponds to data.

3.2 Single-cycle predictor 1.0

In this section, we will make a single-cycle predictor, which is a single implicit unit of the neural network. We will train it to fit the wave curve of shared bikes.

However, before designing bicycle predictors, it is important to understand the concept and how artificial neural networks work.

3.2.1 Introduction to artificial neural network

Artificial neural network (neural network for short) is a kind of computing model inspired by the biological neural network of human brain. Artificial neural networks are very good at learning mappings from input data and labels to complete predictions or solve classification problems. Artificial neural networks are also known as universal fitters because they can fit arbitrary functions or mappings.

Feedforward neural network is the most commonly used network, which generally includes three layers of artificial neural units, namely, input layer, hidden layer and output layer, as shown in FIG. 3.3. Among them, hidden layer can contain multiple layers, which constitute the so-called deep neural network.

Each circle represents an artificial neuron, and the line represents an artificial synapse, which connects the two neurons. Each edge contains a number called a weight, which is usually denoted by w.

The operation of neural network usually includes feedforward prediction process (or decision-making process) and feedback learning process.

In the feedforward prediction process, the signal is input from the input unit and transmitted along the network edge, and each signal is multiplied by the weight of the edge, so as to obtain the input of the hidden layer unit. Next, the hidden layer unit summarizes (sums) all the connected input signals, and then outputs them after certain processing (the specific processing process will be described in the next section). These output signals are multiplied by the weights of the lines from the hidden layer to the output to obtain the input signals to the output unit; Finally, the output unit then summarizes the signals of each input side, and then outputs them after processing. The final output is the output of the entire neural network. In the training stage, the neural network will adjust the weight w value of each connecting edge.

In the feedback learning process, each output neuron will first calculate its prediction error, and then carry out the reverse propagation of the error along all the connecting edges of the network to get the error of each hidden layer node. Finally, according to the error of the two nodes connected by each connected edge, the weight update of the connected edge is calculated to complete the learning and adjustment of the network.

Now, let’s start with artificial neurons and talk about how neural networks work in detail.

3.2.2 Artificial neuron

Artificial neural network is similar to biological neural network, which is composed of artificial neurons (neuron for short). Neurons use simple mathematical models to simulate the signaling and activation of biological nerve cells. To understand how artificial neural networks work, let’s start with the simplest case: the single-neuron model. As shown in Figure 3.4, it has only one input layer unit, one hidden layer unit, and one output layer unit.

X is the input data, y is the output data, they are real numbers. The weight w from the input cell to the hidden layer, the hidden layer cell bias B, and the weight W ‘from the hidden layer to the output layer are all real numbers that can be arbitrarily evaluated.

We can think of this simplest neural network as a function that maps from x to y, with w, b, and w’ being the parameters of the function. The equation for this function is shown in the equation in Figure 3.5, where σ represents the Sigmoid function. When w=1, w’=1, and b=0, the graph of this function is shown in Figure 3.5.

This is the shape of the sigmoid function and the mathematical expression for σ(x). If x is less than 0, sigma (x) is less than 1/2, and the smaller x is, the closer sigma (x) is to 0. σ(x) is always greater than half when x is greater than 0, and the larger x is, the closer σ(x) is to 1. There is a mutation from 0 to 1 near the point x=0.

When we transform w, b, and w prime, the graph of the function changes accordingly. For example, if we keep w’=1 and b=0 constant and change the size of w, the function graph changes as shown in Figure 3.6.

It can be seen that, when w>0, its size controls the bending degree of the function. The larger w is, the greater the bending degree will be near 0, so the mutation from x=0 will be more severe. When w is less than 0, the curve flips left and right, and it jumps from 1 to 0.

Again, let’s look at the effect of parameter B on the curve, leaving w=w’=1 unchanged, as shown in Figure 3.7.

It is clear that B controls the horizontal position of the sigmoid function curve. B >0, it shifts to the left; If I shift it to the right. Finally, let’s see how w’ affects the curve, as shown in Figure 3.8.

It is not difficult to see that when w’ > 0, w’ controls the height of the curve; When w prime is less than 0, the direction of the curve is reversed.

It can be seen that by controlling w, W ‘and B, we can arbitrarily adjust the shape of the function from input X to output y. But no matter how you adjust it, the curve will always be s-shaped (including inverted s-shaped). To get a more complex picture of the function, we need to bring in more neurons.

3.2.3 Two hidden layer neurons

Let’s make the model a little more complex and see how the two hidden layer neurons affect the curve, as shown in Figure 3.9.

Once the input enters the network, it splits into two paths, one into the first neuron from the left and the other into the second neuron from the right. The two neurons perform their calculations separately and sum by w’1 and W ‘2 to get y. So the output Y is essentially a superposition of two neurons. The network is still a function that maps x to y, and the function equation is:

In this formula, there are six different parameters: W1, w2, W ‘1, w’2, b1, b2. Their combination also has an effect on the shape of the curve.

For example, we can take w1=w2= W ‘1=w’2=1, b1=-1, b2=0, then the curve shape of this function is shown in Figure 3.10.

Thus, the resultant function graph becomes a curve with two steps.

Let’s look at another parameter combination, w1=w2=1, b1=0, B2 =-1, W ‘1=1, w’2=-1, then the function graph is shown in Figure 3.11.

Thus, we have synthesized a curve with a single crest, somewhat similar to a bell curve with a normal distribution. In general, we can use two hidden layer neurons to fit any curve with single peak as long as the parameter combination is changed.

Then, if there are four or six or even more hidden layer neurons, it is not difficult to imagine a curve with two peaks, three peaks and any number of peaks. We can roughly think that two neurons can be used to approximate a peak (trough). In fact, for more general cases, scientists have already theoretically proved that a finite number of hidden layer neurons can approximate any curve in a finite interval, which is called the Universal approximation Theorem.

3.2.4 Training and operation

In the previous discussion, we saw that any desired curve can be obtained by adjusting the combination of parameters in a neural network. The question is, how do we pick these parameters? The answer lies in training.

In order to complete the training of the neural network, a loss function should be defined for the neural network, which is used to measure the output performance of the network under the existing parameter combination. This is similar to the linear regression used in Chapter 2 to predict the total error function (that is, the sum of squares of the distances between fitting lines and all points) L in house prices. Similarly, in the case of bicycle prediction, we can define the loss function as the mean of the square sum of the difference between the number of bicycles predicted by the neural network and the number of bicycles in the actual data for all data samples, namely:

Here, N is the total number of samples,

With this loss function L, we have the direction to adjust the parameters of the neural network — to minimize L as much as possible. Therefore, what the neural network needs to learn is the weight and bias of connecting edges between neurons, and the purpose of learning is to get a group of parameter value combinations that can minimize the total error.

This is an optimization problem that takes an extreme value, and higher mathematics tells us that you just have to set the derivative to zero. However, as neural networks are generally very complex and contain a large number of nonlinear operations, it is not feasible to directly calculate derivatives by mathematics, so we generally use numerical methods to solve, that is, gradient descent algorithm. Each iteration moves in the negative direction of the gradient, making the error value decrease gradually. The parameter update uses the back propagation algorithm, which transmits the loss function L back along the network layer by layer to correct the parameters of each layer. We won’t go into the details of the backpropagation algorithm here, because PyTorch has automatically turned this complex algorithm into a simple command: BACKWARD. Once this command is invoked, PyTorch automatically executes the back propagation algorithm, calculating the gradient for each parameter, and we only need to update the parameters according to these gradients to complete a step of learning.

Neural network learning and operation are usually alternating. In other words, in every cycle, the neural network performs feedforward operation from the input to the output; Then, according to the loss value of the output end, the back propagation algorithm is used to adjust the parameters of the neural network. Repeating these two steps over and over again makes the neural network learn better and better.

3.2.5 Failed neural predictors

After figuring out how the neural network works, let’s take a look at how to use it to predict the curve of shared bikes. We hope to use artificial neural network to fit the cycle curve within a period of time, and give the curve of the cycle usage at future time points, similar to the practice of predicting housing prices.

To make the demonstration simple and clear, we only selected the first 50 records from the data and plotted them as shown in Figure 3.12. In this curve, the horizontal axis is the number of data records, and the vertical axis is the number of bikes.

Next, we need to design a neural network whose input X is the data number and output is the corresponding number of bikes. By looking at this curve, we find that it has at least 3 peaks, and 10 hidden layer elements are sufficient to ensure fitting this curve. Therefore, our artificial neural network architecture is shown in Figure 3.13.

Next, we have to start writing programs to implement the network. First import all the dependent libraries used by this program. We will use the library pandas to read and manipulate data. To install the software package, run conda Install Pandas in the Anaconda environment.

import numpy as np
import pandas as pd  # Library to read CSV files
import torch
from torch.autograd import Variable
import torch.optim as optim
import matplotlib.pyplot as plt
Let the output graphics be displayed directly in the Notebook
%matplotlib inline
Copy the code

Next, import the desired data from the hard disk file.

data_path = 'hour.csv'  Rides is a dataframe object to read data into memory
rides = pd.read_csv(data_path)
rides.head()  Output some data
counts = rides['cnt'] [5]# Intercept data
x = np.arange(len(counts))  Get variable x
y = np.array(counts) # Number of bikes is y
plt.figure(figsize = (10, 7)) # Set drawing window size
plt.plot(x, y, 'o-')  # Draw raw data
plt.xlabel('X')  # Change the axis label
plt.ylabel('Y')  # Change the axis label
Copy the code

Here, we use the PANDAS library to quickly import data from the CSV file into the rides. Rides can store data in the form of a two-dimensional table and access and manipulate it like an array. Head () is used to print out part of the data record.

Then, we select the first 50 from all the records of rides and only filter out the CNT fields into the COUNTS array. This array stores the top 50 bike usage records. Next, we graphed the first 50 records, as shown in Figure 3.13.

With the data ready, we can use PyTorch to build an artificial neural network. Similar to the linear regression example in Chapter 2, we first need to define a set of variables, including the weights and biasing of all the edges, and let PyTorch automatically generate the calculation graph through these variables.

# input variables, 1,2,3... A one-dimensional array like this
x = Variable(torch.FloatTensor(np.arange(len(counts), dtype = float))) 
The output variable is a one-dimensional array of 50 data points, the number of cycles at each time read from data Counts
y = Variable(torch.FloatTensor(np.array(counts, dtype = float))) 

sz = 10  # Set the number of hidden layer neurons
Initialize the weight matrix from the input layer to the hidden layer. Its size is (1,10).
weights = Variable(torch.randn(1, sz), requires_grad = True)  
Initialize the offset vector of the hidden layer node, which is a one-dimensional vector of size 10
biases = Variable(torch.randn(sz), requires_grad = True)  
Initialize the weight matrix from the hidden layer to the output layer. Its size is (10,1).
weights2 = Variable(torch.randn(sz, 1), requires_grad = True)  
Copy the code

After setting the variables and initial parameters of the neural network, it is time to iteratively train the neural network.

Learning_rate = 0.0001# Set the learning rate
losses = [] # This array records the loss function value for each iteration to facilitate subsequent drawing
for i in range(1000000):
    # Calculation from input layer to hidden layer
    hidden = x.expand(sz, len(x)).t() * weights.expand(len(x), sz) + biases.expand(len(x), sz)
    # at this point, the size of hidden variable is :(50,10), that is, 50 data points and 10 hidden layer neurons

    # Apply sigmoID function to each neuron in the hidden layer
    hidden = torch.sigmoid(hidden)
    # The hidden layer is output to the output layer, and the final prediction is calculated
    predictions = hidden.mm(weights2)
    # at this time, the sizes of the predictions are :(50,1), that is, the predicted values of 50 data points
    # Calculate the mean square error by comparing it with the standard answer Y in the data
    loss = torch.mean((predictions - y) ** 2) 
    # In this case, loss is a scalar, that is, a number
    losses.append(loss.data.numpy())

    if i % 10000 == 0: Print the loss function value every 10,000 cycles
        print('loss:', loss)

    # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
    # Next, start gradient descent algorithm to propagate the error back
    loss.backward()  The loss function is retrograded

    Update weights or biases values by using gradient information such as weights and biases obtained in the previous calculation
    weights.data.add_(- learning_rate * weights.grad.data)  
    biases.data.add_(- learning_rate * biases.grad.data)
    weights2.data.add_(- learning_rate * weights2.grad.data)

    Clear gradients for all variables
    weights.grad.data.zero_()
    biases.grad.data.zero_()
    weights2.grad.data.zero_()
Copy the code

In the code above, we did 100,000 training iterations. In each iteration, we input all x of 50 data points as an array into the neural network, and let the neural network complete the calculation step by step according to the steps from the input layer to the hidden layer, and then from the hidden layer to the output layer, and finally output the prediction array prediction of 50 data points.

Then, the error between Prediction and standard answer Y is calculated, and the average error value loss of all 50 data points is calculated, which is the loss function L mentioned above. Then, call Loss.Backward () to complete the backpropagation process of errors along the neural network, so as to calculate the gradient update value of each leaf node on the calculation graph, and record it in the. Grad attribute of each variable. Finally, we use this gradient value to update the value of each parameter, thus completing an iteration.

A close comparison of this code with the linear regression code in Chapter 2 shows that everything is the same except the computation and loss function in the middle. In fact, almost all of the machine learning cases in this book use such steps as feedforward, backward propagation to calculate gradients, and updating parameter values based on gradients.

We can print the curve of Loss decline with step by step iteration, which can help us intuitively see the neural network training process, as shown in Figure 3.14.

plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
Copy the code

As can be seen from the curve, with the passage of time, the prediction error of neural network is indeed decreasing step by step. Moreover, after about 20, 000 steps, the error hardly drops significantly.

Next, we can draw the prediction curve of the trained network on these 50 data points and compare it with the standard answer Y, with the code as follows:

x_data = x.data.numpy()  Get x package data
plt.figure(figsize = (10, 7))  # Set drawing window size
xplot, = plt.plot(x_data, y.data.numpy(), 'o')  # Draw raw data
yplot, = plt.plot(x_data, predictions.data.numpy())  # Draw the fit data
plt.xlabel('X')  # Change the axis label
plt.ylabel('Y')  # Change the axis label
plt.legend([xplot, yplot],['Data'.'Prediction under 1000000 epochs'])  # Draw a legend
plt.show()
Copy the code

The final visualization is shown in Figure 3.15.

As you can see, our prediction curve fits the data well at the first wave peak, but after that, it doesn’t match the real data very well. Why is that?

As we know, the value range of X is 1~50, and the initial values of ownership weight and bias are set at (-1, 1) normal distribution random numbers, so the value range of nodes from the input layer to the hidden layer is 50~50. It takes a lot of computing time to adjust the multiple peaks of sigmoid function to the desired positions. In fact, we can fit the rest of the curve pretty well if we keep training longer.

The solution to this problem is to normalize the range of input data, that is, to set the input value range of X to 0~1. Since x ranges from 1 to 50, we just divide each number by 50:

x = Variable(torch.FloatTensor(np.arange(len(counts), dtype = float) / len(counts)))
Copy the code

This operation will change the value range of x to 0.02, 0.04… 1. When you run the program after these improvements, you can see that the training speed is significantly faster and the visualized fit is better, as shown in Figure 3.16.

We see two peaks in the improved model, which also fits the data points very well, forming a nice curve.

Next, we need to use trained models to make predictions. Our prediction task is the number of bikes in the next 50 data points. The value of x is 51, 52… 100, you also have to divide by 50.

counts_predict = rides['cnt'[when] 0]Read the next 50 data points to be predicted
x = Variable(torch.FloatTensor((np.arange(len(counts_predict), dtype = float) + len(counts)) / len(counts)))
# Read the y values of the next 50 points without normalization
y = Variable(torch.FloatTensor(np.array(counts_predict, dtype = float)))  

# Use x to predict y
hidden = x.expand(sz, len(x)).t() * weights.expand(len(x), sz)  # Calculation from input layer to hidden layer
hidden = torch.sigmoid(hidden)  # Apply sigmoID function to each neuron in the hidden layer
predictions = hidden.mm(weights2)  # Output from the hidden layer to the output layer, and calculate the final prediction
loss = torch.mean((predictions - y) ** 2)  # Calculate the loss function on the forecast data
print(loss)

# Draw the prediction curve
x_data = x.data.numpy()  Get x package data
plt.figure(figsize = (10, 7)) # Set drawing window size
xplot, = plt.plot(x_data, y.data.numpy(), 'o') # Draw raw data
yplot, = plt.plot(x_data, predictions.data.numpy())  # Draw the fit data
plt.xlabel('X')  # Change the axis label
plt.ylabel('Y')  # Change the axis label
plt.legend([xplot, yplot],['Data'.'Prediction'])  # Draw a legend
plt.show()
Copy the code

Finally, we have the curve shown in Figure 3.17. The straight lines are the predicted curves given by our model, and the dots are the curves corresponding to actual data. Model predictions and actual data were completely off!

Why is it that our neural network can fit 50 known data points very well, but can’t predict any more data points at all? The reason: overfitting.

3.2.6 fitting

Over fitting refers to the fact that the model can perform very well on training data, but not on new test data. In this example, the training data is the first 50 data points, and the test data is the next 50 data points. Our model can smoothly fit the curve of the training data by adjusting the parameters, but this deliberate fit has no generalization value at all, resulting in the curve fitting far from the standard answer of the test data. Our neural network model does not learn patterns in the data.

So why can’t our neural network learn patterns in curves? The reason is that we chose the wrong characteristic variable: we tried to use subscripts of the data (1, 2, 3…). Or its normalization (0.1, 0.2…) To make a prediction of y. However, the pattern of fluctuations in the curve (i.e. the number of bikes used) apparently does not depend on subscripts, but on factors such as weather, wind speed, day of the week and whether there are holidays or not. However, we do not care, and forced to use a powerful artificial neural network to fit the whole curve, which naturally leads to the phenomenon of over-fitting, and very serious over-fitting.

As this example shows, blindly pursuing AI technology without considering the context of the actual problem can easily lead us astray. When we are faced with big data, the meaning behind the data can often guide us to find a shortcut to analyze big data more quickly.

In this section, we learned a lot about how neural networks work, how to choose the number of hidden layers based on the complexity of the problem, and how to adjust the data to make training faster. More importantly, we have learned a bloody lesson about what over-fitting is.

3.3 Bicycle predictor 2.0

Now, let’s get on the road to getting it right. Since we guess the use of the weather, wind speed, the week, whether information such as the holidays can better predict bicycle use quantity, but also contains the information of our original data, then we might as well to design a neural network, input all the relevant information, to predict the number of bikes.

3.3.1 Data pretreatment process

However, before we start designing neural networks, it is best to take a closer look at the data, because it is more important to increase your understanding of the data.

Looking further at the data in Figure 3.2, we find that all variables can be divided into two types: type variables and numerical variables.

A type variable is a variable that can be evaluated in several different categories. For example, the variable week has 1, 2, 3… 0 represents Monday, Tuesday, Wednesday…… Sunday is a few days. The weathersit variable can range from 1 to 4. 1 is sunny, 2 is cloudy, 3 is light rain/snow, and 4 is heavy rain/snow.

The other type is a numeric type, in which a variable is evaluated consecutively from an interval of values. For example, humidity is a variable with consecutive values from [0, 1]. Temperature and wind speed are also variables of this type.

We cannot input different types of variables into the neural network without any processing, because different values represent completely different meanings. In a type variable, the size of a number actually has no meaning. For example, the number 5 is greater than the number 1, but that doesn’t mean Friday is any more special than Monday. In addition, variables of different numeric types have different ranges of variation. If they are directly mixed together, it is bound to cause unnecessary trouble. Based on the above considerations, we need to preprocess the two variables respectively.

1. Unique heat coding of type variables

The size of a type variable has no meaning, just to distinguish between different types. For example, the variable seasons can be equal to 1, 2, 3, 4, the seasons, and the numbers just distinguish between them. We cannot input the season variable directly into the neural network because the season value does not represent the corresponding signal strength. Our solution is to convert the type variable to one-hot encoding, as shown in Table 3.1.

Next, we only need to convert a certain column type variable into the unique thermal coding vector of multiple columns in the data to complete the pre-processing process of such variables, as shown in FIG. 3.18.

As a result, the weekday property is changed to seven different properties, adding six columns to the database.

Programmatically, pandas can easily do the following:

dummy_fields = ['season'.'weathersit'.'mnth'.'hr'.'weekday'] The names of all types of encoding variables
for each in dummy_fields:
    Take all type variables and convert them to unique heat coding
    dummies = pd.get_dummies(rides[each], prefix=each, drop_first=False)
    # Merge the new unique heat encoding variable with all the existing variables
    rides = pd.concat([rides, dummies], axis=1)

Delete the original type variable from the table
fields_to_drop = ['instant'.'dteday'.'season'.'weathersit'.'weekday'.'atemp'.'mnth'.'workingday'.'hr'] The name of the type variable to be deleted
data = rides.drop(fields_to_drop, axis=1) Delete them from the database variables
Copy the code

After this processing, the original 17 columns of data suddenly become 59 columns, and some data fragments are shown in Figure 3.19.

** 2. Handling of numeric variables **

The problem with numerically typed variables is that each variable varies in a different range and in different units, so different variables cannot be compared. The solution we adopted was to standardize the variable, that is, to standardize the variable with the mean value and standard deviation of the variable, so as to transform them into the values fluctuating within the range [-1, 1]. For example, for the variable temp, its average value in the whole database is mean(temp) and variance is STD (temp), then the normalized temperature can be calculated as follows:

Temp ‘is a number in the range [-1, 1]. The advantage of this is that you can set variables of different value ranges to equal status.

We can standardize these variables with the following code:

quant_features = ['cnt'.'temp'.'hum'.'windspeed'] The name of a numeric type variable
scaled_features = {}  # Store the mean and variance of each variable in the scaled_features variable
for each in quant_features:
    # Calculate the mean and variance of these variables
    mean, std = data[each].mean(), data[each].std()
    scaled_features[each] = [mean, std]
    Normalize each variable
    data.loc[:, each] = (data[each] - mean)/std
Copy the code

** 3. Division of data set **

After preprocessing, our data set contains 17 379 records and 59 variables. Next, we will partition this data set.

First of all, the set of variables is divided into feature and target sets. Where, the set of characteristic variables includes: Year (yr), holiday (or not), temperature (temp), humidity (hum), windspeed (windspeed), season 1-4 (season), weather 1-4 (weathersit) Different weather types), month 1 to 12 (MNTH), hour 0 to 23 (hr), and week 0 to 6 (weekday) are the variables input to the neural network. The target variables include CNT, casual, and registered users. Among them, we only take CNT as the target variable, and the other two do not do any processing temporarily. We will use 56 characteristic variables as the input of the neural network to predict 1 variable as the output of the neural network.

Next, we divided 17 379 records into two sets: the first 16 875 records were used as training sets to train our neural network; The data of the last 21 days (504 records) were used as a test set to test the prediction effect of the model. This part of data is not involved in neural network training, as shown in Figure 3.20.

The data processing code is as follows:

test_data = data[-21*24:] # select the training set
train_data = data[:-21*24] Select test set

The fields contained in the target column
target_fields = ['cnt'.'casual'.'registered'] 

The training set is divided into feature variable column and target feature column
features, targets = train_data.drop(target_fields, axis=1), train_data[target_fields]

The test set is divided into feature variable column and target feature column
test_features, test_targets = test_data.drop(target_fields, axis=1), test_data[target_fields]

Convert the data type to a NumPy array
X = features.values  # Convert data from pandas Dataframe to NumPy
Y = targets['cnt'].values
Y = Y.astype(float)

Y = np.reshape(Y, [len(Y),1])
losses = []
Copy the code

3.3.2 Neural network construction

After data processing, we will construct a new artificial neural network. The network has three layers: an input layer, a hidden layer and an output layer. The dimensions (number of neurons) of each layer are 56, 10 and 1 respectively (as shown in Figure 3.21). Among them, the number of neurons in the input layer and output layer is determined by the data respectively, while the number of neurons in the hidden layer is determined by our estimation of the data complexity. In general, the more complex the data, the larger the volume of data, the more neurons you need. However, too many neurons easily lead to overfitting.

In addition to constructing a neural network manually using a tensor calculation, PyTorch also automatically calls an existing function to do the same thing. This code is much simpler, as shown below:

# Define neural network architecture, features.shape[1] input layer units, 10 hidden layers, 1 output layer
input_size = features.shape[1]
hidden_size = 10
output_size = 1
batch_size = 128
neu = torch.nn.Sequential(
    torch.nn.Linear(input_size, hidden_size),
    torch.nn.Sigmoid(),
    torch.nn.Linear(hidden_size, output_size),
)
Copy the code

In this code, we can call torch.nn.sequential () to construct a neural network and store it in a NEU variable. Torch.nn.Sequential() The function torch.nn.Sequential() builds a multi-layer neural network from a series of computing modules. In this case, these modules include Linear(input_size, hidden_size), Linear(input_size, hidden_size), the nonlinear sigmoid function torch.nn.sigmoid (), And a Linear mapping from the hidden layer to the output layer torch.nn.Linear(hidden_size, output_size). It is worth noting that the layers inside Sequential do not correspond strictly to the layers of neural networks, but rather refer to multi-step operations, which correspond to the layers of dynamic computational graphs.

We can also use PyTorch’s built-in loss function:

cost = torch.nn.MSELoss()
Copy the code

PyTorch is a packaged loss function that calculates the mean square error (MSE). It is a function pointer assigned to the variable cost. During the calculation, we only need to call cost(x,y) to calculate the mean square error between the prediction vector X and the target vector y.

In addition, PyTorch also comes with an optimizer to automatically implement the optimization algorithm:

Optimizer = torch.optim.sgd (neu.parameters(), lr = 0.01)Copy the code

Torch.optim.sgd () calls the stochastic Gradient Descent (SGD) algorithm of PyTorch as the optimizer. When initializing the Optimizer, we need all the parameters to be optimized (in this case, the parameters passed in include the ownership weight and bias contained by the neural network NEU, i.e. Neu.parameters ()), as well as the learning rate lr=0.01 for executing the gradient descent algorithm. After all the materials are ready, we can carry out the training.

Batch processing of data

There is, however, another problem with the training cycle. In the previous example, we fed all the data into the neural network at once during each training cycle. This is no problem in small data volumes. However, the current data volume is 16 875 pieces, in such a large amount of data, if all the data are processed in each training cycle, there will be problems such as slow computing speed and non-convergence of iteration.

The solution is usually to adopt the mode of Batch processing, that is, divide all data records into small data sets of batch size, and then input a batch of data to the neural network in each training cycle, as shown in Figure 3.22. The size of the batch depends on the complexity of the problem and the size of the data. In this case, we set batCH_size =128.

The training code after batch processing is as follows:

# Neural network training cycle
losses = []
for i in range(1000):
    # Each 128 sample points are divided into batches, which are read in batches during the loop
    batch_loss = []
    #start and end are the start and end subscripts to extract a batch of data, respectively
    for start in range(0, len(X), batch_size):
        end = start + batch_size if start + batch_size < len(X) else len(X)
        xx = Variable(torch.FloatTensor(X[start:end]))
        yy = Variable(torch.FloatTensor(Y[start:end]))
        predict = neu(xx)
        loss = cost(predict, yy)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        batch_loss.append(loss.data.numpy())

    Output the loss value every 100 steps
    if i % 100==0:
        losses.append(np.mean(batch_loss))
        print(i, np.mean(batch_loss))

Print out the loss value
plt.plot(np.arange(len(losses))*100,losses)
plt.xlabel('epoch')
plt.ylabel('MSE')
Copy the code

By running this program, we can train the neural network. Figure 3.23 shows the decline of the loss function as the training cycle runs. The abscissa represents the training period, and the ordinate represents the mean error. It can be seen that the mean error decreases rapidly with the training period.

3.3.3 Test neural network

We can then use the trained neural network to make predictions on the test set, and plot the prediction data for the next 21 days against the real data for comparison.

targets = test_targets['cnt']  Read the CNT value of the test set
targets = targets.values.reshape([len(targets),1])  Translate the data into proper tensor form
targets = targets.astype(float)  Make sure data is real

Wrap characteristic variables and target variables in Variable variables
x = Variable(torch.FloatTensor(test_features.values))
y = Variable(torch.FloatTensor(targets))

# Use neural networks for prediction
predict = neu(x)
predict = predict.data.numpy()

fig, ax = plt.subplots(figsize = (10, 7))

mean, std = scaled_features['cnt']
ax.plot(predict * std + mean, label='Prediction')
ax.plot(targets * std + mean, label='Data')
ax.legend()
ax.set_xlabel('Date-time')
ax.set_ylabel('Counts')
dates = pd.to_datetime(rides.loc[test_data.index]['dteday'])
dates = dates.apply(lambda d: d.strftime('%b %d'))
ax.set_xticks(np.arange(len(dates))[12::24])
_ = ax.set_xticklabels(dates[12::24], rotation=45)
Copy the code

The comparison between the actual and predicted curves is shown in Figure 3.24. Where the abscissa is the different date and the ordinate is the predicted or actual data value. The dashed line is the forecast curve and the solid line is the actual data.

It can be seen that the two curves are basically consistent, but there is a large deviation between the actual value and the predicted value in the days around December 25. Why was the performance so poor during this period?

A closer look at the data shows that December 25 falls on Christmas Day. For European and American countries, Christmas is equivalent to our Spring Festival, before and after the Christmas holiday, people’s travel habits will be very different from the past. However, in our training sample, because the whole data is only two years long, the sample before and after Christmas is included only once, which makes it impossible for us to make a good prediction of the pattern of this special holiday.

3.4 Anatomy of neural network Neu

By rights, our work has been finished by now. However, we also hope to gain a more thorough understanding of how artificial neural networks work. Therefore, we will analyze the trained neural network Neu to see why it is able to perform well in some data and poorly in others.

For us, what happens to the neural network during training is completely black box, but the weight of the neural network’s edge is actually in the memory of the computer, and we can extract and analyze the data we are interested in.

We define a function feature(), which is used to extract all parameters stored in edges and nodes in the neural network. The code is as follows:

def feature(X, net):
    # define a function to extract the weight information of the network. All network parameter information is stored in the named_parameters collection of NEU
    X = Variable(torch.from_numpy(X).type(torch.FloatTensor), requires_grad = False)
    dic = dict(net.named_parameters()) Extract this collection
    weights = dic['0.weight'] Number of layers Name "to index the corresponding parameter values in the collection
    biases = dic['0.bias'] 
    h = torch.sigmoid(X.mm(weights.t()) + biases.expand([len(X), len(biases)])) # Hidden layer calculation process
    return h # Output layer calculation
Copy the code

In this code, we use the net.named_parameters() command to extract all the parameters of the neural network, including weights and biases for each layer, and place them in the Python dictionary. This can then be extracted using the above code, for example by dic[‘0. Weight ‘] and DIC [‘0. Bias ‘] to obtain the ownership weight and bias of the first layer. In addition, we can get all the extractable parameter names by iterating through the parameter dictionary DIC.

Due to the large amount of data, we select a part of the data input neural network, and extract the activation mode of the network. As we know, the forecast is not accurate on December 22, December 23, December 24 these three days. Therefore, the data of these three days are gathered together and stored in subset and subtargets variables.

bool1 = rides['dteday'] = ='2012-12-22'
bool2 = rides['dteday'] = ='2012-12-23'
bool3 = rides['dteday'] = ='2012-12-24'

Multiply three Boolean arrays
bools = [any(tup) for tup in zip(bool1,bool2,bool3) ]
# Pull out the corresponding variable
subset = test_features.loc[rides[bools].index]
subtargets = test_targets.loc[rides[bools].index]
subtargets = subtargets['cnt']
subtargets = subtargets.values.reshape([len(subtargets),1])
Copy the code

Input the data of these three days into the neural network, read out the activation value of neurons at the hidden layer with the previously defined feature() function, and store it in results. For ease of reading, the predicted values of the normalized output can be restored to the range of values of the original data.

# Input the data into the neural network, read the activation value of neurons in the hidden layer, and store it in results
results = feature(subset.values, neu).data.numpy()
# Predicted values corresponding to these data (output layer)
predict = neu(Variable(torch.FloatTensor(subset.values))).data.numpy()
Restore the predicted value to the value range of the original data
mean, std = scaled_features['cnt']
predict = predict * std + mean
subtargets = subtargets * std + mean
Copy the code

Next, let’s draw the activation of the hidden layer neurons. Meanwhile, for comparison, we plotted these curves together with the values predicted by the model, and the visualized results are shown in Figure 3.25.

# Plot all neuron activation levels on the same graph
fig, ax = plt.subplots(figsize = (8, 6))
ax.plot(results[:,:],':',alpha = 0.1) ax.plot((predict-min (predict))/(Max (predict) -min (predict)),'bo-',label='Prediction')
ax.plot((subtargets - min(predict)) / (max(predict) - min(predict)),'ro-',label='Real')
ax.plot(results[:, 6],':',alpha=1,label='Neuro 7')

ax.set_xlim(right=len(predict))
ax.legend()
plt.ylabel('Normalized Values')

dates = pd.to_datetime(rides.loc[subset.index]['dteday'])
dates = dates.apply(lambda d: d.strftime('%b %d'))
ax.set_xticks(np.arange(len(dates))[12::24])
_ = ax.set_xticklabels(dates[12::24], rotation=45)
Copy the code

The block curves are the predicted values of the model, the dotted curves are the actual values, and the dashed lines of different colors and linetypes are the output values of each neuron. The output curve of Neuro 6 was found to be close to the real output curve. Therefore, we can assume that this neuron has a higher contribution to improving prediction accuracy.

We also wanted to know why Neuro 6 neurons perform better and who determines their activation. Further analysis of its influencing factors shows the weight pointing to it from the input layer, as shown in Figure 3.26.

We can visualize these weights using the following code.

# Find the neuron corresponding to the peak and output it to the weight of the input layer
dic = dict(neu.named_parameters())
weights = dic['0.weight']
plt.plot(weights.data.numpy()[6, :],'o-')
plt.xlabel('Input Neurons')
plt.ylabel('Weight')
Copy the code

The result is shown in Figure 3.27. The horizontal axis represents the different weights, which are the numbers of the input neurons; The vertical axis represents the edge weights of the neural network after training. For example, the 10th number on the horizontal axis, corresponding to the 10th neuron in the input layer, corresponds to the input data and is the type variable that detects the weather category. The 32nd number, which is the number of hours, is also a type variable, and it detects the 6 a.m. pattern. We can think of it as, a positive value on the vertical axis is promotion, and a negative value is inhibition. So the peaks in the diagram are when the neuron is activated, and the troughs are when the neuron is not activated.

We see that the curve has a high weight at HR_12, weekday_0,6, which means the Neuro 6 is detecting whether it is 12 noon and whether it is Sunday or Saturday. If these conditions are met, the neuron is activated. In contrast, the neuron had a negative weight on weathersit_3 and HR_6 inputs and a low, meaning that the neuron would be suppressed when it rained or snowed, and at 6 a.m. By looking at the calendar, we know that December 22 and 23, 2012, fell on a Saturday and Sunday, so the Neuro 6 was activated, and they contributed to correctly predicting the midday peak on those two days. However, with Christmas approaching and people likely to return early to prepare for it, this weekend was special and did not see the usual high demand for cycling on weekends, so the Neuro 6 activation numbers led to an overestimated midday cycle count.

Similarly, we can find the cause of the over-forecast morning and evening peak on December 24. We found that no. 4 neuron played a major role because its wave shape was just negatively correlated with the morning and evening peak of the prediction curve on the 24th, as shown in FIG. 3.28.

Similarly, the weight corresponding to this neuron and its detection mode are shown in Figure 3.29.

This neuron detects a pattern similar to but opposite to Neuro 6, which is inhibited during morning and evening peaks and activated on holidays and weekends. Looking further at the connection from the hidden layer to the output layer, we find that Neuro 4 has a negative weight, but the negative weight is not that large. Therefore, this resulted in suppression during the morning and evening peak on December 24, but the signal suppression effect was not significant enough to lead to the predicted peak.

So, we analyzed that the reason why the neural predictor Neu was wrong during these three days was due to the abnormal pattern of the Christmas holiday. December 24 is Christmas Eve, and the network does not restrain the suppression unit of morning and evening peak during holidays, so the prediction is not accurate. With more training data, it might be possible to adjust the weight of neuron 4 even lower, which might improve the accuracy of the prediction.

3.5 summary

In this chapter, we take the problem of predicting the number of shared bikes in a certain place as the entry point, and introduce the working principle of artificial neural network. By adjusting the parameters in the neural network, we can get curves of arbitrary shape. Then, we try to use the neural network with single input and single output to fit the data of shared bikes and try to predict.

However, the prediction effect is very poor. After analysis, we found that the characteristic variable used was the number of data, which had nothing to do with the number of bikes, so the illusion of perfect fitting was just the result of over-fitting. So, we tried a new prediction method, using the characteristic variables in each piece of data, including the weather, wind speed, day of the week, whether it is a holiday, time of day and other characteristics to predict the number of bike use, and achieved success.

In our second attempt, we also learned how to partition data and how to implement our artificial neural network, loss function, and optimizer using PyTorch’s built-in encapsulation functions. At the same time, we introduced the concept of batch processing, that is, data is divided into batches, in each training cycle, a small batch of data is used to train the neural network and let it adjust parameters. This batch processing method can not only speed up the running of the program, but also enable the neural network to adjust parameters steadily.

Finally, we analyze the trained neural network. Learn how artificial neurons activate under different conditions by monitoring the inherent patterns in the data. It is also clear that neural networks do not work well on some data because it is difficult to encounter special conditions such as holidays in the data.

3.6 Q&A

Book content from zhangjiang teacher in “set” AI academy courses in network “deep learning” in the history of the torch, in order to help readers quickly clear train of thought or solve the problem of common practice, we choose the representative of the questions raised by the students in class, and attach the zhangjiang teacher answers, “Q&A section” of attached to the end of the relevant chapters. If readers have similar questions in the process of reading, I hope to get answers from them.

Q: Are more neurons better?

A: Certainly not the more the better. The prediction ability of neural network model is not only related to the number of neurons, but also to the structure of neural network and input data.

Q: In the experiment to predict the usage of shared bikes, why do we do gradient clearing?

A: If gradients are not cleared, the function backward() accumulates gradients. We do gradient backpropagation immediately after a training session, so we don’t need to systematically add gradients. If gradients are not cleared, the model may fail to converge.

Q: For neural networks, can non-convergent functions also be approximated?

A: In A certain closed interval, yes. Because in a closed interval, a function cannot diverge indefinitely, there is always a bound, and you can use neural network models to approximate it. For an infinite interval, the neural network model does not work, because there is a finite number of neurons in the neural network model for fitting.

Q: In the case of predicting bike sharing, the model did not predict bike usage accurately enough during Christmas. So can we improve the accuracy of neural network prediction by increasing training data?

A: Yes. If the model was trained with more training data that included bike use during Christmas, the model predicted bike use during Christmas more accurately.

Q: Since models predicting the use of shared bikes can be analyzed and dissected, can every neural network be analyzed in the same way?

A: Not necessarily. Because the model for predicting the usage of shared bikes is relatively simple, there are only 10 hidden layer neurons. When the number of neurons in the network model is large or there are multiple layers of neurons, a “decision” of the neural network model will be difficult to attribute to a single neuron. At this time, it is difficult to analyze the neural network model in the way of “anatomy”.

Q: When training the neural network model, it was mentioned that “training set/test set = K”. Then what is the reasonable proportion k? Does K have any influence on the convergence speed and error of prediction?

A: When the amount of data is relatively small, we generally choose the test set in accordance with the ratio of 10∶1; However, in the case of a relatively large amount of data, for example, more than 100,000 pieces of data, it is not necessary to divide the training set and test set in proportion.