This article is participating in Python Theme Month. See the link to the event for more details

preface

motivation

Today, there are many useful and mature deep learning frameworks, such as PyTorch, Tensorflow and Mxnet, in which most of the popular algorithms and models have been encapsulated for us to call. So we have to write a neural network by hand. In fact, I think it is necessary. After learning the theory of deep learning, especially back propagation, it is difficult to have a deeper understanding of back propagation if you do not write one by yourself. Not only that, but in practice, sometimes the task is not so responsible that it is not necessary to introduce a large framework or your model needs to be trained and predicted on a small device, so you can implement simple neural networks in the underlying language.

Bit by bit

  • Understanding deep learning
  • Learn about Python and its libraries, such as Numpy Pandas and MatplotLab

Shared goals

Step by step, I’ll give you a hand-written version of implementing a network in Python, explaining some of the key steps to help you understand

import csv
import pandas as pd
Copy the code

Pandas, introduced here primarily for reading and viewing datasets, is a powerful tool set for analyzing structured data. In the data mining and analysis will often use this library, if you do not know it or feel it is necessary to learn.

Preparing a Data set

Before we start writing code, let’s talk about tasks that drive us to write a neural network. Only with a more comprehensive understanding of the task can we come up with solutions to the task or the problem. The data predict whether a person will have a heart attack. A heart disease dataset from the UCL database will be used. You can download it here.

Data characteristics

headers = ['age'.'sex'.'chest_pain'.'resting_blood_pressure'.'serum_clolestoral'.'fasting_blood_sugar'.'resting_ecg_results'.'max_heart_rate_achieved'.'exercise_induced_angina'.'oldpeak'.'slope_of_the_peak'.'number_of_major_vessels'.'thal'.'heart_disease']
Copy the code

Sets what attributes each sample of the dataset has, and the last column is the label, which indicates whether the record has heart disease. So if you’re interested in these metrics you can look them up, but I’m not going to tell you much about them.

heart_df = pd.read_csv('./data/heart.dat',sep=' ',names=headers)
heart_df.head()
Copy the code
age sex chest_pain resting_blood_pressure serum_clolestoral fasting_blood_sugar resting_ecg_results max_heart_rate_achieved exercise_induced_angina oldpeak slope_of_the_peak number_of_major_vessels thal heart_disease
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1

We as developers don’t need to know much about these data characteristics, and we can predict good results with our models. This is the benefit of deep learning. You don’t need a professional to process and feature engineer the data. Values are noted in the last column of the data label, with 1 for none and 2 for heart disease.

heart_df.isna().sum(a)Copy the code
age                        0
sex                        0
chest_pain                 0
resting_blood_pressure     0
serum_clolestoral          0
fasting_blood_sugar        0
resting_ecg_results        0
max_heart_rate_achieved    0
exercise_induced_angina    0
oldpeak                    0
slope_of_the_peak          0
number_of_major_vessels    0
thal                       0
heart_disease              0
dtype: int64
Copy the code

Viewing data Types

# Check the data type
heart_df.dtypes
Copy the code
age                        float64
sex                        float64
chest_pain                 float64
resting_blood_pressure     float64
serum_clolestoral          float64
fasting_blood_sugar        float64
resting_ecg_results        float64
max_heart_rate_achieved    float64
exercise_induced_angina    float64
oldpeak                    float64
slope_of_the_peak          float64
number_of_major_vessels    float64
thal                       float64
heart_disease                int64
dtype: object
Copy the code

The data type with no lost data in the data and the sample property is float64

Separate the data into training sets and test sets

After we have a general understanding of the data, such as the attributes, whether the data is complete, and the data type corresponding to each attribute, we start to split the data set into training sets and test sets. Here in the split data set and data standardization work, with skLearn provides two methods respectively train_test_split and StandardScaler

import numpy as np
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Copy the code
X = heart_df.drop(columns=['heart_disease'])

# Replace heart_disease with 0 and 1 instead of 1 and 2. Usually in dichotomies, it is preferred to use 0 and 1 to represent the two categories

heart_df['heart_disease'] = heart_df['heart_disease'].replace(1.0)
heart_df['heart_disease'] = heart_df['heart_disease'].replace(2.1)

# print(heart_df['heart_disease'].values.shape)
# (270,) to (270,1) [[0],[0]...]
y_label =heart_df['heart_disease'].values.reshape(X.shape[0].1)

X_train, X_test, y_train,y_test = train_test_split(X,y_label,test_size=0.2,random_state=2)

sc = StandardScaler()
sc.fit(X_train)

X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

print(f"Shape of train set: {X_train.shape}")
print(f"Shape of test set: {X_test.shape}")
print(f"Shape of train label set: {y_train.shape}")
print(f"Shape of test label  set: {y_test.shape}")
Copy the code
Shape of train set: (216, 13)
Shape of test set: (54, 13)
Shape of train label set: (216, 1)
Shape of test label  set: (54, 1)
Copy the code

Define layer

Our early neural network structure is relatively simple, which is layer stack. If there is no activation function, it is a linear transformation of our data (matrix).

The middle layer of the neural network is mainly composed of input, output and parameters (weights). As shown in the figure, the edges between nodes are parameters (weights). In the fully connected layer, this is a fully connected layer. The output vector, each element (component) is all the input, all the components are linearly transformed to yj=∑iwijxi+bjy_j = \sum_{I} w_{ij} x_i + b_jyj=∑iwijxi+bj

class Layer:
    def __init__(self) :
        self.input = None
        self.output = None
    def forward(self, input) :
        raise NotImplementedError
    
    def backward(self,output_error, learning_rate) :
        raise NotImplementedError
Copy the code

Defining the base layer

The base class Layer can also be understood as an interface, which is to define what the Layer looks like and what functions it can provide. Here forward and backward correspond to forward propagation and back propagation respectively. In the forward definition, what operations are performed in the propagation Layer. And backward calculates the gradient of input variables and parameters in the layer.

class FCLayer(Layer) :
    def __init__(self,input_size,output_size) :
        self.weights = np.random.rand(input_size,output_size) - 0.5
        self.bias = np.random.rand(1,output_size) - 0.5
    
    def forward(self,input_data) :
        self.input = input_data
        self.output = np.dot(self.input,self.weights) + self.bias
        return self.output
    
    def backward(self,output_error,learning_rate) :
# print(output_error)
# print(learning_rate)
        input_error = np.dot(output_error, self.weights.T)
        weights_error = np.dot(self.input.T, output_error)
        
        self.weights -= learning_rate * weights_error
        self.bias -= learning_rate * output_error
        return input_error
Copy the code

We implement a fully connected layer, forward propagation Y=XTW+BY =X ^TW +BY =XTW+B, the key is back propagation, for input variables, weights and bias inverse derivative is the key, this will be explained later.

class ActivationLayer(Layer) :
    def __init__(self,activation, activation_prime) :
        self.activation = activation
        self.activation_prime = activation_prime
    
    def forward(self,input_data) :
        self.input = input_data
        self.output = self.activation(self.input)
        return self.output
    
    def backward(self,output_error, learning_rate) :
        return self.activation_prime(self.input) * output_error
Copy the code

Back propagation

Neural network forward propagation are better understanding and implementation, is the linear transformation, may use matrix to represent the operation you know about linear algebra is not may seem a bit strange, but as long as the simple double linear algebra and come back to see, no big problem, the key is back propagation is difficult to understand, involves matrix derivation, this example is relatively simple. I’m going to list all the derivations, but it’s not hard to understand if you have some basis. The sense formula is more convincing. You need to be familiar with the chain rule, and then when you take the derivative of a particular parameter, and you follow it, you need to think about all the paths from that parameter to the loss function.

The weight is the derivative of the loss function


y j = b j + i x i w i j y_j = b_j + \sum_{i}x_i w_{ij}

Here we take wijw_{ij}wij relative to the loss function to take the partial derivative of wij as an example, wijw_{ij}wij path from the figure above is still relatively simple path yjy_jyj and output or loss connection.


partial E partial W = [ partial E partial w 11 partial E partial w 1 j partial E partial w i 1 partial E partial w i j ] \frac{\partial E}{\partial W} = \begin{bmatrix} \frac{\partial E}{\partial w_{11}} & \cdots & \frac{\partial E}{\partial w_{1j}}\\ \vdots & \cdots & \vdots \\ \frac{\partial E}{\partial w_{i1}} & \cdots & \frac{\partial E}{\partial w_{ij}} \end{bmatrix}

Take the derivative of the parameter with respect to the loss, which is either a scalar or a vector, and take the derivative of each element of the parameter matrix with respect to E.


y j = i w i j x i y j = x 1 w 1 j + x 2 w 2 j + + x i w i j \begin{aligned} y_j = \sum_{i} w_{ij} x_i\\ y_j = x_1 w_{1j} + x_2 w_{2j} + \cdots + x_i w_{ij}\\ \end{aligned}

Let’s combine the graph to see how to calculate an intuition. For the JTH component of the output Y vector, it is all the elements (features) of the input X sample multiplied by a corresponding weight, yjy_jyj for wijw_{ij}wij is its coefficient xix_ixi


partial y j partial w i j = x i \frac{\partial y_j}{\partial w_{ij}} = x_i


partial E partial y = j partial E partial y j \frac{\partial E}{\partial y} = \sum_{j} \frac{\partial E}{\partial y_j}

And then we use the product chain rule to get the following formula


partial E partial w i j = partial E partial y j partial y j partial w i j = partial E partial y j x i \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial w_{ij}} = \frac{\partial E}{\partial y_j} x_i

And then you apply that to the W matrix and you take the derivative of each of the entries, and you represent it as a matrix.


partial E partial W = [ partial E partial y 1 x 1 partial E partial y j x 1 partial E partial y 1 x i partial E partial y j x i ] = [ x 1 x i ] partial E partial y 1 partial E partial y j = X T partial E partial Y \frac{\partial E}{\partial W} = \begin{bmatrix} \frac{\partial E}{\partial y_{1}}x_1 & \cdots & \frac{\partial E}{\partial y_{j}} x_1\\ \vdots & \cdots & \vdots \\ \frac{\partial E}{\partial y_{1}} x_i& \cdots & \frac{\partial E}{\partial y_{j}} x_i \end{bmatrix} = \begin{bmatrix} x_1\\ \vdots\\ x_i \end{bmatrix} \begin{aligned} \frac{\partial E}{\partial y_{1}} & \cdots & \frac{\partial E}{\partial y_{j}} \end{aligned} = X^T \frac{\partial E}{\partial Y}

The offset is the derivative of the loss function

The next step is to take the bias, because bib_ibi is only related to yiy_iyi, so it’s relatively easy to calculate the gradient. I won’t explain too much here, if you are interested please leave me a message.


partial E partial B = [ partial E partial b 1 partial E partial b 2 partial E partial b j ] \frac{\partial E}{\partial B} = \begin{bmatrix} \frac{\partial E}{\partial b_{1}} & \frac{\partial E}{\partial b_{2}} & \cdots & \frac{\partial E}{\partial b_{j}} \end{bmatrix}


partial E partial b j = partial E partial y j \frac{\partial E}{\partial b_j} = \frac{\partial E}{\partial y_j}


partial E partial B = [ partial E partial y 1 partial E partial y 2 partial E partial y j ] = partial E partial Y \frac{\partial E}{\partial B} = \begin{bmatrix} \frac{\partial E}{\partial y_1} & \frac{\partial E}{\partial y_2} & \cdots & \frac{\partial E}{\partial y_j} \end{bmatrix} = \frac{\partial E}{\partial Y}

Take the derivative of X with respect to the output

Neural network back propagation is not updating the parameters to take the gradient of the parameters, why do you take the derivative of the variables? Because the gradient of the loss value needs to be passed down layer by layer, it is necessary to find the gradient of the variable.


partial E partial X = [ partial E partial x 1 partial E partial x 2 partial E partial x i ] \frac{\partial E}{\partial X} = \begin{bmatrix} \frac{\partial E}{\partial x_{1}} & \frac{\partial E}{\partial x_{2}} & \cdots & \frac{\partial E}{\partial x_{i}} \end{bmatrix}

When you take the X variable, you also take the partial of each of the elements of the sample, and then you put them together to get the partial of X with respect to E.

And the other thing we need to do is look at the graph, and again, we take each X and each element and we see how many paths it takes to get to the output, and then we take the derivative of each of those paths and we summarize them.


partial E partial x i = partial E partial y 1 partial y 1 partial x i + + partial E partial y j partial y j partial x i \frac{\partial E}{\partial x_i} = \frac{\partial E}{\partial y_1} \frac{\partial y_1}{\partial x_{i}} + \cdots + \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial x_{i}}

partial E partial x i = partial E partial y 1 w i 1 + + partial E partial y j w i j \frac{\partial E}{\partial x_i} = \frac{\partial E}{\partial y_1} w_{i1}+ \cdots + \frac{\partial E}{\partial y_j} w_{ij}

If you are not familiar with this derivation, you can look it up by following the diagram above.


partial E partial X = [ ( partial E partial y 1 w 11 + + partial E partial y j w 1 j ) partial E partial y 1 w i 1 + + partial E partial y j w i j ] \frac{\partial E}{\partial X} = \begin{bmatrix} ( \frac{\partial E}{\partial y_1} w_{11}+ \cdots + \frac{\partial E}{\partial y_j} w_{1j}) & \cdots & \frac{\partial E}{\partial y_1} w_{i1}+ \cdots + \frac{\partial E}{\partial y_j} w_{ij} \end{bmatrix}

So let’s put it in matrix form and take the derivative of this matrix.


partial E partial X = partial E partial Y W T \frac{\partial E}{\partial X} = \frac{\partial E}{\partial Y} W^T

Take the derivative in the direction of the activation function


Y = [ f ( x 1 ) f ( x i ) ] Y = \begin{bmatrix} f(x_1) & \cdots & f(x_i) \end{bmatrix}

partial E partial X = [ partial E partial x 1 partial E partial x 2 partial E partial x i ] = [ partial E partial x 1 partial y 1 partial x 1 partial E partial y 2 partial y 2 partial x 2 partial E partial y i partial y i partial x i ] [ partial E partial x 1 f ( x 1 ) partial E partial y 2 f ( x 2 ) partial E partial y i f ( x i ) ] = [ partial E partial x 1 partial E partial x 2 partial E partial x i ] [ f ( x 1 ) f ( x 2 ) f ( x i ) ] = partial E partial Y f ( X ) \begin{aligned} \frac{\partial E}{\partial X} = \begin{bmatrix} \frac{\partial E}{\partial x_{1}} & \frac{\partial E}{\partial x_{2}} & \cdots & \frac{\partial E}{\partial x_{i}} \end{bmatrix}\\ = \begin{bmatrix} \frac{\partial E}{\partial x_{1}} \frac{\partial y_1}{\partial x_{1}} & \frac{\partial E}{\partial y_{2}}\frac{\partial y_2}{\partial x_{2}} & \cdots & \frac{\partial E}{\partial y_{i}}\frac{\partial y_i}{\partial x_{i}} \end{bmatrix}\\ \begin{bmatrix} \frac{\partial E}{\partial x_{1}} f^{\prime}(x_1) & \frac{\partial E}{\partial y_{2}}f^{\prime}(x_2) & \cdots & \frac{\partial E}{\partial y_{i}}f^{\prime}(x_i) \end{bmatrix}\\ = \begin{bmatrix} \frac{\partial E}{\partial x_{1}} & \frac{\partial E}{\partial x_{2}} & \cdots & \frac{\partial E}{\partial x_{i}} \end{bmatrix} \begin{bmatrix} f^{\prime}(x_1) & f^{\prime}(x_2) & \cdots & f^{\prime}(x_i) \end{bmatrix}\\ =\frac{\partial E}{\partial Y}f^{\prime}(X) \end{aligned}

Loss function


E = 1 n i n ( y i y i ) 2 E = \frac{1}{n} \sum_{i}^n (y^{*}_i – y_i)^2


partial E partial Y = 2 n ( Y Y ) \frac{\partial E}{\partial Y} = \frac{2}{n}(Y – Y^*)

def tanh(x) :
    return np.tanh(x)

def tanh_prime(x) :
    return 1-np.tanh(x)**2
Copy the code
def mse(y_true, y_pred) :
    return np.mean(np.power(y_true - y_pred,2))

def mse_prime(y_true,y_pred) :
    return 2*(y_pred - y_true)/y_true.size
Copy the code

Defining the network structure

In the network structure, here is a simple layer-by-layer network, so in the constructor we maintain the list, and then add the full connection layer and the activation layer to the neural network through the add method. Loss and loss_prime each accept two methods, one for calculating the loss value and one for calculating the gradient in the back propagation. The FIT function is used to train the network and update the parameters in each layer of the network according to the gap between the model output and the true value, while the predict is to use the model to predict the input data.

class Network:
    
    def __init__(self) :
        self.layers = []
        self.loss = None
        self.loss_prime = None
        
    def add(self, layer) :
        self.layers.append(layer)
        
    def use(self, loss, loss_prime) :
        self.loss = loss
        self.loss_prime = loss_prime
    
    def predict(self, input_data) :
        samples = len(input_data)
        result = []
        
        for i in range(samples):
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward(output)
            result.append(output)
        return result
    
    def fit(self, x_train, y_train, epochs, learning_rate) :
        samples = len(x_train)
        
        for i in range(epochs):
            err = 0
            for j in range(samples):
                output = x_train[j]
                for layer in self.layers:
                    output = layer.forward(output)
                err += self.loss(y_train[j],output)
                
                error = self.loss_prime(y_train[j],output)
# print(f"error: {error}")
                for layer in reversed(self.layers):
# print(f"error: {error}")
                    error = layer.backward(error, learning_rate)
            err /= samples
            print(f"epoch {(i+1)/epochs} error ={err}")
Copy the code

This is for a simple data set, XOR, and let the network run on this data set first to see how it works.

x_train = np.array([[[0.0]], [[0.1]], [[1.0]], [[1.1]]])
y_train = np.array([[[0]], [[1]], [[1]], [[0]]])
Copy the code

The infrastructure is ready, the next step is to build a network with these modules, and now a neural network for the XOR task.

net = Network()
net.add(FCLayer(2.3))
net.add(ActivationLayer(tanh,tanh_prime))
net.add(FCLayer(3.1))
net.add(ActivationLayer(tanh,tanh_prime))

net.use(mse,mse_prime)
net.fit(x_train,y_train,epochs=2000, learning_rate=0.01)

out = net.predict(x_train)
print(out)
Copy the code
Epoch 0.0005 error =0.7136337374260526 EPOCH 0.001 error =0.5905739289052101 EPOCH 0.0015 error =0.49847803406876834 Epoch 0.002 error =0.4315229169435869 EPOCH 0.0025 error =0.3833645522357618 EPOCH 0.003 error =0.34869455999659626 Epoch 0.0035 error =0.3235475456763668 EPOCH 0.004 error =0.3051103124983356 EPOCH 0.0045 error =0.29142808210519544 Error =0.28114855069402345 [Array ([[0.01603471]]), array([[0.884377]]), array([[0.89208809]]), Array ([[0.02946468]])]Copy the code

In terms of the predictive structure, the neural network does learn something over 2,000 iterations and give a good answer

[array ([[0.01603471]]), array ([[0.884377]]), array ([[0.89208809]]), array ([[0.02946468]]]]Copy the code

Give it a try

Give you a homework, ha ha, don’t dare. I’m going to write a network and I’m going to run around. This data set, whether I have heart disease or not doesn’t work very well. You can tune it and run it.

X_train = X_train.reshape(X_train.shape[0].1,X_train.shape[1])
Copy the code
net2 = Network()
net2.add(FCLayer(13.8))
net2.add(ActivationLayer(tanh,tanh_prime))
net2.add(FCLayer(8.5))
net2.add(ActivationLayer(tanh,tanh_prime))
net2.add(FCLayer(5.2))
net2.add(ActivationLayer(tanh,tanh_prime))

net2.use(mse,mse_prime)
net2.fit(X_train,y_label,epochs=35, learning_rate=0.1)
Copy the code
X_test = X_test.reshape(X_test.shape[0].1,X_test.shape[1])
Copy the code
output = net2.predict(X_test[0:3])
Copy the code
output
Copy the code
[[[0.82602593, 0.82602593]]), array([[0.17209814, 0.17209814]]), Array ([[0.95838074, 0.95838074]])]Copy the code
y_test[:3]
Copy the code

Tensor, there will be a tensor, and you will have a tensor

array([[1],
       [0],
       [0]])
Copy the code

If you have any questions about the above content, please leave a message. Due to the relatively hasty writing, if there is omission or way out of the place also hope that we are more corrections and criticism.