@[toc]

define

Neural networks are not very complicated, and the simplest neural networks involve simple generalized linear regression and gradient descent.

Application: The application of neural network is very extensive, the simplest is data prediction, also can be used for data prediction, and because of the operation topology of neural network can also be used for classification. For example, if we feature an image and then predict and classify the extracted features, then we can achieve simple image recognition and so on (of course there are convolutional neural networks and so on).

This post is about the simplest neural network algorithm and implementing one manually (based on Python)

The basic principle of

The basic transition

In fact, the simplest way to describe how a neural network works is by brute force.

To take a simple example (using gradient descent to demonstrate unary linear regression) suppose we have a set of numbers

x = [1 2 3 4 5]

The output collection

y = [1 2 3 4 5]

Let’s say we have a relationship y is equal to w times x plus b

Now all we need to do is know the values of w and b.

The easiest thing to do, as you probably know, is to randomly generate the values of w and b, and then we can modify the values of w and b by error. So in order to calculate the error here we need a way to calculate the error here and modify the corresponding value of w b, for example we calculate D(y actual, y predicted) by variance.

So let’s say x is equal to 1 and w is equal to 2 and b is equal to 1 and then after we calculate (and of course we’re actually using the average of x and the average of y)

Y prediction =3 variance = 4

And then we can apply the gradient descent algorithm

This variance is called the loss function Loss = (w*x + b – y)^2

Take the partial of w and b, and then we get

W ‘= w – w partial derivative x step

B prime is equal to the partial derivative of b minus the x step

And then we go into the loop and we end up with w = 1 and b = 0

So it looks like I’m just showing you how to use gradient descent, but if we use a flow chart you can see this thing, right

This is essentially a “neural network” with only one node in the middle.

Basic neural network structure

In this step we can introduce the concept of neural network, we found ourselves in front of the computing when w b is a random initial value (generally can also, of course, the default starting from 0) through a node, then the problem to the fitting for complex, predicted that w b value of the initial result may be different, And one node doesn’t seem to be rigorous enough. So, if we look at biology (actually, I prefer to say that two heads are better than one), we can divide multiple nodes, and then we can take the x value apart according to the weight and throw it into the nodes and then calculate it, and then combine it back according to the weight, And then we end up with the values of W and b for each of the nodes and we end up putting them together again.

So it gets a little bit complicated and it looks like this:

And then you realize that’s not enough and you see a lot of online scaremongers

or

But even though this is the case, the general principle is similar.

implement

Instead of writing my own Dome, I used a simple Dome that someone else had implemented. After all, a hand-written device is still not as good as a ready-made framework. There are a lot of details to consider, but the basic principle is the same.

The target

The neural network that we’re going to simulate this time is

It should also be clear from the previous example that the first thing we need to do is to select a suitable (or guess) fitting equation as the hidden layer.

So it’s called the activation function here

We choose the most common

def sigmoid (x) :
    return 1/ (1+numpy.exp(-x))
Copy the code

This is also an activation function that I used directly in mathematical modeling (of course, I made it with MATLAB toolbox).

Use the activation function

Now we’ve chosen an activation function, so what we’re going to do is we’re going to divide X by weight, we’re going to plug in the activation function, and we’re going to print it by weight. Of course, we don’t know the weight, but our training is to determine the weight, determine the weight of each layer, and eventually we can derive a very complicated equation. (Yes, I do the same thing in mathematical modeling, because there is no multi-objective optimization, so directly through the neural network to fit the weight relationship between each target, get a very complex equation and run genetic. Reference: www.nnetinfo.com/text/show/4…

This is what the code looks like in that picture

def feedforward(self,x) :
        h1 = x[0]*self.w1+x[1]*self.w2+self.b1
        h1f = sigmoid(h1)
        h2 = x[0]*self.w3+x[1]*self.w4+self.b2
        h2f = sigmoid(h2)
        o1 = h1f*self.w5+h2f*self.w6+self.b3
        of = sigmoid(o1)

Copy the code

And then we initialize by giving a random value first and then training later

class nerualnetwo() :
    def __init__(self) :
       self.w1 = numpy.random.normal()
       self.w2 = numpy.random.normal()
       self.w3 = numpy.random.normal()
       self.w4 = numpy.random.normal()
       self.w5 = numpy.random.normal()
       self.w6 = numpy.random.normal()
       self.b1 = numpy.random.normal()
       self.b2 = numpy.random.normal()
       self.b3 = numpy.random.normal()

Copy the code

Loss function

So this is just comparing the previous example, how do I know that this thing w b is good, so I need to fix it, and I’m using the variance again

def mse_loss(y_tr,y_pre) :
    return((y_tr - y_pre)**2).mean()
Copy the code

Feedback optimization (based on gradient descent)

In fact, it is not necessary to use gradient descent to optimize the weight of each node, but this is the easiest to implement, and the best to understand. (Which I can write)

This is where gradient descent comes in. But there’s a difference here

We have two levels of gradient descent:

Input to the hidden layer

Hidden layer to output

So we have two functions that we want to differentiate

def der_sigmoid(x) :
    return sigmoid(x)*(1-sigmoid(x))


der_L_y_pre = -2*(y_tr-y_pre)
Copy the code

And notice here that minus 2 times y_tr-y_pre is actually y_pre which is the partial of the y prediction

Because there are two layers and the gradient descent is going to go like this

self.w1 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_w1

self.b1 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_b1

Der_h1_b1 = der_sigmoid(valcell[0])

​ *der_h1_w1 = der_sigmoid(valcell[0])x[0]

** Then this treatment is called partial derivative chain rule!! **

That’s the partial derivative of each with respect to w b

And valcells are H1, H1F, H2, H2F, O1,of

Want to!

Gradient correction

That’s easy to say, but the whole thing is going to go into the cycle, but I’m just going to do it 1,000 times

And this is the heart of the operation

def train(self,data,all_y_tr) :
        epochs = 1000
        learn_rate = 0.1
        for i in range(epochs):
            for x , y_tr in zip(data,all_y_tr):
                valcell = self.feedforward(x)
                y_pre = valcell[5]
                der_L_y_pre = -2*(y_tr-y_pre)
                der_y_pre_h1 = der_sigmoid(valcell[4])*self.w5
                der_y_pre_h2 = der_sigmoid(valcell[4])*self.w6
                der_h1_w1 = der_sigmoid(valcell[0])*x[0]
                der_h1_w2 = der_sigmoid(valcell[0])*x[1]
                der_h2_w3 = der_sigmoid(valcell[2])*x[0]
                der_h2_w4 = der_sigmoid(valcell[2])*x[1]
                der_y_pre_w5 = der_sigmoid(valcell[4])*valcell[1]
                der_y_pre_w6 = der_sigmoid(valcell[4])*valcell[3]
                der_y_pre_b3 = der_sigmoid(valcell[4])
                der_h1_b1 = der_sigmoid(valcell[0])
                der_h2_b2 = der_sigmoid(valcell[2])
# reassign weights and offsets
                self.w1 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_w1
                self.w2 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_w2
                self.w3 -= learn_rate * der_L_y_pre * der_y_pre_h2 * der_h2_w3
                self.w4 -= learn_rate * der_L_y_pre * der_y_pre_h2 * der_h2_w4
                self.w5 -= learn_rate * der_L_y_pre * der_y_pre_w5
                self.w6 -= learn_rate * der_L_y_pre * der_y_pre_w6
                self.b1 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_b1
                self.b2 -= learn_rate * der_L_y_pre * der_y_pre_h2 * der_h2_b2
                self.b3 -= learn_rate * der_L_y_pre *der_y_pre_b3
              Output the current loss value every 10 steps
                if i % 10= =0 :
                    y_pred = numpy.apply_along_axis(self.simulate,1,data)
                    loss = mse_loss (all_y_tr , y_pred)
                    print(i,loss)

Copy the code

At this point, we know the weight of each layer after the training, so we can participate in the operation

Operation function

def simulate (self,x) :
        h1 = x[0]*self.w1+x[1]*self.w2+self.b1
        h1f = sigmoid(h1)
        h2 = x[0]*self.w3+x[1]*self.w4+self.b2
        h2f = sigmoid(h2)
        o1 = h1f*self.w5+h2f*self.w6+self.b3
        of = sigmoid(o1)

Copy the code

This is actually the model of the operation, and if you train the model and you know the weights, you plug in the equation.

The overall code

import numpy

def sigmoid (x) :
    return 1/ (1+numpy.exp(-x))

def der_sigmoid(x) :
    return sigmoid(x)*(1-sigmoid(x))

def mse_loss(y_tr,y_pre) :
    return((y_tr - y_pre)**2).mean()


class nerualnetwo() :
    def __init__(self) :
       self.w1 = numpy.random.normal()
       self.w2 = numpy.random.normal()
       self.w3 = numpy.random.normal()
       self.w4 = numpy.random.normal()
       self.w5 = numpy.random.normal()
       self.w6 = numpy.random.normal()
       self.b1 = numpy.random.normal()
       self.b2 = numpy.random.normal()
       self.b3 = numpy.random.normal()
    def feedforward(self,x) :
        h1 = x[0]*self.w1+x[1]*self.w2+self.b1
        h1f = sigmoid(h1)
        h2 = x[0]*self.w3+x[1]*self.w4+self.b2
        h2f = sigmoid(h2)
        o1 = h1f*self.w5+h2f*self.w6+self.b3
        of = sigmoid(o1)
        return h1,h1f,h2,h2f,o1,of
    def simulate (self,x) :
        h1 = x[0]*self.w1+x[1]*self.w2+self.b1
        h1f = sigmoid(h1)
        h2 = x[0]*self.w3+x[1]*self.w4+self.b2
        h2f = sigmoid(h2)
        o1 = h1f*self.w5+h2f*self.w6+self.b3
        of = sigmoid(o1)
        return of
    def train(self,data,all_y_tr) :
        epochs = 1000
        learn_rate = 0.1
        for i in range(epochs):
            for x , y_tr in zip(data,all_y_tr):
                valcell = self.feedforward(x)
                y_pre = valcell[5]
                der_L_y_pre = -2*(y_tr-y_pre)
                der_y_pre_h1 = der_sigmoid(valcell[4])*self.w5
                der_y_pre_h2 = der_sigmoid(valcell[4])*self.w6
                der_h1_w1 = der_sigmoid(valcell[0])*x[0]
                der_h1_w2 = der_sigmoid(valcell[0])*x[1]
                der_h2_w3 = der_sigmoid(valcell[2])*x[0]
                der_h2_w4 = der_sigmoid(valcell[2])*x[1]
                der_y_pre_w5 = der_sigmoid(valcell[4])*valcell[1]
                der_y_pre_w6 = der_sigmoid(valcell[4])*valcell[3]
                der_y_pre_b3 = der_sigmoid(valcell[4])
                der_h1_b1 = der_sigmoid(valcell[0])
                der_h2_b2 = der_sigmoid(valcell[2])

                self.w1 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_w1
                self.w2 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_w2
                self.w3 -= learn_rate * der_L_y_pre * der_y_pre_h2 * der_h2_w3
                self.w4 -= learn_rate * der_L_y_pre * der_y_pre_h2 * der_h2_w4
                self.w5 -= learn_rate * der_L_y_pre * der_y_pre_w5
                self.w6 -= learn_rate * der_L_y_pre * der_y_pre_w6
                self.b1 -= learn_rate * der_L_y_pre * der_y_pre_h1 * der_h1_b1
                self.b2 -= learn_rate * der_L_y_pre * der_y_pre_h2 * der_h2_b2
                self.b3 -= learn_rate * der_L_y_pre *der_y_pre_b3
                if i % 10= =0 :
                    y_pred = numpy.apply_along_axis(self.simulate,1,data)
                    loss = mse_loss (all_y_tr , y_pred)
                    print(i,loss)

                    
if __name__ == "__main__":
    data = numpy.array([[-2, -1], [25.6], [17.4], [...15, -6]])
    all_y_trues = numpy.array([1.0.0.1])
    ner = nerualnetwo()

    ner.train(data,all_y_trues)

Copy the code

conclusion

In fact, this is what the basic neural network looks like up here, and at the heart of it are two things and of course there are a lot of details.

The activation function

Auto correct weights

Select the appropriate activation function, activate it, and then use self-correcting methods to minimize the loss function. So as the number of layers of the neural network increases, it becomes more complex, and of course the accuracy is not necessarily proportional to the number of layers, which I tested.

Reference:

zhuanlan.zhihu.com/p/58964140

Blog.csdn.net/Syuhen/arti…

www.nnetinfo.com/text/show/4