This is the fifth day of my November challenge

perceptron

Neural network can be simply understood as a network composed of artificial neurons. Perceptron is a kind of artificial neuron, and its simple form is as follows:

In the example, the perceptron has three inputs x1,x2,x3x_1, X_2, X_3x1,x2,x3. The output is a number of 0 or 1, and each connection has a weight WWW. If the sum of parameters and weights is greater than a certain threshold, the input is 1, otherwise the output is 0.


o u t p u t = { 0     i f   i w i x i Or less t h r e s h o l d 1     i f   i w i x i > t h r e s h o l d (1) output=\begin{cases} 0\ \ \ if\ \sum_iw_ix_i \leq threshold\\ 1\ \ \ if\ \sum_iw_ix_i > threshold \end{cases} \tag{1}

According to Formula (1), different decision models can be adjusted by changing thresholds and weights.

The above example is just a simple perceptron. A more complex network consisting of multiple perceptrons is shown below:

  • Layer 1 perceptron: Makes three simple judgments based on input
  • Layer 2 perceptron: Make the next judgment based on the judgment results of layer 1 perceptron. The decision made by layer 2 perceptron is more complex and abstract than that made by layer 1.

Similarly, the more layers there are, the more abstract and complex the perceptron’s decisions become.

∑ Iwixi \ sum_iw_IX_I ∑iwixi= W ⋅x\ sum_iw_IX_i =\ PMB {w}\cdot\ PMB {x}∑ Iwixi = Ww ⋅xx, Use thresholdthresholdthreshold – b – b – b said, get:


o u t p u t = { 0     i f   w x + b Or less 0 1     i f   w x + b > 0 (2) output=\begin{cases} 0\ \ \ if\ \pmb{w}\cdot\pmb{x}+b \leq 0\\ 1\ \ \ if\ \pmb{w}\cdot\pmb{x}+b > 0 \end{cases} \tag{2}

Use thresholdthresholdthreshold – b – b – b (bias) after replacement, it can be look for how the parameters of the container output 1 perceptron (that is, the easier the BBB value, the greater the perceptron output 1).

The core of neural network learning algorithm is to automatically adjust the weight and bias of artificial neurons according to the objective function, without the direct intervention of human, which is the fundamental difference between artificial neurons and traditional logic gates.

S-type neuron

In learning, it is usually expected that slight adjustment of parameters such as weight or bias will only slightly change the output result. However, it can be seen from Formula (2) that artificial neurons such as perceptron can hardly meet this requirement. Changing parameters may change the output result from 0 to 1 (because the perceptron has only two output values). So an artificial neuron called an S-type neuron was introduced to improve on this.

The S-type function σ\sigmaσ is introduced into the S-type neuron, which is defined as:


sigma ( z ) = 1 1 + e z \sigma(z)=\frac{1}{1+e^{-z}}

The output of s-type neurons becomes:


1 1 + e x p ( ( w x + b ) ) \frac{1}{1+exp(-(\pmb{w}\cdot\pmb{x}+b))}

The general shape of the s-type function sigma \sigma sigma is shown below:

⋅x+bz=\ PMB {w}\cdot\ PMB {x}+bz=ww⋅xx+b, the function approximating 1 when its value is very large and 0 when its value is very small.

If the s-type function sigma \sigma sigma is converted into a step function, it is shown as follows:

And you can see that this is actually the perceptron, which means that the S-neuron is actually a smooth version of the perceptron, which allows you to fine-tune the parameters and fine-tune the output.

δ output\Delta output δ output can be expressed as:


Δ o u t p u t material i partial o u t p u t partial w i Δ w i + partial o u t p u t partial b Δ b \Delta output \approx \sum_i \frac{\partial output}{\partial w_i}\Delta w_i+\frac{\partial output}{\partial b}\Delta b

Neural network architecture

The neural network infrastructure is as follows:

The leftmost layer is called the input layer, in which the neuron is called the input neuron, the right is the output layer, in which the neuron is called the output neuron, the middle layer is collectively called the hidden layer.

Classified handwritten digital network

Recognition of handwritten numbers can be divided into two problems, one is to divide a string of numbers into a single number, the other is to classify individual numbers to complete recognition, only the second problem is solved here. Use the three-layer neural network shown below to identify a number:

  • Input layer: Set the size of handwritten digital image input by network as: 28×2828 \times 2828×28 pixel, the input pixel is gray level, 0.0 represents white, 1.0 represents black, and the middle value represents gradual gray, so the input layer should contain 28×28=78428 \times 28=78428×28=784 neurons.
  • Hidden layer: Use NNN to represent the number of neurons in the hidden layer. NNN is an adjustable parameter. In the figure, N =15n=15n=15.
  • Output layer: contains 10 neurons, ranging from 0−90-90−9. Which neuron is activated indicates which class it is assigned to.

Gradient descent algorithm learning parameters

The training dataset is a well-known MNIST dataset.

The input of the network is x\ PMB {x}xx, the dimension is 784, and the expected output is the 10-dimensional vector y\ PMB {y}yy, that is, we expect Y =y(x)\ PMB {y}=y(\ PMB {x})yy=y(xx). Such as the input for number is 6, the expectations of network output to y (x) = (0,0,0,0,0,0,1,0,0,0) Ty (\ PMB {x}) = (0,0,0,0,0,0,1,0,0,0) ^ Ty (xx) = (0,0,0,0,0,0,1,0,0,0) T.

In order to quantify the learning objective of neural network, the objective function is defined as follows:


C ( w . b ) = 1 2 n x y ( x ) a 2 C(\pmb{w},b)=\frac{1}{2n}\sum_x||y(\pmb{x})-a||^2

W \ PMB {w} ww said the network weights, BBB said all bias, NNN said the total number of input data, aaa says when the input is x \ PMB {x} xx output vector, ∑ x \ sum_x ∑ x said the summation of all the input data, ∣ ∣ ∣ ∣ | | | | ∣ ∣ ∣ ∣ said orientation of die, This objective function is set by a method called the mean square error. The smaller the difference between aaa and y(x)y(\ PMB {x})y(xx), the smaller the value of C(w,b)C(\ PMB {w},b)C(ww,b). \ PMB {w}, BWW,b\ rightarrow 0C(\ PMB {w},b) \rightarrow 0C(ww,b)→0

This problem is transformed to solve w,b\ PMB {w}, BWW,b, when:


min C ( w . b ) \min C(\pmb{w},b)

We can solve this minimization problem by using the gradient descent method. The principle of the gradient descent method is very simple, that is, each time the two parameter values are moved in the direction of the negative change in the function value until the extreme point is reached.

C\ Nabla C∇C is used to express the gradient vector of the function C(w,b)C(\ PMB {w},b)C(ww,b) :


C = ( partial C partial w . partial C partial b ) T \nabla C = (\frac{\partial C}{\partial w},\frac{\partial C}{\partial b})^T

δ C\Delta C δ C


Δ C material C ( Δ w . Δ b ) \Delta C \approx \nabla C \cdot (\Delta w,\Delta b)

The problem is transformed into how to select (δ w, δ b)(\Delta w,\Delta b)(δ w, δ b) so that δ C\Delta C δ C must be negative. To introduce the learning rate η,η>0\eta, \eta >0η >0, let:


( Δ w . Δ b ) = eta C (\Delta w,\Delta b) = -\eta \nabla C

Since η\etaη is a small positive number, it follows that:


Δ C material eta C C = eta C 2 \Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta||\nabla C||^2

∣ ∣ ∇ C ∣ ∣ > 0 2 | | \ nabla C | | ^ 2 > 0 ∣ ∣ ∇ C ∣ ∣ 2 > 0, so you can get a constant negative Δ \ Delta Δ C C C, and by adjusting the value of eta \ eta eta can adjust the rate of gradient descent. The figure below shows the geometric representation of the gradient descent method

To sum up, the change selection method of two parameters is obtained:


w w = w eta partial C partial w     b b = b eta partial C partial b w \rightarrow w’=w-\eta \frac{\partial C}{\partial w}\\ \\ \ \\ \ b \rightarrow b’=b-\eta \frac{\partial C}{\partial b}

Repeat the above steps until you find the minimum of the function.

Stochastic gradient descent

Noting that the learning process of the above method is actually time-consuming, Previously mentioned objective function C (w, b) = 12 n ∑ x ∣ ∣ y (x) – a ∣ ∣ 2 C (\ PMB {w}, b) = \ frac {1} {2 n} \ sum_x | | y (\ PMB {x}) – a | | ^ 2 C (ww, b) = 2 n1 ∑ x ∣ ∣ y (xx) – a ∣ ∣ 2, The Cx = ∣ ∣ y (x) – a ∣ ∣ 22 c_x = \ frac {| | y (\ PMB {x}) – a | | ^ 2} {2} Cx = 2 ∣ ∣ y (xx) – a ∣ ∣ 2, Get C(w,b)=1n∑xCxC(\ PMB {w},b)=\frac{1}{n}\sum_xC_xC(ww,b)= N1 ∑xCx. In the calculation, calculate the gradient value of x\ PMB {x} XX of each input ∇Cx\ Nabla C_x∇Cx and average it. This makes for a lot of computation and slow learning.

Therefore, the stochastic gradient descent algorithm is introduced to solve the speed of accelerated learning. The core idea is to randomly select small samples to calculate ∇Cx\ Nabla C_x∇Cx, and estimate the gradient value of all samples from the gradient value calculated from small samples.

Select a mini-batch of data, including the number of MMM samples, and obtain the following results when MMM is large enough:


i = 1 m C X i m material x C x n = C \frac{\sum^m_{i=1} \nabla C_{X_i}}{m} \approx \frac{\sum_x \nabla C_{x}}{n} = \nabla C

In the GRADIENT descent method, a batch of small-batch data is re-selected for training in each iteration until all training set samples are used up, which is called the completion of a training iteration period (EPOCH), and then a new iteration period is started.

Code implementation

Structural neural network

Import numpy as np import random # sigmoid def sigmoid(z): return 1.0/(1.0+np.exp(-z)) def sigmoid_prime(z): Return sigmoid(z)*(1-sigmoid(z)) # net = Network([2, 3, 1]) # def __init__(self,sizes): def __init__(self,sizes): Self.num_layers =len(sizes) self.sizes = sizes # The first neuron is an input layer without bias, So starting at 1 # np.random. Randn The randn function returns one or a group of samples, Self. Biases =[Np.random. Randn (y,1)for Y in sizes[1:]] # zip function used for iterable objects as parameters, Package the corresponding elements of the object into tuples and return a list of those tuples. Weights = [np.random. Randn (y,x) for x,y in list(zip(sizes[:-1],sizes[1:]))] # feedForward method, Def feedforward(self,a): for b,w in list(zip(self.biases,self.weights)): #np.dot matrix multiplication a = sigmoid(Np.dot (w,a)+b) return a # random gradient descent algorithm def SGD(self,training_data,epochs,mini_batch_size,eta,test_data=None): If test_data:n_test=len(test_data) # training_data is a list of (x,y). N = len(training_data) # epochs and mini_batch_size indicate the number of iterations and the size of the small batch data at the time of sampling for j in range(epochs): Shuffle (training_data) # Select data from the training set in the range k to K + MINI_BATCH_size as data in a MINI_batches batch. Training_data [k:k+mini_batch_size] # range(0,n,mini_batch_size) # k Nimi_batch_size for K in range(0,n,mini_batch_size)] # ETA # update_mini_batch(mini_batch, ETA) # Update_mini_batch (mini_batch, ETA) If test_data: print ("Epoch {0}: {1} / {2}". Format (j, self.evaluate(test_data), n_test)) else: Print ("Epoch {0} complete".format(j)) # def update_mini_batch(self, mini_batch, eta): Shape can quickly read the shape of the matrix nabla_b = [Np.zeros (B.shape) for B in self. Biases] nabla_w = [np.zeros(w.shape) for w in self.weights] for x,y in mini_batch: # self. Backprop backpropagation algorithm A fast method to calculate the gradient of cost function delta_Nabla_b, delta_Nabla_W =self.backprop(x,y) # Calculate the gradient vector nabla_b = [NB + DNB for NB, dnb in list(zip(nabla_b, delta_nabla_b))] nabla_w = [nw + dnw for nw, dnw in list(zip(nabla_w, Self.weights = [w - (eta/len(mini_batch))) * nw for w, nw in zip(self.weights, Nabla_w)] self.biases = [B - (eta/len(mini_batch)) * NB for B, NB in zip(self.biases, NABla_b)] Def backprop(self,x,y) def backprop(self,x,y): nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # feedforward Activation = x activations = [x] # store a list of all activations, zs = [] # store a list of all z vectors, For B, W in list(zip(self.biases, self.weights)): z = np.dot(w, activation) + b zs.append(z) activation = sigmoid(z) activations.append(activation) # backward pass delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1]) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose()) for l in range(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp nabla_b[-l] = delta nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose()) return nabla_b, nabla_w def evaluate(self, test_data): """Return the number of test inputs for which the neural network outputs the correct result. Note that the neural network's output is assumed to be the index of whichever neuron in the final layer has the highest activation.""" test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data] return sum(int(x == y) for (x, y) in test_results) def cost_derivative(self, output_activations, y): """Return the vector of partial derivatives \partial C_x / \partial a for the output activations.""" return output_activations - yCopy the code

Read the data set and convert it to the appropriate format

Import pickle import gzip import numpy as NP # load_data Returns MNIST data as a meta-parent including training data, Validation data, and test data # Training Data is a meta-ancestor that contains two entities, the first of which contains the actual training image. A NUMpy nDARray contains 50000 entities # each entity in turn is a Numpy Ndarray containing 784 values, So a MNIST image is 28X28=784 pixels # The second entity is a Numpy Ndarray with 50000 entities, # validation data is the same as test data, each containing 10000 images def load_data(): The # Gzip module provides a file-like interface for GNU zip files. It uses zlib to compress and decompress data files, reads and writes Gzip files f = gzzip.open ('mnist.pkl.gz', 'rb') # pickle.load reads a string from f, And refactor it into the original Python objects training_data, validation_data, test_data = pickle.load(f,encoding='bytes') f.close() return training_data, validation_data, Test_data # load_datA_wrapper returns a meta-ancestor containing (training_data, validation_data, test_data), # training Data is a list of 50,000 tuples: (x,y), where x is a 784-dimensional numpy. Ndarray, where x is a 10-dimensional numpy. Ndarray is the category for x: (x,y) def load_data_wrapper(): Tr_d, Va_d, TE_D = load_data() # Np.0 Changing the format of an array x to the data to be processed without changing the data content is 0 054 054 Training_Inputs = [0. 0 (0, 1)) for x in tr_d[0]] training_results = [vectorized_result(y) for y in tr_d[1]] training_data = list(zip(training_inputs, training_results)) validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]] validation_data = list(zip(validation_inputs, va_d[1])) test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]] test_data = list(zip(test_inputs, te_d[1])) return training_data, validation_data, Def vectorized_result(j) def vectorized_result(j) def vectorized_result(j) def vectorized_result(j) E = np.zeros((10, 1)) e[j] = 1.0 return eCopy the code

test

import handwrite import mnist_loader training_data, validation_data, Test_data = mnist_loader.load_data_wrapper() net = handwrite.Network([784, 30, 10]) net.sgd (training_data, 30, 10, 3.0, test_data=test_data)Copy the code

The test results

Epoch 0: 8917 / 10000
Epoch 1: 9193 / 10000
Epoch 2: 9305 / 10000
Epoch 3: 9301 / 10000
Epoch 4: 9338 / 10000
Epoch 5: 9366 / 10000
Epoch 6: 9356 / 10000
Epoch 7: 9425 / 10000
Epoch 8: 9405 / 10000
Epoch 9: 9415 / 10000
Epoch 10: 9402 / 10000
Epoch 11: 9448 / 10000
Epoch 12: 9437 / 10000
Epoch 13: 9426 / 10000
Epoch 14: 9466 / 10000
Epoch 15: 9414 / 10000
Epoch 16: 9439 / 10000
Epoch 17: 9432 / 10000
Epoch 18: 9473 / 10000
Epoch 19: 9462 / 10000
Epoch 20: 9456 / 10000
Epoch 21: 9471 / 10000
Epoch 22: 9485 / 10000
Epoch 23: 9460 / 10000
Epoch 24: 9472 / 10000
Epoch 25: 9492 / 10000
Epoch 26: 9492 / 10000
Epoch 27: 9475 / 10000
Epoch 28: 9508 / 10000
Epoch 29: 9470 / 10000
Copy the code

You can see that the final test results are close to 95% accurate.