Algorithm is introduced

Perceptron algorithm is a classical binary classification algorithm.

A popular description: There are many oranges and apples on the table, and there is a line so that the apples and oranges are on each side of the line. The purpose of the perceptron algorithm is to find this line.

Formal description (extracted from Machine Learning Algorithms — Li Hang) : Perceptron algorithm is a linear classification model of binary classification. Its input is the feature vector of instances, and its output is the category of instances, with binary values of +1 and -1. Perceptron corresponds to the hyperplane that divides the instance into positive and negative categories in the input space (feature vector), which belongs to the discriminant model. The purpose of the perceptron algorithm is to find the classification hyperplane which divides the training data into different categories. Therefore, the loss function based on misclassification is introduced and the gradient descent algorithm is used to minimize the loss function to obtain the perceptron model. This model was proposed by Rosenblatt in 1957 and is the basis of neural network and support vector machine.

Model describes

To help describe the problem, draw with matplotlib.

Model description: Substitute the coordinates of all points into a function, the result may be +1 or -1. These results classify goals into two categories. The corresponding function is called the perceptron.

Formal description of the model:

Suppose that the input space (the characteristic space) was X⊆X\subseteqX⊆ RnR^nRn,

The output space is Y={−1,+1}Y=\{-1, +1\}Y={−1,+1}.

The input x⊆Xx\subseteq Xx⊆ x denotes the vector space of the instance, corresponding to points in the input space (the feature space);

Y ⊆Yy\subseteq Yy⊆ y indicates the category of an instance. The following function from input space to output space:


f ( x ) = s i g n ( w x + b ) F (x) = sign (w. x + b)

It’s called a perceptron.

Among them:

WWW and BBB are perceptron model parameters

W ⊆Rnw\subseteq R^nw⊆Rn is called a weight (or vector)

A ⊆Rb\subseteq Rb⊆R is called a bias

W ⋅xw·xw⋅x represents the interior product of WWW and XXX

Note: w = (w0, w1) w = (w_0, w_1) w = (w0, w1), it is a vector x = (x0, x1) x = (x_0, x_1) x = (x0, x1) is a vector. Notice that this is vectorization.

Signsignsign is a symbolic function:


s i g n ( x ) = { + 1 . x > 0 1 . x < 0 sign(x)=\begin{cases} +1, & x>0 \\ -1, & x<0 \\ \end{cases}

As shown in figure:

Perceptron learning strategies

The perceptron algorithm tries to separate positive and negative instances exactly correctly using a hyperplane (represented as a line in two dimensions). To find such a line, we can count the sum of the distances from the plane of all the misclassified points. So let’s look at the distance formula.

(x1,y1)(x_1, y_1)(x1,y1) (x1,y1) ax+by+c=0ax+by+c=0ax+by+c=0


a x + b y + c a 2 + b 2 \frac {|ax+by+c|} {\sqrt{a^2+b^2}}

Similarly, for the perceptron model defined in the previous step, expressed in the form of vectorization, the distance between X0x_0x0 and hyperplane SSS is:


1 w w x 0 + b \ frac {1} {| | w | |} | | w. x_0 + b

Second, for misclassified data (xi,yi)(x_i, y_i)(xi,yi),


d i s t a n c e = { 1 w ( w x i + b ) > 0 . y i < 0 1 w ( w x i + b ) < 0 . y i > 0 Short = \ begin {cases} \ frac {1} {| | w | |} (w. x_i + b) > 0, & y_i < 0 \ \ \ frac {1} {| | w | |} (w. x_i + b) < 0, & y_i > 0 {cases} \ \ \ end

Simplified as:


d i s t a n c e = 1 w y i ( w x i + b ) > 0 Short = – \ frac {1} {| | w | |} y_i (w. x_i + b) > 0

⋅ xi, when w + b > 0 w. x_i ⋅ xi + b + b > 0 w > 0, yi yi = y_i = = – 1-1-1; When w ⋅ xi + b < 0 w. x_i ⋅ xi + b + b < 0 w < 0, + 1 yi yi = + 1 y_i = = + 1;

So we have the distance from some misclassified point to the hyperplane. Suppose that the set of all misclassified points is MMM. Then the sum of the distances from all misclassified points to the hyperplane:


1 w x i M y i ( w x i + b ) – \ frac {1} {| | w | |} \ sum_ {x_i \} in M y_i (w. x_i + b)

Ps: WWW and BBB can be scaled up and down. But the hyperplane they represent doesn’t change. Here we do not consider 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, you can get the perceptron learning loss function: given a training set


T = { ( x 1 . y 1 ) . ( x 2 . y 2 ) . . . . . ( x N . y N ) } T = \{(x_1, y_1), (x_2, y_2), … , (x_N, y_N)\}

Among them:


x i X R n x_i\in X \subseteq R^n


y i Y = { 1 . + 1 } y_i\in Y = \{-1, +1\}

I = 1, 2,… ,N

The loss function of the perceptron algorithm is


L ( w . b ) = x i M y i ( w x i + b ) L (w, b) = – \ sum_ {x_i \} in M y_i (w. x_i + b)

M is the set of misclassified points.

So solving the perceptron model is now solving w,bw, bw,b, so that L(w,b)L(w,b)L(w,b) is minimal. That is:


min w . b L ( w . b ) \min_{w,b}L(w,b)

Perceptron learning algorithm steps

We need gradient descent to solve this, so let’s solve for the gradient of the loss function L(w,b)L(w,b)L(w,b)


w L ( w . b ) = x i M y i x i \nabla_w L(w,b)=-\sum_{x_i\in M}y_i x_i

b L ( w . b ) = x i M y i \nabla_b L(w,b)=-\sum_{x_i\in M}y_i

Specific steps of the algorithm:

(1) Select the initial values w0, B0W_0, B_0W0, B0;

(2) Select data (XI,yi)(X_i, y_i)(xi,yi) in the training set;

(3) if yi (w ⋅ xi + b) < = 0 y_i (w. x_i + b) < = 0 yi (w ⋅ xi + b) < = 0,


w please w + eta y i x i w \leftarrow w + \eta y_i x_i

b please b + eta y i b \leftarrow b + \eta y_i

(4) Proceed to (2) until there are no misclassification points in the training set.

So far, we have described the steps of the perceptron algorithm. Next, we need to do two things: 1, call the perceptron algorithm in SkLearn, and 2, write the perceptron algorithm, and call it.

Invoke the perceptron model in SkLearn

Start by importing the required libraries

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
Copy the code

Import data set

Iris = load_iris() # There are 4 eigenvalues in total. We take the second and third feature for convenient analysis. data = iris.data[:,2:4][iris.target < 2] target = iris.target[iris.target < 2]Copy the code

Divide the data into training set and test set:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data, target, Test_size = 0.33, random_state = 42)Copy the code

Let’s look at the distribution of the dataset:

plt.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], c='r',)
plt.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], c='g')

plt.xlim(-2, 6)
plt.ylim(-2, 6)
Copy the code

The result is shown below:

Next, we import the perceptron model and start training

from sklearn.linear_model import Perceptron
clf = Perceptron()
clf.fit(X_train, y_train)
X_predict = clf.predict(X_test)
X_predict
Copy the code

The results are as follows:

array([1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1])
Copy the code

Compare this with y_test:

y_test
Copy the code

The results are as follows:

array([1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1])
Copy the code

All predictions turned out to be accurate. Let’s use the score function to see how accurate it is:

clf.score(X_test, y_test)
Copy the code

The results are as follows:

1.0
Copy the code

The proof accuracy is 100% because the data set is so small. And the data set is completely linearly separable, so the accuracy is 100%.

Now I’m sure some of you can’t wait to see what the hyperplane is.

Let’s take a look at the solved vector parameter WWW:

clf.coef_
Copy the code

The results are as follows:

Array ([[0.8, 0.8]])Copy the code

B:

clf.intercept_
Copy the code

The results are as follows:

array([-2.])
Copy the code

Now that I have the parameters, I have the corresponding


0.8 x 1 + 0.8 x 2 2 = 0 0.8 * x_1 + 0.8 * x_2-2 = 0

That is: x 1 + x 2 2.5 = 0 X_1 + x_2-2.5 = 0

Now let’s draw the line on the same graph as the previous data point to see what it looks like:

plt.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], c='r',) plt.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], c='g') plt.xlim(-2, 6) plt.ylim(-2, 6) x_1 = np.arange(-2, 6, 0.01) x_2 = -x_1 + 2.5 plt.plot(x_1, x_2)Copy the code

The results are as follows:

It’s weird, isn’t it? It’s too close to the red zone. But it’s still splitting the data set in two. The effect is already complete.

Implement perceptron algorithm by yourself

In the previous step, we called the perceptron algorithm in SkLearn and implemented the dichotomy of samples. Now let’s implement the perceptron algorithm ourselves. And call it.

Self-implemented perceptron code is as follows:

#coding: utf-8 import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn.metrics import accuracy_score class Perceptron: Def __init__(self, eta=0.01, n_iter=50, random_state=1): self.eta = eta self.n_iter = n_iter self.random_state = random_state def fit(self, X, y): """ Fit training data. Parameters ---------- X : {array-like}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape = [n_samples] Target values. Returns ------- self : object """ assert X.ndim == 2, "X must be 2 dimensional" assert y.ndim == 1, "y must be 1 dimensional" assert X.shape[0] == y.shape[0], \ "the size of X_train must be equal to the size of y_train" rgen = np.random.RandomState(self.random_state) self.w_ = Normal (loc=0.0, scale=0.01, size=1 + x.shape [1]) for _ in range(self.n_iter): assert X[self.predict(X)!=y].shape[0] == y[self.predict(X)!=y].shape[0], \ "the size of X_train must be equal to the size of y_train" # M = X[self.predict(X)!=y] M_target = y[self.predict(X)!=y] if (len(M) > 0): M_predict = np.array(self.predict(M)) x_i = M[0] M_target_i = NP.array ([M_target[0]])  M_predict_i = np.array([M_predict[0]]) update = self.eta * (M_target_i - M_predict_i) self.w_[1:] += update * x_i self.w_[0] += update return self def _compute_value(self, X): assert X.ndim == 2, "the single input must be one dimenssional" return np.dot(X, self.w_[1:]) + self.w_[0] def predict(self, X): assert X.ndim == 2, "function predict: X must be 2 dimensional" return np. Where (self._compute_value(X) >= 0.0, 1, -1) def score(self, X_test, y_test): X_test = self. Predict (X_test) return accuracy_score(y_test, y_predict)Copy the code

Now let’s call it and see what happens. (Import data set, data set segmentation, please use the code in the step, we now directly call our perceptron algorithm)

From Perceptron import Perceptron CLF = Perceptron(n_iter=10000, ETA =0.1) y_train) clf.predict(X_test) clf.score(X_test, y_test), clf.w_Copy the code

The results are as follows:

(1.0, array([-0.38375655,  0.13388244,  0.19471828]))
Copy the code

We can see that the accuracy is 100%. Clf. w_[0] corresponds to BBB in our model

The portion other than clF.w_ [0] represents WWW, clF.w_ [1:]

Let’s plot the points and the computed line:

plt.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], c='r',) plt.scatter(X_train[y_train==-1][:, 0], X_train[y_train==-1][:, 1], c='g') plt.xlim(-2, 6) plt.ylim(-2, 6) y_1 = np.arange(-2, 6, 0.01) y_2 = - CLF. W_ [1] / CLF w_ y_1 - CLF. [2] * w_ [0] / CLF w_ [2] PLT. The plot (y_1, y_2)Copy the code

The results are as follows:

You can see that the segmentation of the data set has been implemented.

conclusion

There is an ETA parameter in the algorithm implemented by myself, which represents the learning rate. The larger ETA is, the faster convergence is. However, if ETA is too large, the fluctuation of convergence process will be too large. You need to adjust the parameters yourself.

There are also parameters in the corresponding Sklearn. If you are interested, you can set different values and observe the convergence process.