Algorithm is introduced
Perceptron algorithm is a classical binary classification algorithm.
A popular description: There are many oranges and apples on the table, and there is a line so that the apples and oranges are on each side of the line. The purpose of the perceptron algorithm is to find this line.
Formal description (extracted from Machine Learning Algorithms — Li Hang) : Perceptron algorithm is a linear classification model of binary classification. Its input is the feature vector of instances, and its output is the category of instances, with binary values of +1 and -1. Perceptron corresponds to the hyperplane that divides the instance into positive and negative categories in the input space (feature vector), which belongs to the discriminant model. The purpose of the perceptron algorithm is to find the classification hyperplane which divides the training data into different categories. Therefore, the loss function based on misclassification is introduced and the gradient descent algorithm is used to minimize the loss function to obtain the perceptron model. This model was proposed by Rosenblatt in 1957 and is the basis of neural network and support vector machine.
Model describes
To help describe the problem, draw with matplotlib.
Model description: Substitute the coordinates of all points into a function, the result may be +1 or -1. These results classify goals into two categories. The corresponding function is called the perceptron.
Formal description of the model:
Suppose that the input space (the characteristic space) was X⊆X\subseteqX⊆ RnR^nRn,
The output space is Y={−1,+1}Y=\{-1, +1\}Y={−1,+1}.
The input x⊆Xx\subseteq Xx⊆ x denotes the vector space of the instance, corresponding to points in the input space (the feature space);
Y ⊆Yy\subseteq Yy⊆ y indicates the category of an instance. The following function from input space to output space:
It’s called a perceptron.
Among them:
WWW and BBB are perceptron model parameters
W ⊆Rnw\subseteq R^nw⊆Rn is called a weight (or vector)
A ⊆Rb\subseteq Rb⊆R is called a bias
W ⋅xw·xw⋅x represents the interior product of WWW and XXX
Note: w = (w0, w1) w = (w_0, w_1) w = (w0, w1), it is a vector x = (x0, x1) x = (x_0, x_1) x = (x0, x1) is a vector. Notice that this is vectorization.
Signsignsign is a symbolic function:
As shown in figure:
Perceptron learning strategies
The perceptron algorithm tries to separate positive and negative instances exactly correctly using a hyperplane (represented as a line in two dimensions). To find such a line, we can count the sum of the distances from the plane of all the misclassified points. So let’s look at the distance formula.
(x1,y1)(x_1, y_1)(x1,y1) (x1,y1) ax+by+c=0ax+by+c=0ax+by+c=0
Similarly, for the perceptron model defined in the previous step, expressed in the form of vectorization, the distance between X0x_0x0 and hyperplane SSS is:
Second, for misclassified data (xi,yi)(x_i, y_i)(xi,yi),
Simplified as:
⋅ xi, when w + b > 0 w. x_i ⋅ xi + b + b > 0 w > 0, yi yi = y_i = = – 1-1-1; When w ⋅ xi + b < 0 w. x_i ⋅ xi + b + b < 0 w < 0, + 1 yi yi = + 1 y_i = = + 1;
So we have the distance from some misclassified point to the hyperplane. Suppose that the set of all misclassified points is MMM. Then the sum of the distances from all misclassified points to the hyperplane:
Ps: WWW and BBB can be scaled up and down. But the hyperplane they represent doesn’t change. Here we do not consider 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, you can get the perceptron learning loss function: given a training set
Among them:
I = 1, 2,… ,N
The loss function of the perceptron algorithm is
M is the set of misclassified points.
So solving the perceptron model is now solving w,bw, bw,b, so that L(w,b)L(w,b)L(w,b) is minimal. That is:
Perceptron learning algorithm steps
We need gradient descent to solve this, so let’s solve for the gradient of the loss function L(w,b)L(w,b)L(w,b)
Specific steps of the algorithm:
(1) Select the initial values w0, B0W_0, B_0W0, B0;
(2) Select data (XI,yi)(X_i, y_i)(xi,yi) in the training set;
(3) if yi (w ⋅ xi + b) < = 0 y_i (w. x_i + b) < = 0 yi (w ⋅ xi + b) < = 0,
(4) Proceed to (2) until there are no misclassification points in the training set.
So far, we have described the steps of the perceptron algorithm. Next, we need to do two things: 1, call the perceptron algorithm in SkLearn, and 2, write the perceptron algorithm, and call it.
Invoke the perceptron model in SkLearn
Start by importing the required libraries
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
Copy the code
Import data set
Iris = load_iris() # There are 4 eigenvalues in total. We take the second and third feature for convenient analysis. data = iris.data[:,2:4][iris.target < 2] target = iris.target[iris.target < 2]Copy the code
Divide the data into training set and test set:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data, target, Test_size = 0.33, random_state = 42)Copy the code
Let’s look at the distribution of the dataset:
plt.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], c='r',)
plt.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], c='g')
plt.xlim(-2, 6)
plt.ylim(-2, 6)
Copy the code
The result is shown below:
Next, we import the perceptron model and start training
from sklearn.linear_model import Perceptron
clf = Perceptron()
clf.fit(X_train, y_train)
X_predict = clf.predict(X_test)
X_predict
Copy the code
The results are as follows:
array([1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1])
Copy the code
Compare this with y_test:
y_test
Copy the code
The results are as follows:
array([1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1])
Copy the code
All predictions turned out to be accurate. Let’s use the score function to see how accurate it is:
clf.score(X_test, y_test)
Copy the code
The results are as follows:
1.0
Copy the code
The proof accuracy is 100% because the data set is so small. And the data set is completely linearly separable, so the accuracy is 100%.
Now I’m sure some of you can’t wait to see what the hyperplane is.
Let’s take a look at the solved vector parameter WWW:
clf.coef_
Copy the code
The results are as follows:
Array ([[0.8, 0.8]])Copy the code
B:
clf.intercept_
Copy the code
The results are as follows:
array([-2.])
Copy the code
Now that I have the parameters, I have the corresponding
Now let’s draw the line on the same graph as the previous data point to see what it looks like:
plt.scatter(X_train[y_train==0][:, 0], X_train[y_train==0][:, 1], c='r',) plt.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], c='g') plt.xlim(-2, 6) plt.ylim(-2, 6) x_1 = np.arange(-2, 6, 0.01) x_2 = -x_1 + 2.5 plt.plot(x_1, x_2)Copy the code
The results are as follows:
It’s weird, isn’t it? It’s too close to the red zone. But it’s still splitting the data set in two. The effect is already complete.
Implement perceptron algorithm by yourself
In the previous step, we called the perceptron algorithm in SkLearn and implemented the dichotomy of samples. Now let’s implement the perceptron algorithm ourselves. And call it.
Self-implemented perceptron code is as follows:
#coding: utf-8 import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn.metrics import accuracy_score class Perceptron: Def __init__(self, eta=0.01, n_iter=50, random_state=1): self.eta = eta self.n_iter = n_iter self.random_state = random_state def fit(self, X, y): """ Fit training data. Parameters ---------- X : {array-like}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape = [n_samples] Target values. Returns ------- self : object """ assert X.ndim == 2, "X must be 2 dimensional" assert y.ndim == 1, "y must be 1 dimensional" assert X.shape[0] == y.shape[0], \ "the size of X_train must be equal to the size of y_train" rgen = np.random.RandomState(self.random_state) self.w_ = Normal (loc=0.0, scale=0.01, size=1 + x.shape [1]) for _ in range(self.n_iter): assert X[self.predict(X)!=y].shape[0] == y[self.predict(X)!=y].shape[0], \ "the size of X_train must be equal to the size of y_train" # M = X[self.predict(X)!=y] M_target = y[self.predict(X)!=y] if (len(M) > 0): M_predict = np.array(self.predict(M)) x_i = M[0] M_target_i = NP.array ([M_target[0]]) M_predict_i = np.array([M_predict[0]]) update = self.eta * (M_target_i - M_predict_i) self.w_[1:] += update * x_i self.w_[0] += update return self def _compute_value(self, X): assert X.ndim == 2, "the single input must be one dimenssional" return np.dot(X, self.w_[1:]) + self.w_[0] def predict(self, X): assert X.ndim == 2, "function predict: X must be 2 dimensional" return np. Where (self._compute_value(X) >= 0.0, 1, -1) def score(self, X_test, y_test): X_test = self. Predict (X_test) return accuracy_score(y_test, y_predict)Copy the code
Now let’s call it and see what happens. (Import data set, data set segmentation, please use the code in the step, we now directly call our perceptron algorithm)
From Perceptron import Perceptron CLF = Perceptron(n_iter=10000, ETA =0.1) y_train) clf.predict(X_test) clf.score(X_test, y_test), clf.w_Copy the code
The results are as follows:
(1.0, array([-0.38375655, 0.13388244, 0.19471828]))
Copy the code
We can see that the accuracy is 100%. Clf. w_[0] corresponds to BBB in our model
The portion other than clF.w_ [0] represents WWW, clF.w_ [1:]
Let’s plot the points and the computed line:
plt.scatter(X_train[y_train==1][:, 0], X_train[y_train==1][:, 1], c='r',) plt.scatter(X_train[y_train==-1][:, 0], X_train[y_train==-1][:, 1], c='g') plt.xlim(-2, 6) plt.ylim(-2, 6) y_1 = np.arange(-2, 6, 0.01) y_2 = - CLF. W_ [1] / CLF w_ y_1 - CLF. [2] * w_ [0] / CLF w_ [2] PLT. The plot (y_1, y_2)Copy the code
The results are as follows:
You can see that the segmentation of the data set has been implemented.
conclusion
There is an ETA parameter in the algorithm implemented by myself, which represents the learning rate. The larger ETA is, the faster convergence is. However, if ETA is too large, the fluctuation of convergence process will be too large. You need to adjust the parameters yourself.
There are also parameters in the corresponding Sklearn. If you are interested, you can set different values and observe the convergence process.