6.1 the introduction
PLA is the full name of Perceptron Linear Algorithm, namely Linear Perceptron Algorithm, which is one of the simplest Perceptron models. The perceptron is a linear classification model of binary classification. Its basic structure is shown in Figure 1. Its input is the feature vector of the instance, and its output is the category of the instance, with binary values of +1 and -1. The perceptron, corresponding to the input space (feature space), divides the instance into positive and negative separated hyperplane, which belongs to the discriminant model and solves the classification problem. Perceptron learning algorithm is simple and easy to implement. It can be divided into primitive form and dual form. Perceptron prediction is to classify new input instances using the learned perceptron model. Perceptron algorithm was proposed by Rosenblatt in 1957, which is the basis of support vector machine (SVM) and neural network. Learning perceptron is of great benefit to SVM and neural network.
Figure 1. Perceptron model
Where x I x_i xi is the input, W I W_i WI is the weight coefficient, and b b b is the offset constant.
According to the three elements of statistical learning:
- According to the linear model, a symbolic function (discriminant model);
- The loss function is the sum of the distance between the error point and the hyperplane.
- According to the algorithm, the gradient descent algorithm is used for optimization.
Ok, let’s start to learn the perceptron algorithm. The goal of this chapter is to find the classification hyperplane that linearly divides training data, import the loss function of misclassification, and minimize the loss function by using gradient descent method to find the perceptron model.
6.2 Perceptron model
6.2.1 Perceptron Model definition
It was assumed that the input space (the characteristic space) was a χ⊆Rn \chi\subseteq R^n χ⊆Rn, and the output space was Y=+1,−1 Y={+1,-1} Y=+1,−1. The input x⊆χ x \subseteq\chi x⊆χ represented the feature vector of the instance, corresponding to the point of the input space (the feature space); “Y ⊆ y\subseteq y y⊆ y” indicates the category of an instance. The following function from input space to output space:
C (x)= S I gn(W ⋅x+b) \color{red}f(x)=sign(w\cdot x+b) F (x)=sign(W \ x+b)
It’s called a perceptron. In which, w w w and b b are model parameters of the perceptron, w⊆Rn w\subseteq R^n w⊆Rn is called weight or weight vector, ⋅x w\cdot x w⋅x represents the interior product of W W w and x x x. S I gn() sign() is a sign function:
S I g n (x) = {+ 1 x ≤ 0 − 1 x < 0 sign(x)= \begin{cases} +1& {x\leq0}\\ -1& {x< {0} \ end cases} sign (x) = {x + 1-1 0 x or less < 0
Perceptron is a linear classification model and belongs to discriminant model. The hypothesis space of perceptron model is all linear classification models or linear classifiers defined in the feature space, that is, the set of functions:
{f ⋅ x ∣ f (x) = w + b} \ {\ | f (x) = w f cdot x + b \} {f ⋅ x ∣ f (x) = w + b}
6.2.2 Geometric interpretation of perceptron model
The linear equation W ⋅x+b=0 w\cdot x+b=0 W ⋅x+b=0 w⋅x+b=0 corresponds to a hyperplane S S S in the eigenspace Rn R^n Rn, where W w w w is the normal vector of the hyperplane and b b is the intercept of the hyperplane. This hyperplane divides the feature space into two parts. Points in both parts (eigenvectors) are divided into positive and negative categories, because S S S hyperplanes are called separating Hyperplanes.
Figure 2
The hyperplane in Rn R^n Rn is:
W ⃗ ⋅ ⃗ + b = 0 x \ vec {w} \ cdot \ vec {x} ⋅ x + b + b = 0 w = 0
In a multidimensional space, the vectors w ⃗,x ⃗ \vec{w},\vec{x} w,x is the multidimensional. Of course, w ⃗,x ⃗ \vec{w},\vec{x} w,x belongs to this space. In two dimensions, the equation represents a straight line, which is a hyperplane of a plane. In three dimensions, this equation represents a plane, which is the hyperplane of space.
The distance between a point and the hyperplane
Vector projection: given two vectors u ⃗,v ⃗ \vec{u},\vec{v} u,v, find the length of the projection of w ⃗ \vec{w} w on x ⃗ \vec{x} x, the Angle between the vectors is cosθ cosθ cosθ.
Figure 3
D = ∣ u ⃗ ∣ c o s theta d = | \ vec {u} | cos \ theta d = ∣ u ∣ cosine theta, C o s theta = u ⃗ ⋅ v ⃗ ∣ u ⃗ ∣ ∣ v ⃗ ∣ cos \ theta = \ frac {\ vec {u} \ cdot \ vec {n}} {| \ vec {u} | | \ vec |} {n} cosine theta equals ∣ u ∣ ∣ u ⋅ ∣ v v, in conclusion, D = u ⃗ ⋅ v ⃗ ∣ v ⃗ ∣ d = \ frac {\ vec {u} \ cdot \ vec {n}} {| \ vec |} {n} d = ∣ u ⋅ ∣ v v.
Distance from point to hyperplane: ⋅x ⃗ +b=0 \vec{w}\cdot \vec{x}+b=0 w ⋅x +b= any point on Then the distance of point x x x from the hyperplane is x−x0 x-x_0 x−x0 on the projection length of the normal vector w ⃗ \vec{w} w of the hyperplane:
D = ∣ ∣ w (x – x 0) + b ∣ ∣ ∣ ∣ w ∣ ∣ = ∣ ∣ w x – w x 0 + b ∣ ∣ ∣ ∣ w ∣ ∣ = ∣ ∣ w x ⃗ + b ∣ ∣ ∣ ∣ w ∣ ∣ d=\frac{||w(x-x_0)+b||}{||w||}=\frac{||wx-wx_0+b||}{||w||}=\frac{||w\vec{x}+b||}{||w||} D = ∣ ∣ w ∣ ∣ ∣ ∣ w (x – x0) + b ∣ ∣ = ∣ ∣ w ∣ ∣ ∣ ∣ wx – wx0 + b ∣ ∣ = ∣ ∣ w ∣ ∣ ∣ ∣ wx + b ∣ ∣
The origin of the distance to the hyperplane – b ∣ ∣ w ∣ ∣ – \ frac {b} {| | w | |} – ∣ ∣ w ∣ ∣ for b.
Perceptron learning: the training data set T = {(1 x, y, 1), (2) 2 x, y,… (N x, y, N)} T = \ {(x_1, y_1), (x_2, y_2),… (x_N,y_N)\} T={(x1,y1),(x2,y2),… (xN,yN)} (eigenvectors and classes of instances), where, X I ∈χ Rn x_i\in \chi \subseteq R^n xi∈χ⊆Rn, y I ∈ y ⊆ = {+ 1, – 1} y_i \ \ Y in subseteq = \ \} {+ 1, – 1 yi ∈ ⊆ Y = {+ 1-1}, I = 1, 2, 3,…., N I = 1, 2, 3,… , N I = 1, 2, 3,… N, find the perceptron model, that is, find the parameters W, W, W and b, b, b. Perceptron prediction: The perceptron model obtained through learning can give corresponding output categories for new input instances.
6.3 Perceptron Policy
Given a data set, a T = {(1 x, y, 1), (2) 2 x, y,… (N x, y, N)} T = \ {(x_1, y_1), (x_2, y_2),… (x_N,y_N)\} T={(x1,y1),(x2,y2),… (xN, yN)}, X I ∈χ Rn x_i\in \chi \subseteq R^n xi∈χ⊆Rn, y I ∈ y ⊆ = {+ 1, – 1} y_i \ \ Y in subseteq = \ \} {+ 1, – 1 yi ∈ ⊆ Y = {+ 1-1}, I = 1, 2, 3,…., N I = 1, 2, 3,… , N I = 1, 2, 3,… , N. If there is a hyperplane:
⋅x+b=0 W \cdot x+b=0 W ⋅x+b=0
K k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k 0 w\cdot x_i+b> 0 w ⋅ xi + b > 0; 12. W x I + b < for instance I I I =−1 0 w\cdot x_i+b< 12. 0 W ⋅xi+b<0, then the dataset T T T is called Linear fractable dataset. Otherwise, the data set T T T is said to be linearly indivisible.
Assuming that the training data set is linearly separable, the goal of perceptron learning is to obtain a separate hyperplane that can completely separate the positive and negative instance points of the training set. In order to find such a hyperplane, namely, to determine the parameters w, Bw, bw, b of the perceptron model, a learning strategy needs to be determined. That is to define and minimize the (empirical) loss function.
A natural selection of the loss function is the total number of misclassification points, but such a loss function is not the continuous differentiable function of parameters W, b w, b w, b, and should not be optimized. Another alternative to the loss function is the total distance from the misclassification point to the hyperplane S S S, which is used by the perceptron. To do this, first write the distance (from the point to the line) from any point in the input space Rn R^n x0 X_0 x0 to the hyperplane S S S:
1 ∣ ∣ w ∣ ∣ ⋅ 0 x ∣ w + b ∣ \ frac {1} {| | w | |} \ | w cdot x_0 + b | ∣ ∣ w ∣ ∣ ∣ 1 w ⋅ x0 + b ∣ including ∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣ is w w w L 2 L_2 L2 norm.
Secondly, for the misclassification points (xi,yi) (x_i,y_i) (xi,yi),
− Y I (W ⋅ x I + b) > 0 -y_i(w\cdot x_i+b)> 0 – yi ⋅ xi + b (w) > 0
Was established. Because w… 0 w\cdot x_0+b> ⋅x0+b>0, y I =−1 y_i=-1 yi=−1, and w⋅x I +b < 0 w\cdot x_i+b< 0 w⋅xi+b<0, y I =±1 y_i=\pm1 yi=±1 So the distance from the misequinox x I x_I xi to the hyperplane S S S can be written as follows:
– 1 ∣ ∣ w ∣ ∣ y I ⋅ x 0 + b (w) – \ frac {1} {| | w | |} y_i (w \ cdot x_0 + b) – ∣ ∣ w ∣ ∣ 1 yi (w ⋅ x0 + b)
Thus, assuming that the set of misclassification points of the hyperplane S S is M M M, the total distance between all misclassification points and the hyperplane S S is:
– 1 ∣ ∣ w ∣ ∣ ∑ x I ⊆ M y I ⋅ x 0 + b (w) – \ frac {1} {| | w | |} \ sum_} {x_i \ subseteq M y_i (w \ cdot x_0 + b) 1 ∑ – ∣ ∣ w ∣ ∣ xi ⊆ Myi (w ⋅ x0 + b) do not consider 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, get the perceptron learning loss function.
【 note 】 don’t consider 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1. 1, 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 constant is positive, ⋅ X I + B) -y_i(W \cdot X_I + B) – Yi (W \ xi+ B) positive and negative judgment, that is, does not affect the intermediate process of learning algorithm. Since the perceptron learning algorithm is driven by misclassification (the model is adjusted only when misclassification occurs, or the loss function is only related to the misclassification point), it should be noted that ⋅ X I +b) -y_i(W \cdot X_I + B) -Yi (W \ xi+b) And 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 does not affect the positive and negative value judgment, so 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 could do without the sense of machine learning algorithm in the middle of the process.
2, 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 does not affect the final result of perception machine learning algorithm, because the perceptron learning algorithm of the final termination conditions are all input are correct classification, namely there is no point of misclassification, loss function is zero at this time, Corresponding to a – 1 ∣ ∣ w ∣ ∣ y I ⋅ x (I) + b (w) – \ frac {1} {| | w | |} y_i (w \ cdot x_i + b) – ∣ ∣ w ∣ ∣ 1 yi ⋅ xi + b (w), namely molecules to 0. You can see 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 is no effect on the final result.
To sum up, even ignore 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, nor the implementation process of perceptron learning algorithm have any effect, but also can simplify operation, improve the efficiency of algorithm implementation. Given a data set, a T = {(1 x, y, 1), (2) 2 x, y,… (N x, y, N)} T = \ {(x_1, y_1), (x_2, y_2),… (x_N,y_N)\} T={(x1,y1),(x2,y2),… (xN, yN)}, X I ∈χ Rn x_i\in \chi \subseteq R^n xi∈χ⊆Rn, y I ∈ y ⊆ = {+ 1, – 1} y_i \ \ Y in subseteq = \ \} {+ 1, – 1 yi ∈ ⊆ Y = {+ 1-1}, I = 1, 2, 3,…., N I = 1, 2, 3,… , N I = 1, 2, 3,… , N. ⋅ X + B) Sign (W \cdot x+ B) Sign (W \ x+ B) Loss function is defined as:
L (w, b) = – ∑ x I ⊆ M y I ⋅ x (I) + b (w) (w, b) = L – \ sum_} {x_i \ subseteq M y_i (w \ cdot x_i + b) L (w, b) = – ∑ xi ⊆ Myi ⋅ xi + b (w)
Where, M M M is the set of misclassification points, and the loss function is the empirical risk function of perceptron learning.
Obviously, the loss function L(w,b) L(w,b) L(w,b) is non-negative. If there are no misclassification points, the loss function is zero. In addition, the fewer misclassification points and the closer the misclassification points are to the hyperplane, the smaller the loss function value will be. The loss function for a particular sample point: Is a linear function of parameters W, bw, bw,b in misclassification, and is 0 in correct classification. Therefore, given the training data set T, T, T, the loss function L(w,b) L(w,b) L(w,b) is the continuous differentiable function of W, bw, bw,b.
In a word, the perceptron learning strategy is to select the model parameters W, Bw, bw,b that minimize the loss function formula in the hypothesis space, namely, the perceptron model.
6.4 Perceptron learning algorithm
6.4.1 Original form of perceptron algorithm
Given a set of data T = {(1 x, y, 1), (2) 2 x, y,… (N x, y, N)} T = \ {(x_1, y_1), (x_2, y_2),… (x_N,y_N)\} T={(x1,y1),(x2,y2),… (xN, yN)}, X I ∈χ Rn x_i\in \chi \subseteq R^n xi∈χ⊆Rn, y I ∈ y ⊆ = {+ 1, – 1} y_i \ \ Y in subseteq = \ \} {+ 1, – 1 yi ∈ ⊆ Y = {+ 1-1}, I = 1, 2, 3,…., N I = 1, 2, 3,… , N I = 1, 2, 3,… , N. Find the parameters w, bw, bw,b and make them the solution of the following loss function minimization problem:
m i n w , b ( L ( w , B)) = – ∑ x I ⊆ M y I ⋅ I + b) x (w \ underset {w, b} {min} (L) (w, b) = – \ sum_} {x_i \ subseteq M y_i (w \ cdot x_i + b) W, bmin (L) (w, b) = – ∑ xi ⊆ Myi ⋅ xi + b (w)
Where, M M M is the set of misclassification points.
Perceptron learning algorithm is driven by misclassification and adopts stochastic gradient descent algorithm. Firstly, a hyperplane w0,b0, W_0, B_0, w0,b0 is selected arbitrarily, and then the objective function is continuously minimized by gradient descent algorithm. In the process of minimization, the gradient descent of all misclassification points in M M M is not made at one time, but one misclassification point is randomly selected at a time. Assuming that the set of misclassification points M M M is fixed, then the gradient of the loss function L(w,b) L(w,b) L(w,b) is determined by:
▽ W L (W, B))=−∑ X I ⊂My I x I \bigtriangledown_wL(W,b))=-\ SUM_ {x_I \subset M} y_IX_I ▽wL(w,b))=−∑xi⊂Myixi ▽ B L(W, B)) = – ∑ I ⊂ M y x x I \ I bigtriangledown_bL (w, b)) = – \ sum_ {} x_i \ subset M y_ix_i del bL (w, b)) = – ∑ xi ⊂ Myixi
Is given. A random misclassification point (xi,yi) (x_i,y_i) (xi,yi) is selected to update w, bw, bw,b:
W ← W +ηy I x I w \leftarrow w+ eta y_ix_I W ← W +ηyixi b←+ηy I b\leftarrow+ eta y_i b←+ηyi
Where, η (0 < η ≤ 1) \eta(0< \eta \leq 1) η(0<η≤1) is the step length, also known as the learning rate in statistical learning. In this way, the loss function L(w,b) L(w,b) L(w,b) can be expected to shrink continuously until it is 0 through iteration. In summary, the following algorithm can be obtained:
Algorithm: the perceptron algorithm: the original form of input training data sets, T = {(1 x, y, 1), (2) 2 x, y,… (N x, y, N)} T = \ {(x_1, y_1), (x_2, y_2),… (x_N,y_N)\} T={(x1,y1),(x2,y2),… (xN,yN)}, where, X I ∈χ Rn x_i\in \chi \subseteq R^n xi∈χ⊆Rn, y I ∈ y ⊆ = {+ 1, – 1} y_i \ \ Y in subseteq = \ \} {+ 1, – 1 yi ∈ ⊆ Y = {+ 1-1}, I = 1, 2, 3,…., N I = 1, 2, 3,… , N I = 1, 2, 3,… ,N, learning rate η (0 < η ≤ 1) \eta(0< \eta \leq 1) η(0
(1) Select initial values w0,b0 w_0,b_0 w0,b0; (2) Select data (xi, Yi) (X_i, Y_i) (xi,yi) in the training set; 12) If y I (W ⋅x I +b)≤0 y_i(w\cdot x_i+b)\leq 0 yi(W \ xi+b)≤0
W ← W +ηy I x I w \leftarrow w+ eta y_ix_I W ← W +ηyixi b←+ηy I b\leftarrow+ eta y_i b←+ηyi
(4) Go to (2) until there is no mismark in the training set. Intuitively this learning algorithm has the following explanation, when an instance is misclassification, namely in the separating hyperplane wrong side, the adjustment of w, w, b b w, the value of b, the separation of super flat for the misclassification points on one side of the move, to reduce the misclassification point and the distance between the hyperplane, until the hyperplane through the misclassification, make its be classified correctly. As the perceptron learning algorithm adopts different initial values (W 0,b0) (W_0,b_0) (w0, B0) or selects different misclassification points (because the misclassification points are randomly selected), the final solutions can be different.
Example 1 (original form solution) The training set shown below, the real example point is x I =(3,3)T x_i=(3,3)^T xi=(3,3)T, x2=(4,3)T x_2=(4,3)^T x2=(4,3)T, X 3=(1,1)T x_3=(1,1)^T x3=(1,1)T, ⋅ X + B) f(x)= Sign (W \cdot x+b) F (x)= Sign (W \ x+ B) here w = ( w ( 1 ) , w ( 2 ) ) T w=(w^{(1)},w^{(2)})^T w=(w(1),w(2))T , x = ( x ( 1 ) , T x = x (2)) (x ^ {} (1), x ^ {} (2)) ^ T x = (x (1), x (2)) T.
Figure 4. Perceptron example
Solution: Construct optimal problem: m i n w , b ( L ( w , B)) = – ∑ x I ⊆ M y I ⋅ I + b) x (w \ underset {w, b} {min} (L) (w, b) = – \ sum_} {x_i \ subseteq M y_i (w \ cdot x_i + b) W, bmin (L) (w, b) = – ∑ xi ⊆ Myi ⋅ xi + b (w) in accordance with the above algorithm for w, eta = 1 w, b, b, eta = 1 w, \ b, eta = 1. (1) Take the initial value, w0=0,b0=0, W_0 =0,b_0=0, w0=0,b0=0; X I =(3,3)T x_i=(3,3)^T xi=(3,3)T, y I (w0 x I +b0)=0 y_i(w_0\cdot x_i+b_0)=0 yi(w0 \ xi+b0)=0, Not classified correctly, update w,b w,b w,b
Y 0 + 1 x 1 w = 1 w = (3, 3) T w_1 = w_0 + y_1x_1 w1 = = (3, 3) ^ T w0 + y1x1 = T (3, 3), b = b y 0 + 1 = 1 b_1 = b_0 + y_1 = 1 b1 = b0 + y1 = 1
You get a linear model
w 1 x + b 1 = 3 x ( 1 ) + 3 x ( 2 ) + 1 w_1x+b_1=3x^{(1)}+3x^{(2)}+1 w1x+b1=3x(1)+3x(2)+1
12. For x1,x2 x_1,x_2 x1,x2, obviously, y I (w 1 ⋅ x I + b 1) > 0 y_i(w_1\cdot x_i+b_1)> X 3=(1,1)T x3=(1,1) ^T x3=(1,1)T, y 3 (w1⋅x 3 +b1) < 0 y_3(w_1\cdot x_3+b_1)< ⋅ 3(w1⋅x3+b1)<0
W 1 + 2 = w 3 x 3 = y (2, 2) T + w_2 = w_1 y_3x_3 w2 = = (2, 2) ^ T w1 + y3x3 = (2, 2) T, b 2 b = 1 + 3 = 0, y = b_2 b_1 + y_3 = 0 b2 b1 + y3 = = 0
You get a linear model
w 2 x + b 2 = 2 x ( 1 ) + 2 x ( 2 ) w_2x+b_2=2x^{(1)}+2x^{(2)} w2x+b2=2x(1)+2x(2)
And so on, All the way to w7=(1,1)T w_7=(1,1)^T w7=(1,1)T, B = 7-3 b_7 = 3 b7 = 7 x – 3 w + b 7 = x (1) + x (2) – 3 w_7x + b_7 = x ^ {} (1) + x ^ {} (2) – 3 w7x + b7 = x (1) + x (2) – 3
For all classification points, there is no misclassification point and the loss function is minimum. Classification hyperplane as follows: (1) + x x (2) – 3 x ^ {} (1) + x ^ {} (2) – 3 x (1) + x (2) – 3 perceptron model for: F (x = s I g n (x (1) + x (2) – 3) f (x = sign (x ^ {} (1) + x ^ {} (2) – 3) f (x = sign (x (1) + x (2) – 3) iterative process table:
Table 1 Iterative process of solution
This is the separated hyperplane and the perceptron model obtained by taking the error points successively in the calculation. If the error points are taken successively in the calculation, the separated hyperplane obtained is. It can be seen that the perceptron model algorithm can have different solutions due to adopting different initial values or selecting different misclassification points.
[Note] Convergence Proof for the Perceptron Algorithm
6.4.1 Dual form of perceptron algorithm
The basic idea of the dual form is to express W and b as linear combinations of instances and markers, and to find W, W, W and B, b by solving for their coefficients. Suppose that the initial values w0,b0, W_0,b_0, w0,b0 are all 0,. For misclassification point x I, Y I x_i,y_i xi,yi through α I ←α I +ηy I x I \alpha_i\leftarrow\alpha_i+\eta y_ix_i α I ←α I +ηyixi b ← +ηy I B eta y_i please b + + \ \ leftarrow eta yi
Step by step change w, W, W,b,b,b, let’s do it n, n, n times, then the increments of w,b, W,b, w,b with respect to x I,y I x_i,y_i xi,yi are respectively and, here, so as you can see from the learning process, The w that I learned at the end, ∑ I =1Nα I y I x I \sum_{I =1}^{N}\alpha_iy_ix_i ∑ I =1Nαiyixi b = ∑ I =1Nα I y I B = \ sum_ {I = 1} ^ {N} \ alpha_i y_i b = ∑ I = 1 N alpha iyi
Where, α I > = 0 \alpha_i> =0 α I >=0, I =1,2, N I =1,2… , N I = 1, 2,… N When η=1 \eta =1 η=1, it means the number of misclassification of the i-th instance point. The more times the instance point is updated, the closer it is to the separation hyperplane, and the more difficult it is to classify.
Algorithm: the perceptron algorithm: the dual form of input training data sets, T = {(1 x, y, 1), (2) 2 x, y,… (N x, y, N)} T = \ {(x_1, y_1), (x_2, y_2),… (x_N,y_N)\} T={(x1,y1),(x2,y2),… (xN,yN)}, where, X I ∈χ Rn x_i\in \chi \subseteq R^n xi∈χ⊆Rn, y I ∈ y ⊆ = {+ 1, – 1} y_i \ \ Y in subseteq = \ \} {+ 1, – 1 yi ∈ ⊆ Y = {+ 1-1}, I = 1, 2, 3,…., N I = 1, 2, 3,… , N I = 1, 2, 3,… ,N, learning rate η (0 < η ≤ 1) \eta(0< Eta \ \ 1) leq eta (0 < eta 1 or less); Output: α,b \alpha,b α,b; ⋅ x I +b) F (x)=sign(\ Sum_ {j=1}^{n}\alpha_jy_jx_j\cdot x_i+b) F (x) = sign (∑ j = 1 n alpha jyjxj ⋅ xi + b). (1) A ←0, a\leftarrow0, A ←0, B ←0 b\leftarrow0 b←0; (2) Select data (xi, Yi) (X_i, Y_i) (xi,yi) in the training set; (3) If y I (∑ j=1 N α jy jx j j k k I +b) ≤ 0 y_i(\sum_{j=1}^{N}\alpha_jy_jx_j\cdot x_i+b) \leq0 Yi (∑ j = 1 n alpha jyjxj ⋅ xi + b) 0 or less
α i ← α i + η \alpha_i\leftarrow\alpha_i+\eta αi←αi+η
b ← + η y i b\leftarrow+\eta y_i b←+ηyi
(4) Go to (2) until there is no mismark in the training set. In the dual form, training instances only appear in the form of inner product. For convenience, the inner product between instances in the training set can be calculated in advance and stored in the form of matrix, which is the so-called Gram matrix: G = [x I ⋅ x j] N * N G = [x_i \ cdot x_j] _ \ times N} {N G = [xi ⋅ xj] N * N
Example 2 (dual form solution) Data same as example 1, using the dual form of perceptron learning algorithm to solve the perceptron model. Solution: (1) take αj=0 \alpha_j=0 αj=0 I =1,2,3 I =1,2,3 I =1,2,3 I =1,2,3, b=0 b=0, η=1 \eta=1 η=1 (2) calculate Gram matrix
(3) G = [ 18 21 6 21 25 7 6 7 2 ] G=\left[ \begin{matrix} 18 & 21 & 6 \\ 21 & 25 & 7 \\ 6 & 7 & 2 \end{matrix} \right] \tag{3} G=⎣, repeatable, repeatable ⎤(3)
σ j=1 N α jy jx jx x I +b) ≤ 0 y_i(\ Sum_ {j=1}^{N}\alpha_jy_jx_j\cdot x_i+b) \leq0 Yi (∑ j = 1 n alpha jyjxj ⋅ xi + b) 0 or less
Parameter update α I ←α I +1, B ←+y I \alpha_i\leftarrow\ alpha_I +1, B \leftarrow+y_i α I ←α I +1, B ←+yi
(4) Iteration, For x 1 = (3, 3) T, Y I (0 ∗ I ⋅ x 1 x 1 + b y 0) = 0 x_1 = (3, 3) ^ T, y_i (0 * y_ix_1 \ cdot x_1 + b_0) = 0 x1 = T (3, 3), yi (0 ∗ yix1 ⋅ x1 + b0) = 0, is not the correct classification, Updated alpha b \ alpha, alpha, b b at alpha 1 = 0, alpha 2 = 0, alpha 3 = 0, b = 0 \ alpha_1 = 0, \ alpha_2 = 0, \ alpha_3 = 0, b = 0 alpha 1 = 0, alpha 2 = 0, alpha 3 = 0, b = 0
Alpha 1 = alpha 1 + 1, + 1, b = b \ alpha_1 = \ alpha_1 + 1, b = b + 1, alpha 1 = alpha 1 + 1, b = b + 1,
So alpha 1 = 1, alpha 2 = 0, alpha 3 = 0, b = 1 \ alpha_1 = 1, \ alpha_2 = 0, \ alpha_3 = 0, b = 1 alpha 1 = 1, alpha 2 = 0, alpha 3 = 0, b = 1
12 (∑ j = 1 3 α j y j x j j I + b) = (α 1 y 1 x 1 + α 2 y 2 x 2 + α 3 y 3 x 3) x + b = (α 1 y 1 x 1 (3, 3) T ⋅ x + 1 y_i (\ sum_ ^ {j = 1} {3} \ alpha_jy_jx_j \ cdot x_i + b) = (\ alpha_1y_1x_1 + + \ \ alpha_2y_2x_2 alpha_3y_3x_3) x + b = (\ alpha_1y_1x_1) \ cdot x + b = (3, 3) ^ T \ cdot x + 1 yi (∑ j = 13 alpha jyjxj ⋅ xi + b) = (alpha 1 y1x1 + alpha 2 y2x2 + alpha 3 y3x3) x + b = (alpha 1 y1x1) ⋅ x + b = (3, 3) T ⋅ x + 1
And so on, Table 2 shows the results. (5) W =2×1+0x2−5×3=(1,1)T w=2x_1+0x_2-5x_3=(1,1)^T W =2×1+0x2−5×3=(1,1)T b=−3 b=-3 b=−3
The classified hyperplane is:
X (1) (2) – 3 + x = 0 x ^ {} (1) + x ^ {} (2) – 3 = 0 x (1) (2) – 3 + x = 0
Perceptron model: f (x) = s I g n (x (1) + x (2) – 3) f (x) = sign (x ^ {} (1) + x ^ {} (2) – 3) f (x) = sign (x (1) + x (2) – (3)
The iterative process is as follows:
Table 2 Iterative process
For example 1, the results are consistent, the iterative steps are compared with each other, and there are multiple solutions.
[1] Li Hang. Statistical Learning Methods