The background knowledge needed to read this article: linear support vector machines, a didu programming knowledge
One, the introduction
Previously, we used two sections to introduce two kinds of support vector machine models — hard interval support vector machine and soft interval support vector machine. These two models can be collectively called linear support vector machine. Next, we will introduce another Support Vector Machine model — nonlinear Support Vector Machine 1 (NON-Linear Support Vector Machine).
Ii. Model introduction
Before introducing nonlinear support vector machines, let’s look at the following linearly indivisible data set:
Figure 2-1
In FIG. 2-1, circles and forks respectively represent different label classifications. Obviously, this data set cannot be correctly classified by a straight line, but as shown in FIG. 2-2, this data set can be correctly classified by an elliptic curve.
Figure 2-2
However, it is relatively difficult to solve the nonlinear classification of elliptic curves as shown in FIG. 2-2. Since the linear classification problem is relatively easy to solve, can the data be transformed into a linear classification data through certain nonlinear changes? The answer is yes.
Figure 2-3
As shown in Figure 2-3, nonlinear transformation (Z = X * X) is carried out on the data set. At this time, it can be seen that the transformed data can be correctly classified through a straight line. The elliptic curve in Figure 2-2 is transformed into a straight line in Figure 2-3, and the original nonlinear classification problem is transformed into a linear classification problem.
The original model
Let’s review the original model of the previous soft-interval support vector machine:
Type 2-1
Suppose there is a function φ that can transform x, such as mapping x to Z in the above example, into the original model of the above soft-interval support vector machine:
Type 2-2
Equation 2-2 above is the original model of nonlinear support vector machine.
Dual model
Similar to the dual model solved in the previous two sections, the dual model of nonlinear support vector machines can be obtained:
Type 2-3
The nuclear techniques
Observing the change of the dual model of soft-interval support vector machine in Equation 2-3, the dimension number of feature vector may be very high after φ function transformation, and it is usually difficult to calculate the inner product again. In order to bypass this inner product calculation, we can find a Function as shown in Formula 2-4, which makes the value calculated by the Function equal to the inner product after feature transformation. Such a Function is called Kernel Function, and this substitution method is called Kernel Trick 2.
Type 2-4
The final dual model of nonlinear support vector machine is obtained by substituting the kernel technique into Equation 2-3:
Type 2-5
The decision surface under the kernel technique: (1) the original decision surface function; (2) insert the weight coefficient W; (3) Obtain the final decision surface function after substitution with the kernel technique
Type 2-6
Kernel techniques are not only applicable to support vector machines, but also widely used in other machine learning algorithms.
Kernel function
Common Kernel functions are as follows: (1) Linear Kernel: the classification results of SVM using Linear Kernel function are the same as those of soft interval SVM. (2) Polynomial Kernel: When γ = 1, ε = 0, d = 1, the polynomial Kernel degenerates into linear Kernel function (3) : A common Gaussian Kernel is one of the radial basis Kernel functions. (4) Sigmoid Kernel: TANh is a hyperbolic tangent function
Type 2-7
Three, algorithm steps
Compared with the dual model of soft interval support vector machine, only the inner product of feature vector of nonlinear support vector machine is replaced by kernel function, and the other parts are unchanged, so the corresponding sequence minimum optimization algorithm (SMO) has little change. For the algorithm steps, please refer to the previous two sections and the code implementation below, as well as the corresponding implementation of SMO algorithm in Paper 3.
Four, code implementation
Using Python to Implement Nonlinear Support Vector Machines (SMO Algorithms)
import numpy as np
class SMO:
Implementation of Sequential minimal Optimization Algorithm for Support Vector Machines (SMO)
def __init__(self, X, y, kernel, degree = 3, coef0 = 0.0, gamma = 1.0) :
Training sample eigenmatrix (N * p)
self.X = X
Training sample tag vector (N * 1)
self.y = y
Lagrangian multiplier vector (N * 1)
self.alpha = np.zeros(X.shape[0])
Error vector, default negative tag vector (N * 1)
self.errors = -y
# the offset
self.b = 0
# generation value
self.cost = -np.inf
# kernel function
self.kernel = kernel
Kernel function parameters
self.degree = degree
self.coef0 = coef0
self.gamma = gamma
def fit(self, C = 1, tol = 1e-4) :
""" Algorithm from John C. Platt's paper https://www.microsoft.com/en-us/research/uploads/prod/1998/04/sequential-minimal-optimization.pdf """
# number of updates
numChanged = 0
# Check all
examineAll = True
while numChanged > 0 or examineAll:
numChanged = 0
if examineAll:
for idx in range(X.shape[0]):
numChanged += self.update(idx, C)
else:
for idx in range(X.shape[0) :if self.alpha[idx] <= 0:
continue
numChanged += self.update(idx, C)
if examineAll:
examineAll = False
elif numChanged == 0:
examineAll = True
# Calculate the generation value
cost = self.calcCost()
if self.cost > 0:
# End the algorithm when the change of contemporary value is less than the allowable range
rate = (cost - self.cost) / self.cost
if rate <= tol:
break
self.cost = cost
def update(self, idx, C = 1) :
Update Lagrange multiplier with subscript IDX
X = self.X
y = self.y
alpha = self.alpha
# Check whether the current Lagrange multiplier satisfies KKT condition, if so, return 0 directly
if self.checkKKT(idx, C):
return 0
if len(alpha[(alpha != 0)]) > 1:
Find the subscript of the second Lagrange multiplier to be optimized according to the principle of | E1-e2 | maximum
jdx = self.selectJdx(idx)
Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
if self.updateAlpha(idx, jdx, C):
return 1
# If the update is not successful, iterate over the non-zero Lagrange multiplier to update
for jdx in range(X.shape[0) :if alpha[jdx] == 0:
continue
Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
if self.updateAlpha(idx, jdx, C):
return 1
# If there is still no update success, traverse the Lagrange multiplier of zero to update
for jdx in range(X.shape[0) :ifalpha[jdx] ! =0:
continue
Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
if self.updateAlpha(idx, jdx, C):
return 1
# return 0 if still not updated
return 0
def selectJdx(self, idx) :
Search for the subscript of the Second Lagrange multiplier to be optimized
errors = self.errors
if errors[idx] > 0:
# When the error term is greater than zero, select the subscript of the smallest value in the error vector
return np.argmin(errors)
elif errors[idx] < 0:
# When the error term is less than zero, select the subscript of the maximum value in the error vector
return np.argmax(errors)
else:
# When the error term is equal to zero, select the subscript with the largest absolute values of the maximum and minimum values in the error vector
minJdx = np.argmin(errors)
maxJdx = np.argmax(errors)
if max(np.abs(errors[minJdx]), np.abs(errors[maxJdx])) - errors[minJdx]:
return minJdx
else:
return maxJdx
def calcB(self) :
Calculate the offset of each Lagrange multiplier that is not zero and take its average value.
X = self.X
y = self.y
alpha = self.alpha
alpha_gt = alpha[alpha > 0]
# Number of non-zero Lagrange multiplier vectors
alpha_gt_len = len(alpha_gt)
# return 0 if all values are zero
if alpha_gt_len == 0:
return 0
# b = y-wx, please refer to the specific algorithm in the article
X_gt = X[alpha > 0]
y_gt = y[alpha > 0]
s = 0
for idx in range(X_gt.shape[0]):
ss = 0
for jdx in range(X_gt.shape[0) : ss += alpha_gt[jdx] * y_gt[jdx] * self.kernel(X_gt[jdx], X_gt[idx], self.degree, self.coef0, self.gamma) s += y_gt[idx] - ssreturn s / alpha_gt_len
def calcCost(self) :
"" The generation value can be calculated according to the algorithm in the paper.
X = self.X
y = self.y
alpha = self.alpha
cost = 0
for idx in range(X.shape[0) :for jdx in range(X.shape[0) : cost = cost + (y[idx] * y[jdx] * self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma) * alpha[idx] * alpha[jdx])return np.sum(alpha) - cost / 2
def checkKKT(self, idx, C = 1) :
2. Alpha <= C 3. y * f(x) -1 >= 0 4. Alpha * (y * f(x) -1) = 0
y = self.y
errors = self.errors
alpha = self.alpha
r = errors[idx] * y[idx]
if (alpha[idx] > 0 and alpha[idx] < C and r == 0) or (alpha[idx] == 0 and r > 0) or (alpha[idx] == C and r < 0) :return True
return False
def calcU(self, idx, jdx, C = 1) :
Calculate the upper bound of the Lagrangian multiplier so that the two optimized Lagrangian multipliers are both greater than or equal to 0 according to the algorithm in the paper.
y = self.y
alpha = self.alpha
if y[idx] * y[jdx] == 1:
return max(0, alpha[jdx] + alpha[idx] - C)
else:
return max(0.0, alpha[jdx] - alpha[idx])
def calcV(self, idx, jdx, C = 1) :
"" Calculate the lower bound of the Lagrange multiplier so that the two optimized Lagrange multipliers are both greater than or equal to 0 according to the algorithm in the paper.
y = self.y
alpha = self.alpha
if y[idx] * y[jdx] == 1:
return min(C, alpha[jdx] + alpha[idx])
else:
return min(C, C + alpha[jdx] - alpha[idx])
def updateAlpha(self, idx, jdx, C = 1) :
Update the Lagrange multiplier with subscripts IDX and JDX according to the algorithm in the article.
if idx == jdx:
return False
X = self.X
y = self.y
alpha = self.alpha
errors = self.errors
# error term of idx
Ei = errors[idx]
The error term of JDX
Ej = errors[jdx]
U = self.calcU(idx, jdx, C)
V = self.calcV(idx, jdx, C)
if U == V:
return False
Kii = self.kernel(X[idx], X[idx], self.degree, self.coef0, self.gamma)
Kjj = self.kernel(X[jdx], X[jdx], self.degree, self.coef0, self.gamma)
Kij = self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma)
# to calculate K
K = Kii + Kjj - 2 * Kij
oldAlphaIdx = alpha[idx]
oldAlphaJdx = alpha[jdx]
oldB = self.b
s = y[idx] * y[jdx]
if K > 0:
# Compute the new Lagrange multiplier of JDX
newAlphaJdx = oldAlphaJdx + y[jdx] * (Ei - Ej) / K
if newAlphaJdx < U:
# When the new value exceeds the upper bound, change it to the upper bound
newAlphaJdx = U
if newAlphaJdx > V:
# When the new value is below the lower bound, change it to the lower bound
newAlphaJdx = V
else: fi = y[idx] * (Ei + oldB) - oldAlphaIdx * Kii - s * oldAlphaJdx * Kij fj = y[jdx] * (Ej + oldB) - s * oldAlphaIdx * Kij - oldAlphaJdx * Kjj Vv = oldAlphaIdx + s * (oldAlphaJdx - V) Uu = oldAlphaIdx + s * (oldAlphaJdx - U) Vv = Vv * fi + V * fj +0.5 * (Vv ** 2) * Kii + 0.5 * (V ** 2) * Kjj + s * V * Vv * Kij
Uu = Uu * fi + U * fj + 0.5 * (Uu ** 2) * Kii + 0.5 * (U ** 2) * Kjj + s * U * Uu * Kij
if Vv < Uu:
newAlphaJdx = V
elif Vv > Uu:
newAlphaJdx = U
else:
newAlphaJdx = oldAlphaJdx
if oldAlphaJdx == newAlphaJdx:
# if the new value equals the old value, the new value is not updated
return False
# Compute the new Lagrange multiplier of IDx
newAlphaIdx = oldAlphaIdx + s * (oldAlphaJdx - newAlphaJdx)
Update the Lagrange multiplier vector
alpha[idx] = newAlphaIdx
alpha[jdx] = newAlphaJdx
oldB = self.b
Recalculate the offset
self.b = self.calcB()
# recalculate the error vector
newErrors = []
for i in range(X.shape[0) : newError = errors[i] + y[idx] * (newAlphaIdx - oldAlphaIdx) * self.kernel(X[idx], X[i], self.degree, self.coef0, self.gamma) + y[jdx] * (newAlphaJdx - oldAlphaJdx) * self.kernel(X[jdx], X[i]) - oldB + self.b newErrors.append(newError) self.errors = newErrorsreturn True
def predict(self, X) :
fxs = []
for idx in range(len(X)):
fx = 0
for jdx in range(self.X.shape[0) : fx += self.y[jdx] * self.alpha[jdx] * self.kernel(self.X[jdx], X[idx], self.degree, self.coef0, self.gamma) fxs.append(fx + self.b)return np.sign(fxs)
Copy the code
Linear kernel support vector machines
# Linear kernel support vector machines
def linearKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0) :
return x.dot(z)
linearSmo = SMO(X, y, kernel = linearKernel)
linearSmo.fit()
Copy the code
Polynomial kernel function support vector machine
Polynomial kernel function support vector machines
def polynomialKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0) :
return np.power(gamma * x.dot(z) + coef0, degree)
polynomialSmo = SMO(X, y, kernel = polynomialKernel)
polynomialSmo.fit()
Copy the code
Radial basis kernel function support vector machines
RBF kernel function support vector machines
def rbfKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0) :
return np.exp(-gamma * np.power(np.linalg.norm(x - z), 2))
rbfSmo = SMO(X, y, kernel = rbfKernel)
rbfSmo.fit()
Copy the code
Fifth, third-party library implementation
Scikit – learn4 implementation
from sklearn.svm import SVC
# Linear kernel support vector machines
svc = SVC(kernel = "linear")
Polynomial kernel function support vector machines
svc = SVC(kernel = "poly", gamma = 1)
RBF kernel function support vector machines
svc = SVC(kernel = "rbf", gamma = 1)
# Fit data
svc.fit(X, y)
Copy the code
Scikit-learn internally uses the LIBSVM5 library, which can be implemented in article 6 for details.
Six, animation demonstration
The following three figures show the classification results of the linearly indivisible data set for different kernel functions respectively. Red represents the sample point with label value -1, and blue represents the sample point with label value 1. The light red areas are the areas with a predicted value of -1 and the light blue areas are the areas with a predicted value of 1:
Figure 6-1
Figure 6-2
Figure 6-3
It can be seen that different kernel functions have a great influence on the classification results. When the form of feature transformation is not known, the appropriate kernel function cannot be selected. However, in general, the radial basis kernel function can be used to find the appropriate parameters through cross verification for nonlinear classification problems.
7. Mind mapping
Figure 7-1
8. References
- En.wikipedia.org/wiki/Suppor…
- En.wikipedia.org/wiki/Kernel…
- www.microsoft.com/en-us/resea…
- Scikit-learn.org/stable/modu…
- www.csie.ntu.edu.tw/~cjlin/libs…
- Github.com/Kaslanarian…
The full demo can be found here
Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism