The background knowledge needed to read this article: hard interval support vector machines, relaxation variables, a drop programming knowledge
One, the introduction
In the previous section, we introduced one of the most basic support vector machine models — hard-interval support vector machines, which can classify linearly separable data sets, but in reality data sets are often linearly indivisible. So in this section, we will introduce the second kind of Support Vector Machine — soft-margin Support Vector Machine 1 (SOFT-margin Support Vector Machine), to solve the problem of linear indivisibility of data set mentioned above.
Ii. Model introduction
The original model
Let’s first look at the original model of hard interval support vector machine as follows:
Type 2-1
The constraint y(wx + b) is greater than or equal to 1 so that all sample points are correctly classified, and that’s why it’s called a hard interval. However, the data set cannot be completely separated by a hyperplane, so it is necessary to allow some data sets not to meet the above constraints. Modify the above cost function and add the penalty term for the sample points that are incorrectly classified, where 1(x) was mentioned in the previous section, namely indicator function 2. If the following inequality is satisfied, the function returns 1; if not, the function returns 0. At the same time, constant C is used to control the punishment intensity, where C > 0, it can be seen that when C is infinite, it will force each sample point to be correctly classified, which is equivalent to hard interval support vector machine.
Type 2-2
The indicating function in Equation 2-2 is neither convex nor continuous, so it is troublesome to handle. In this case, it can be replaced by Max function, in the following form:
Type 2-3
At this time, the slack variable 3 is introduced, and the following conditions are added to further remove the Max function. Equation 2-3 is the original model of soft interval support vector machine:
Type 2-4
Dual model
In the same way as the hard interval support vector machine, the original model is transformed by Lagrange multiplier method: (1) Lagrange multiplier method is used to introduce two kinds of Lagrange parameters λ and μ. Get the Lagrange function (2) Take partial derivative of the Lagrange function with respect to W and set it to zero (3) get the analytical solution of W as hard interval support vector machine (4) Take partial derivative of the Lagrange function with respect to B and set it to zero (5) Take partial derivative of the Lagrange function with respect to ξ and set it to zero (6) get C = λ + μ
Type 2-5
The results obtained in Equation 2-5 are brought back to the Lagrange function: (1) Lagrange function (2) is substituted into the expanded parentheses (3). It can be seen that the 2nd and 5th terms in Equation (2) cancel each other, the 3rd and 8th terms cancel each other, and the 7th term is zero after merging
Type 2-6
In view of the Lagrange multiplier parameters: (1) constraints on the Lagrange multiplier μ (2) add the Lagrange multiplier λ (3) to both sides of the equation 2-5 (6) results obtained
Type 2-7
As with hard interval support vector machines, KKT conditions are introduced as follows:
Type 2-8
After applying the Lagrange multiplier method, the Lagrange dual model of the original model is obtained and KKT conditions as shown above are required:
Type 2-9
Three, algorithm steps
The above type 2-9 dual model with hard intervals dual model of support vector machine (SVM) after compare, will find that the difference lies in the constraint condition of Lagrange multiplier, the former more than just a upper bound of the penalty factor C, so the solving algorithm and hard intervals of support vector machine (SVM) is roughly same, only the following two places are different.
One change in the algorithm is the calculation of the upper and lower bounds of the new λ _B: (1), (2) all λ needs to be greater than or equal to zero and less than or equal to C (3) to discuss the limitations of λ _B. (4) Combining equation (1) to obtain the final variable λ _B
Type 3-1
Another change in the algorithm is the judgment of KKT conditions: (2) According to the result of (6) in Formula 2-5, it can be known that μ is equal to C (3) Because μ is equal to C, it can be known that ξ is equal to 0 according to KKT condition of Formula 2-8. (4) It can be concluded that ξ is equal to 0 by substituting ξ to KKT condition
Type 3-2
(1) considering the situation that λ is equal to C (2) according to the result of (6) in Equation 2-5, μ is equal to 0 (3) because λ is not equal to 0, according to KKT condition of equation 2-8, (4) because ξ is greater than or equal to 0 can be obtained by combining equation (3)
Type 3-3
(2) according to the result of (6) in Equation 2-5, we know that μ is also between 0 and C. (3) according to the KKT condition of equation 2-8, we know that (4) since λ is not equal to 0, we can obtain according to the KKT condition of equation 2-8 and in combination with equation (3)
Type 3-4
Based on equations 3-2, 3-3 and 3-4 above, the following new constraints can be obtained:
Type 3-5
Please refer to section 3 of hard interval support vector machines and the code implementation below for the algorithm steps.
Four, code implementation
Using Python
import numpy as np
class SMO:
Implementation of Sequential minimal Optimization Algorithm for Soft Interval Support Vector Machine (SMO)
def __init__(self, X, y) :
Training sample eigenmatrix (N * p)
self.X = X
Training sample tag vector (N * 1)
self.y = y
Lagrangian multiplier vector (N * 1)
self.alpha = np.zeros(X.shape[0])
Error vector, default negative tag vector (N * 1)
self.errors = -y
# the offset
self.b = 0
# Weight vector (p * 1)
self.w = np.zeros(X.shape[1])
# generation value
self.cost = -np.inf
def fit(self, C = 1, tol = 1e-4) :
""" Algorithm from John C. Platt's paper https://www.microsoft.com/en-us/research/uploads/prod/1998/04/sequential-minimal-optimization.pdf """
# number of updates
numChanged = 0
# Check all
examineAll = True
while numChanged > 0 or examineAll:
numChanged = 0
if examineAll:
for idx in range(X.shape[0]):
numChanged += self.update(idx, C)
else:
for idx in range(X.shape[0) :if self.alpha[idx] <= 0:
continue
numChanged += self.update(idx, C)
if examineAll:
examineAll = False
elif numChanged == 0:
examineAll = True
# Calculate the generation value
cost = self.calcCost()
if self.cost > 0:
# End the algorithm when the change of contemporary value is less than the allowable range
rate = (cost - self.cost) / self.cost
if rate <= tol:
break
self.cost = cost
def update(self, idx, C = 1) :
Update Lagrange multiplier with subscript IDX
X = self.X
y = self.y
alpha = self.alpha
# Check whether the current Lagrange multiplier satisfies KKT condition, if so, return 0 directly
if self.checkKKT(idx, C):
return 0
if len(alpha[(alpha != 0)]) > 1:
Find the subscript of the second Lagrange multiplier to be optimized according to the principle of | E1-e2 | maximum
jdx = self.selectJdx(idx)
Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
if self.updateAlpha(idx, jdx, C):
return 1
# If the update is not successful, iterate over the non-zero Lagrange multiplier to update
for jdx in range(X.shape[0) :if alpha[jdx] == 0:
continue
Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
if self.updateAlpha(idx, jdx, C):
return 1
# If there is still no update success, traverse the Lagrange multiplier of zero to update
for jdx in range(X.shape[0) :ifalpha[jdx] ! =0:
continue
Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
if self.updateAlpha(idx, jdx, C):
return 1
# return 0 if still not updated
return 0
def selectJdx(self, idx) :
Search for the subscript of the Second Lagrange multiplier to be optimized
errors = self.errors
if errors[idx] > 0:
# When the error term is greater than zero, select the subscript of the smallest value in the error vector
return np.argmin(errors)
elif errors[idx] < 0:
# When the error term is less than zero, select the subscript of the maximum value in the error vector
return np.argmax(errors)
else:
# When the error term is equal to zero, select the subscript with the largest absolute values of the maximum and minimum values in the error vector
minJdx = np.argmin(errors)
maxJdx = np.argmax(errors)
if max(np.abs(errors[minJdx]), np.abs(errors[maxJdx])) - errors[minJdx]:
return minJdx
else:
return maxJdx
def calcB(self) :
Calculate the offset of each Lagrange multiplier that is not zero and take its average value.
X = self.X
y = self.y
alpha = self.alpha
alpha_gt = alpha[alpha > 0]
# Number of non-zero Lagrange multiplier vectors
alpha_gt_len = len(alpha_gt)
# return 0 if all values are zero
if alpha_gt_len == 0:
return 0
# b = y-wx, please refer to the specific algorithm in the article
X_gt = X[alpha > 0]
y_gt = y[alpha > 0]
alpha_gt_y = np.array(np.multiply(alpha_gt, y_gt)).reshape(-1.1)
s = np.sum(np.multiply(alpha_gt_y, X_gt), axis=0)
return np.sum(y_gt - X_gt.dot(s)) / alpha_gt_len
def calcCost(self) :
"" The generation value can be calculated according to the algorithm in the paper.
X = self.X
y = self.y
alpha = self.alpha
cost = 0
for idx in range(X.shape[0) :for jdx in range(X.shape[0]):
cost = cost + (y[idx] * y[jdx] * X[idx].dot(X[jdx]) * alpha[idx] * alpha[jdx])
return np.sum(alpha) - cost / 2
def checkKKT(self, idx, C = 1) :
2. Alpha <= C 3. y * f(x) -1 >= 0 4. Alpha * (y * f(x) -1) = 0
y = self.y
errors = self.errors
alpha = self.alpha
r = errors[idx] * y[idx]
if (alpha[idx] > 0 and alpha[idx] < C and r == 0) or (alpha[idx] == 0 and r > 0) or (alpha[idx] == C and r < 0) :return True
return False
def calcE(self) :
Calculate the error vector E = f(x) -y
X = self.X
y = self.y
alpha = self.alpha
alpha_y = np.array(np.multiply(alpha, y)).reshape(-1.1)
errors = X.dot(X.T).dot(alpha_y).T[0] + self.b - y
return errors
def calcU(self, idx, jdx, C = 1) :
Calculate the upper bound of the Lagrangian multiplier so that the two optimized Lagrangian multipliers are both greater than or equal to 0 according to the algorithm in the paper.
y = self.y
alpha = self.alpha
if y[idx] * y[jdx] == 1:
return max(0, alpha[jdx] + alpha[idx] - C)
else:
return max(0.0, alpha[jdx] - alpha[idx])
def calcV(self, idx, jdx, C = 1) :
"" Calculate the lower bound of the Lagrange multiplier so that the two optimized Lagrange multipliers are both greater than or equal to 0 according to the algorithm in the paper.
y = self.y
alpha = self.alpha
if y[idx] * y[jdx] == 1:
return min(C, alpha[jdx] + alpha[idx])
else:
return min(C, C + alpha[jdx] - alpha[idx])
def updateAlpha(self, idx, jdx, C = 1) :
Update the Lagrange multiplier with subscripts IDX and JDX according to the algorithm in the article.
if idx == jdx:
return False
X = self.X
y = self.y
alpha = self.alpha
errors = self.errors
# error term of idx
Ei = errors[idx]
The error term of JDX
Ej = errors[jdx]
Kii = X[idx].dot(X[idx])
Kjj = X[jdx].dot(X[jdx])
Kij = X[idx].dot(X[jdx])
# to calculate K
K = Kii + Kjj - 2 * Kij
oldAlphaIdx = alpha[idx]
oldAlphaJdx = alpha[jdx]
# Compute the new Lagrange multiplier of JDX
newAlphaJdx = oldAlphaJdx + y[jdx] * (Ei - Ej) / K
U = self.calcU(idx, jdx, C)
V = self.calcV(idx, jdx, C)
if newAlphaJdx < U:
# When the new value exceeds the upper bound, change it to the upper bound
newAlphaJdx = U
if newAlphaJdx > V:
# When the new value is below the lower bound, change it to the lower bound
newAlphaJdx = V
if oldAlphaJdx == newAlphaJdx:
# if the new value equals the old value, the new value is not updated
return False
# Compute the new Lagrange multiplier of IDx
newAlphaIdx = oldAlphaIdx + y[idx] * y[jdx] * (oldAlphaJdx - newAlphaJdx)
# Update weight vector
self.w = self.w + y[idx] * (newAlphaIdx - oldAlphaIdx) * X[idx] + y[jdx] * (newAlphaJdx - oldAlphaJdx) * X[jdx]
Update the Lagrange multiplier vector
alpha[idx] = newAlphaIdx
alpha[jdx] = newAlphaJdx
oldB = self.b
Recalculate the offset
self.b = self.calcB()
# recalculate the error vector
eps = np.finfo(np.float32).eps
newErrors = []
for i in range(X.shape[0) : newError = errors[i] + y[idx] * (newAlphaIdx - oldAlphaIdx) * X[idx].dot(X[i]) + y[jdx] * (newAlphaJdx - oldAlphaJdx) * X[jdx].dot(X[i]) - oldB + self.bif np.abs(newError) < eps:
newError = 0
newErrors.append(newError)
self.errors = newErrors
return True
Copy the code
Fifth, third-party library implementation
Scikit – learn4 implementation
from sklearn.svm import SVC
svc = SVC(kernel = "linear", C = 1)
# Fit data
svc.fit(X, y)
# Weight coefficient
w = svc.coef_
# intercept
b = svc.intercept_
print("w", w, "b", b)
Copy the code
Six, animation demonstration
The following figure shows a linearly indivisible data set, where red represents sample points with a label value of -1 and blue represents sample points with a label value of 1:
Figure 6-1
The following figure shows the results of data fitting by soft interval support vector machine, where the light red area is the part with the predicted value of -1, and the light blue area is the part with the predicted value of 1:
Figure 6-2
7. Mind mapping
Figure 7-1
8. References
- En.wikipedia.org/wiki/Suppor…
- En.wikipedia.org/wiki/Indica…
- En.wikipedia.org/wiki/Slack_…
- Scikit-learn.org/stable/modu…
The full demo can be found here
Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism