The background knowledge needed to read this article: linear support vector machines, a didu programming knowledge

One, the introduction

Previously, we used two sections to introduce two kinds of support vector machine models — hard interval support vector machine and soft interval support vector machine. These two models can be collectively called linear support vector machine. Next, we will introduce another Support Vector Machine model — nonlinear Support Vector Machine 1 (NON-Linear Support Vector Machine).

Ii. Model introduction

Before introducing nonlinear support vector machines, let’s look at the following linearly indivisible data set:

Figure 2-1

In FIG. 2-1, circles and forks respectively represent different label classifications. Obviously, this data set cannot be correctly classified by a straight line, but as shown in FIG. 2-2, this data set can be correctly classified by an elliptic curve.

Figure 2-2

However, it is relatively difficult to solve the nonlinear classification of elliptic curves as shown in FIG. 2-2. Since the linear classification problem is relatively easy to solve, can the data be transformed into a linear classification data through certain nonlinear changes? The answer is yes.

Figure 2-3

As shown in Figure 2-3, nonlinear transformation (Z = X * X) is carried out on the data set. At this time, it can be seen that the transformed data can be correctly classified through a straight line. The elliptic curve in Figure 2-2 is transformed into a straight line in Figure 2-3, and the original nonlinear classification problem is transformed into a linear classification problem.

The original model

Let’s review the original model of the previous soft-interval support vector machine:


min w . b . Is deduced 1 2 w T w + C i = 1 N Is deduced i  s.t.  C > 0 Is deduced i p 0 Is deduced i p 1 y i ( w T x i + b ) i = 1 . 2 . . N \begin{array}{c} \underset{w, b, \xi}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } \quad C>0 \quad \xi_{i} \geq 0 \quad \xi_{i} \ \ geq 1 – y_ {I} left (w ^ {T} x_ + b \ {I} right) \ quad I = 1, 2, \ \ cdots, N end {array}

Type 2-1

Suppose there is a function φ that can transform x, such as mapping x to Z in the above example, into the original model of the above soft-interval support vector machine:


min w . b . Is deduced 1 2 w T w + C i = 1 N Is deduced i  s.t.  C > 0 Is deduced i p 0 Is deduced i p 1 y i ( w T ϕ ( x i ) + b ) i = 1 . 2 . . N \begin{array}{c} \underset{w, b, \xi}{\min} \frac{1}{2} w^{T} w+C \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } \quad C>0 \quad \xi_{i} \geq 0 \quad \xi_{i} \ \ geq 1 – y_ {I} left (w ^ {T} \ phi (x_ {I}) + b \ right) \ quad I = 1, 2, \ \ cdots, N end {array}

Type 2-2

Equation 2-2 above is the original model of nonlinear support vector machine.

Dual model

Similar to the dual model solved in the previous two sections, the dual model of nonlinear support vector machines can be obtained:


max Lambda. i = 1 N Lambda. i 1 2 i = 1 N j = 1 N Lambda. i Lambda. j y i y j ϕ ( x i ) T ϕ ( x j )  s.t.  i = 1 N Lambda. i y i = 0 0 Or less Lambda. i Or less C i = 1 . 2 . . N \begin{array}{c} \underset{\lambda}{\max} \sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} \phi(x_{i})^{T} \phi(x_{j}) \\ \text { s.t. } \quad \sum_{i=1}^{N} \lambda_{i} Y_ {I}=0 \quad 0 \leq \lambda_{I} \leq C \quad I =1,2, \cdots, N \end{array}

Type 2-3

The nuclear techniques

Observing the change of the dual model of soft-interval support vector machine in Equation 2-3, the dimension number of feature vector may be very high after φ function transformation, and it is usually difficult to calculate the inner product again. In order to bypass this inner product calculation, we can find a Function as shown in Formula 2-4, which makes the value calculated by the Function equal to the inner product after feature transformation. Such a Function is called Kernel Function, and this substitution method is called Kernel Trick 2.


K ( x i . x j ) = ϕ ( x i ) T ϕ ( x j ) K(x_i, x_j) = \phi(x_i)^T\phi(x_j)

Type 2-4

The final dual model of nonlinear support vector machine is obtained by substituting the kernel technique into Equation 2-3:


max Lambda. i = 1 N Lambda. i 1 2 i = 1 N j = 1 N Lambda. i Lambda. j y i y j K ( x i . x j )  s.t.  i = 1 N Lambda. i y i = 0 0 Or less Lambda. i Or less C i = 1 . 2 . . N \begin{array}{c} \underset{\lambda}{\max} \sum_{i=1}^{N} \lambda_{i}-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \lambda_{i} \lambda_{j} y_{i} y_{j} K(x_i, X_j) \ \ \ text {s.t.} \ quad \ sum_ {I = 1} ^ {N} \ lambda_ {I} y_ {I} = 0 \ \ leq quad 0 \ lambda_ {I} \ \ leq C quad I = 1, 2, \ \ cdots, N \end{array}

Type 2-5

The decision surface under the kernel technique: (1) the original decision surface function; (2) insert the weight coefficient W; (3) Obtain the final decision surface function after substitution with the kernel technique


f ( x ) = w T ϕ ( x ) + b ( 1 ) = i = 1 N Lambda. i y i ϕ ( x i ) ϕ ( x ) + b ( 2 ) = i = 1 N Lambda. i y i K ( x i . x ) + b ( 3 ) \begin{aligned} f(x) &=w^{T} \phi(x)+b & (1)\\ &=\sum_{i=1}^{N} \lambda_{i} y_{i} \phi\left(x_{i}\right) \phi(x)+b & (2)\\ &=\sum_{i=1}^{N} \lambda_{i} y_{i} K\left(x_{i}, x\right)+b & (3) \end{aligned}

Type 2-6

Kernel techniques are not only applicable to support vector machines, but also widely used in other machine learning algorithms.

Kernel function

Common Kernel functions are as follows: (1) Linear Kernel: the classification results of SVM using Linear Kernel function are the same as those of soft interval SVM. (2) Polynomial Kernel: When γ = 1, ε = 0, d = 1, the polynomial Kernel degenerates into linear Kernel function (3) : A common Gaussian Kernel is one of the radial basis Kernel functions. (4) Sigmoid Kernel: TANh is a hyperbolic tangent function


K ( x i . x j ) = x i T x j ( 1 ) K ( x i . x j ) = ( gamma x i T x j + Epsilon. ) d ( 2 ) K ( x i . x j ) = e gamma x i x j 2 ( 3 ) K ( x i . x j ) = tanh ( gamma x i T x j + Epsilon. ) ( 4 ) \begin{aligned} K\left(x_{i}, x_{j}\right)&=x_{i}^{T} x_{j} & (1) \\ K\left(x_{i}, x_{j}\right)&=\left(\gamma x_{i}^{T} x_{j}+\varepsilon\right)^{d} & (2)\\ K\left(x_{i}, x_{j}\right)&=e^{-\gamma\left\|x_{i}-x_{j}\right\|^{2}} & (3)\\ K\left(x_{i}, x_{j}\right)&=\tanh \left(\gamma x_{i}^{T} x_{j}+\varepsilon\right) & (4) \end{aligned}

Type 2-7

Three, algorithm steps

Compared with the dual model of soft interval support vector machine, only the inner product of feature vector of nonlinear support vector machine is replaced by kernel function, and the other parts are unchanged, so the corresponding sequence minimum optimization algorithm (SMO) has little change. For the algorithm steps, please refer to the previous two sections and the code implementation below, as well as the corresponding implementation of SMO algorithm in Paper 3.

Four, code implementation

Using Python to Implement Nonlinear Support Vector Machines (SMO Algorithms)

import numpy as np

class SMO:
    Implementation of Sequential minimal Optimization Algorithm for Support Vector Machines (SMO)

    def __init__(self, X, y, kernel, degree = 3, coef0 = 0.0, gamma = 1.0) :
        Training sample eigenmatrix (N * p)
        self.X = X
        Training sample tag vector (N * 1)
        self.y = y
        Lagrangian multiplier vector (N * 1)
        self.alpha = np.zeros(X.shape[0])
        Error vector, default negative tag vector (N * 1)
        self.errors = -y
        # the offset
        self.b = 0
        # generation value
        self.cost = -np.inf
        # kernel function
        self.kernel = kernel
        Kernel function parameters
        self.degree = degree
        self.coef0 = coef0
        self.gamma = gamma

    def fit(self, C = 1, tol = 1e-4) :
        """ Algorithm from John C. Platt's paper https://www.microsoft.com/en-us/research/uploads/prod/1998/04/sequential-minimal-optimization.pdf """
        # number of updates
        numChanged = 0
        # Check all
        examineAll = True
        while numChanged > 0 or examineAll:
            numChanged = 0
            if examineAll:
                for idx in range(X.shape[0]):
                    numChanged += self.update(idx, C)
            else:
                for idx in range(X.shape[0) :if self.alpha[idx] <= 0:
                        continue
                    numChanged += self.update(idx, C)
            if examineAll:
                examineAll = False
            elif numChanged == 0:
                examineAll = True
            # Calculate the generation value
            cost = self.calcCost()
            if self.cost > 0:
                # End the algorithm when the change of contemporary value is less than the allowable range
                rate = (cost - self.cost) / self.cost
                if rate <= tol:
                    break
            self.cost = cost

    def update(self, idx, C = 1) :
        Update Lagrange multiplier with subscript IDX
        X = self.X
        y = self.y
        alpha = self.alpha
        # Check whether the current Lagrange multiplier satisfies KKT condition, if so, return 0 directly
        if self.checkKKT(idx, C):
            return 0
        if len(alpha[(alpha != 0)]) > 1:
            Find the subscript of the second Lagrange multiplier to be optimized according to the principle of | E1-e2 | maximum
            jdx = self.selectJdx(idx)
            Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
            if self.updateAlpha(idx, jdx, C):
                return 1
        # If the update is not successful, iterate over the non-zero Lagrange multiplier to update
        for jdx in range(X.shape[0) :if alpha[jdx] == 0:
                continue
            Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
            if self.updateAlpha(idx, jdx, C):
                return 1
        # If there is still no update success, traverse the Lagrange multiplier of zero to update
        for jdx in range(X.shape[0) :ifalpha[jdx] ! =0:
                continue
            Update Lagrangian multiplier with subscripts idx and JDX, return 1 on success
            if self.updateAlpha(idx, jdx, C):
                return 1
        # return 0 if still not updated
        return 0

    def selectJdx(self, idx) :
        Search for the subscript of the Second Lagrange multiplier to be optimized
        errors = self.errors
        if errors[idx] > 0:
            # When the error term is greater than zero, select the subscript of the smallest value in the error vector
            return np.argmin(errors)
        elif errors[idx] < 0:
            # When the error term is less than zero, select the subscript of the maximum value in the error vector
            return np.argmax(errors)
        else:
            # When the error term is equal to zero, select the subscript with the largest absolute values of the maximum and minimum values in the error vector
            minJdx = np.argmin(errors)
            maxJdx = np.argmax(errors)
            if max(np.abs(errors[minJdx]), np.abs(errors[maxJdx])) - errors[minJdx]:
                return minJdx
            else:
                return maxJdx
    
    def calcB(self) :
        Calculate the offset of each Lagrange multiplier that is not zero and take its average value.
        X = self.X
        y = self.y
        alpha = self.alpha
        alpha_gt = alpha[alpha > 0]
        # Number of non-zero Lagrange multiplier vectors
        alpha_gt_len = len(alpha_gt)
        # return 0 if all values are zero
        if alpha_gt_len == 0:
            return 0
        # b = y-wx, please refer to the specific algorithm in the article
        X_gt = X[alpha > 0]
        y_gt = y[alpha > 0]
        
        s = 0
        for idx in range(X_gt.shape[0]):
            ss = 0
            for jdx in range(X_gt.shape[0) : ss += alpha_gt[jdx] * y_gt[jdx] * self.kernel(X_gt[jdx], X_gt[idx], self.degree, self.coef0, self.gamma) s += y_gt[idx] - ssreturn s / alpha_gt_len

    def calcCost(self) :
        "" The generation value can be calculated according to the algorithm in the paper.
        X = self.X
        y = self.y
        alpha = self.alpha
        cost = 0
        for idx in range(X.shape[0) :for jdx in range(X.shape[0) : cost = cost + (y[idx] * y[jdx] * self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma) * alpha[idx] * alpha[jdx])return np.sum(alpha) - cost / 2

    def checkKKT(self, idx, C = 1) :
        2. Alpha <= C 3. y * f(x) -1 >= 0 4. Alpha * (y * f(x) -1) = 0
        y = self.y
        errors = self.errors
        alpha = self.alpha
        r = errors[idx] * y[idx]
        if (alpha[idx] > 0 and alpha[idx] < C and r == 0) or (alpha[idx] == 0 and r > 0) or (alpha[idx] == C and r < 0) :return True
        return False

    def calcU(self, idx, jdx, C = 1) :
        Calculate the upper bound of the Lagrangian multiplier so that the two optimized Lagrangian multipliers are both greater than or equal to 0 according to the algorithm in the paper.
        y = self.y
        alpha = self.alpha
        if y[idx] * y[jdx] == 1:
            return max(0, alpha[jdx] + alpha[idx] - C)
        else:
            return max(0.0, alpha[jdx] - alpha[idx])

    def calcV(self, idx, jdx, C = 1) :
        "" Calculate the lower bound of the Lagrange multiplier so that the two optimized Lagrange multipliers are both greater than or equal to 0 according to the algorithm in the paper.
        y = self.y
        alpha = self.alpha
        if y[idx] * y[jdx] == 1:
            return min(C, alpha[jdx] + alpha[idx])
        else:
            return min(C, C + alpha[jdx] - alpha[idx])

    def updateAlpha(self, idx, jdx, C = 1) :
        Update the Lagrange multiplier with subscripts IDX and JDX according to the algorithm in the article.
        if idx == jdx:
            return False
        X = self.X
        y = self.y
        alpha = self.alpha
        errors = self.errors
        # error term of idx
        Ei = errors[idx]
        The error term of JDX
        Ej = errors[jdx]
        U = self.calcU(idx, jdx, C)
        V = self.calcV(idx, jdx, C)
        if U == V:
            return False
        Kii = self.kernel(X[idx], X[idx], self.degree, self.coef0, self.gamma)
        Kjj = self.kernel(X[jdx], X[jdx], self.degree, self.coef0, self.gamma)
        Kij = self.kernel(X[idx], X[jdx], self.degree, self.coef0, self.gamma)
        # to calculate K
        K = Kii + Kjj - 2 * Kij
        oldAlphaIdx = alpha[idx]
        oldAlphaJdx = alpha[jdx]
        oldB = self.b
        s = y[idx] * y[jdx]
        if K > 0:
            # Compute the new Lagrange multiplier of JDX
            newAlphaJdx = oldAlphaJdx + y[jdx] * (Ei - Ej) / K
            if newAlphaJdx < U:
                # When the new value exceeds the upper bound, change it to the upper bound
                newAlphaJdx = U
            if newAlphaJdx > V:
                # When the new value is below the lower bound, change it to the lower bound
                newAlphaJdx = V
        else: fi = y[idx] * (Ei + oldB) - oldAlphaIdx * Kii - s * oldAlphaJdx * Kij fj = y[jdx] * (Ej + oldB) - s * oldAlphaIdx * Kij - oldAlphaJdx * Kjj Vv = oldAlphaIdx + s * (oldAlphaJdx - V) Uu = oldAlphaIdx + s * (oldAlphaJdx - U) Vv = Vv * fi + V *  fj +0.5 * (Vv ** 2) * Kii + 0.5 * (V ** 2) * Kjj + s * V * Vv * Kij
            Uu = Uu * fi + U * fj + 0.5 * (Uu ** 2) * Kii + 0.5 * (U ** 2) * Kjj + s * U * Uu * Kij
            if Vv < Uu:
                newAlphaJdx = V
            elif Vv > Uu:
                newAlphaJdx = U
            else:
                newAlphaJdx = oldAlphaJdx
        if oldAlphaJdx == newAlphaJdx:
            # if the new value equals the old value, the new value is not updated
            return False
        # Compute the new Lagrange multiplier of IDx
        newAlphaIdx = oldAlphaIdx + s * (oldAlphaJdx - newAlphaJdx)
        Update the Lagrange multiplier vector
        alpha[idx] = newAlphaIdx
        alpha[jdx] = newAlphaJdx
        oldB = self.b
        Recalculate the offset
        self.b = self.calcB()
        # recalculate the error vector
        newErrors = []
        for i in range(X.shape[0) : newError = errors[i] + y[idx] * (newAlphaIdx - oldAlphaIdx) * self.kernel(X[idx], X[i], self.degree, self.coef0, self.gamma) + y[jdx] * (newAlphaJdx - oldAlphaJdx) * self.kernel(X[jdx], X[i]) - oldB + self.b newErrors.append(newError) self.errors = newErrorsreturn True
    
    def predict(self, X) :
        fxs = []
        for idx in range(len(X)):
            fx = 0
            for jdx in range(self.X.shape[0) : fx += self.y[jdx] * self.alpha[jdx] * self.kernel(self.X[jdx], X[idx], self.degree, self.coef0, self.gamma) fxs.append(fx + self.b)return np.sign(fxs)
Copy the code

Linear kernel support vector machines

# Linear kernel support vector machines
def linearKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0) :
    return x.dot(z)

linearSmo = SMO(X, y, kernel = linearKernel)
linearSmo.fit()
Copy the code

Polynomial kernel function support vector machine

Polynomial kernel function support vector machines
def polynomialKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0) :
    return np.power(gamma * x.dot(z) + coef0, degree)

polynomialSmo = SMO(X, y, kernel = polynomialKernel)
polynomialSmo.fit()
Copy the code

Radial basis kernel function support vector machines

RBF kernel function support vector machines
def rbfKernel(x, z, degree = 3, coef0 = 0.0, gamma = 1.0) :
    return np.exp(-gamma * np.power(np.linalg.norm(x - z), 2))

rbfSmo = SMO(X, y, kernel = rbfKernel)
rbfSmo.fit()
Copy the code

Fifth, third-party library implementation

Scikit – learn4 implementation

from sklearn.svm import SVC

# Linear kernel support vector machines
svc = SVC(kernel = "linear")
Polynomial kernel function support vector machines
svc = SVC(kernel = "poly", gamma = 1)
RBF kernel function support vector machines
svc = SVC(kernel = "rbf", gamma = 1)

# Fit data
svc.fit(X, y)
Copy the code

Scikit-learn internally uses the LIBSVM5 library, which can be implemented in article 6 for details.

Six, animation demonstration

The following three figures show the classification results of the linearly indivisible data set for different kernel functions respectively. Red represents the sample point with label value -1, and blue represents the sample point with label value 1. The light red areas are the areas with a predicted value of -1 and the light blue areas are the areas with a predicted value of 1:

Figure 6-1

Figure 6-2

Figure 6-3

It can be seen that different kernel functions have a great influence on the classification results. When the form of feature transformation is not known, the appropriate kernel function cannot be selected. However, in general, the radial basis kernel function can be used to find the appropriate parameters through cross verification for nonlinear classification problems.

7. Mind mapping

Figure 7-1

8. References

  1. En.wikipedia.org/wiki/Suppor…
  2. En.wikipedia.org/wiki/Kernel…
  3. www.microsoft.com/en-us/resea…
  4. Scikit-learn.org/stable/modu…
  5. www.csie.ntu.edu.tw/~cjlin/libs…
  6. Github.com/Kaslanarian…

The full demo can be found here

Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism