Support vector machine

Linear Regression, Perceptron Learning Algorithm, and Logistics regression are all classifiers. We can use these classifiers for linear and nonlinear classification, such as the following question:

Before introducing SVM, let’s take a look at the concept of hyperplane: in fact, hyperplane is a space used to divide the current dimension. For example, one dimension can be divided by a point, and two dimensions by a line, and these points and lines are called “super” planes. They’re in double quotes because they’re not really hyperplanes, because hyperplanes are bigger than three dimensions. So the hyperplane in four dimensions is three. For example, a hyperplane in two dimensions is a straight line:

Hyperplane in three dimensions:

Let’s consider a, B, and C as W0, W1, W2… Take x, y and z as x1, x2 and x3, then we have:

Just now, support vector machines are not looking for the hyperplane, but for the best hyperplane, that is, the greater the tolerance for point errors, the better. In fact, the larger the function spacing is, the better:

The distance between the functions is the greatest. ** is the shortest distance from the line and the shortest distance from the line is the greatest distance. If we assume that the projection of X0 onto the plane is X1, then there must be a normal vector W perpendicular to X0X1:

The inner is to find the point with the minimum distance from the hperplane, and the outermost is to pick the best W, where B makes the point with the minimum distance the maximum distance from the hperplane.

For the above expression, notice the inside expression:

Why do I add 1/2? I just want to simplify it when I take the derivative, but it doesn’t really make any difference. And W squared is actually using convex optimization, because it’s a convex function, and it doesn’t have any effect on the optimal result.

For the above conditional optimization problem, it is natural to use the Lagrange multiplier method.



Let’s pause for a moment and consider:

####⑴SVM machine learning feasibility problem: first of all, let’s observe this formula, feeling deja vu. It is similar to L2 regularization. For L2 regularization, the minimum value of Ein is calculated first. The so-called Ein is the expected value of the model’s error in the current training data. ‖ L2 is also a depreciated Ein and Regularized L2 Paradigms; Support vector machines, on the other hand, assume that my plane is correctly classified and take W squared:

We’re going to use that later. To review the problems machine learning is trying to solve:
1) Ein material EoutEin just explained, Eout is the global error that this model makes, Ein ≈ Eout is the requirement that this model reflect the global, if it doesn’t, it’s overfitting.
(2) Ein material 0This is the training error is close to zero, at this step is prone to overfitting phenomenon.
And Ein ≈ Eout, the generalization ability, is limited by the VC dimension, that is, the more complex the model, the more complex the VC dimension. So the omega to the right of VC bound is going to be big, so VC bound is going to be big, so Ein is going to be much smaller than Eout, because complex models mean smaller Ein.VC Dimension, by the way, is break point-1. If so, then normally the VC dimension of SVM will be very large, because its W is related to the data dimension. The number of dimensions of data will be as many as W, and W represents the degree of freedom, which usually also represents the VC dimension.
But the effect of SVM is still very good, why? What limits the VC dimension of SVM?Let’s look at an example: On a circle, there are three points, and if you want to find a straight line that can be separated, you can get the VC Dimension is 3. But if you add the constraints and the line width is 5, then the VC dimension is 0 because there’s no place to put it. So if it is large margin, VC dimension ≤ 3. As shown in figure:

Therefore, large margin is the restriction condition of VCdimension of SVM, leading to its good classification effect, VCdimension is small natural generalization ability, here solve the problem of Ein ≈ Eout, Ein ≈ 0, which is the problem we will use convex optimization to solve later.



Back to business:

We need to minimize:

When this point violates the condition, then the constraint is greater than 0, alpha is greater than 0, and then maximum alpha is infinite, and then it can’t be the minimum, it can’t be the minimum, because it’s infinite.
When this point is not violated, the constraint is < 0, α > 0, and the maximum α constraint is 0. Minimize w and B and the target function is the same.

== origin target function



Pause again and consider what the KKT condition is:

####⑵ Derivation of KKT condition For the above modified target function, it has the following properties:

So there’s going to be left ≥ right for any condition. Add another w to the left, and b takes the smallest:

All the cases that are greater than the right hand side, so naturally, I can also add a maximum of alpha to the right hand side:

And on the right hand side we swap conditions with maximum and that’s called the Daul Problem. ** Therefore, the primal problem is ≥ dual problem. ** So is there any way to convert the solution of the original problem into the solution of the dual problem?



####⑶ simple proof of KKT condition

Duality means that there is an optimal solution that makes both sides of this equation true. So we assume that there is a W and B that are optimal, then there is:

In the end, we can see that the solution is exactly the original objective function of f(W), and the original formula is:

And that’s going to be the same thing, because ag(x) is equal to 0 after maximum, so essentially this function is still minimizing f(W). So there’s no difference between the dual and the original. According to this, we’re essentially asking for

The minimum value of delta, and of course the W B here is replaced by the original variable W B. Minimization of course is going to be a gradient of 0, so we have a gradient of 0 for w and b. And then there’s the condition that ag(x) = 0, which is actually a prerequisite:

We said that this is the same as the target function, and if we want to be the same, we have to cancel out the g(x) and h(x). Ag (x) = 0. If g(x) < 0, ag(x) = 0. If g(x) < 0, ag(x) = 0. So you add a couple of Lagrange conditions to that and you have KKT conditions.
So KKT condition is:

The last few conditions are actually conditions for the Lagrange multiplier method.



Back to the topic, now that we know that the origin target function can be converted into daul problem by KKT condition, we try to solve the above problem by KKT condition: First, take partial derivatives of W and B:

And when alpha is greater than 0, we know from the formula above that this point is just on the boundary, and these points are called the support vector, the points of the support vector. Our fitting line will also be determined by some points, and any other point α that is not a support vector will be 0.



Pause again, and let’s think about this:

####⑷ Why do we need dual problem? In fact, this optimal problem can be solved by ordinary QP, but if the dimension of our data is large, and the dimension is even larger after feature transform, VC dimension will be increased. In particular, the polynomial kernel, RBF kernel and so on. After the transformation of the KKT condition of the dual problem, our target expression is:

After the duality problem is converted, the number of variables is N, and the constraints are also N, which has no relation to the VC Dimension. In a sense, the computational complexity is simplified. In fact, the computational complexity remains the same, but the calculation of dimensions has been elevated to the inner product of points between variables. The original SVM is transformed into a dual problem. The original intention is to eliminate the influence of D ‘if D’ is large after feature transformation in nonlinear change. Kernel SVM is further introduced to fundamentally solve the above problems. Notice, this is just a way of looking at it and it does eliminate the d dimension, it doesn’t actually go away, it just goes into the inner product.



Back to the subject, our target function:



Pause again, ** What is a support vector point, and why is a non-support vector point α = 0? ** Here only think linear SVM, if soft margin is different.

####⑸ support vector

If it is a support vector, its function margin is 1; For many support vector points, function margin > 1, so the right-hand side is negative, in order to meet the maximum, so α can only be 0, so the non-support vector point α is 0.



##⑤ The kernel Support Vector Machine is also called linear SVM. Linear SVM is valid for linear components only. What about linear inseparable? Like a circle, so you can’t play. Remember that Linear Regression and Logistics Regression talked about a feature transform before. If it is non-linear, we can map it to other dimensions to solve it, such as the most common Polynomial Transform. But that begs the question, didn’t I just transfer dimension D to the inner product? (KKT condition of dual problem) in the feature transform that is φ(x0)φ(x1), the dimension is larger.

Such as polynomial:

As you can see, the final transformation is only related to the original space, not to the dimension of the z-space after the transformation. X is 2 d space, for example, in order to solve the nonlinear problem, we map the z space, in the z space dimension will certainly is bigger than in the original dimension of x space, and finally with the z space inner product we just need to take the original x space dimension, because we can inner product first, then l d, rather than the first l d inner product again.

polynomial kernel function

Gaussion kernel function The q-order polynomial introduced just now is of finite dimensions. If it is of infinite dimensions, can it be simplified by kernel? There is an infinite dimensional kernel function called Gaussion kernel



Again, I pause to think about what kernels mean and how they compare:

####⑹Comparison of Kernels Polynomial Kernel hyperplanes are composed of Polynomial curves. Advantages: The order can be flexibly set, more close to the actual distribution; Disadvantages: When Q is very high, if the kernel value is <1, it will be close to 0. If Q is greater than 1, it will become very large, increasing the computational complexity. And too many parameters, it is difficult to choose the right value.

The advantages of Gaussan Kernel are that the boundary is more complex and diverse, which can most accurately region the fractional data sample, and the numerical calculation K value has little fluctuation, and only one parameter, which is easy to select. The disadvantage is that w is not solved due to the feature transformation into infinite dimensions, and the calculation speed is lower than linear kernel, and the fitting may occur. Mysterious – no w; Slower; Too powerful.

As I said before, by using the duality problem, we’ve shifted the dimension of the data to the inner product, so in a way we’ve simplified the computation, but the inner product is still a big computation.
So one of the functions of the kernel, is to simplify the calculation, the ascending and the inner product of the calculation together, reduce the computational complexity. I’ve combined the steps together, where we used to map and then compute the inner product, but now we do it all together. The second function of the kernel is to calculate the similarity of two sample points, namely the inner product. Since it represents similarity, can we use other kernel functions? Or create your own, like Euclidean distance, cosine distance, etc.? The answer is no.Let’s take a look at the kernel matrix:

This is a little bit like the covariance matrix we had before, except we didn’t subtract the mean, so symmetric semisdefinite is fundamental. So naturally, we have to choose when we create or choose ourselves
① Symmetric symmetric ②positive semi-definite This is also the validity of the kernel.



Anyway, I just gave you a little bit of an understanding of the kernel. ##⑥ The Gaussion Kernel applied to the soft-margin Support Vector Machine still seems to have a fit, and it is quite serious, which indicates that large Margin has no longer limited the Gaussion Kernel. We need to find other ways to deal with this problem. Before, there was a relatively simple algorithm — perceptron Learning Algorithm, which has a good way to deal with nonlinear problem. We do not require correct classification, but only need to find a classification with the least errors. So his function looks like this:

Firstly, it is not linear and cannot use quadratic programming, not to mention dual. Secondly, big and small mistakes should be treated equally and connot distinguish small and large errors.

Next is the derivation of the duality problem, and the previous hard is actually similar, Lagrange multiplier method plus duality conditions:

There is a contradiction in the formula to solve B, that is, we need to get the ξ value of B first, but the ξ value can only be solved by the formula of B, so it is a chicken born chicken egg born chicken problem. So we drop this ξ. β(-ξ) is an affine function. The affine function has the property β(ξ) = 0, so the formula above is obtained by substituting β. ξ is equal to 0, so it is only when C-α is not equal to 0. Therefore, when α∈ (0, C), it has ξ 0. As we will talk later, when α∈ (0, C), it is actually a point of support vector. And then we can solve for b.

Next, let’s look at a relatively important thing: physical significance of alpha

Why βξ = 0? The reason is the same as the previous formula, because we want to take the maximum value, so this should be equal to 0, β ≥ 0, and the actual formula is negative ξ, so the maximum can only be multiplied by 0; Secondly, if it is not equal to 0, it is not equal to the solution of the original problem. If it is not equal to 0, there is an even point, which is not equal to the original problem. Then the transformation of Daul problem can be carried out.
We’re going to look at the physical implications of different values of alpha from these two formulas.

When α = 0, ξ = 0, it is the point at which it is not misplaced, because ξ = 0 is not tolerated. And alpha = 0, so it’s not a support vector machine point, so it’s a point that’s outside of bound and properly classified.

When α∈ (0, C) returns ξ = 0, it is different. There is no error point, but the first expression is equal to 0 in parentheses, which means it is on bound.

When alpha is equal to C, it is not certain whether ξ is zero,


Is how much is wrong, there are two cases, one is the classification is correct, but the distance is too close; Or maybe the classification was wrong.
When alpha is greater than C, there is no alpha.

Let’s get the whole idea straight. ① Find the best hperplane, the widest one. ② Get target function ③ find that dimension has great influence on computer complexity after feature transform, and use dual problem to transfer to inner product processing ④ Find that complexity is still in. Kernel function (5) Found that the kernel function was still overfitting, so soft margin was introduced



Before talking about SMO algorithm, let’s talk about the understanding of error function:

####⑺ For the understanding of SVM error function, we change the FORM of SVM. To ξ, it means how far each point is from the boundary. One makes the definition of ξ when y(wTz + b) is equal to or greater than 1, making it equal to 1-y (wTz + b) > 0. The definition of ξ = Max (1-y (wTz + b), 0) is widened by making sure that the point is outside the boundary. The definition of ξ is 0 and the definition of ξ is not widened by violating margin.

This is the error function of the support vector machine, which predicts that Ein = 0, which is exactly the case, as I mentioned earlier. This function is similar to L2 lost Function:

This is similar to cost function of L2 paradigm of Logistics Regression.

It’s almost the same, there’s no difference, but if it’s the same why not do it this way? There are two reasons. One is that this unconditional optimization problem cannot be solved by QP, that is, duality derivation and kernel cannot be used. Another is that the Max () term contained in the form may cause the function to be not differentiable everywhere, which is difficult to solve by differentiating.

By comparison, it is found that L2 regularization and soft margin SVM form are the same, and the two expressions λ and C are corresponding to each other. The large margin in soft marginSVM corresponds to the short W in L2 Regularization, which makes the hypothesis set simpler. λ and C also correspond to each other, λ is larger, w is smaller, and the degree of regularization is larger; If C is small, Ein is large, and the margin will play in response. Therefore, increasing C and decreasing λ are the same meaning, so large margin is equivalent to regularization, both of which are used to prevent overfitting.


If it is according to our previous err0/1, correct is 1, error is 0, then:


It can be seen that SVM is greater than ERR0/1, which can be used to replace err0/1 classification according to VC Bound theory.
Then add the cost function of logic function:



And this is almost the same as l2-regularized Logistic regression. Both Logistic Regression and soft-margin SVM are on the upper bound of optimal ERR0/1. It can be seen that solving regularized Logistic regression problem is equivalent to solving soft-margin SVM problem.

####⑻ loss function: err0/1:

So soft margin is like this, if it’s greater than 0 it’s 1 and less than 0.
Hinge Lost Function — insensitive loss function

There are log loss functions, cross entropy and so on. Logistics uses cross entropy, while SVM uses Hinge Lost Function. Support vector machine (SVM) is a approximate to realize structural risk minimization, structural risk equivalent of expected risk (Eout) of an upper bound, it is the empirical risk (Ein) and ci (Ω complexity) model and empirical risk depends on the selection of decision function f, but the confidence interval is that f the VC dimension increasing function, the two are contradictory. The paradox now is that when the VC dimension gets bigger you can pick better F so that the empirical risk is smaller, but the confidence interval is bigger. That corresponds to the VC bound theory. Fortunately, I had a good understanding of the theoretical basis of machine learning after listening to the course of Lin Xuanyu from Taiwan University.



Let’s get back to the SMO algorithm. SMO target function:

Select two variables, fix the other variables, and construct a quadratic programming problem for the two variables. Each time, the minimum value of the objective function is solved for two variables. After solving the solution, the objective function is continued to find new variables to find the objective function. In the process of finding new α, the objective function will be further optimized until all α I are updated. As for the selection of α, one is the one that violates KKT condition most seriously, and the other is automatically determined by constraint conditions.

First, suppose we select two variables α1 and α2 and fix other variables:

So when we update alpha, we have to clip alpha according to the scope.

We assume:

SMO algorithm has two main points: ①α1 selection, violating the most serious KKT condition ②α2 selection strategy



It’s a very important question, how do you choose variables, and I’ll give you an example.

####⑼ Selection method of variable SMO calls the process of selecting the first variable the outer cycle. The outer loop selects the most serious sample of KKT Violation among the training samples. The first time I saw this thing, I was stunned. But think carefully, which sample does not meet KKT’s conditions is detected:

Let’s first iterate over all the sample points with 0 < α < C, and see if they satisfy, if they don’t carry all of the variables. Test whether KKT is met. Therefore, in the two steps of SMO iteration, as long as one of α violates the KKT condition, the value of the objective function will inevitably increase after the completion of this iteration. Generally speaking, the greater the KKT condition violation, the more obvious the optimization effect after iteration, and the greater the increase. The choice of the second variable is called the memory loop, so let’s just use ordinary random selection to see what happens.



## ## ## ## ## ## #

import numpy as np  
import matplotlib.pyplot as plt  
import random  
import seaborn as sea  
import pandas as pd  


def get_positive_and_negative():  
  dataSet = pd.read_csv('Datas/LogiReg_data.txt', names=['V1'.'V2'.'Class'])  
  dataSet.Class[dataSet.Class == 0] = -1  
  dataSet = dataSet[60 : 80]  
  positive = dataSet[dataSet['Class'] == 1]  
  negative = dataSet[dataSet['Class'] == -1]  
  return positive , negative , dataSet  


def show_picture(positive , negative):  
  columns = ['V1'.'V2']  
  fig, ax = plt.subplots(figsize=(10, 5))  
  ax.scatter(positive[columns[0]], positive[columns[1]], s=30, c="b", marker="o", label="class 1")  
  ax.scatter(negative[columns[0]], negative[columns[1]], s=30, c="r", marker="x", label="class -1")  
  ax.legend()  
  ax.set_xlabel('V1')  
  ax.set_ylabel('V3') plt.show() def load_data_set(): _, _, file = get_positive_and_negative() orig_data = file.as_matrix() cols = orig_data.shape[1] data_mat = orig_data[ : , 0 : cols-1] label_mat = orig_data[ : , cols-1 : cols]return  data_mat , label_mat  

positive , negative , data = get_positive_and_negative()  
show_picture(positive , negative)  
print(data)  
Copy the code

The first one is to get positive and negative samples, and then display them, and the last one is to load the data, just pick any data you want.

positive , negative , data = get_positive_and_negative()  
show_picture(positive , negative)  
Copy the code

Finally call some to see what the points are:

' '' '' Generate a random number '' '  
def select_jrand(i , m):  
  j = i  
  while(j == i):  
      j = int(random.uniform(0 , m))  
  return j  
  pass  

' '' ''Restraint the alpha'' '  
def clip_alpha(aj , H , L):  
  if aj > H:  
      aj = H  
  elif aj < L:  
      aj = L  
  return aj  
  pass  
Copy the code

The next step is to implement support vector machines:

def SVM(data_mat , class_label , C , tolar , max_iter):  

  data_mat = np.mat(data_mat)  
  label_mat = np.mat(class_label)  
  b = 0  
  m , n = np.shape(data_mat)  
  alphas = np.zeros((m , 1))  
  iter = 0  

  while iter < max_iter:  
      # as an iterative variable
      alpha_pairs_changed = 0  
      # as the first A
      for i in range(m):  
          WT_i = np.dot(np.multiply(alphas , label_mat).T , data_mat)  
          f_xi = float(np.dot(WT_i , data_mat[i , :].T)) + b  
          Ei = f_xi - float(label_mat[i])  
          if ((label_mat[i]*Ei < -tolar) and (alphas[i] < C)) or ((label_mat[i]*Ei > tolar) and (alphas[i] > 0)):  
              j = Tools.select_jrand(i , m)  
              WT_j = np.dot(np.multiply(alphas , label_mat).T , data_mat)  
              f_xj  = float(np.dot(WT_j , data_mat[j , :].T)) + b  
              Ej = f_xj - float(label_mat[j])  
              alpha_iold = alphas[i].copy()  
              alpha_jold = alphas[j].copy()  

              if(label_mat[i] ! = label_mat[j]): L = max(0 , alphas[j] - alphas[i]) H = min(C , C + alphas[j] - alphas[i])else:  
                  L = max(0 , alphas[j] + alphas[i] - C)  
                  H = min(C , alphas[j] + alphas[i])  
              if H == L:  
                  continueEta data_mat = 2.0 * * data_mat [I:] [r]. J: T - data_mat data_mat [I:] * [I:]. T - data_mat [j:] * data_mat [r]. J: Tif eta >= 0: continue  
              alphas[j] = (alphas[j] - label_mat[j]*(Ei - Ej))/eta  
              alphas[j] = Tools.clip_alpha(alphas[j], H, L)  
              if (abs(alphas[j] - alpha_jold) < 0.00001):  
                  continue  
              alphas[i] = alphas[i] + label_mat[j]*label_mat[i]*(alpha_jold - alphas[j])  


              b1 = b - Ei + label_mat[i]*(alpha_iold - alphas[i])*np.dot(data_mat[i,:], data_mat[i,:].T) +\  
              label_mat[j]*(alpha_jold - alphas[j])*np.dot(data_mat[i,:], data_mat[j,:].T)  
              b2 = b - Ej + label_mat[i]*(alpha_iold - alphas[i])*np.dot(data_mat[i,:], data_mat[j,:].T) +\  
              label_mat[j]*(alpha_jold - alphas[j])*np.dot(data_mat[j,:], data_mat[j,:].T)  
              if (0 < alphas[i]) and (C > alphas[i]):  
                  b = b1  
              elif (0 < alphas[j]) and (C > alphas[j]):  
                  b = b2  
              else:  
                  b = (b1 + b2)/2.0  
              print(b)  
              alpha_pairs_changed += 1  
              pass  
      if alpha_pairs_changed == 0:  
          iter += 1  
      else:  
          iter = 0  

  support_x = []  
  support_y = []  
  class1_x = []  
  class1_y = []  
  class01_x = []  
  class01_y = []  
  for i in range(m):  
      ifAlphas [I] > 0.0: support_x. appEnd (data_mat[I, 0]) support_y. appEnd (data_mat[I, 1])for i in range(m):  
      if label_mat[i] == 1:  
          class1_x.append(data_mat[i, 0])  
          class1_y.append(data_mat[i, 1])  
      else:  
          class01_x.append(data_mat[i, 0])  
          class01_y.append(data_mat[i, 1])  
  w_best = np.dot(np.multiply(alphas, label_mat).T, data_mat)  
  fig, ax = plt.subplots(figsize=(10, 5))  
  ax.scatter(support_x, support_y, s=100, c="y", marker="v", label="support_v")  
  ax.scatter(class1_x, class1_y, s=30, c="b", marker="o", label="class 1")  
  ax.scatter(class01_x, class01_y, s=30, c="r", marker="x", label="class -1")  
  lin_x = np.linspace(0, 100)  
  lin_y = (-float(b) - w_best[0, 0] * lin_x) / w_best[0, 1]  
  plt.plot(lin_x, lin_y, color="black")  
  ax.legend()  
  ax.set_xlabel("factor1")  
  ax.set_ylabel("factor2")  
  plt.show()  
  returnB, alphas = dataSet. Load_data_set () b, alphas = SVM(Datamat, Labelmat, 0.6, 0.001, 10)print(b , alphas)  
Copy the code

The last parameters that were first introduced were the degree of punishment and the degree of tolerance. The most important sentence is this:

if ((label_mat[i]*Ei < -tolar) and (alphas[i] < C)) or ((label_mat[i]*Ei > tolar) and (alphas[i] > 0)):  
Copy the code

It translates to yg(x) < 1-ξ or y(g(x)) > 1+ξ. If it is less than, then this point is closer to hperplane, this time this point should be equal to C; If it’s greater than, it’s way above the boundary, it’s far away from the boundary, but alpha is greater than 0, far from the boundary means it’s not a support vector, so alpha should be 0, so it can change. The following ones are based on the formula:

Effect:

I have summarized several reasons for the instability: ① No cache, slow update and insufficient iteration number; ② no good strategy for the selection of α2; ③ For n, which is the update formula:

I don’t know if it’s greater than 0. What is n?

If it’s less than zero that means the kernel matrix is not semidefinite, K11 + K22 is less than 2K12; In addition, this n is actually:

The second derivative of phi, if it’s less than 0, it’s not convex. So it should be the update when encountered these conditions caused instability.

' ''
load data and define some tool function
'' '
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import random

def loadDataSet(filename):
  ' '' :param filename: :return dataset and label: '' '

  dataset = []
  label = []
  fr = open(filename)
  for line in fr.readlines():
      lineArr = line.strip().split('\t')
      dataset.append( [np.float32(lineArr[0]) , np.float32(lineArr[1])] )
      label.append(np.float32(lineArr[2]))
  return dataset , label
  pass

' ''
select alpha2 randomly
'' '
def selectAlphaTwo(i , m):
  ' '' :param i: :param m: :return: '' '
  j = i
  while(j == i):
      j = int(random.uniform(0 , m))
  return j

def rangeSelectionForAlpha(aj , H , L):
  if aj > H:
      aj = H
  if L > aj:
      aj = L
  return aj
  pass

' ''
calculate Ei
'' '
def calculateEi(os , k):
  fxk = float(np.multiply(os.alphas, os.labels).T * (os.x * os.x[k, :].T)) + os.b
  Ek = fxk - float(os.labels[k])
  return Ek

' '' put the Ei into the cache when calculate Ei '' '
def selectj(i , os , Ei):
  maxk = -1
  maxDeltaE = 0
  Ej = 0
  os.eCache[i] = [1 , Ei]
  validEachlist = np.nonzero(os.eCache[: , 0].A)[0]
  if (len(validEachlist) > 1):
      for k in validEachlist:
          if k == i:
              continue
          Ek = calculateEi(os , k)
          deltaE = np.abs(Ei - Ek)
          if deltaE > maxDeltaE:
              maxk = k
              maxDeltaE = deltaE
              Ej = Ek
      return maxk , Ej
      pass
  else:
      j = selectAlphaTwo(i , os.m)
      Ej = calculateEi(os , j)
  return j , Ej
  pass

' ''
draw picture
'' '
def drawDataset(data , label , x = None , y = None , line = True , alphas = None , kernel = True):
  index_one = []
  index_negative_one = []
  for i in range(100):
      if label[i] == 1:
          index_one.append(data[i])
      else:
          index_negative_one.append(data[i])
  index_one = np.matrix(index_one)
  index_negative_one = np.matrix(index_negative_one)
  plt.scatter(index_one[ : , 0].tolist() , index_one[: , 1].tolist() , c = 'r' , marker='<' , label = 'class equal one')
  plt.scatter(index_negative_one[: , 0].tolist() , index_negative_one[: , 1].tolist() , c = 'b' , marker='x' , label = 'class equal negative one')
  if line == True:
      plt.plot(x , y)
      pass

  ' ''Draw the support vector,the point which the α is not equal zero'' '
  if line == True or kernel == True:
      a1 = []
      for i in range(len(alphas)):
          a = alphas[i]
          ifa ! = 0: a1.append(data[i]) a1 = np.matrix(a1)print('The number of the support vector : ' , len(a1))
      plt.scatter(a1[: , 0].tolist(),a1[: , 1].tolist(), s=150, c='none', alpha = 0.7, our linewidth = 1.5, edgecolor ='#AB3319' , label = 'support vector')

  plt.legend()
  plt.xlabel('X axis')
  plt.ylabel('Y axis')
  plt.show()

def updateEk(os,k):
  Ek = calculateEi(os,k)
  os.eCache[k]=[1,Ek]

if __name__ == '__main__':
  data , label = loadDataSet('.. /Data/testSetRBF.txt')
  drawDataset(data , label , line=False ,kernel=False)


Copy the code

The SMO algorithm has only one class:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import random
import KernelTransform
class optStruct:
  def __init__(self , dataMat , labels , C , toler):
      self.x = dataMat
      self.labels = labels
      self.C = C
      self.toler = toler
      self.m = np.shape(dataMat)[0]
      self.alphas = np.mat(np.zeros((self.m , 1)))
      self.b = 0
      self.eCache = np.mat(np.zeros((self.m , 2)))
      self.K = np.mat(np.zeros((self.m , self.m)))
      for i in range(self.m):
          self.K[: , i] = KernelTransform.kernelTrans(self.x , self.x[i , :] , kTup=('rbf'Pass, 1.2))if __name__ == '__main__': OS = optStruct ([1, 2], [3, 4], 1, 1) a = OS. Alphas. Tolist () [0] [0] - OS. Alphas. Tolist () [1] [0]print(Max) (1.0 a)Copy the code

The only explanation left should be selectj(), which chose α2 by calculating the maximum length. First of all, let’s say the maximum length is negative 1, because you can’t subtract absolute value from negative; Os. eCache is the Ei of our cache. We store Ei first, 1, to indicate that this number is not 0, this step is to get all valid (non-0) Ei in the cache. Determine if the list you get has anything in it, and select it randomly if it doesn’t. Let’s explain why we need this form again. When we pick the first alpha 1, we pick the points that are outside the boundary, that are non-boundary points. Traversing non-boundary data samples is preferred because non-boundary data samples are more likely to need adjustment, and boundary data samples often cannot be further adjusted and remain on the boundary. Since most of the data samples clearly cannot be support vectors, the corresponding α multiplier does not need to be adjusted once it reaches zero. Walk through the non-boundary data samples and select them until they violate KKT conditions. When a traversal finds that no non-boundary data samples are adjusted, all data samples are traversed to test whether the whole set satisfies the KKT condition. If there is further evolution of data samples in the test of the whole set, it is necessary to traverse the non-boundary data samples again. In this way, it keeps switching between traversing all data samples and traversing non-boundary data samples until the entire sample set satisfies the KKT condition. The above tests on the data samples with KKT conditions are to achieve a certain accuracy ε can be stopped as a condition. If a very precise output algorithm is required, it often does not converge quickly. So the first alpha picked in the cache in ECHA, because what we picked out is the non-boundary point, the alpha 2 picked out continues to walk over it, even though the cache has Ei, but this Ei can’t be used directly because that’s the old value. Therefore, the iteration strategy of α alternates between non-boundary and global selection.

Then comes the formal algorithm:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import random
import Tool
import smo_class
import KernelTransform
def innerL(i ,os):
  Ei = Tool.calculateEi(os , i)
  if ((os.labels[i]*Ei < -os.toler) and
      (os.alphas[i] < os.C)) or ((os.labels[i]*Ei > os.toler) and
                                 (os.alphas[i] > 0)):
      j , Ej = Tool.selectj(i , os , Ei)
      alphaIold = os.alphas[i].copy()
      alphaJold = os.alphas[j].copy()
      if(os.labels[i] ! = os.labels[j]): L = max(0 , os.alphas[j] - os.alphas[i]) H = min(os.C , os.C + np.array(os.alphas)[j] - np.array(os.alphas)[i])else:
          L = max(0 , os.alphas[j] + os.alphas[i] - os.C)
          H = min(os.C , np.array(os.alphas)[j] + np.array(os.alphas)[i])
      if L == H:
          returnEta = 2.0 0 * OS x [I:] * OS x [r]. J: T - OS. X [I:] * OS in x [I:]. T - OS. [j:] * OS x, x) [r]. J: Tif eta >= 0:
          print('η> 0, the kernel matrix is not semi-positive definite')
          return 0
      os.alphas[j] -= os.labels[j]*(Ei - Ej)/eta
      os.alphas[j] = Tool.rangeSelectionForAlpha(os.alphas[j] , H , L)
      Tool.updateEk(os , j)

      if(OS. Alphas [J] -AlphaJold) < 0.00001):print("j not moving enough")
          return 0
      os.alphas[i] += os.labels[j] * os.labels[i] * (alphaJold - os.alphas[j])
      Tool.updateEk(os , i)
      b1 = os.b - Ei - os.labels[i] * (os.alphas[i] - alphaIold) * \
           os.x[i, :] * os.x[i, :].T - os.labels[j] * \
           (os.alphas[j] - alphaJold) * os.x[i, :] * os.x[j, :].T
      b2 = os.b - Ej - os.labels[i] * (os.alphas[i] - alphaIold) * \
           os.x[i, :] * os.x[j, :].T - os.labels[j] * \
           (os.alphas[j] - alphaJold) * os.x[j, :] * os.x[j, :].T
      if (0 < os.alphas[i]) and (os.C > os.alphas[i]):
          os.b = b1
      elif (0 < os.alphas[j]) and (os.C > os.alphas[j]):
          os.b = b2
      else:
          os.b = (b1 + b2) / 2.0
      return 1
  else:
      return0 def smo(data,labels,C = 0.6,toler = 0.001,maxIter = 40, kernel = True): oS = smo_class.optStruct(np.mat(data),np.mat(labels).transpose(),C,toler) iter =0 entireSet = True alphaPairsChanged = 0while(iter < maxIter) and ((alphaPairsChanged >0) or (entireSet)):
      alphaPairsChanged = 0
      if entireSet:
          for i in range(oS.m):
              if kernel == True:
                  alphaPairsChanged += KernelTransform.innerL(i,oS)
              else:
                  alphaPairsChanged += innerL(i, oS)
          print("fullSet,iter: %d i: %d,pairs changed %d" %\
              (iter,i,alphaPairsChanged))
          iter +=1
      else:
          # the product of two elements, the non-zero every two elements to do multiplication,1,1,0,0 [0] *,1,0,1,0 [1] = [0,1,0,0,0]
          nonBoundIs = np.nonzero((oS.alphas.A > 0)*(oS.alphas.A < C))[0]
          for i in nonBoundIs:
              alphaPairsChanged += innerL(i,oS)
              print("nou-bound,iter: %d i:%d,pairs changed %d" % (iter,i,alphaPairsChanged))
          iter +=1
      EntireSet controls alternate policy selection
      if entireSet:
          entireSet = False
      There must be alpha to update
      elif(alphaPairsChanged == 0):
          entireSet = True
      print("The iteration number: % d." % iter)
  return oS.b,oS.alphas
Copy the code

EntireSet is the flag of the exchange policy. There doesn’t seem to be much to say. After that, it’s time to execute functions:

import Tool
import SMO
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import KernelTransform
' ''Calculate W and draw the picture, the variable which the α not equal Zero, we call support vector'' '
def calculateW(alphas , data , labels):
  x = np.mat(data)
  label = np.mat(labels).transpose()
  m , n = np.shape(x)
  w = np.zeros((n , 1))
  for i in range(m):
      w += np.multiply(alphas[i] * label[i] , x[i , :].T)
  return w
  pass

if __name__ == '__main__':
  data, label = Tool.loadDataSet('.. /Data/testSet.txt')
  b,alphas = SMO.smo(data , label , kernel=False)
  w = calculateW(alphas , data , label)
  x = np.arange(0 , 11)
  print(w)
  y = (-b - w[0]*x)/w[1]
  Tool.drawDataset(data , label , x , y.tolist()[0] , line=True , alphas=alphas)

  data, label = Tool.loadDataSet('.. /Data/testSetRBF.txt')
  b, alphas = SMO.smo(data, label,kernel=True ,maxIter=100)
  svInd = np.nonzero(alphas.A > 0)[0]
  Tool.drawDataset(data, label,  line=False, alphas=alphas)







Copy the code

There is one for kernel function, so don’t worry about it. Effect:

Innel and SMO: Innel and SMO:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import Tool
def kernelTrans(X,A,kTup):
  m,n = np.shape(X)
  K = np.mat(np.zeros((m,1)))
  if kTup[0]=='lin':
      K = X*A.T
  elif kTup[0] =='rbf':
      for j in range(m):
          deltRow = X[j,:]-A
          K[j] = deltRow*deltRow.T
      K = np.exp(K/(-1*kTup[1]**2))
  return K

' ''
update the innel function
'' '
def innerL(i ,os):
  Ei = calculateEi(os , i)
  if ((os.labels[i]*Ei < -os.toler) and
      (os.alphas[i] < os.C)) or ((os.labels[i]*Ei > os.toler) and
                                 (os.alphas[i] > 0)):
      j , Ej = Tool.selectj(i , os , Ei)
      alphaIold = os.alphas[i].copy()
      alphaJold = os.alphas[j].copy()
      if(os.labels[i] ! = os.labels[j]): L = max(0 , os.alphas[j] - os.alphas[i]) H = min(os.C , os.C + np.array(os.alphas)[j] - np.array(os.alphas)[i])else:
          L = max(0 , os.alphas[j] + os.alphas[i] - os.C)
          H = min(os.C , np.array(os.alphas)[j] + np.array(os.alphas)[i])
      if L == H:
          return0 eta = 2.0 * OS. K [I, j] - OS. [, I] - OS. K K [j] jif eta >= 0:
          print('η> 0, the kernel matrix is not semi-positive definite')
          return 0
      os.alphas[j] -= os.labels[j]*(Ei - Ej)/eta
      os.alphas[j] = Tool.rangeSelectionForAlpha(os.alphas[j] , H , L)
      updateEk(os , j)

      if(OS. Alphas [J] -AlphaJold) < 0.00001):print("j not moving enough")
          return 0
      os.alphas[i] += os.labels[j] * os.labels[i] * (alphaJold - os.alphas[j])
      updateEk(os , i)
      b1 = os.b - Ei - os.labels[i] * (os.alphas[i] - alphaIold) * \
           os.K[i , i] - os.labels[j] * \
           (os.alphas[j] - alphaJold) *  os.K[i , j]
      b2 = os.b - Ej - os.labels[i] * (os.alphas[i] - alphaIold) * \
           os.K[i , j] - os.labels[j] * \
           (os.alphas[j] - alphaJold) * os.K[j , j]
      if (0 < os.alphas[i]) and (os.C > os.alphas[i]):
          os.b = b1
      elif (0 < os.alphas[j]) and (os.C > os.alphas[j]):
          os.b = b2
      else:
          os.b = (b1 + b2) / 2.0
      return 1
  else:
      return 0

' ''
updata the Ei
'' '
def calculateEi(os , k):
  fxk = float(np.multiply(os.alphas, os.labels).T * os.K[:, k] + os.b)
  Ek = fxk - float(os.labels[k])
  return Ek
def updateEk(os,k):
  Ek = calculateEi(os,k)
  os.eCache[k]=[1,Ek]
Copy the code

The implementation of this function already includes the kernel, so you can see the effect directly:

Github.com/GreenArrow2…