Today is the 35th article of machine learning topic, we continue the principle of SVM model, today we will explain SMO algorithm.

Formula for review

In the previous article, we analyzed and derived the formulas for both hard and soft intervals. We found that the forms of soft and hard intervals were very similar, with only a few different parameters. So let’s focus on soft interval processing.

Through Lagrange multiplier method and solving the dual problem of the original problem, we get the quadratic programming:

It should satisfy the following KTT conditions:

In other words, we have to solve for the extremum of equation (1) under these conditions. How can we solve for the extremum of equation (1) under this constraint, even though we have reduced the equation to only one parameter, α\alphaα? To solve this problem, we need to introduce a new algorithm, which is the main topic of today’s article — the SMO algorithm.

Introduction to SMO algorithm

The full writing of SMO is Sequential Minimal Optimization, which translates to Sequential minimum Optimization algorithm. The core idea of the algorithm is that (1) is maximized because we need to find a series of α\alphaα values, but the problem is that it is difficult to optimize all of these values simultaneously. So the SMO algorithm came up with the genius of treating two of the alpha \alphaα sequences as variables and all the others as constants.

The question here is why do we choose two alpha \alphaα as variables and not one? Wouldn’t it be easier to pick one? Because one of our constraints is that ∑yiα I =0\sum y_i\alpha_i=0∑yiα I =0, we would obviously break this constraint if we chose only one α\alphaα to adjust. So let’s pick two, one of which changes, and the other changes with it, so that we don’t break the constraint.

For the sake of narration, the two α\alphaα chosen by default are α1,α2\alpha_1, \alpha_2α1,α2. In addition, since we are dealing with xiTxjx_i^Tx_jxiTxj operations, let Kij=xiTxjK_{ij}= x_I ^Tx_jKij=xiTxj. Thus, equation (1) above can be written as:

Y1 =±1y_1 = \ PM 1y1=±1, yi2=1y_i^2 = 1yi2=1, \alpha_1, \alpha_2α1, \ alpha2 We assume that the alpha 1 y1 + alpha 2 y2 = k + \ \ alpha_1y_1 alpha_2y_2 = alpha 1 y1 + alpha 2 y2 k = k, including 1 alpha and alpha 2 ∈ \ [0, C] alpha_1, \ alpha_2 \ [0, C] in alpha 1, alpha 2 ∈ [C] 0,, And since yiy_iyi only has two choices 1 or -1, we can talk about it case by case.

Case by case discussion

The first case is α1−α2=k\alpha_1 – \alpha_2 = K α1−α2= K, that is, α2=α1− K \alpha_2 = \alpha_1 – kα2=α1− K, We assume that k is greater than 0. In the second case, α2=α1+k\alpha_2 = \alpha_1 +k α2=α1+ K. We assume that k is less than 0. It’s easy to see that in the first case, if k is less than 0, that’s the second case, and in the second case, if k is greater than 0, that’s the first case. This becomes a linear programming problem, and it becomes very clear when we draw it.

In the first case, we can see the range of alpha 2 \ alpha_2 alpha 2 is (0, C – alpha alpha 2 + 1) (0, C – \ \ alpha_1 alpha_2 +) (0, 2 C – alpha + alpha 1), The second is the scope of (alpha 2 – alpha 1, C) (\ alpha_2 – \ alpha_1, C) (alpha 2 – alpha 1, C). Here we express k back to α1,α2\alpha_1,\alpha_2α1,α2, since we want to optimize the values of α1,α2\alpha_1,\alpha_2α1,α2 by iterative method, So we make the last round of alpha 1, 2 \ alpha_1 alpha, \ alpha_2 alpha 1, alpha 2, o, alpha 1 alpha 2 o \ alpha_ {1} o, \ alpha_ {2} o alpha 1 o, alpha 2 o. The lower bound L of the next round is Max ⁡(0,α 2O −α1o)\ Max (0, \alpha_{2O} – \alpha_{1o}) Max (0,α 2O −α1o), Upper bound H is min ⁡ (C + alpha 2 o – alpha 1 o, C), min (C + \ alpha_ {2} o – \ alpha_ {1} o, C) min (C + alpha 2 o – alpha 1 o, C).

In the same way, if we plot the case of α1,α2\alpha_1, \alpha_2α1,α2 with the same sign, there are also k > 0 and K < 0.

In the first case, y1=y2=1y_1 = y_2 = 1y1=y2=1, then α1+α2= K \alpha_1 + \alpha_2 = K α1+α2= K, then K > 0, The corresponding value of alpha 2 \ alpha_2 alpha 2 is (0, alpha 1 o + alpha 2 o) (0, \ alpha_ {1} o + \ alpha_ {2} o) (0, alpha 1 o + alpha 2 o). When k > C, which is the case at 1′ in the upper right corner, and the middle dotted line is passed, the range of α2\alpha_2α2 is (α 1O +α 2O −C,C)(\alpha_{1O} + \alpha_{2O} -c,C)(α1o+α 2O −C,C).

In the second case, y1=y2=−1y_1 = y_2 = -1y1=y2=−1, α1+α2= K \alpha_1 + \alpha_2 = K α1+α2= K, k < 0, Since the constraint conditions 0≤α1,α2≤C0\le \alpha_1, \alpha_2 \le C0≤α1,α2≤C are not met at this time, there is no solution at this time. That the synthesis of the two, can get a lower bound is Max ⁡ L (0, alpha 1 o + alpha 2 o – C) \ Max (0, \ alpha_ {1} o + \ alpha_ {2} o – C) Max (0, alpha 1 o + alpha 2 o – C), The previous H is min ⁡ (alpha 1 o + alpha 2 o, C), min (\ alpha_ {1} o + \ alpha {2} o, C) min (alpha 1 o + alpha 2 o, C).

Let’s assume that the next round we get through the iteration is α2new,unc\alpha_{2new, unc}α2new,unc, where unc means unconstrained. So if we add the constraint, we get:

α2new,unc\alpha_{2new,unc}α2new,unc is the value of α2\alpha_2α2 when we take the derivative, but the problem is that due to constraints, this value is not always available. So the above series of operations are to explore the extreme cases that we can get under constraints. It doesn’t matter if you don’t understand the derivation, at least the conclusion needs to be understood.

Generation into the elimination

Now that we have the range of values for the new α2\alpha_2α2 obtained after the next iteration, what we need to do is to solve for the values of α1\alpha_1α1 and α2\alpha_2α2 that minimize the loss function, like gradient descent. Since the values of α1+α2\alpha_1 + \alpha_2α1+α2 have been determined, we can solve for one of them.

We make the alpha 1 y1 + 2 y2 = alpha factor \ \ alpha_2y_2 alpha_1y_1 + = \ xi alpha 1 y1 + 2 y2 = alpha factor, So we can get into alpha 1 = y1 (factor – alpha 2 y2) \ alpha_1 = y_1 (\ xi – \ alpha_2y_2) alpha 1 = y1 (factor – alpha 2 y2)

If we substitute this into the original formula, we can eliminate α1\alpha_1α1 from the resulting formula, so that we get the formula containing only α2\alpha_2α2. We can think of it as a function of alpha 2\alpha_2α2, and to simplify it further, We make the vi = ∑ j = 3 myj alpha jKi, j, Ei = f (xi) – yi = ∑ j = 1 m alpha jyjKi, j + b – yiv_i = \ sum_ {j = 3} ^ my_j \ alpha_j K {I, j}. E_i = f (x_i) – y_i = \ sum_ {j = 1} ^ m \ alpha_jy_jK_ {I, j} + b – y_ivi = ∑ j = 3 myj alpha jKi, j, Ei = f (xi) – yi = ∑ j = 1 m alpha jyjKi, j + b – yi

Here, EiE_iEi represents the difference between the true value of the ith sample and the predicted value. We substitute the above two expressions into the original formula, and the simplification can be obtained:


f ( Alpha. 2 ) = 1 2 K 11 ( Is deduced Alpha. 2 y 2 ) + 1 2 K 22 Alpha. 2 2 + y 2 K 12 ( Is deduced Alpha. 2 y 2 ) Alpha. 2 ( Is deduced Alpha. 2 y 2 ) y 1 Alpha. 2 + ( Is deduced Alpha. 2 y 2 ) v 1 + y 2 Alpha. 2 v 2 f(\alpha_2) = \frac12 K_{11}(\xi – \alpha_2y_2) + \frac12K_{22}\alpha_2^2 + y_2K_{12}(\xi- \alpha_2y_2)\alpha_2 – (\xi – \alpha_2y_2)y_1 – \alpha_2 + (\xi – \alpha_2y_2)v_1 + y_2\alpha_2v_2

And then we’re going to take the derivative of this and we’re going to take the extremum, which is high school math.


partial W partial Alpha. 2 = K 11 Alpha. 2 + K 22 Alpha. 2 2 K 12 Alpha. 2 K 11 Is deduced y 2 + K 12 Is deduced y 2 + y 1 y 2 1 v 1 y 2 + y 2 v 2 = 0 \frac {\partial W}{\partial\alpha_2}=K_{11}\alpha_2 + K_{22}\alpha_2 – 2K_{12}\alpha_2 – K{11}\xi y_2+K{12}\xi y_2 + y_1y_2 -1 – v_1y_2 + y_2v_2 = 0

If we solve this equation, we can get:


Alpha. 2 n e w . u n c = Alpha. 2 o + y 2 ( E 1 E 2 ) K 11 + K 22 2 K 12 \alpha_{2new, unc} = \alpha_{2o} + \frac{y_2(E_1 – E_2)}{K_{11}+K_{22} – 2 K_{12}}

We can calculate the value of α2\alpha_2α2 after the next iteration according to this formula. After calculating the value, we compare it with the upper and lower bounds of the constraint, and then we can get the best value that can be obtained under the condition of satisfying the constraint. Finally, we substitute α2\alpha_2α2 into the formula to solve α1\alpha_1α1. In this way, we optimize a pair of α\alphaα parameters at the same time. The SMO algorithm simply repeats the above optimization method and continuously selects two parameters for optimization until the number of iterations is reached or no further improvement can be achieved.

The logic of the whole algorithm is not difficult to understand, but the derivation process of the intermediate formula is really a little more. This is why I put the SVM model at the end of the machine learning topic. In the next article, we will bring you the kernel function of the SVM model. After that, our machine learning topic will come to an end, and then we will start the topic of deep learning.

This is the end of today’s article, if you like this article, please invite a wave of quality sanlian, give me a little support (follow, forward, like).

Original link, ask a concern