This is the second day of my participation in the August More text Challenge. For details, see: August More Text Challenge

5.1 Original problem and dual problem

5.1.1 Dual problems

Let’s put aside the SVM problem for a moment and look at how to solve the normal maximum value problem of inequality constraints. Suppose the problem is as follows:


{ m i n w f ( w ) s . t . g i ( w ) Or less 0 i = 1… k s . t . h j ( w ) = 0 j = 1… m (2) \begin{cases} min_w f(w)\\ s.t.{\quad}g_i(w) \le0 {\quad}i=1… k\\ s.t. {\quad}h_j(w) =0{\quad}j=1… m\\ \end{cases} \tag2

This is a maximum value problem with equality and inequality constraints. We first define two functions:


L ( w . Alpha. . Beta. ) = f ( w ) + i = 1 k Alpha. g i ( w ) + j = 1 m Beta. h j ( w ) Theta. ( Alpha. . Beta. ) = m i n w L ( w . Alpha. . Beta. ) (3) \begin{aligned} L(w,\alpha,\beta) &= f(w) + \sum^{k}_{i =1}{\alpha g_i(w)}+\sum^{m}_{j=1}{\beta h_j(w)}\\ \theta(\alpha,\beta) &= min_{w}L(w,\alpha,\beta) \end{aligned} \tag3

Next, the dual problem of the original problem (2) is defined as:


{ m a x Alpha. . Beta. Theta. ( Alpha. . Beta. ) s . t . Alpha. i p 0 i = 1… k (4) \begin{cases} max_{\alpha,\beta}\theta(\alpha,\beta)\\ s.t.{\quad}\alpha_i \ge 0{\quad}i=1… k \end{cases} \tag4

Here we have the first conclusion: if the w ∗ w ^ * w ∗ is the solution of original problem, (alpha ∗, beta ∗ \ alpha ^ *, \ beta ^ * alpha ∗, beta ∗) is a dual problem solution, F (w ∗) or greater theta (alpha ∗, beta ∗) f (w ^ *) ge \ \ theta (\ alpha ^ *, \ beta ^ *) f (w ∗) or greater theta (alpha ∗, beta ∗).

Proof is as follows:


Theta. ( Alpha. . Beta. ) = m i n w L ( w . Alpha. . Beta. ) Or less L ( w . Alpha. . Beta. ) = f ( w ) + i = 1 k Alpha. g i ( w ) + j = 1 m Beta. h j ( w ) \begin{aligned} \theta(\alpha^*,\beta^*)&=min_wL(w,\alpha^*,\beta^*)\\ &\le L(w^*,\alpha^*,\beta^*)\\ &=f(w^*) + \sum^{k}_{i =1}{\alpha^* g_i(w^*)}+\sum^{m}_{j =1}{\beta^* h_j(w^*)} \end{aligned}

Because w∗w^*w∗ is a solution to the original problem, satisfying the constraints of the original problem: Gi, (w ∗) 0 or less, the hi (w ∗) = 0 g_i (w ^ *) \ le 0, h_i gi ^ * (w) = 0, (w ∗) 0 or less, hi, (w ∗) = 0, alpha I \ alpha_i alpha is a dual problem solution, I meet the constraints of the dual problem of alpha I ∗ acuity 0 \ alpha_i ^ * \ ge alpha I ∗ acuity 0 0, It follows:


i = 1 k Alpha. g i ( w ) Or less 0 j = 1 m Beta. h j ( w ) = 0 \begin{aligned} \sum^{k}_{i =1}{\alpha^* g_i(w^*)} & \le 0\\ \sum^{m}_{j =1}{\beta^* h_j(w^*)} & =0 \end{aligned}

Ultimately f ∗ (w) or greater theta (alpha ∗, beta ∗) f (w ^ *) ge \ \ theta (\ alpha ^ *, \ beta ^ *) f (w ∗) or greater theta (alpha ∗, beta ∗).

5.1.2 Strong duality theorem

We see that the original problem and the duality problem are not always equal, and when the equal sign is true, it is called strong duality. When is the equal sign true? In this case, we need to introduce the strong duality theorem:


if g ( w ) = A w + b . h ( w ) = c w + d . and f ( w ) Is a convex function, then there is f ( w ) = Theta. ( Alpha. . Beta. ) If g (w) = Aw + b, h (w) = the cw + d, and f (w) is a convex function, has the f (w ^ *) = \ theta (\ alpha ^ *, \ beta ^ *)

Is simply if the original problem is convex function, and the constraint conditions for the linear function, the solution of original problem f ∗ (w) (w ^ *) f f ∗ (w) is equal to the dual function of theta (alpha ∗, beta ∗) \ theta (\ alpha ^ *, \ beta ^ *) theta (alpha ∗, beta ∗), due to limited capacity, here to give a detailed proof, For more information, check out The machine learning course of Mr. Hu Haoji from Zhejiang University in MOOCs.

From the strong dualism theorem, we can prove that for any III = 1 to K, αigi(w)=0\ alpha_IG_I (w)=0α igI (w)=0(i.e. KKT condition). Proof is as follows:


f ( w ) = Theta. ( Alpha. . Beta. ) = m i n w L ( w . Alpha. . Beta. ) Or less L ( w . Alpha. . Beta. ) = f ( w ) + i = 1 k Alpha. i g i ( w ) + j = 1 m Beta. j h j ( w ) \begin{aligned} f(w^*)&=\theta(\alpha^*,\beta^*)\\ &=min_wL(w,\alpha^*,\beta^*)\\ &\le L(w^*,\alpha^*,\beta^*)\\ &=f(w^*) + \sum^{k}_{i =1}{\alpha_i^* g_i(w^*)}+\sum^{m}_{j =1}{\beta_j^* h_j(w^*)} \end{aligned}

Because w∗w^*w∗ is a solution to the original problem, satisfying the constraints of the original problem: Gi, (w ∗) 0 or less, the hi (w ∗) = 0 g_i (w ^ *) \ le 0, h_i gi ^ * (w) = 0, (w ∗) 0 or less, hi, (w ∗) = 0, alpha I \ alpha_i alpha I is a dual problem solution, Meet the constraints of the dual problem of alpha I ∗ acuity 0 \ alpha_i ^ * \ ge alpha I ∗ acuity 0 0, then we can draw: alpha igi (w) = 0 \ alpha_ig_i alpha igi (w) = 0 (w) = 0

5.1.3 summary

Based on the above content, we get the original problem and the dual problem:

The original problem:


{ m i n w f ( w ) s . t . g i ( w ) Or less 0 i = 1… k s . t . h j ( w ) = 0 j = 1… m \begin{cases} min_w f(w)\\ s.t.{\quad}g_i(w) \le0 {\quad}i=1… k\\ s.t. {\quad}h_j(w) =0{\quad}j=1… m\\ \end{cases}

Dual problem:


{ m a x Alpha. . Beta. Theta. ( Alpha. . Beta. ) s . t . Alpha. i p 0 i = 1… k \begin{cases} max_{\alpha,\beta}\theta(\alpha,\beta)\\ s.t.{\quad}\alpha_i \ge 0{\quad}i=1… k \end{cases}

Among them


Theta. ( Alpha. . Beta. ) = m i n w L ( w . Alpha. . Beta. ) L ( w . Alpha. . Beta. ) = f ( w ) + i = 1 k Alpha. i g i ( w ) + j = 1 m Beta. j h j ( w ) \begin{aligned} \theta(\alpha,\beta) &= min_{w}L(w,\alpha,\beta)\\ L(w,\alpha,\beta) &= f(w) + \sum^{k}_{i =1}{\alpha_i g_i(w)}+\sum^{m}_{j=1}{\beta_j h_j(w)}\\ \end{aligned}

The optimal solution of the original problem satisfying the strong duality theorem is the same as the optimal solution of the dual problem and αigi(w)=0\ alpha_IG_I (w)=0 αigi(w)=0.

Next, we will introduce how to transform the SVM into the corresponding dual problem solution.

5.2 TRANSFORMATION of SVM into duality problem

Firstly, the SVM problem obtained in the previous section is deformed to make it conform to the form of the original problem (2)


{ m i n w . b w 2 2 s . t . 1 y i ( w T x + b ) Or less 0 \begin{cases} min_{w,b} \frac {||w||^2}{2}\\ s.t.{\quad} 1-y_i(w^Tx+b) \le 0 \end{cases}

Note that the original problem of SVM has two variables, WWW and BBB. Although only WWW is in the expression, BBB implicitly restricts the value of WWW by the restriction condition, so there are two variables here, not one.

According to Equation (3), we write the corresponding function of SVM:


L ( w . b . Alpha. ) = w 2 2 + i = 1 m Alpha. i ( 1 y i ( w T x + b ) ) Theta. ( Alpha. ) = m i n w . b L ( w . b . Alpha. ) (5) \begin{aligned} L(w,b,\alpha) &= \frac {||w||^2}{2} + \sum^{m}_{i =1}{\alpha_i (1-y_i(w^Tx+b))} \\ \theta(\alpha) &= min_{w,b}L(w,b,\alpha) \end{aligned} \tag5

Note the difference between (3) and (5). Since the original SVM problem has no equality constraints, there are no H (w), H (w), h(w) and β\betaβ. Our original problem has two variables, WWW and BBB, WWW is a vector and BBB is a scalar. The final duality problem we require is:


{ m a x Alpha. Theta. ( Alpha. ) s . t . Alpha. i p 0 i = 1… k \begin{cases} max_{\alpha}\theta(\alpha)\\ s.t.{\quad}\alpha_i \ge 0{\quad}i=1… k \end{cases}

We can see that the final problem does not have WWW and BBB, so we first have to write the expression for θ(α)\ (\alpha)θ(α). We can see that it’s just a problem of finding the best value, which is usually solved by using the derivative equal to 0:


partial L partial w = w i = 1 m Alpha. i y i x i = 0 partial L partial b = i = 1 m Alpha. i y i = 0 \begin{aligned} \frac{\partial L}{\partial w}&=w-\sum^{m}_{i =1}\alpha_iy_ix_i=0\\ \frac{\partial L}{\partial b}&=-\sum^{m}_{i =1}\alpha_iy_i=0 \end{aligned}

After finishing, we get:


i = 1 m Alpha. i y i x i = w i = 1 m Alpha. i y i = 0 (6) \begin{aligned} \sum^{m}_{i =1}\alpha_iy_ix_i&=w\\ \sum^{m}_{i =1}\alpha_iy_i&=0 \end{aligned} \tag6

Substitute (6) into (5) to get:


Theta. ( Alpha. ) = i = 1 m Alpha. i 1 2 i = 1 m j = 1 m Alpha. i Alpha. j y i y j x i T x j (7) \theta(\alpha) = \sum^{m}_{i =1}\alpha_i – \frac{1}{2}\sum^{m}_{i =1}\sum^{m}_{j =1}\alpha_i\alpha_jy_iy_jx_i^Tx_j \tag7

Detailed process is not careful, pay attention to ∣ ∣ w ∣ ∣ 2 = wTw | | w | | ^ 2 = w ^ Tw ∣ ∣ w ∣ ∣ 2 = wTw. The final form of the duality problem can be obtained by adding the restriction condition and KKT condition obtained before:


{ m a x Alpha. ( i = 1 m Alpha. i 1 2 i = 1 m j = 1 m Alpha. i Alpha. j y i y j x i T x j ) s . t . i = 1 m Alpha. i y i = 0 s . t . Alpha. i ( y i ( w T x i + b ) 1 ) = 0 s . t . y i ( w T x i + b p 0 ) s . t . Alpha. i p 0 (8) \begin{cases} max_\alpha (\sum^{m}_{i =1}\alpha_i – \frac{1}{2}\sum^{m}_{i =1}\sum^{m}_{j =1}\alpha_i\alpha_jy_iy_jx_i^Tx_j)\\ s.t.{\quad}\sum^{m}_{i =1}\alpha_iy_i=0\\ s.t.{\quad}\alpha_i(y_i(w^Tx_i+b)-1)=0\\ s.t.{\quad}y_i(w^Tx_i+b \ge 0)\\ s.t.{\quad} \alpha_i \ge 0 \end{cases} \tag8

By alpha I (yi (wTx + b) – 1) = 0 \ alpha_i (y_i (w ^ Tx + b) – 1) = 0 alpha I (yi (wTx + b) – 1) = 0, as we’ve learned that a condition or alpha I = 0 \ alpha_i = 0 alpha due to either when I = 0 (yi (wTx + b) – 1) = 0 (y_i (w ^ Tx + b) – 1) = 0 (yi (wTx + b) – 1) = 0.

When α I =0\alpha_i =0 α I =0, ∑ I =1mαiyixi=w\sum^{m}_{I =1}\alpha_iy_ix_i=w∑ I =1mαiyixi=w

When yi(wTxi+b)−1=0y_i(w^Tx_i+b)-1=0yi(wTxi+b)−1=0, sample xix_ixi is the support vector.

It can be seen from the above two points that after the training is completed, most of the training samples do not need to be retained, and the final model is only related to the support vector, which is also the reason why this method is called support vector machine.

(8) is a quadratic programming problem, that is, the objective function is a quadratic function, the constraint is a quadratic function, such a problem either has a unique solution or no solution, for this problem can be solved by quadratic programming algorithm, An efficient solution such as Sequential Minimal Optimization (SMO) can also be designed according to the characteristics of the problem itself.

∑ I =1mαiyixi=w\sum^{m}_{I =1}\alpha_iy_ix_i=w∑ I =1mαiyixi=w And because the support vector xix_ixi satisfies yi(wTxi+b)=1y_i(w^Tx_i+b)=1yi(wTxi+b)=1, BBB can be obtained.

Then we get the classification hyperplane in the case that the samples are linearly separable:


f ( x ) = w T x + b f(x) = w^Tx +b

Then we will discuss how to use SVM for classification in the case of linear inseparability.

The resources

Machine Learning. By Zhihua Zhou. Tsinghua University Press

Detailed explanation of machine learning formulas, By Xie Wenrui, People’s Posts and Telecommunications Press

The mooC is a machine learning course taught by Hu Haoji from Zhejiang University