SVM is usually solved by duality problem, which has two advantages: 1. There are only N variables (N is the number of samples in the training set). The number of variables in the original problem is the same as the number of features of sample points. 2. Kernel function can be easily introduced to solve nonlinear SVM. SMO is a commonly used algorithm to solve dual problems. It is difficult for beginners to thoroughly understand this algorithm. This paper tries to simulate the thinking process of the author of the algorithm to make SMO easily understood. The “I” in this article refers to the god who invented algorithms.

The more wit, the less courage

Recently, many friends reported to me that the SVM dual problem solving algorithm is too inefficient. When the training set is large, the algorithm is not as fast as the snail. Many world-renowned scholars are studying new algorithms. Hearing this, I was delighted: “Brother, this is my chance to make a name for myself!”

I opened the book, looked for the question, and saw something like this:





2.PNG

This is obviously a convex quadratic programming problem, right? Wait, my buddy said the current algorithm is slow, so I definitely can’t think in the conventional way, I have to think in a different way.

A way, a way. Where are you?

I think hard for a few days, there is no good way, ah! Looks like the fame thing is going down the drain. Putting my book down, I decided to go to the lake. Take a break. I’ve been cooped up in a dark room too long.

It takes no effort to come

Noon, no wind, the lake scattered small lovers in murmurs, only helpless pain I single, I sat on a large rock by the lake, calm lake reflected my bearded haggard face, MY heart wry smile: “the lake must be poor me, reflected a pair of shadow to accompany me.” “To the film?? !!!!!!!!!” A light came into my heart like the first thunder of parched earth! I suddenly had a new idea!

I ran frantically back into the house, the resentful eyes of frightened young couples behind me.

I began to gather my thoughts:

If this problem is regarded as a pure convex quadratic programming problem, it is difficult to have any new method, after all, convex quadratic programming has been studied thoroughly. But what makes it special is that it’s a dual problem of another problem, and it satisfies the KKT condition, so how can we make the most of this particularity?

I’ll pick a random α= (α1, α2… Alpha N). Assuming that it is the optimal solution, the KKT condition can be used to calculate the optimal solution of the original problem (w,b), which looks like this:





w.PNG





b.PNG

Then the separated hyperplane can be obtained:





CodeCogsEqn.gif

According to the theory of SVM, if this G (x) is the optimal separated hyperplane, then:





kkt.PNG

Let’s call this g(x) target condition. According to the existing theory, the above derivation is reversible. In other words, as long as I can find an alpha, it satisfies only two initial constraints on the duality problem:





3.PNG

The separated hyperplane G (x) can also satisfy the g(x) objective condition, then this α is the optimal solution of the duality problem!!

At this point, my train of thought has identified: first, the initialization of an alpha, make it meet the dual problem of two initial conditions, and then optimize it continuously, make be determined by its separation hyperplane of g (x) goal conditions, in the process of optimization has always been to ensure it meet the initial conditions, so that you can find the optimal solution.

I can’t help but feel smug, man. Instead of thinking about how to minimize the target function, I’m thinking about how to make alpha satisfy the g(x) target condition. I’m fucking awesome! Ha ha!!!!!

011, the middle of the water can not stop

So how do you optimize alpha? After thinking about it, I found two basic principles to be followed:

  • Each time you optimize, you have to optimize both components of alpha, because if you optimize only one component, the new alpha no longer satisfies the equality condition in the original constraint.

  • The two components of each optimization should violate more g(x) objective conditions. That is to say, it should be greater than or equal to 1, and the less it is, the more it violates the g(x) condition, and so you have a basic criterion for choosing the two components of the optimization.

Well, let me pick the first component, alpha has components that are equal to 0, alpha has components that are equal to C, alpha has components that are greater than 0 and less than C, and my intuition tells me that it’s wise to pick the components that are greater than 0 and less than C, and then pick the other two components if I can’t find anything that I can optimize.

Now, I picked a component, let’s call it alpha 1, and the 1 here means it’s the first component that I pick to optimize, not the first component of alpha.

Why don’t I just pick two components?

What I thought at that time was that the two components selected should meet the g(x) objective condition in violation of a large number of conditions, and another important consideration was that after one optimization, the two components should be changed as much as possible, so that they could meet the G (x) objective condition with as few iterations as possible. Since α1 is selected according to more conditions violating the G (x) objective, I hope to choose α2 according to as many changes of α1 and α2 after optimization.

And you might be thinking, well, that sounds nice, but how do you choose alpha 2?

After a lot of thought, I did find a standard for choosing alpha 2!!

I calculate an index E for each component, which looks like this:





E.PNG

I found that when the greater the | | E1 and E2, optimized the greater the alpha 1, alpha 2 change. So if E1 is positive, the more negative E2 is, the better, and if E1 is negative, the more positive E2 is. So I can pick my alpha 2.

What? Why is that, you ask?

I’ll talk about that later, but now I’m going to optimize my alpha 1 and alpha 2.

Infinite scenery in the perilous peak

How can α1 and α2 be optimized to ensure that their corresponding samples satisfy the G (x) objective condition or violate the G (x) objective condition to a lesser extent? I this person is not greedy, as long as the optimization is in the direction of good development can.

I thought the peak turned, who knew that after the peak back is a fucking steeper mountain! My heart a horizontal, you are 90 degrees of mountain, the elder brothers I also want to climb it!!

In the midst of my contemplation, my eye had stumbled upon the dual question:





2.PNG

A flash of inspiration, the plan to heart!

Although I don’t know how to optimize α1 and α2 to make their corresponding samples violate the G (x) objective condition lighter, I can make their optimized objective function value smaller! Making the objective function smaller is definitely optimizing in the right direction! It must be optimized in the direction of making the g(x) violation condition lighter, the two are the same ah!!

How clever of me!

In this case, α1 and α2 are treated as variables, and the other components are treated as constants. The duality problem is a super simple quadratic function optimization problem:





10.png

Among them:





11.png





12.png

At this point, the problem is super easy!

For example, if y1 and y2 are both equal to 1, then the first constraint becomes 1





13.png

Firstly, α1=K-α2 is substituted into the objective function, then the objective function becomes a unary function about α2, and the derivative of α2 is 0 to obtain α2_new.

Max (k-c, 0) ≦α2≦min (K, C) Max (k-c, 0) ≦α2≦min (K, C) Max (k-c, 0) ≦α2≦C

If α2_new is within this limit, OK! Solve for α1_new and complete an iteration. If α2_new does not fall within this limit, truncate it and get a new α2_new_new, then get α1_new_new, and the iteration ends again!!

At this point, I finally found a new method to solve the dual problem of SVM, in the land of SVM, planted a tree of my own! It’s only natural to make a name for yourself!