Machine learning Notes (9) - Detailed derivation of the conditions between SVM, duality and KKT

Summary of the SVM

Support vector machine (SVM) is a supervised classification algorithm, and most of it is also a binary classification problem, first through a series of pictures to understand several concepts of SVM.In the figure above, there are orange dots and blue dots representing two types of tags. If you want to classify them, what do you need to do? Some partners may think of the logical regression fitting decision boundary mentioned in the last article, which is certainly a good method, and the SVM mentioned in this paper can also solve this classification problem; Since both are classification algorithms, the similarities and differences between them can be compared through an example.

hyperplane

It can be seen that there are two partitioning methods given here. As for the solid line in the figure, it can be called decision boundary in logistic regression, while it is called hyperplane in SVM.

In the example above, the data points are distributed on a two-dimensional plane, so the hyperplane is a straight line. If the given data set is three, four… And N dimensions? The corresponding dimensions of the hyperplane are two, three,… And n-1 dimension. The figure below shows the hyperplane of the multi-dimensional data set.

The largest interval

For this example, there may be multiple hyperplanes that can be accurately classified, among which the hyperplane with the maximum interval (the distance between two dotted lines) is the optimal solution to be found by SVM, and this optimal solution corresponds to the sample points crossed by two dotted lines. It is “support vector”. The distance between the support vector and the hyperplane is called margin, which is marked in the figure below.

The formulas

Hyperplane equation

In the end, we want to solve the hyperplane with maximum interval from numerous hyperplanes by using SVM algorithm modeling, so this is also an optimization problem.

There are two basic elements to the optimization problem:

Objective function: What measure of something you want to achieve the best.
Optimizer: What factors do you want to change to optimize the objective function? In linear SVM algorithm, the objective function is “interval” and the optimization object is “hyperplane”.

Therefore, the equation of the “hyperplane” needs to be derived first. The formula of the “hyperplane” in two-dimensional space is also the equation of a line, as follows:The operation here of changing x to x1 and y to x2 is to vectorize it.Finally, it is sorted into:The normal vectors are the column vectors, so this is going to beI took the transpose, and IThe vector is perpendicular to our line, and we can prove that by assuming that the slope a of the line is a constant, whereIt controls the direction of the line, and B controls the position of the line, so you have to change the equation of the lineAnd b optimize the objective function.

Interval formula

The “interval” is the distance between the point in the graph and the hyperplane. The formula is as follows:Where D stands for interval,Represents theThe binary norm (modulus) of, that is, the square root of the sum of squares of all elements.The goal of modeling is to find the maximum interval, where the maximum interval W=2d. The larger W is, the better the classification effect of this model will be. Finally, it becomes a problem of maximizing D.

The constraint

For the classifier we built above, when we input data to the classifier, it will return a category label. Here, blue is defined as negative sample (-1) and red as positive sample (+1), and we can get a set of formulas. If hyperplane can accurately classify sample points in the figure, the following formula can be obtained:The above formula can be naturalized as:

S.t. means “subject to” or subject to a condition

Here is a review of the maximum interval equation above, the idea of maximum interval can be summarized asThe smallestGeometric distance from a point to the hyperplanemaximize. The minimum is to obtain accurate classification of different categories during classification, while the maximum distance is to obtain “maximum interval” to achieve classifier tuning. The formula is as follows:If we want the optimal geometric distance between the hyperplanes to be zero, that is, the geometric distance between all sample points and the hyperplane is at least, so the following formula must be true.hereSet it to 1. Think of it this way, whether weWhat is it? Let’s divide both sides of this equation.The coefficient on alpha and b shrinksTimes, but the hyperplane is fixed, and the coefficients can be scaled to the same scale, which can be analogous to the equation of a line. fixedAfter that, the following formula can be obtained.Here toI’m doing some processing, maximizingAnd minimizeIs equivalent, in order to facilitate the derivation of the objective function during optimization, there is no influence on the optimal solution.The first formula is our objective function, and the second formula is the constraint in this optimization problem, becauseIs a convex function, so this problem is a convex optimization problem.

Solving optimization problem

Classification of optimization problems

Optimization problems can be divided into two categories: unconstrained optimization problems and constrained optimization problems, and constrained optimization problems can be divided into optimization problems with equality constraints and optimization problems with inequality constraints.

For the unconstrained optimization problem, we can take the derivative of the function, then set it to zero, select the optimal value from the candidate value, and verify it. If the function is convex, it is guaranteed to be the optimal solution. Stochastic gradient descent and batch gradient descent are unconstrained optimization methods.
For optimization problems with equality constraints, Lagrange multiplier method is commonly used to transform them into unconstrained optimization problems. Specifically, the constraint conditions and functions are written into a function, called Lagrangian function, the coefficient is Lagrangian multiplier; Take the derivative of each variable through Lagrange function, make it zero, select the optimal value from the candidate value, and verify it.
For inequality constrained optimization problems, the KKT condition is used to transform them into unconstrained optimization problems. Specifically, it is the necessary condition for obtaining the optimal value under some conditions by constructing Lagrange function, which is KKT condition.

The necessary conditions for A are the conclusions that A can derive

For the optimization problem that we have constructed is obviously an optimization problem with inequality constraints, the concept of Lagrange function is not much introduced, the Lagrange multiplier method is introduced below, and through the Lagrange multiplier method leads to the dual problem and KKT condition.

Lagrange multiplier method

The idea of Lagrangian multiplier method is to transform the optimization problem with D variables and K constraints into an unconstrained optimization problem with D + K variables by introducing the Lagrangian multiplier.

Here interested partners can search the big guy’s blog or watermelon book are introduced in detail, really regret high number class did not listen to this part carefully.By introducing the Lagrange multiplierThe above optimization problem can be translated into the following form:One of the things to notice is that.By Lagrange function, we can transform the above formula into:Some of you may not understand why this is a minimization of the maximum value of the Lagrange function, and here’s why.It is obvious when, the objective function value is positive infinity is meaningless, while when, the two are equivalent.

The dual problem

Duality can be used to transform the above original problem into a duality problem, as follows:The main operation of this process is to swap min and Max, and they have a property that the former >= the latter, which is like choosing a shorter person from a tall person is higher than choosing a taller person from a short person. By default, they are weak duality, but they are strong duality (equivalence relation) under this objective function and constraint condition.

So once we’ve converted it to a duality problem, we can solve for it, respectivelyrightTake the partial derivative and make it equal to 0.After the above two expressions are substituted into the constructed Lagrange function, it can be obtained:Finally, the final optimization problem after our derivation is sorted out as follows:

KKT conditions

Suppose there is an optimization problem with inequality constraints:If we want to use KKT conditions to deal with this optimization problem, we need to use Lagrange multiplier method to combine inequality constraints, equality constraints and objective function into a formula, as follows:KKT condition means that the optimal value obtained must meet the following conditions:

Strong symmetry between the primal problem and the dual problem is a sufficient and necessary condition for this problem to meet KKT conditions, so the optimization problem in this paper meets the following conditions:

According to these conditions, we can conclude that only when the sample point is the support vector (the sample point is on the dotted line),I can take any value, and I can take any sample pointIt must be 0. This is different from logistic regression. When logistic regression fits decision boundaries, all samples will have an impact, while SVM mainly has an effect on samples near the boundary.

Because our hyperplane equation is zero, so we obtained the solution of the original optimization problem asIn LWhen I take the derivative, I getThe solution to theta, whereas for theta, the solution process is as follows:The thing to watch out for here is SettingsIs the support vector, so.I can cancel out. Finally, the optimal hyperplane equation is as follows:

conclusion

The concepts of hyperplane and maximum interval are introduced.
The hyperplane equation is derived in two dimensional space and the interval formula is introduced.
The constraint conditions of the optimization problem are derived.
The classification and corresponding solution of optimization problems are introduced.
The duality problem is introduced by Lagrange multiplier method.
KKT condition is explained on the basis of duality problem.
The blogger is still a novice, welcome to point out the problems in the article. This article does not contain code, focusing on formula derivation, for handwritten code is an important part should be formula derivation, only understand the idea and origin of each step, in order to better programming, the next article will introduce a simplified version of sequence minimization (SMO) algorithm, welcome to pay attention to, thank you for reading.

Pay attention to the public number [sugar cat] to obtain the source code and data of each article, welcome partners and bloggers to exchange ah.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning Notes (9) — Detailed derivation of the conditions between SVM, duality and KKT

Summary of the SVM

hyperplane

The largest interval

The formulas

Hyperplane equation

Interval formula

The constraint

Solving optimization problem

Classification of optimization problems

Lagrange multiplier method

The dual problem

KKT conditions

conclusion

Machine learning Notes (9) — Detailed derivation of the conditions between SVM, duality and KKT

Summary of the SVM

hyperplane

The largest interval

The formulas

Hyperplane equation

Interval formula

The constraint

Solving optimization problem

Classification of optimization problems

Lagrange multiplier method

The dual problem

KKT conditions

conclusion

Related Posts

Understand real-time data synchronization artifact – Canal ~

Neurons talk about deep neural networks

【 image Classification 】 PyTorch — Using ResNet to classify dogs and cats