I put the code, the data set, the articleGithub.com/AAAZC/SVM_b…Above, the article is in issues, I suggest you go to this website
Machine Learning practice: popular understanding of support vector machines
About this article
Machine Learning in the Field is, after all, a practical book, less about theory and more about taking the reader to understand the use of algorithms. Just like chapter 6: Support vector machine, the most critical classifier in it is less than two pages to solve the optimization problem. Support vector machines are already difficult to understand, and this makes it even harder to understand, but this book is still a good introduction, and I believe that many students are using this excellent textbook, so I use it as a basis for the analysis.
There are many well-written articles about SVM on the Internet, among which I think one of the best is July’s introduction to Support Vector Machines (Understanding the Three Levels of SVM). The formula derivation is very comprehensive and well-reasoned. But I am a poor student in mathematics, so I think this article is too “hardcore” for me, and I think since to learn an algorithm first to understand it, to say the common point is to convert the theoretical things into their own words. Therefore, I want to write an article that is suitable for students of my level of mathematics to understand easily. Of course, I will not miss the derivation of mathematical formula, because this is the core, but I will interpret it in a more popular way.
In fact, this is a study note, purely personal understanding, if you have questions, welcome to corrections, thank you!
Which part of the textbook does this passage correspond to?
This part of the textbook is really hard to understand. (If you haven’t seen it, please read the following)
What is the SVM
I saw a very vivid example in an article:
A long time ago, a hero’s lover was captured by the devil, the hero wanted to save her, but the devil played a game with him, the title is like this: there is a handful of rice and a stone on the table, I now give you a stick, I want you to take this and the stick to separate these two things.
So the warrior divided like this:
Doesn’t that look good? But then the devil deliberately make things difficult for him, and put a lot of things on the table, this time something seems to be on the wrong side?
So the warrior put the stick again
Now, no matter how much the devil puts on the table, this stick will serve as a good demarcation:
Then the devil was very anxious. He put the rice and stone together again like this. Then he gave the warrior a piece of paper to separate them.
The master felt something was wrong at the first sight, good fellow you this devil do not speak the martial arts, angry fierce clap on the table, at this time like rice this kind of slightly light is patted very high, and the stone is with a little lower, at this time the master quick eye and hand put this paper between them, completed the condition.
When a line separates two classes of objects, as in the devil’s first problem, it is called linear separability. But in real life there are very few data are linearly separable, so we need to go into the higher dimensions, with a piece of paper to separate, and this is called the hyperplane segmentation of paper or sticks or separating hyperplane, and we hope we find a hyperplane is stable, no matter how the data again, or an error occurs, this is the source of the support vector machine (SVM). If we can find the closest point to the hyperplane, and figure out the distance between them, isn’t that the two kinds of data are very far apart, so the further apart they are, the less error prone they are. Support vector machines solve the problem of finding the maximum interval.
Support vectors are points that are very close to the hyperplane, and the distance from the support vector to the hyperplane is called the interval. And we use the concept of vector because a vector is a vector, and in higher dimensions, it’s easier to deal with data that has both direction and magnitude.
Linear classification
Where do the labels 1 and -1 come from
Before we move on to SVM, we need to address the origin of the two categories’ 1 and -1 labels in the book.
This has to start with the origin of linear classification Logistic regression:
Logistic regression is a commonly used dichotomous regression method, which is explained in detail in Chapter 5 of Machine Learning Practice, but basically, we want a function that will take the values that we calculate, and classify them into categories 0 or 1. The most common such function is the sigmoid function, whose calculation formula and image are as follows:
By observing the image, we can get, when z = 0, sigma (z) = 0.5 \ sigma (z) = 0.5 sigma (z) = 0.5, z, the greater the sigma (z) \ sigma sigma (z) 1, z (z) is an approximation to the smaller, sigma (z) \ sigma sigma (z) (z) infinite close to zero. If you can imagine that this graph is very close to a step function at a large scale of Z, then we can use this function as a classifier, when σ(z)\sigma(z)σ(z)>0.5, I think it is class 1, and otherwise class 0.
At this time, we have solved the classifier problem, so we only need to focus on the linear fitting, that is, z=w0+w1x1+… +wnxnz=w_0 + w_1x_1 + … +w_nx_nz=w0+w1x1+… + WNXN, so the above formula can be converted to:
This is what we call logistic regression. Next, we relate it to SVM:
The hyperplane formula in the book, wTx+bw^Tx + bwTx+b, is actually easiest to understand when you put it in two dimensions, where w is the parameter in front of x and y, and b is the intercept. The equation of this line is −x+2=y-x +2=y −x+2=y. X +y−2=0x +y−2=0x +y−2=0, b is 2, w is [1, 1], x is [x,y], and that’s what it means. Why do you use a wTw^TwT? That’s why you use a wTw^TwT for linear fitting.
Let’s change the label y=0 to y=-1 and z to wTx+bw^Tx + bwTx+b. Let’s change the label y=0 to y=-1 and z to wTx+bw^Tx + bwTx+b.
And it turns out that the relationship is really just changing the label from zero to minus one, and nothing else is changing. I’ll explain why we don’t just use zeros and ones.
Functional spacing and geometric spacing
Why set the label to 1 and -1
Let’s take a simple example. As shown in the figure below, there are two groups of numbers on the two-dimensional plane, one group is represented by blue, the other by red. Let blue be category 1, and red is category -1. It’s not hard to see that they’re linearly separable, so let’s just pick any hyperplane, so the points above the hyperplane are class 1, and the ones below are class -1.
Remember what problem SVM was trying to solve? We need to find one of the support vectors that is furthest from the hyperplane and make an intuitive analysis. Let’s translate this line and this line. The first point that these two parallel lines will encounter is the support vector of the hyperplane.
(To be honest, this is the formula we need to use. At the beginning of my study, I read this part quickly, but I can’t understand it later, because the derivation of SVM contains a lot of assumptions, so this is a key point)
In fact, the derivation of this formula is not very difficult, but in ORDER to make subsequent operations more convenient, a thing called function distance is substituted in SVM, so let’s first look at what function distance is:
Define the interval (y^\hat{y}y^) as:
Where yiy_iyi of the formula represents category, and f(x)f(x)f(x) f(x) is the equation of the hyperplane. The interval of functions can be intuitively understood as category times function value, which is the section in red below: In other words, our hyperplane is black and this solid line, if it’s on the line, f of x, f of x, f of x, should be equal to 0, if it’s not on the line, f of x, f of x, f of x, is definitely equal to some non-0 number, and now I multiply it by its category, which is equivalent to adding an absolute value to it. So we don’t have to worry about what class it is, because we’re trying to find the support vector, just look at the distance and don’t care what it means. So that’s the good thing about using negative 1 and 1, because not only does it get rid of the sign, but it doesn’t get rid of the value of the function.
But in terms of distance, it’s not enough to use this function distance, so you can think of a problem, if you multiply w and b by two, 2wTx+2b=02w^Tx +2b= 02wTx+2b=0, it doesn’t matter to the hyperplane, but it does matter to the distance, It’s now four times bigger, which is obviously not true.
This is where the concept of geometric spacing is introduced:
Suppose there is a point x whose projection onto the hyperplane is x0x_0x0, w is the normal vector to the hyperplane, and y is the distance from x to the hyperplane:
According to geometric knowledge can get x = x0 + yw ∣ ∣ w ∣ ∣ = x_0 x + y \ frac {w} {| | w | |} x = x0 + y ∣ ∣ w ∣ ∣ w
∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣ is second order paradigm of w, is actually length, w ∣ ∣ w ∣ ∣ \ frac {w} {| | w | |} ∣ ∣ w ∣ ∣ w is the unit normal vector of a hyperplane, vector divided by its module is a unit vector. Since x0x_0x0 is a point on the hyperplane, we substitute in x0x_0x0 to get wT+b=0w^T +b= 0wT+b=0 (I call it formula 1 here), This time we are in x = x0 + yw ∣ ∣ w ∣ ∣ = x_0 x + y \ frac {w} {| | w | |} x = x0 + y ∣ ∣ w ∣ ∣ w around at the same time multiplied by the wTw ^ TwT and simultaneous 1 type, are: (One of the properties used in this derivation is that wTw=w2w^Tw =w ^2wTw=w2, and the others are just transpose. It is not difficult to derive.)
But there is a disadvantage to the above formula, which is that it can be negative, and the distance can’t be negative, so we multiply it by the category, which is equivalent to adding an absolute value, so we get the set distance:
Does this formula look familiar? , is divided by function distance ∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣, more intuitive geometric distance, so we ask is that the geometric distance.
Definition of the Maximum Margin Classifier
This is the part of the book that gets the most compressed, and it’s actually pretty easy to understand because we’ve done it before, but the most important part of the book is covered in a few sentences, so let’s use that to explain how to find the maximum interval.
Again, what does SVM do? SVM is to find one of the support vectors farthest from the hyperplane. Whatever we’re deriving, we’re going to come back to this at the end of the day! In the last section we talked about using geometric intervals to find intervals, but let’s take a closer look at the formula:
It turns out that it’s really hard to figure out the distance, because we don’t know w, we don’t know b, we don’t know the numerator, we don’t know the denominator, so it’s really hard to figure out the distance, so let’s just set the denominator to be a constant, and a lot of the books here set the denominator to be equal to 1, just because 1 is easy to figure out. When the geometric spacing becomes a y ~ = 1 ∣ ∣ w ∣ ∣ \ tilde {} y = \ frac {1} {| | w | |} y ~ = ∣ ∣ w ∣ ∣ 1, again to solve interval becomes much easier, I want to make more geometric spacing, the ∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣ is smaller, we put the above drawing figure this hypothesis, And it looks like this:
(Actually, this is the graph that’s been bothering me for a long time, because at the beginning I was thinking it’s going to have a distance of one, so what do I do with the points inside? Never mind the question, it involves something calledSlack variableWe’ll talk about that next.)
Assumption and then passed from the original hyperplane to next to the dotted line, which is the distance to the hyperplane 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, so the whole interval length is 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, and in this way, we will not only simplify the problem, As shown in the figure above, the points in the interval of wx+b=1wx +b=1wx +b=1 and wx+b=−1wx +b= -1wx+b=−1 are support vectors. The other points are irrelevant to us and we don’t want them. So we set the constraint that yi(wTx+b)>=1y_i(w^Tx +b)>= 1yi(wTx+b)>=1, which means that the points on the line are neither here nor above the line. Because the line may not find the support vector at the beginning, there may be points in the interval). So the problem we’re solving looks like this, where S.T. stands for constraint:
But this with formula and insufficient place, this is the 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 see are not pleasing to the eye, you can imagine, given an equation of the most value, let’s solve it is the first time we thought about the number of derivative, the time-sharing look let a person very uncomfortable, So we turn him once, since the request 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ a maximum of 1, which in turn is solving 12 ∣ ∣ w ∣ ∣ 2 \ frac {1} {2} | | w | | ^ 221 ∣ ∣ w ∣ ∣ a minimum value of 2 (everyone can derivation and see here, found that the derivative is w), I introduced squared to make it a function of two variables, so the derivative is just w itself.
What is the Lagrange multiplier and the law of duality
If intuitive to solving 12 ∣ ∣ w ∣ ∣ 2 \ frac {1} {2} | | w | | ^ 221 ∣ ∣ w ∣ ∣ 2 of the most value, to be honest, it is a very easy thing, I directly to the differential equation and it is ok to make it to 0, but now not so realistic, because we were above the equation constraints, He has to meet the criteria to do that. So in order to solve this problem of finding the best value with constraints, we introduce the Lagrange multiplier method, and I directly quote someone else’s work (because it is well written) :
Let’s first understand the Lagrange multiplier, so why do we need the Lagrange multiplier? Remember, where you have a Lagrange multiplier, you have to have a combinatorial optimization problem. So constrained optimization problems are easy to solve, like this one:
This is an optimization problem with equality constraints, with target values, with constraints. So let’s think about how we can solve this problem if we have no constraints. Can we just take the derivative of each of the x’s to be equal to 0 and solve for the x’s. But x equals 0 doesn’t satisfy the constraint, so we have a problem. The point here is, why does it say that the derivative is 0? In theory, most questions are ok, but some questions are not. If the derivative is zero, then f must be a convex optimization problem. What is convex? Like the following figure (convex is just a short name, it can be either up or down) :
For this convex optimization, it is more accurate to say:
Meet f (x1) + f (x2) > f (x1 + x22) 2 orf (x1) + f (x2) < 2 f (x1 + x22) \ frac {f (x_1) + f (x_2)} {2} > f (\ frac {x_1 + x_2} {2}) \ \ or \ \ \ frac {f (x_1) + f (x_2)} {2} < f (\ frac {x_1 + x_2} {2}) 2 f (x1) + f (x2) > f (2 x1 + x2) or2f (x1) + f (x2) < f (2 x1 + x2), It’s a convex function.
That means that a function that is not convex should look like this:
If x satisfies condition two, if x satisfies condition one, then the interval will be different, and we can’t guarantee that we’re going to get the best solution this time.
Now let’s go back to the constraint problem, and since you can’t take the derivative directly because you have the constraint, why don’t you just take the constraint out? How do you get rid of it? That’s where the Lagrange method comes in. Since this is an equality constraint, we add this constraint to the objective function by multiplying it by a coefficient. In this way, we take into account both the original objective function and the constraint, such as the function above. Adding this constraint becomes:
α1\alpha_1α1, α2\alpha_2α2, α1\alpha_1α1, α2\alpha_2α2, α1\alpha_1α1, α2\alpha_2α2
We can find out α1\alpha_1α1=-0.39, α2\alpha_2α2=-1.63, so we can get x by substituting α\alphaα into the original formula. This is a violation of solving an equality constraint, so what do we do with inequality constraints? We’re going to use the KKT condition.
(Actually, you have to write in the constraints of the hyperplane first to prove that it satisfies the KKT condition, but I think it’s much easier to look at KKT first, and then look at it the other way around, because if you can understand it, you can disprove it yourself.)
KKT conditions
So we’re going to keep going, but there are three kinds of constraints, the equality constraints, the greater than sign constraints, the less than sign constraints, and we changed the greater than sign constraints for convenience, so there are two kinds of constraints. Here’s another example of this KKT condition:
As stated above, we convert the greater than sign problem to the less than sign problem (the reason for changing the sign is that it is easier to solve a problem in the same direction than to solve a problem in different directions for the sake of uniform operation), and then add the alpha in the Lagrange multiplier method to get:
So what is the definition of KKT condition? Is a relationship that, after the transformation, becomes:
Where g is inequality constraint and h is equality constraint, then KKT refers to the optimal solution of the function satisfying the following conditions:
The first two are pretty easy to understand, they’re the same as the equation constraints, and the third one I’m not going to draw, I’m just going to use the formula in the book:
The equation of SVM is:
After KKT:
Now for the third condition, we can understand it this way:
All of the points within the constraint yi(wTx+b)>=1y_i(w^Tx+b)>=1yi(wTx+b)>=1, so we can directly take the derivative, or another way of saying it is that the points in the blue area are actually in the constraint, so this constraint is invalid, So alpha is going to be equal to zero, and we can just find the optimal solution. On the other hand, the point outside the constraint is what we want, so the alpha is non-zero, which means we need to find the point where the constraint is least satisfied, which is the maximum alpha.
Therefore, according to the above statement, formula 1 can be written:
Let’s look at the formula we got above:
We look from left to right, it’s easy to find a problem, solving equations to work inside the layer, that is to say, at first we will go to calculate solution about inequality constraints, and then calculate w and b, but it must be difficult to ask, is better than the transformation idea, first calculate without constraints, coupled with the constraint conditions, and calculate again, this is the dual law. In other words, our above equation is equivalent to:
Let me write it in more detail
Now that’s a lot easier to do, and there are some advantages to this transformation, because we have w,b and then we have alpha, which means we have only one solution which is alpha, and it reduces our solution to one, and alpha represents everything, I don’t know if you remember that we mentioned above that the point alpha is equal to 0 within the constraint, so this is easy, the support vector is a non-zero alpha, and the solution is also alpha, how about that? Is that clever?
A three-step derivation for solving dual problems
The origin of THE SMO algorithm
We’re pretty close to the answer, so let’s just keep going along the lines we started with, and now we’ve solved the whole equation to a state where we can get the most value by taking the derivative equals zero, so let’s try:
First, we fixed alpha and took the partial derivatives with respect to W and B:
In fact, this is what the code in the article means:
fXi = float(np.multiply(alphas, labelMat).T * (dataMatrix * dataMatrix[i, :].T)) + b
Copy the code
So put the above results into the original formula (it’s too cumbersome to type the formula, I’ll just use July’s derivation) :
The final result is:
This part of the derivation is just a matter of memorizing the result, because the derivation is really complicated, and it’s not necessary, but just memorizing the following formula.
So what we’re doing here is adding an I and a j to the original formula, which is essentially saying that we’re taking any two vectors and using them, and using these two vectors to find the maximum spacing.
What is a relaxation variable
Remember one of the problems we left behind? We assumed the distance to be one just for the sake of calculation, but we found that when we assumed the distance to be one there were some points in the interval, so what do we do with that? For example below, we found that the data are linearly separable, that’s right, I’ll function distance hypothesis 1 after determine the interval of the interval, the interval of point is not satisfy the constraints, then I’ll take this two dotted line has been to move, more and more into, the scope of more and more small, no points at the end of the range of this is the last we want to get something.
In other words, as long as the data is linearly separable, we must have a hyperplane that separates the data perfectly. But this kind of data is rare in real life, so what to do if the above data does:
It was originally a good set of data, but it chose to produce such an outlier (the big point in red). Our interval could have been large, but because of such a data, the interval was small or even no interval. This is where we need to introduce a concept called a relaxation variable.
The relaxation variable, as the name implies, is a little bit more relaxed about this interval:
Comparing the top and bottom, I put two red dotted lines in the graph, and that’s the relaxation variable,
Set up such a thing that is to say, in the slack variable to the hyperplane interval data we also need not tube which category it is, we don’t, because you see on the above data, the distribution of those strange, after all, only a few data, our sample is so big, less a few problems, the combination of the knowledge derived from the above, we We can easily derive a new constraint:
The implication of this relationship is that, for alpha, we only need to find alpha in the range of greater than 0 and less than C, and then we need to take the other points according to the above constraint.
So let’s combine this with SMO, which is taking two vectors at random, and seeing if they can be optimized. What is optimization, is that you take two vectors that are relevant to the problem that you’re trying to solve, and you think you can explore them further. That is to say we two vectors are within the (0, C) to the luo, that if we take the vector is not within the scope of this what to do, so he is equal to the border, pictured above, for example, point to the black dotted line then I forced outside he is equal to 1 (alpha = 0 is explained above), similarly to the outside of the red dashed line that let him equals C, And that ensures that every time I draw a vector, I draw something that I don’t need and it doesn’t affect the interval that we’re assuming.
Now that I’ve covered the basics, the kernel part is to be continued.
reference
Machine Learning in action. By Peter Harrington
Mathematical Foundations of Artificial Intelligence. By Tang Yudi
Introduction to Support Vector Machines (Understanding the three levels of SVM)
Blog.csdn.net/v_JULY_v/ar…
Deep parsing of python version of the SVM source series (3) – to calculate the predicted categories of samples
Blog.csdn.net/bbbeoy/arti…
Decrypting SVM series (I) : On Lagrange multiplier methods and KKT conditions
Blog.csdn.net/on2way/arti…
MinL (w) becomes minmaxL(A,w)
Blog.csdn.net/appleyuchi/…
Convex optimization problems, convex quadratic programming problems QP, convex functions
Blog.csdn.net/promisejia/…
The transformation and cause of SVM from original problem to dual problem
Blog.csdn.net/qq_44987376…
Why does SVM transform the original problem into a dual problem?
Blog.csdn.net/qq_40598006…
Mathematical derivation of support vector machine Part1
Blog.csdn.net/qq_23869697…
Decrypting SVM series (I) : On Lagrange multiplier methods and KKT conditions
Blog.csdn.net/on2way/arti…
Understanding the relaxation variables of support vector machines
Blog.csdn.net/ustbbsy/art…
SVM relaxation variable
Blog.csdn.net/wusecaiyun/…
SVM maximum interval hyperplane study notes and thinking about setting function interval to 1
Blog.csdn.net/weixin_4402… Above, the article is in issues, I suggest you go to this website to see **
Machine Learning practice: popular understanding of support vector machines
About this article
Machine Learning in the Field is, after all, a practical book, less about theory and more about taking the reader to understand the use of algorithms. Just like chapter 6: Support vector machine, the most critical classifier in it is less than two pages to solve the optimization problem. Support vector machines are already difficult to understand, and this makes it even harder to understand, but this book is still a good introduction, and I believe that many students are using this excellent textbook, so I use it as a basis for the analysis.
There are many well-written articles about SVM on the Internet, among which I think one of the best is July’s introduction to Support Vector Machines (Understanding the Three Levels of SVM). The formula derivation is very comprehensive and well-reasoned. But I am a poor student in mathematics, so I think this article is too “hardcore” for me, and I think since to learn an algorithm first to understand it, to say the common point is to convert the theoretical things into their own words. Therefore, I want to write an article that is suitable for students of my level of mathematics to understand easily. Of course, I will not miss the derivation of mathematical formula, because this is the core, but I will interpret it in a more popular way.
In fact, this is a study note, purely personal understanding, if you have questions, welcome to corrections, thank you!
Which part of the textbook does this passage correspond to?
This part of the textbook is really hard to understand. (If you haven’t seen it, please read the following)
What is the SVM
I saw a very vivid example in an article:
A long time ago, a hero’s lover was captured by the devil, the hero wanted to save her, but the devil played a game with him, the title is like this: there is a handful of rice and a stone on the table, I now give you a stick, I want you to take this and the stick to separate these two things.
So the warrior divided like this:
Doesn’t that look good? But then the devil deliberately make things difficult for him, and put a lot of things on the table, this time something seems to be on the wrong side?
So the warrior put the stick again
Now, no matter how much the devil puts on the table, this stick will serve as a good demarcation:
Then the devil was very anxious. He put the rice and stone together again like this. Then he gave the warrior a piece of paper to separate them.
The master felt something was wrong at the first sight, good fellow you this devil do not speak the martial arts, angry fierce clap on the table, at this time like rice this kind of slightly light is patted very high, and the stone is with a little lower, at this time the master quick eye and hand put this paper between them, completed the condition.
When a line separates two classes of objects, as in the devil’s first problem, it is called linear separability. But in real life there are very few data are linearly separable, so we need to go into the higher dimensions, with a piece of paper to separate, and this is called the hyperplane segmentation of paper or sticks or separating hyperplane, and we hope we find a hyperplane is stable, no matter how the data again, or an error occurs, this is the source of the support vector machine (SVM). If we can find the closest point to the hyperplane, and figure out the distance between them, isn’t that the two kinds of data are very far apart, so the further apart they are, the less error prone they are. Support vector machines solve the problem of finding the maximum interval.
Support vectors are points that are very close to the hyperplane, and the distance from the support vector to the hyperplane is called the interval. And we use the concept of vector because a vector is a vector, and in higher dimensions, it’s easier to deal with data that has both direction and magnitude.
Linear classification
Where do the labels 1 and -1 come from
Before we move on to SVM, we need to address the origin of the two categories’ 1 and -1 labels in the book.
This has to start with the origin of linear classification Logistic regression:
Logistic regression is a commonly used dichotomous regression method, which is explained in detail in Chapter 5 of Machine Learning Practice, but basically, we want a function that will take the values that we calculate, and classify them into categories 0 or 1. The most common such function is the sigmoid function, whose calculation formula and image are as follows:
By observing the image, we can get, when z = 0, sigma (z) = 0.5 \ sigma (z) = 0.5 sigma (z) = 0.5, z, the greater the sigma (z) \ sigma sigma (z) 1, z (z) is an approximation to the smaller, sigma (z) \ sigma sigma (z) (z) infinite close to zero. If you can imagine that this graph is very close to a step function at a large scale of Z, then we can use this function as a classifier, when σ(z)\sigma(z)σ(z)>0.5, I think it is class 1, and otherwise class 0.
At this time, we have solved the classifier problem, so we only need to focus on the linear fitting, that is, z=w0+w1x1+… +wnxnz=w_0 + w_1x_1 + … +w_nx_nz=w0+w1x1+… + WNXN, so the above formula can be converted to:
This is what we call logistic regression. Next, we relate it to SVM:
The hyperplane formula in the book, wTx+bw^Tx + bwTx+b, is actually easiest to understand when you put it in two dimensions, where w is the parameter in front of x and y, and b is the intercept. The equation of this line is −x+2=y-x +2=y −x+2=y. X +y−2=0x +y−2=0x +y−2=0, b is 2, w is [1, 1], x is [x,y], and that’s what it means. Why do you use a wTw^TwT? That’s why you use a wTw^TwT for linear fitting.
Let’s change the label y=0 to y=-1 and z to wTx+bw^Tx + bwTx+b. Let’s change the label y=0 to y=-1 and z to wTx+bw^Tx + bwTx+b.
And it turns out that the relationship is really just changing the label from zero to minus one, and nothing else is changing. I’ll explain why we don’t just use zeros and ones.
Functional spacing and geometric spacing
Why set the label to 1 and -1
Let’s take a simple example. As shown in the figure below, there are two groups of numbers on the two-dimensional plane, one group is represented by blue, the other by red. Let blue be category 1, and red is category -1. It’s not hard to see that they’re linearly separable, so let’s just pick any hyperplane, so the points above the hyperplane are class 1, and the ones below are class -1.
Remember what problem SVM was trying to solve? We need to find one of the support vectors that is furthest from the hyperplane and make an intuitive analysis. Let’s translate this line and this line. The first point that these two parallel lines will encounter is the support vector of the hyperplane.
(To be honest, this is the formula we need to use. At the beginning of my study, I read this part quickly, but I can’t understand it later, because the derivation of SVM contains a lot of assumptions, so this is a key point)
In fact, the derivation of this formula is not very difficult, but in ORDER to make subsequent operations more convenient, a thing called function distance is substituted in SVM, so let’s first look at what function distance is:
Define the interval (y^\hat{y}y^) as:
Where yiy_iyi of the formula represents category, and f(x)f(x)f(x) f(x) is the equation of the hyperplane. The interval of functions can be intuitively understood as category times function value, which is the section in red below: In other words, our hyperplane is black and this solid line, if it’s on the line, f of x, f of x, f of x, should be equal to 0, if it’s not on the line, f of x, f of x, f of x, is definitely equal to some non-0 number, and now I multiply it by its category, which is equivalent to adding an absolute value to it. So we don’t have to worry about what class it is, because we’re trying to find the support vector, just look at the distance and don’t care what it means. So that’s the good thing about using negative 1 and 1, because not only does it get rid of the sign, but it doesn’t get rid of the value of the function.
But in terms of distance, it’s not enough to use this function distance, so you can think of a problem, if you multiply w and b by two, 2wTx+2b=02w^Tx +2b= 02wTx+2b=0, it doesn’t matter to the hyperplane, but it does matter to the distance, It’s now four times bigger, which is obviously not true.
This is where the concept of geometric spacing is introduced:
Suppose there is a point x whose projection onto the hyperplane is x0x_0x0, w is the normal vector to the hyperplane, and y is the distance from x to the hyperplane:
According to geometric knowledge can get x = x0 + yw ∣ ∣ w ∣ ∣ = x_0 x + y \ frac {w} {| | w | |} x = x0 + y ∣ ∣ w ∣ ∣ w
∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣ is second order paradigm of w, is actually length, w ∣ ∣ w ∣ ∣ \ frac {w} {| | w | |} ∣ ∣ w ∣ ∣ w is the unit normal vector of a hyperplane, vector divided by its module is a unit vector. Since x0x_0x0 is a point on the hyperplane, we substitute in x0x_0x0 to get wT+b=0w^T +b= 0wT+b=0 (I call it formula 1 here), This time we are in x = x0 + yw ∣ ∣ w ∣ ∣ = x_0 x + y \ frac {w} {| | w | |} x = x0 + y ∣ ∣ w ∣ ∣ w around at the same time multiplied by the wTw ^ TwT and simultaneous 1 type, are: (One of the properties used in this derivation is that wTw=w2w^Tw =w ^2wTw=w2, and the others are just transpose. It is not difficult to derive.)
But there is a disadvantage to the above formula, which is that it can be negative, and the distance can’t be negative, so we multiply it by the category, which is equivalent to adding an absolute value, so we get the set distance:
Does this formula look familiar? , is divided by function distance ∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣, more intuitive geometric distance, so we ask is that the geometric distance.
Definition of the Maximum Margin Classifier
This is the part of the book that gets the most compressed, and it’s actually pretty easy to understand because we’ve done it before, but the most important part of the book is covered in a few sentences, so let’s use that to explain how to find the maximum interval.
Again, what does SVM do? SVM is to find one of the support vectors farthest from the hyperplane. Whatever we’re deriving, we’re going to come back to this at the end of the day! In the last section we talked about using geometric intervals to find intervals, but let’s take a closer look at the formula:
It turns out that it’s really hard to figure out the distance, because we don’t know w, we don’t know b, we don’t know the numerator, we don’t know the denominator, so it’s really hard to figure out the distance, so let’s just set the denominator to be a constant, and a lot of the books here set the denominator to be equal to 1, just because 1 is easy to figure out. When the geometric spacing becomes a y ~ = 1 ∣ ∣ w ∣ ∣ \ tilde {} y = \ frac {1} {| | w | |} y ~ = ∣ ∣ w ∣ ∣ 1, again to solve interval becomes much easier, I want to make more geometric spacing, the ∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣ is smaller, we put the above drawing figure this hypothesis, And it looks like this:
(Actually, this is the graph that’s been bothering me for a long time, because at the beginning I was thinking it’s going to have a distance of one, so what do I do with the points inside? Never mind the question, it involves something calledSlack variableWe’ll talk about that next.)
Assumption and then passed from the original hyperplane to next to the dotted line, which is the distance to the hyperplane 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, so the whole interval length is 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1, and in this way, we will not only simplify the problem, As shown in the figure above, the points in the interval of wx+b=1wx +b=1wx +b=1 and wx+b=−1wx +b= -1wx+b=−1 are support vectors. The other points are irrelevant to us and we don’t want them. So we set the constraint that yi(wTx+b)>=1y_i(w^Tx +b)>= 1yi(wTx+b)>=1, which means that the points on the line are neither here nor above the line. Because the line may not find the support vector at the beginning, there may be points in the interval). So the problem we’re solving looks like this, where S.T. stands for constraint:
But this with formula and insufficient place, this is the 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 see are not pleasing to the eye, you can imagine, given an equation of the most value, let’s solve it is the first time we thought about the number of derivative, the time-sharing look let a person very uncomfortable, So we turn him once, since the request 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ a maximum of 1, which in turn is solving 12 ∣ ∣ w ∣ ∣ 2 \ frac {1} {2} | | w | | ^ 221 ∣ ∣ w ∣ ∣ a minimum value of 2 (everyone can derivation and see here, found that the derivative is w), I introduced squared to make it a function of two variables, so the derivative is just w itself.
What is the Lagrange multiplier and the law of duality
If intuitive to solving 12 ∣ ∣ w ∣ ∣ 2 \ frac {1} {2} | | w | | ^ 221 ∣ ∣ w ∣ ∣ 2 of the most value, to be honest, it is a very easy thing, I directly to the differential equation and it is ok to make it to 0, but now not so realistic, because we were above the equation constraints, He has to meet the criteria to do that. So in order to solve this problem of finding the best value with constraints, we introduce the Lagrange multiplier method, and I directly quote someone else’s work (because it is well written) :
Let’s first understand the Lagrange multiplier, so why do we need the Lagrange multiplier? Remember, where you have a Lagrange multiplier, you have to have a combinatorial optimization problem. So constrained optimization problems are easy to solve, like this one:
This is an optimization problem with equality constraints, with target values, with constraints. So let’s think about how we can solve this problem if we have no constraints. Can we just take the derivative of each of the x’s to be equal to 0 and solve for the x’s. But x equals 0 doesn’t satisfy the constraint, so we have a problem. The point here is, why does it say that the derivative is 0? In theory, most questions are ok, but some questions are not. If the derivative is zero, then f must be a convex optimization problem. What is convex? Like the following figure (convex is just a short name, it can be either up or down) :
For this convex optimization, it is more accurate to say:
Meet f (x1) + f (x2) > f (x1 + x22) 2 orf (x1) + f (x2) < 2 f (x1 + x22) \ frac {f (x_1) + f (x_2)} {2} > f (\ frac {x_1 + x_2} {2}) \ \ or \ \ \ frac {f (x_1) + f (x_2)} {2} < f (\ frac {x_1 + x_2} {2}) 2 f (x1) + f (x2) > f (2 x1 + x2) or2f (x1) + f (x2) < f (2 x1 + x2), It’s a convex function.
That means that a function that is not convex should look like this:
If x satisfies condition two, if x satisfies condition one, then the interval will be different, and we can’t guarantee that we’re going to get the best solution this time.
Now let’s go back to the constraint problem, and since you can’t take the derivative directly because you have the constraint, why don’t you just take the constraint out? How do you get rid of it? That’s where the Lagrange method comes in. Since this is an equality constraint, we add this constraint to the objective function by multiplying it by a coefficient. In this way, we take into account both the original objective function and the constraint, such as the function above. Adding this constraint becomes:
α1\alpha_1α1, α2\alpha_2α2, α1\alpha_1α1, α2\alpha_2α2, α1\alpha_1α1, α2\alpha_2α2
We can find out α1\alpha_1α1=-0.39, α2\alpha_2α2=-1.63, so we can get x by substituting α\alphaα into the original formula. This is a violation of solving an equality constraint, so what do we do with inequality constraints? We’re going to use the KKT condition.
(Actually, you have to write in the constraints of the hyperplane first to prove that it satisfies the KKT condition, but I think it’s much easier to look at KKT first, and then look at it the other way around, because if you can understand it, you can disprove it yourself.)
KKT conditions
So we’re going to keep going, but there are three kinds of constraints, the equality constraints, the greater than sign constraints, the less than sign constraints, and we changed the greater than sign constraints for convenience, so there are two kinds of constraints. Here’s another example of this KKT condition:
As stated above, we convert the greater than sign problem to the less than sign problem (the reason for changing the sign is that it is easier to solve a problem in the same direction than to solve a problem in different directions for the sake of uniform operation), and then add the alpha in the Lagrange multiplier method to get:
So what is the definition of KKT condition? Is a relationship that, after the transformation, becomes:
Where g is inequality constraint and h is equality constraint, then KKT refers to the optimal solution of the function satisfying the following conditions:
The first two are pretty easy to understand, they’re the same as the equation constraints, and the third one I’m not going to draw, I’m just going to use the formula in the book:
The equation of SVM is:
After KKT:
Now for the third condition, we can understand it this way:
All of the points within the constraint yi(wTx+b)>=1y_i(w^Tx+b)>=1yi(wTx+b)>=1, so we can directly take the derivative, or another way of saying it is that the points in the blue area are actually in the constraint, so this constraint is invalid, So alpha is going to be equal to zero, and we can just find the optimal solution. On the other hand, the point outside the constraint is what we want, so the alpha is non-zero, which means we need to find the point where the constraint is least satisfied, which is the maximum alpha.
Therefore, according to the above statement, formula 1 can be written:
Let’s look at the formula we got above:
We look from left to right, it’s easy to find a problem, solving equations to work inside the layer, that is to say, at first we will go to calculate solution about inequality constraints, and then calculate w and b, but it must be difficult to ask, is better than the transformation idea, first calculate without constraints, coupled with the constraint conditions, and calculate again, this is the dual law. In other words, our above equation is equivalent to:
Let me write it in more detail
Now that’s a lot easier to do, and there are some advantages to this transformation, because we have w,b and then we have alpha, which means we have only one solution which is alpha, and it reduces our solution to one, and alpha represents everything, I don’t know if you remember that we mentioned above that the point alpha is equal to 0 within the constraint, so this is easy, the support vector is a non-zero alpha, and the solution is also alpha, how about that? Is that clever?
A three-step derivation for solving dual problems
The origin of THE SMO algorithm
We’re pretty close to the answer, so let’s just keep going along the lines we started with, and now we’ve solved the whole equation to a state where we can get the most value by taking the derivative equals zero, so let’s try:
First, we fixed alpha and took the partial derivatives with respect to W and B:
In fact, this is what the code in the article means:
fXi = float(np.multiply(alphas, labelMat).T * (dataMatrix * dataMatrix[i, :].T)) + b
Copy the code
So put the above results into the original formula (it’s too cumbersome to type the formula, I’ll just use July’s derivation) :
The final result is:
This part of the derivation is just a matter of memorizing the result, because the derivation is really complicated, and it’s not necessary, but just memorizing the following formula.
So what we’re doing here is adding an I and a j to the original formula, which is essentially saying that we’re taking any two vectors and using them, and using these two vectors to find the maximum spacing.
What is a relaxation variable
Remember one of the problems we left behind? We assumed the distance to be one just for the sake of calculation, but we found that when we assumed the distance to be one there were some points in the interval, so what do we do with that? For example below, we found that the data are linearly separable, that’s right, I’ll function distance hypothesis 1 after determine the interval of the interval, the interval of point is not satisfy the constraints, then I’ll take this two dotted line has been to move, more and more into, the scope of more and more small, no points at the end of the range of this is the last we want to get something.
In other words, as long as the data is linearly separable, we must have a hyperplane that separates the data perfectly. But this kind of data is rare in real life, so what to do if the above data does:
It was originally a good set of data, but it chose to produce such an outlier (the big point in red). Our interval could have been large, but because of such a data, the interval was small or even no interval. This is where we need to introduce a concept called a relaxation variable.
The relaxation variable, as the name implies, is a little bit more relaxed about this interval:
Comparing the top and bottom, I put two red dotted lines in the graph, and that’s the relaxation variable,
Set up such a thing that is to say, in the slack variable to the hyperplane interval data we also need not tube which category it is, we don’t, because you see on the above data, the distribution of those strange, after all, only a few data, our sample is so big, less a few problems, the combination of the knowledge derived from the above, we We can easily derive a new constraint:
The implication of this relationship is that, for alpha, we only need to find alpha in the range of greater than 0 and less than C, and then we need to take the other points according to the above constraint.
So let’s combine this with SMO, which is taking two vectors at random, and seeing if they can be optimized. What is optimization, is that you take two vectors that are relevant to the problem that you’re trying to solve, and you think you can explore them further. That is to say we two vectors are within the (0, C) to the luo, that if we take the vector is not within the scope of this what to do, so he is equal to the border, pictured above, for example, point to the black dotted line then I forced outside he is equal to 1 (alpha = 0 is explained above), similarly to the outside of the red dashed line that let him equals C, And that ensures that every time I draw a vector, I draw something that I don’t need and it doesn’t affect the interval that we’re assuming.
Now that I’ve covered the basics, the kernel part is to be continued.
reference
Machine Learning in action. By Peter Harrington
Mathematical Foundations of Artificial Intelligence. By Tang Yudi
Introduction to Support Vector Machines (Understanding the three levels of SVM)
Blog.csdn.net/v_JULY_v/ar…
Deep parsing of python version of the SVM source series (3) – to calculate the predicted categories of samples
Blog.csdn.net/bbbeoy/arti…
Decrypting SVM series (I) : On Lagrange multiplier methods and KKT conditions
Blog.csdn.net/on2way/arti…
MinL (w) becomes minmaxL(A,w)
Blog.csdn.net/appleyuchi/…
Convex optimization problems, convex quadratic programming problems QP, convex functions
Blog.csdn.net/promisejia/…
The transformation and cause of SVM from original problem to dual problem
Blog.csdn.net/qq_44987376…
Why does SVM transform the original problem into a dual problem?
Blog.csdn.net/qq_40598006…
Mathematical derivation of support vector machine Part1
Blog.csdn.net/qq_23869697…
Decrypting SVM series (I) : On Lagrange multiplier methods and KKT conditions
Blog.csdn.net/on2way/arti…
Understanding the relaxation variables of support vector machines
Blog.csdn.net/ustbbsy/art…
SVM relaxation variable
Blog.csdn.net/wusecaiyun/…
SVM maximum interval hyperplane study notes and thinking about setting function interval to 1
Blog.csdn.net/weixin_4402…