Read the original

SVM algorithm was officially published in 1995, and it has excellent effect on the classification task of small and medium-sized data scale. At the same time, it has a complete theoretical proof that it beat neural network completely in the years of the late 20th century and the decade of the early 21st century. In his 2003 Machien Learning open class, Ng explained it in two classes, while the neural network explained it in about 20 minutes. It was this algorithm that rubbed neural networks to the ground for about 15 years, until deep learning came along. But even so, now SVM algorithm is still widely used.


SVM can be roughly and inaccurately divided into three levels of understanding:

(1) Linear classifier in the case of linear separability, which is the most original SVM. Its core idea is margin maximization; (2) Linear classifier in the case of linear indivisibility, introducing the concept of soft margin; (3) The nonlinear classifier in the case of linear inseparability is the combination of SVM and kernel function. Only part 1 will be covered below.


What is the maximum classification interval

The inspiration of SVM maximum classification interval comes from a very intuitive observation that if there are two types of data and the characteristics of the data are two-dimensional, then we can draw the data on a two-dimensional plane. At this time, I want to find a decision surface (decision boundary) to separate the two types of data. As shown below:

Theoretically there are an infinite number of options for this decision boundary, like the four black lines in the picture, all of which can be classified, but which is the best way to classify? SVM algorithm considers that the best classification choice is when the distance between the point (positive and negative samples) close to the flat boundary of decision making in the figure above is the maximum:

The red line in the figure above is the target for optimization. It represents the distance from the data to the decision boundary, which is known as the maximum classification interval. At the same time, if a few data near the two sides of the top data are missing, it will not affect the determination of the decision boundary, and only three data in the red box determine the final decision boundary, so these three data are called support vectors.


Linear classifier

How does the support vector machine algorithm achieve the task of maximum classification interval? We can start to understand it from the linear classifier. The support vector is a linear classifier without the introduction of kernel function. We assume that the vector perpendicular to the decision boundary (normal vector of the decision surface) is V:

The black line is the hypothetical decision boundary, X1 and X2 are the two points on either side of the decision boundary, Xt1 and Xt2 are the projections of the two points on the normal vector V, so it can be seen intuitively that the distance from the origin to Xt1 is less than the distance from the origin to Xt2, and it can be extended to as long as the data points are on both sides of the decision boundary, Then the projection distance of the data points on the left side of the decision boundary on the normal vector is always shorter than the distance on the right, which is the basis of support vector machine classification prediction. So how do we figure out how far this point is from the projection of the line? To test this hypothesis:

As shown in the figure above, the projection distance D of vector B on vector A is required, and the inner product of the vector can be expressed as:

Distance D can be expressed as:

When we get to the above formula, the problem is very clear. The normal vector V is actually the coefficient of the decision boundary (this is the knowledge in analytic geometry), so you must have seen a very similar formula, called hyperplane linear equation in sample space:

This is what linear classifiers look like!! The same is true of the Logistic model without the sigmoid function!! The e individual neurons in the inactive neural network still look like this!


How to achieve maximum classification interval

As can be seen from the above, the support vector machine (without kernel function) is a linear classifier at this time, and its excellent performance is reflected in the maximum classification interval based on the linear classifier. Therefore, in essence, SVM only needs to train parameters W and B, and the key lies in how SVM reflects the idea of maximum classification interval in optimization! For all traindata, SVM hopes to:

Here, plus or minus 1 reflects the maximum classification interval. Here, plus or minus 1 is chosen for the convenience of calculation, because no matter what the interval is, w and B can be about 1 depending on the expansion. The above formula is the maximum interval hypothesis of SVM. The diagram below:

In this figure, the distance (maximum interval) between the lines on both sides of the decision boundary is:

The support vector in the data is affecting the maximum interval, so assuming that the two support vectors X1 and X2 are positive and negative respectively, the maximum interval should be the projection of X2-x1 on the normal vector:

This is the optimization goal of SVM, it wants to find Max (d), and you may find that there is no B in this goal, it is related to w, so can any B be ok? Obviously not, this optimization has a constraint condition, because the derivation process assumes that two support vectors are required to be on both sides, so this constraint condition can be written as:

So the ultimate goal is:

Or as follows:


(3) Understand the dual problem in SVM