More interesting attention (machine learning eye)~
introduce
-
Support vector machine is mainly composed of two parts: Hinge Loss and Kernel Method.
Loss function
-
Suppose we enter the data in the following format:
-
Where x represents the data vector, y represents the data label, which is divided into two types, namely +1 and -1. Here, the model function is:
-
Therefore, the loss function used for classification is:
-
It defines that when the calculated function is not equal to the label, it takes 1, and when it is equal to the label, it takes 0, but the resulting function has a disadvantage that it cannot differentiate, that is, it cannot do gradient descent. Therefore, we adopt another function as the loss function, namely:
Discussion of various loss functions
-
Below, we will discuss the characteristics of various loss functions, as shown in the figure below:
-
In the diagram above, the abscissa is yf (x), if the prediction of symbols in the symbols of the original label is the same, so their loss value is 0, if the symbol instead the loss value is 1, the black line loss function curve is ideal, but you can clearly see that this is a not differential function, so we use the approximate loss function instead of, This approximate loss function can be chosen by a number of options, each of which is discussed below:
-
Where, the red curve is a quadratic curve, and its loss function is expressed as:
-
We can see that when the label value of the data is +1, the predicted value of +1 can achieve the minimum loss. When the label value is -1, the predicted value is -1 to achieve the minimum loss. Therefore, when yf(x)=1, the minimum value of the loss function is selected. However, this function is unreasonable in the later stage, because as the predicted value gradually increases, the value of the loss function actually increases, which is obviously unreasonable.
-
Next, we consider using Sigmoid + Square Loss as the loss function, whose function curve is the blue one, and the specific loss function is as follows:
The loss function curve is drawn as shown in the figure above. The effect of this method is not as good as that of cross entropy. The specific reasons are as follows:
-
As can be seen from the expression, if yf(x) tends to infinity, the value of the whole function tends to ln1=0, and if yf(x) tends to negative infinity, the value of the whole function tends to infinity, so the function curve is shown in the figure above. This is a reasonable curve, because as y to the nf of xn increases, the function goes down. By contrast with the sigmoid + square loos method as the loss function, we can see that the sigmoid + cross entropy has a large gradient when the independent variable (the value corresponding to the abscissa) reaches negative infinity. The gradient value of Sigmoid + Square loos is almost zero, which means that the former will be rewarded for its main effort during gradient descent, while the latter will not be rewarded for its efforts. The other thing is that we’ve divided the cross entropy by the natural log of 2, and the main reason is to get an upper bound on the ideal loss function.
-
Finally, we consider Hinge Loss function, and the specific expression and function curve are shown in the figure below:
In the loss function above, we can see that for a positive example, a perfect classification result is obtained if f(x)>1, and for a negative example, a perfect classification result is obtained if f(x)<−1. So x doesn’t have to be too big, because it’s going to be the same. If you look at the curve of the function, when yf(x) is greater than 1, it’s good enough, but it’s not good enough in the penalty. Even though it’s classified correctly, the loss function still makes it better to move the independent variable to the right, and this interval is called the margin. The primal value of 1 in the loss function can obtain a compact upper bound of the ideal loss function.
-
If we compare the cross entropy function with the hinge loss function, the biggest difference between them is the attitude towards correctly classified cases. If the black dot in the figure is moved from position 1 to position 2, we can see that the function value of the cross entropy loss function will continue to decline, that is to say, it hopes to get a better result after it has got a good result. But the hinge function is a pass-good function, if it is bigger than margin. In practical use, the hinge function is slightly better than the cross entropy loss function, that is, not much better. But the hinge function is more able to take care of the global, when carrying out multiple classification when the effect will be better.
Linear SVM classifier
-
The steps of linear SVM are mainly divided into three parts as shown in the figure below:
-
The second step is to construct the loss function, which uses the fold loss function and the regularization term of L2. Because the former loss function is obviously a convex function, and the latter regularization term is obviously a convex function, the combination of these is also a convex function.
-
The parameters can then be updated using gradient descent in the third step. Some people might think, well, this function is piecewise linear can we use gradient descent? Yes. If you think about RElu before, it’s the same thing. You can also train with gradient descent.
-
Here we can see that, in fact, the difference between logistic regression and linear SVM lies in the difference of loss function. On the other hand, we can see from the first step that in fact SVM and neural network are connected, so there is an article “Deep Learning Using Linear Support Vector Machines” in ICML in 2013.
SVM training using gradient descent
-
First, we take out the hinge loss function from the loss function and judge whether the following two equations are equal:
In fact, just considering these two equations, they are not equal to each other, but they are the same if you minimize the following loss function:
At this time, we use ϵ to replace the original hinge loss function, where ϵ satisfies the two inequalities in the red box before, so we consider it as a relaxation variable here, and such problems can be solved by quadratic programming.
Nuclear method
Dual representation
-
The weight of the final classification function is actually a linear combination of input data, as shown in the figure below
-
The origin of this result can actually be considered in terms of updating parameters using the gradient descent method described earlier. As shown in the figure above, we combine all the weight updating processes into a vector, in which the derivative of the hinge loss function is denoted as C (w), and its specific expression is also shown in the figure above. If we will be the weight of the initial value is set to 0, then the parameter w is actually a linear combination of the input data, also because for folding loss function, it has a lot of zero value, so the coefficient alpha has a lot of 0, the weight is actually a combination of input data, including the data corresponding to the sparse to zero point of the called support vector. Such results lead to stronger robustness of the model. Even if there are outlier points or unreasonable points, as long as they are not selected as support vectors, the classification results will not be greatly affected.
-
As it has been proved in the above part, the weight of SVM is actually obtained by linear combination of input points, so it can be expressed in the following form:
-
So here we’re taking the original sum and converting it to a vector multiplication, which is a very common expression in machine learning. After obtaining the dual expression of weight, the first step is to carry out variable substitution, as shown in the figure below
-
If you plug in w, you can see that f of x has three main components, the first component is the coefficient alpha, which is a row vector with N columns, and the next two terms you can use the associative law of matrix multiplication, which gives you a column vector with N rows. So f of x can be expressed in the form above, where xn dot x can be expressed as K of xn, x, and we call K the kernel method.
-
As shown in the figure below, after the objective function is expressed as a linear combination of coefficient and K(xn, x), the next task is to minimize the loss function:
-
And the thing to notice is that even though we have n and n prime in the latter term, the sum of these two things is the same, except that we sum n prime first and then we sum n. We don’t really need to know xn here, we just need to know the internal relationship between xn and xn prime. This method is called Kernel Trick.
Kernel Trick
-
When we map the original data point to a higher dimensional space, it is often useful to use the kernel technique, as shown below:
-
As shown in the figure above, we map x to phi (x). At this time, we calculate the transformation to higher dimensions and then do the dot product. Through simplification, we know that this process is equivalent to taking the dot product of the original data first and then square it. This method can reduce a large amount of computation mainly when the input data is high-dimensional data, as shown in the figure below:
-
If you do the high-dimensional mapping first, you need to multiply k+C(k,2) to get the high-dimensional data points, and then you need to do the dot product between the high-dimensional points, a total of 3 times K +C(k,2) multiplication. But if you do the inner product first and then you square it, you do k plus one multiplication. This is obviously a much smaller calculation. So the main function of the nuclear technique is to reduce the amount of computation.
-
The following takes the Radial Basis Function Kernel as an example
-
And we can see that if we expand the radial basis function using Taylor’s formula, we can see that it’s a sum of infinite terms, so there’s no way that you can do that by mapping to higher dimensions and then dot products. In addition, we can also learn from the above process that, in fact, RBF(Radial Basis Function) core is actually a classifier in infinite dimension. Although the effect is relatively good, it is also very easy to overfit.
-
Then we take the Sigmoid core as an example
-
And we can actually see that the calculation of F (x) can be done by the following neural network, for the input data x, it first dot with x1, and this process is actually done even by calculating the weighted input of the first neuron, there are n of these neurons, They multiply and add with the corresponding coefficients to get to the target function f of x. Therefore, SVM using Sigmoid kernel is actually a neural network with only one hidden layer and Sigmoid function as the activation function.
-
Not all dot products of vectors and then other operations have corresponding kernel functions, but only those that satisfy Mercer’s Theory can check.
The relationship between deep learning and support vector machines
-
Deep learning is closely related to support vector machines in principle, as shown in the figure below:
-
The hidden layer before deep learning is actually a high-dimensional mapping of features, and finally a linear classifier is used for classification.
-
SVM uses kernel method to carry out nonlinear mapping of data, and then uses linear classifier to classify data.
-
Both are feature mapping and then classification methods. In fact, the kernel methods of SVM are also learnable, but they cannot be learned to the extent of deep learning.
-
When you use multiple kernel functions, the weight of the connection between the two cores can be learned, just like the previous SVM has only one hidden layer. When using the multi-kernel method, there are multiple hidden layers, so the weight between them can be learned.