This is the 20th day of my participation in the August Text Challenge.More challenges in August
In the last class, we mainly introduced Kernel Logistic Regression and discussed how to apply SVM techniques to soft-binary classification. The method is to use 2-level learning to obtain parameters B and W using SVM first, and then use the general Logistic regression optimization algorithm to fine-tune parameters B and W through iterative optimization to obtain the best solution. Then, the Kernel technique of SVM can be introduced into Z space through Representer Theorem to solve the Logistic regression directly. This lesson will extend the previous lesson and discuss how to apply kernel techniques of SVM to regression problems.
Kernel Ridge Regression
First, let’s review the Representer Theorem introduced in last class. For any L2-regularized Linear Model containing regular terms, its optimal solution W can be written as a linear combination of Z. Therefore, the kernel technique can be introduced. Kernelized the model.
So how do you put regression models into kernel form? The most common error estimation for Linear/Ridge Regression we introduced earlier is squared error, i.e. The analytic solution corresponding to this form is an analytic solution, that is, the optimal solution can be directly obtained through vector operation by using linear least square method. Then we need to study how to introduce kernel into ridge Regression to get a corresponding analytic solution.
Let’s write the Kernel Ridge Regression problem first:
Where, the best solutionIt has to be a linear combination of z. So let’s takeSubstitute into the Ridge regression, replace the inner product of Z with kernel, and computeThe problem of theta becomes the problem of thetaGet:
The Ridge regression can be written as a matrix, where the first entry can be viewed asAnd the second term can be viewed asThe error of the function. So, what we want to do is solve for this formula to minimize thetaValue, thus solving the Kernel Ridge Regression problem.
To solve theCan be written as follows:
Is aboutThe quadratic polynomial of phi, to be trueTo find the minimum solution, this convex quadratic optimization problem, you just need to calculate the gradient, and then set the gradient to zero.I’ve written it in the equation above, and if I set it equal to zero, I get one of the possibilitiesThe analytical solution of is:
The question to be concerned about here isIs there an inverse matrix of? The answer is yes. Because as we introduced before, the kernel K satisfies Mercer’s condition, which is semidefinite, and, soIt must be invertible. In terms of the complexity of the computation time, due toIs NxN size, so the time complexity is. And one more thing,It’s a product of two terms, and the other term is K, can K be equal to 0? In fact, since the kernel K represents the inner product of the z-space, in general, the inner product is zero only if the two vectors are perpendicular to each other, otherwise, in general, K is not equal to zero. This reason also decidedIt’s the dense matrixMost of the solutions are non-zero. And we’ll talk about that later.
Therefore, we can solve non-Linear regression problems through kernel. Let’s compare linear Ridge Regression with Kernel Ridge regression.
As shown in the figure above, the linear Ridge Regression on the left is a straight line; On the right is the Kernel Ridge Regression, which is a curve. Roughly speaking, the curve on the right fits better. What are the advantages and disadvantages of these two regression types? For Linear Ridge Regression, it is a linear model that can only fit straight lines; Second, its training complexity isAnd the predicted complexity is zeroThe model is more efficient if N is much larger than D. For Kernel Ridge Regression, it transforms to z-space and uses kernel techniques to produce non-linear models, so it is more flexible. Second, its training complexity isAnd the predicted complexity is zero, are only related to N. The kernel Ridge Regression is suitable for situations where N is not very large. In comparison, linear and kernel are really a trade-off between efficiency and flexibility.
Support Vector Regression Primal
Linear Regression can be used for classification in the machine Learning foundation course. Kernel Ridge Regression can also be used for classification in the previous section. We applied kernel Ridge regression to classification under the new name least-Squares SVM (LSSVM).
First, see the differences between the soft-margin Gaussian SVM and Gaussian LSSVM results for a certain problem.
As shown in the figure above, there is not much difference between Soft-margin Gaussian SVM and Gaussian LSSVM when only looking at classification boundaries, that is, the classification lines are almost the same. However, if you look at the Support Vector (points marked by the box in the figure), there are not many SVS in the left side of the Soft-margin Gaussian SVM, while almost every point in the right side of the Gaussian LSSVM is SV. This is because of the soft-margin Gaussian SVMMost of them are zero,Theta is only a minority, so SV is low. For LSSVM, we introduced it in the last sectionMost of the solutions are non-zero, so every corresponding point is basically SV. One problem with too many SV’s is the moment to make predictionsIf theIf there are many non-zero values, the computation of G is also large, which reduces the calculation speed. For this reason, soft-margin Gaussian SVM is more advantageous.
So, dense in LSSVMCan we use some method to get sparse, so that SV is not too much, so as to get the same classification effect as soft-margin SVM? We will try to solve this problem.
The method is to introduce a method called Tube Regression, in which a region (neutral region) is delimited above and below the classification line. If the data points are distributed in this region, the classification error is not considered, and only the misclassification outside the neutral region is considered error.
Assume that the width of the neutral zone is., then error measure can be written as:Corresponds to the distance marked in red in the figure above.
This error is usually calledInsensitive Error, this Max form is similar to the Hinge Error measure form we introduced in last lesson. So what we did next was to do L2-regularized Tube regression derivation similar to soft-margin SVM to get sparse.
First, let’s compare the Error in Tube Regression with squared Error:
Then, draw the relationship curves of Err (y,s) and S respectively:
In the figure above, the red line represents Squared error and the blue line represents Tube error. We found that when the | x-y | s comparative small s close to y, the squared error with tube is size error. In a large area of x-y | | s, squared error rate than the growth of tube error. The larger the increase of error is, the more susceptible it is to the influence of noise, which is not conducive to solving the optimization problem. So, in this respect, the Tube Regression error function is better.
Now, let’s write down the L2-regularized Tube Regression:
This optimization problem, since it contains the Max term, is not differentiable everywhere, so it is not suitable to be solved by GD/SGD. Moreover, although satisfying the Representer Theorem, it is possible to solve the problem by introducing kernel, it is not guaranteed to obtain Sparsity. On the other hand, we can transform this problem into a QP problem with conditions, and introduce kernel to obtain KKT conditions, following the derivation method of Dual SVM, so as to ensure the solutionIs sparse.
Therefore, we can write l2-regularized Tube Regression in a form similar to SVM:
It’s worth mentioning the coefficientIs inversely proportional to C,The bigger the C is, the smaller the C is,The smaller the C, the bigger the C. And this formula also takesThat is, B is taken out separately, which is consistent with our previous method of deriving the solution of SVM.
Now that we have the initial form of Standard Support Vector, this is still a Standard QP problem. Let’s continue with some transformations and derivations of this expression:
As shown on the right of the figure above, is the standard QP problem, whereandUpper tube violations and Lower Tube violations. This form is called Support Vector Regression (SVR) Primal.
The standard QP form of an SVR contains several important parameters: C and. C represents the trade-off between regularization and tube violation. Large C prefers tube violation, small C prefers regularization.It represents the region width of tube, namely, the tolerance of right and wrong points.The greater the tolerance, the greater the tolerance for error.Is a settable constant that is unique to SVR problems and does not exist in SVM. In addition, the QP form of SVR is commonTwo parameters, two n plus two n conditions.
Support Vector Regression Dual
Now that we have the Primal form of the SVR, we will derive the Dual form of the SVR. First, like the dual form of SVM, shilling Lagrange factorand, respectively are andandThe inequality corresponds. I’m ignoring thetaandCorresponding Lagrange factors.
Then, the same derivation and simplification as SVM is done, and the partial derivative of Lagrange function to relevant parameters is zero, and the corresponding KKT condition is obtained:
Next, by observing the parameter correspondence between SVM Primal and SVM Dual, the form of SVR Dual was directly derived from SVR primal. (The exact mathematical derivation is ignored here!)
Finally, it’s time to discuss whether SVR solutions are truly sparse. The solution W derived in SVR Dual form has been deduced previously:
Corresponding complementary slackness is:
For points distributed within the central region of tube, satisfy, ignore the error,andAll equal to zero. Then the second term of the two equations is not zero and the complementary slackness must be obtainedand, i.e.,.
So, for the points that are distributed in tube, we get a solutionSparse. And the points that lie outside the tube,. At this point, we have an SPARSE solution to THE SVR.
Summary of Kernel Models
This section will summarize and summarize all the kernel models we have introduced. We have introduced a total of three linear models, namely PLA/ Pocket, Regularized Logistic regression and Linear Ridge regression. All three models can be solved using the Liblinear library function developed by Chih-Jen Lin, PhD, of National Taiwan University.
In addition, we introduce linear soft-margin SVM, where error function is, can be solved by standard QP problems. Linear soft-margin SVM solves the same problem as PLA/ Pocket. Then, the Linear SVR problem, which solves the same problems as linear Ridge Regression, is introduced from the perspective of SVMIs converted to QP problem for solving, which is the main content of this lesson.
The corresponding model in the figure above can also be transformed into dual form and introduced into kernel. The overall block diagram is as follows:
Among them, SVM, SVR and Probabilistic SVM can all be solved by LLibsvm library function developed by Dr. Chih-jen Lin from National Taiwan University. Generally speaking, SVR and Probabilistic SVM are the most commonly used of these models.
conclusion
This class mainly introduces SVR. We first convert the Ridge regression into kernel form, namely kernel Ridge regression, through the Representer THEOREM, and derive the solution of SVR. But the solution is dense, with a large non-zero value. Therefore, we defined a new Tube regression, adopted the SVM derivation method to minimize regularized Tube errors, converted it into a dual form, and obtained sparse solutions. Finally, we make a summary of all kernel models introduced, and briefly outline their characteristics. In practical application, we should choose suitable models according to different problems.