This paper summarizes the scikit-Learn SVM algorithm library from a practical point of view. The SciKit-learn SVM library encapsulates the implementation of libSVM and Liblinear, and only overwrites the interface part of the algorithm.

1. Scikit-learn SVM algorithm library usage overview

The SVM algorithm library in SciKit-Learn is divided into two types. One is the classification algorithm library, including SVC, NuSVC, and LinearSVC classes. Another class is the regression algorithm library, including SVR, NuSVR, and LinearSVR classes. The related classes are wrapped in the sklear.SVM module.

LinearSVC, NuSVC, LinearSVC, NuSVC, LinearSVC, NuSVC, LinearSVC, NuSVC, LinearSVC, NuSVC, LinearSVC, NuSVC, LinearSVC, NuSVC, LinearSVC Cannot be used for linearly indivisible data.

Similarly, for SVR, NuSVR, and LinearSVR three regression classes, SVR and NuSVR are almost the same, the only difference is that the loss is measured differently. LinearSVR is linear regression and only linear kernel functions can be used.

When we use these classes, if we have experience and know that the data is linear and can be fitted, then we can use LinearSVC to classify or LinearSVR to regression. They don’t require us to slowly adjust parameters to select various kernel functions and corresponding parameters, and they are also fast. If we have no experience with data distribution, we usually use SVC for classification or SVR for regression, which requires us to select the kernel function and tune the kernel function.

What special scenarios need to use NuSVC classification and NuSVR regression? NuSVC classification and NuSVR can be selected if we have requirements on the error rate or the percentage of support vector of training set training. They have a parameter that controls the percentage.

The detailed use of these classes will be described below.

2. Review SVM classification algorithm and regression algorithm

Let’s briefly review the SVM classification algorithm and regression algorithm, because some parameters in it correspond to the parameters of the algorithm library. If we don’t review it first, the description of parameters may be a little difficult to understand.

For SVM classification algorithm, its original form is:

Where m is the number of samples, and our sample is.It’s part of our separate hyperplaneCoefficient,Is the relaxation coefficient of the ith sample, and C is the penalty coefficient.Is a mapping function from low dimension to high dimension.

The form after Lagrange function and duality is:

Which differs from the original formIs the Lagrange coefficient vector.That’s the kernel that we’re going to use.

 

For the SVM regression algorithm, its original form is:

Where m is the number of samples, and our sample is.Of our regression hyperplaneCoefficient,Is the relaxation coefficient of the ith sample, C is the penalty coefficient,Is the loss boundary, and the distance from the hyperplane is less thanThe points of the training set are not lost.Is a mapping function from low dimension to high dimension.

The form after Lagrange function and duality is:

Which differs from the original formIs the Lagrange coefficient vector.That’s the kernel that we’re going to use.

3. Overview of SVM kernel functions

There are four built-in kernels in SciKit-Learn, but only three if you don’t consider linear kernels to be kernel functions.

1) Linear Kernel is expressed as:So that’s the normal inner product, so LinearSVC and LinearSVR only use that.

2) Polynomial Kernel is one of the commonly used Kernel functions of linearly indivisible SVM, and the expression is as follows:, among them,All need to adjust their own parameter definition, more trouble.

3) Gaussian Kernel, also known as Radial Basis Function (RBF) in SVM, is the default Kernel Function of libSVM and of course, scikit-learn. The expression is:, among them,If it’s greater than 0, you need to tune the definition.

4) Sigmoid Kernel is also one of the commonly used Kernel functions of linear non-separable SVM. The expression is:, among them,You need to tune your own definitions.

In general, it is better to use the default Gaussian kernel for nonlinear data. If you are not a SVM tuning expert, it is recommended to use the Gaussian kernel for data analysis.

4. Summary of parameters of SVM classification algorithm library

Here we do a detailed explanation of the important parameters of SVM classification algorithm library, focusing on some points of attention to tuning.

parameter LinearSVC  SVC NuSVC
Penalty factor C Is the penalty coefficient C in the prototype form and dual form of SVM classification model in section 2. The default value is 1. Generally, an appropriate C needs to be selected through cross verification. Generally speaking, if there is a lot of noise, C needs to be smaller. NuSVC does not have this parameter. It controls the error rate of the training set through another parameter nu, which is equivalent to choosing a C to make the training set meet a certain error rate after training
nu LinearSVC and SVC don’t have this parameter, but LinearSVC and SVC use the penalty factor C to control the punishment force. Nu represents the upper limit of the error rate of the training set, or the lower limit of the percentage of the support vector. The value range is (0,1). The default value is 0.5. It’s similar to the penalty factor C, which controls the severity of the punishment.
Kernel function is the kernel LinearSVC doesn’t have this argument. LinearSVC limits the use of linear kernel functions

There are four built-in choices for the kernel, as described in Section 3: ‘Linear’, ‘poly’, ‘RBF’, ‘Gaussian’, and ‘Sigmoid’. If these kernels are selected, the corresponding kernel parameters have separate arguments to tune later. The default is gaussian kernel ‘RBF’.

Another option is “precomputed”, that is, we pre-calculate the Gram matrix corresponding to samples of all the training sets and test setsFind the value of the corresponding position directly in the corresponding Gram matrix.

Of course, we can also customize the kernel, but I won’t go into that because I haven’t used custom kernels.

Regularization parameter penalty For linear fitting only, you can choose ‘L1’ for L1 regularization or ‘L2’ for L2 regularization. The default is L2 regularization. If we need to generate sparse coefficients, we can choose L1 regularization, which is similar to Lasso regression in linear regression. SVC and NuSVC do not have this parameter
Whether dual is optimized with dual form This is a Boolean variable that controls whether the algorithm is optimized using the dual form. The default is True, that is, the algorithm is optimized using the dual form of the classification algorithm in section 2 above. If the sample size is larger than the feature number, it will require a large amount of calculation to adopt the dual form. Therefore, it is recommended that Dual be set to False, that is, the original form is adopted for optimization SVC and NuSVC do not have this parameter
The kernel function argument degree LinearSVC doesn’t have this argument. LinearSVC limits the use of linear kernel functions If we use the polynomial kernel function ‘poly’ in the kernel parameter, then we need to tune this parameter. This parameter corresponds toIn the. The default is 3. It is generally necessary to select a suitable set through cross validation
The kernel parameter gamma LinearSVC doesn’t have this argument. LinearSVC limits the use of linear kernel functions

If we use the polynomial kernel ‘poly’, the Gaussian kernel ‘RBF’, or the sigmoid kernel in the kernel parameter, then we need to tune this parameter.

The polynomial kernel corresponds to this parameterIn the. It is generally necessary to select a suitable set through cross validation 

The gaussian kernel corresponds to this parameterIn the. It is generally necessary to select the appropriate one through cross validation

The sigmoid kernel corresponds to this parameterIn the. It is generally necessary to select a suitable set through cross validation 

The default is ‘auto’, that is

The kernel argument coef0 LinearSVC doesn’t have this argument. LinearSVC limits the use of linear kernel functions  

If we use the polynomial ‘poly’ or sigmoid kernel in the kernel argument, then we need to tune this parameter.

The polynomial kernel corresponds to this parameterIn the. It is generally necessary to select a suitable set through cross validation 

The sigmoid kernel corresponds to this parameterIn the. It is generally necessary to select a suitable set through cross validation 

Coef0 defaults to 0

 

Sample weight class_weight The weight of each category of the sample is specified mainly to prevent too many samples of some categories in the training set, leading to the training decision biased too much towards these categories. Here, you can specify the weight of each sample, or use “balanced”. If “balanced” is used, the algorithm will calculate the weight by itself, and the sample weight corresponding to the category with a small sample size will be high. Of course, if your sample category distribution is not significantly biased, you can ignore this parameter and choose the default “None”.
Decision_function_shape LinearSVC does not have this parameter, but uses the multi_class parameter instead. You can select ‘ovo’ or ‘ovo’. The current 0.18 version defaults to ‘ovo’. The 0.19 version will be ‘OVr ‘.

The idea of OvR(One ve Rest) is very simple, no matter how many metasomies you have, we can think of it as binary. Specifically, for the classification decision of the KTH category, we take all samples of the KTH category as positive examples and all samples except samples of the KTH category as negative examples, and then do binary classification above to obtain the classification model of the KTH category. The classification models of other classes are derived in the same way.

OvO(one-VS-One) selects two types of T samples at a time, let’s call them T1 and T2, and puts all the samples with T1 and T2 outputs together, using T1 as a positive example and T2 as a negative example, and performs binary classification to obtain model parameters. We need T times T minus 1 over 2 classifications.

It can be seen from the above description that OvR is relatively simple, but the classification effect is relatively poor (here refers to the majority of sample distribution, some sample distribution OvR may be better). OvO is more accurate, but not as fast as OvR. OvO is generally recommended for better classification.

Classification decision multi_class

You can choose ‘OVR’ or ‘Crammer_singer’

‘OVR’ is similar to the ovR corresponding to decision_function_shape in SVC and nuSVC.

‘Crammer_singer’ is an improved version of ‘OVR’, an improvement, but no better than ‘OVR’ and generally not recommended in applications.

SVC and nuSVC do not have this parameter and use decision_function_shape instead.
Cache size cache_size

LinearSVC is not very computative, so you don’t need this parameter

In large samples, the cache size can affect the training speed, so 500MB or even 1000MB is recommended if the machine has large memory. The default is 200, or 200MB.

5. Summary of parameters of SVM regression algorithm library

A large part of the important parameters of the SVM regression algorithm library are similar to those of the classification algorithm library, so here we focus on the parts different from the classification algorithm library. For the same parts, please refer to the corresponding parameters in the previous section.

parameter LinearSVR SVR nuSVR
Penalty factor C Is the penalty coefficient C in the prototype form and dual form of SVM classification model in section 2. The default value is 1. Generally, an appropriate C needs to be selected through cross verification. Generally speaking, if there is a lot of noise, C needs to be smaller. You may notice that in the classification model, nuSVC uses nu as the equivalent parameter to control the error rate, but not C. Why do we still have this parameter in nuSVR? Is it not repeated? And the reason for this is that in the regression model, in addition to the penalty coefficient C, we also have a distance errorIn other words, the regression error rate is the penalty coefficient C and the distance errorThe result of joint action. We will see what NU does in nuSVR later.
nu LinearSVR and SVR don’t have this parameter, useControl the error rate Nu represents the upper limit of the error rate of the training set, or the lower limit of the percentage of the support vector. The value range is (0,1). The default value is 0.5. Different distance errors can be obtained by selecting different error rates. That is to say nu is used here and LinearSVR and SVRThe parameters are equivalent.
Distance error epsilon That is, in our regression model in section 2, samples in the training set should be satisfied NuSVR does not have this parameter and uses NU to control the error rate
Whether dual is optimized with dual form Similar to SVC, refer to the dual description in the previous section SVR and NuSVR do not have this parameter
Regularization parameter penalty Similar to SVC, see the description of penalty in the previous section SVR and NuSVR do not have this parameter
Kernel function is the kernel LinearSVR doesn’t have this argument, LinearSVR restricts you to using only linear kernel functions Similar to SVC and nuSVC, refer to the kernel description in the previous section
The kernel function parameters degree, gamma, and coef0 LinearSVR doesn’t have these arguments, LinearSVR restricts you to using only linear kernel functions Similar to SVC and nuSVC, please refer to the kernel parameter description in the previous section
The loss function measures loss

The value can be Epsilon_insensitive or squared_epsilon_insensitive. If epsilon_insensitive is selected, the loss measurement is satisfied, that is, the same as the loss measurement in Section 2. Is the default loss measure form of SVM regression.

The loss measure is satisfied when the parameter is selected ‘squared_epsilon_insensitive’, there will be one less relaxation coefficient. The optimization process is not covered in the PRINCIPLE of SVM series, but the objective function optimization process is completely similar.

The default ‘epsilon_insensitive’ is generally sufficient.

SVR and NuSVR do not have this parameter
Cache size cache_size

LinearSVC is not very computative, so you don’t need this parameter

In the case of large samples, the cache size can affect the training speed, so if the machine has large memory, like SVC and nuSVC, 500MB or even 1000MB is recommended. The default is 200, or 200MB.

6. Other tuning points of SVM algorithm library

The library parameters in SciKit-Learn have been summarized above, but the other tuning points are summarized here.

1) It is generally recommended to normalize the data before training, of course, the data in the test set also needs to be normalized.

2) When the number of features is very large, or the number of samples is far less than the number of features, the linear kernel has a good effect and only penalty coefficient C needs to be selected.

3) When selecting kernel functions, it is generally recommended to use the default Gaussian kernel ‘RBF’ if the linear fit is not good. In this case we mainly need the penalty coefficient C and the kernel parameterThrough painstaking parameter tuning, the appropriate penalty coefficient C and kernel function parameters are selected through multiple rounds of cross validation.

4) Gaussian kernel is theoretically no worse than linear kernel, but this theory is based on the fact that it takes more time to tune parameters. So we can actually solve problems with linear kernels and we’ll try to use linear kernels.

    

(Welcome to reprint, reprint please indicate the source. Welcome to communicate: [email protected])