SVM
SVM is support vector machine, often used in binary classification model. Its main idea is:
- It is a linear classifier with the largest spacing in feature space.
- In the case of linear inseparability, the linear inseparability samples in low dimensional space are mapped to higher dimensional feature space by nonlinear mapping algorithm, and the higher dimensional feature space can be analyzed linearly.
Structural risk
For a specified loss function, empirical risk can be calculated according to a certain sample set, and empirical risk minimization is to minimize empirical risk according to the sample set.
If we can capture all the data, we want the loss of the entire data set to be as small as possible, which means the model is as good as possible. However, in many cases, it is almost impossible to obtain all the data, so the expected risk can be calculated according to the joint distribution P(X,Y) of samples:
It can be seen that as the sample size approaches infinity, the empirical risk tends to the expected risk. In practice, our training samples are very limited. If only empirical risk is considered, over-fitting phenomenon may occur. In this case, it can well predict the training samples, but its generalization ability is limited and its prediction ability for non-training data may be poor. To overcome this problem, structural risk is introduced.
Structural risk = empirical risk + confidence risk. Confidence risk can be a regularization term, which is called regularization. The regularization term is related to the complexity of the model, and the higher the complexity, the greater the regularization value. A common regularization term is the norm of model parameters.
In addition, if structural risk is not used, the idea of cross-validation can be used to consider all trained models together, and the model that can predict the validation set more accurately is the better.
Interval maximization
From the perceptual understanding on the support vector machine (SVM), the following figure, assuming that there are two different categories of data, the distribution in space, the task of support vector machine (SVM) is to find a classification hyperplane, divided them into two classes, so we can see there is more than a hyperplane to divide, which one is the best? This requires us to find the best hyperplane according to certain constraints.
For a given sample set, including, which represents the class tag, represented by +1 and -1 respectively. Let’s say the hyperplane equation is zero
Where W is the normal vector, b is the displacement term, and the hyperplane is determined by the normal vector (determining the direction) and the displacement term (determining the distance between the hyperplane and the origin). So the distance from the sample point to the hyperplane is,
In the case that all samples can be correctly divided by the hyperplane, then in the sample set T, forThere areOn the contrary,There are.
There are some sample points nearest to the hyperplane on both sides of the hyperplane (these points are called support vectors), and the middle hyperplane in the figure is translated to both sides to obtain two hyperplanes, and the corresponding variances are respectively
The distance between the two hyperplanes is called the interval, and the sum of the distances between the sample points of the two hyperplanes and the intermediate hyperplane is zero
Now our task is to find the maximum interval under constraints, that is, to find w and b so that L is maximum,
It’s maximizingIs equivalent to minimizing, then the above optimization problem becomes,
The maximum interval separation hyperplane can be obtained by solving the optimization problem.
The dual problem
The above optimization problem is a convex quadratic programming problem, which can be solved by Lagrange method. So let’s first introduce the Lagrange function, and let’s say that.
Among themLagrange times the subvector.
In order to find the solution of the dual problem, it is necessary to find the minimum of the Lagrangian function to W and b, and then find the maximum of the Lagrangian function to A, namely.
To minimize, take the partial derivative with respect to w and b and set it equal to 0
So, if you put w back into the original formula, w and B will cancel out, and then the Lagrangian function of a becomes,
The above equations can be solved by quadratic programming algorithm, numerical method, or SMO algorithm. If you solve for A, you can solve for w and b.
And there’s another very important constraint,And you can see that for any sample, always meetor. When the corresponding Lagrange coefficient is non-zero, this sample is a support vector, and when the corresponding Lagrange coefficient is zero, this sample is not a support vector, and it follows that most Lagrange coefficients are zero, because most samples are not support vectors.
Kernel function
The purpose of introducing kernel function is to solve the case of linear inseparability. It can be seen that the samples discussed before are linearly separable, but for the samples of practical problems, it may not be able to directly find a hyperplane and divide them into two categories. As shown below, the samples cannot be cut apart with a single knife.
For this kind of linearly indivisible problem, the sample can be mapped from the original space to the higher dimensional feature space, so that the higher dimensional feature space can be linearly separable. For example, the original two-dimensional space can be mapped to three-dimensional space, and in this three-dimensional space can be found partition hyperplane. It can be assumed that there must be a higher dimensional space that maps the original space to make it divisible.
Assuming thatIs the feature vector after mapping, then the model discussed in the feature space becomes,
Also solve the duality problem
Continue the hypothesis function
So I can solve for,
Here k is the kernel.
Common kernel functions include linear kernel, polynomial kernel, Gaussian kernel, sigmoid kernel and so on.
————- Recommended reading ————
My 2017 article summary – Machine learning
My 2017 article summary – Java and Middleware
My 2017 article summary – Deep learning
My 2017 article summary — JDK source code article
My 2017 article summary – Natural Language Processing
—————— advertising time —————-
The public menu has been divided into “distributed”, “machine learning”, “deep learning”, “NLP”, “Java depth”, “Java concurrent core”, “JDK source”, “Tomcat kernel” and so on, there may be a suitable for your appetite.
My new book “Analysis of Tomcat kernel Design” has been sold on JINGdong, friends in need can buy. Thank you all.
Why to write “Analysis of Tomcat Kernel Design”
Welcome to: