1. Write at the front

If you want to engage in data mining or machine learning, it is necessary to master common machine learning algorithms. Common machine learning algorithms are as follows:

  • Supervised learning algorithms: logistic regression, linear regression, decision tree, Naive Bayes, K-nearest Neighbor, support vector machine, integration algorithm Adaboost, etc
  • Unsupervised algorithms: clustering, dimensionality reduction, association rules, PageRank, etc

For detailed understanding of the principles, have seen watermelon book, statistical learning methods, such as machine learning field, also heard some machine learning courses, but always feel more abstruse words, don’t have the patience to read, and theory are everywhere, but practice is the most important, so here want to use the most simple language to write a vernacular theory + practice series of machine learning algorithm.

In my opinion, it is more important to understand the idea behind the algorithm and its use than to understand its mathematical derivation. Idea will let you have an intuitive feeling, so as to understand the rationality of the algorithm, the mathematical deduction is to express this kind of rationality in a more rigorous language, for example, a pear is sweet and can be expressed in mathematical language to sugar content is 90%, but only the bite personally, you can truly feel the pear how sweet, And really understand the math of what 90 percent sugar looks like. If these machine learning algorithms are pears, the primary purpose of this article is to take a bite out of them. There are also several other purposes:

  • Test your understanding of the algorithm and make a summary of the algorithm theory
  • Can be happy to learn the core ideas of these algorithms, find the interest in learning these algorithms, for in-depth learning these algorithms lay a foundation.
  • The theory of each class will be put to a practical case, can really learn to apply, not only can exercise programming ability, but also can deepen the grasp of algorithm theory.
  • Also want to put all the previous notes and reference in a piece, convenient for the convenience of checking later.

The process of learning algorithms should not only gain the theory of algorithms, but also have fun and the ability to solve practical problems!

Support Vector Machine (SVM) is a binary classification model. Its basic model is a linear classifier defined in the feature space with the largest interval, which distinguishes it from perceptron. Support Vector Machine also includes nuclear skills. This makes it a nonlinear classifier in nature, and support vector machines perform very well in dealing with dichotomies, which is also a very good algorithm. As a supervised learning model, SVM can usually help us with pattern recognition, classification and regression analysis.

But writing this article is really a little difficult, because the derivation of support vector machine involves a lot of mathematical formulas, such as Lagrange duality, KKT and so on. And if you really want to dig deeper into the principles, it’s not called vernacular machine learning algorithm theory, so I’m going to ignore the derivation here, because I think most of the time, except for writing papers, you’re not going to use the derivation of these formulas.

The so-called vernacular, I understand is to understand the basic principles of the algorithm, and then use it again. As for the details, I can make them up when necessary. I will post links to the derivation of all mathematical formulas and deeper principles later. Today, I will talk about the theoretical part of support vector machine in vernacular language, and then talk about how to call tools to achieve SVM, and finally do a real combat of breast cancer detection.

The outline is as follows:

  • How SVM works (to find the maximum classification interval)
  • Hard interval, Soft interval and Nonlinear SVM (linearly separable SVM, linear SVM, nonlinear SVM)
  • How to solve the multi-classification problem of SVM (one-to-many method and one-to-one method)
  • SVM: How to detect breast cancer?

OK, let ‘s go!

2. Support vector machines? Too deep name, let’s start with a few exercises!

Support Vector Machine (SVM) is called Support Vector Machine. A listen to this name feel lofty, a lot of people a see this feeling is too difficult, the name all don’t understand, and then slip away, which have TV fun. But aren’t you curious about how something that sounds so lofty would behave in life? In fact, it comes from life. No? Why don’t we do some exercises first?

Exercise 1: There are two kinds of red and blue balls on the table. Can you use a stick to separate the two color balls?You say: It’s not easy, it’s neat, it’s neat, yeah, it’s great.Exercise 2: Did I make the first one easy? Then I will make it more difficult. There are still two kinds of red and blue balls on the table, but their arrangement is irregular, as shown in the picture below. How do you use a stick to separate the two colors?Now you’re in a quandary. Can this be separated with a stick? Unless the stick is bent, as follows:So here the line becomes a curve. If you look at it in the same plane, the red and blue balls are very difficult to separate. So is there a way to separate them naturally?

Here you may have a brainwave, bang on the table, and the balls will instantly rise into the air, as shown below. As it rose, a horizontal section appeared, just separating the red and blue spheres.Here, a two-dimensional plane becomes a three-dimensional space. The original curve becomes a plane. This plane, we call it the hyperplane. (You’re so good. Does your hand hurt when the ball flies like this?)

You know what? In fact, this is a bit like support vector machine. The calculation process of support vector machine is to help us find the green hyperplane, which is our SVM classifier. Ha ha, isn’t that amazing? Something this big, looking for a hyperplane like this?

Let’s look at the principle of SVM.

3. Working principle of SVM

Again from exercise 1, I believe that many people may divide the way I draw different, such as the following:These three lines, A,B, and C, can separate the red and blue balls. Which is the best?

u

Obviously, line B is closer to the blue ball, but in real life, if there are more balls, the blue ball might be divided to the right of line B and considered red. Similarly, line A is closer to the red ball, and in real life, if there are more red balls, it might be mistaken for A blue ball. So the division of line C is better than that of line A and line B, because it’s more robust. What is robustness? Fault tolerance is very strong. Don’t easily confuse red balls with blue ones.

You might say, well, of course you drew it so easily, but how on earth does a computer find a line like C? (In fact, our classification environment is not in a two-dimensional plane, but in a multi-dimensional space, so that line C becomes decision surface C.)

That’s a good question, but before we can answer that question, we need to introduce a svM-specific concept: classification interval.

Under the condition that the decision surfaces remain unchanged and the classification does not produce errors, we can move decision surfaces C until two limits are generated: decision surfaces A and B in the figure.The position of the limit means that if you go beyond this position, there will be a classification error. In this case, the boundary C between the two limiting positions A and B is the optimal decision surface.

The distance between the limit position and the optimal decision surface C is the “classification interval”, which is called margin in English.If we rotate this optimal decision surface, you will find that there may be multiple optimal decision surfaces, which can divide the data set correctly, the classification interval of these optimal decision surfaces may be different, and the decision surface with “Max margin” is the optimal solution for SVM. So how do you determine the maximum interval?

Speaking of the term classification interval, it is necessary to talk about how this distance should be measured. (The interval can be divided into functional interval and geometric interval. The former measures the relative distance between multiple points in the same hyperplane, while the latter measures the actual distance between the same point and different hyperplanes. Obviously, the latter is needed here because we have too many hyperplanes.)

And if you want to talk about distance, you have to define the hyperplane.

In the example above, if we put the red and blue colored balls into a three-dimensional space, you find that the decision surface becomes a plane. Here we can express it as a linear function, if it represents a point in one dimension, if it represents a line in two dimensions, if it represents a plane in three dimensions, but of course there can be more dimensions, so we give this linear function a name called the hyperplane. The mathematical expression of the hyperplane can be written as:In this formula, w and x are vectors in n-dimensional space, where x is a function variable; W is the normal vector. The normal vector here refers to the vector represented by the line perpendicular to the plane, which determines the direction of the hyperplane.

SVM is to help us find a hyperplane, which can divide different samples and maximize the minimum distance (i.e. classification interval) between points in the sample set and the classification hyperplane.

Next, we define that the distance from some sample set to the hyperplane is the shortest distance from any sample in that sample set to the hyperplane. We use di to represent the Euclidean distance from the point xi to the hyperplane Wxi +b=0. So we’re going to find the minimum value of di, and we’re going to use that to represent the shortest distance from this sample to the hyperplane. Di can be calculated using the formula:The | | w | | for hyperplane norm, di formula can be derived with analytic geometry knowledge, there is no explanation.

What SVM needs to do is to solve the maximum classification interval, how to solve? We already know a little bit of the distance to the hyperplane formula, we hope that a training sample (xi, yi), the geometry of the sample points and the separating hyperplane minimum distance, as gamma, other each training sample interval greater than gamma, under this constraint, we want to maximize the gamma, namely maximize hyperplane (w, b) about the geometric interval training set.This becomes an optimization problem. The above form can also be expressed like this:The | | w | | this is a constant, can be mentioned:We know that the value of the interval does not affect the solution of the optimization problem (because if w and B are scaled to λw and λb, then the interval of the function is also changed to λ times, which does not affect the minimization problem), so we simply take the latter as 1. That is, the function interval between all sample points and the hyperplane is at least 1, that is, the following rewrite:This problem, we can convert it to the reciprocal and minimize it to the equivalent problem:At this point, this becomes a convex optimization problem, which can be turned into a Lagrangian duality problem, to determine whether KKT conditions are satisfied, and then solve it, which requires mathematical formula derivation. But I don’t want to write this, because I just want to show you that this hyperplane can be solved this way, without knowing how to solve it.

Let’s say I’m done, and I figure out this hyperplane, like this:It doesn’t matter if you don’t know the mathematical formula, but you have to know a few terms, or you’re an amateur.

  • Separating hyperplane: the optimal hyperplane that separates samples
  • Supporting hyperplane: the two lines that separate the hyperplane after it has been shifted to its limiting position
  • Separation interval: The distance between supporting hyperplanes
  • Support vectors: are samples of the two limiting positions. In deciding to separate the hyperplane, only the support vector plays a role, not the other instance points. Since support vectors play a decisive role in determining the separated hyperplane, they are called support vector machines.)

You know how support vector machines work? I want to find a hyperplane that separates all the sample points with the greatest degree of certainty. The most reliable means the maximum distance from the sample point to the hyperplane. The reason why they are called support vectors is that only the two points in the limit position are useful in determining the separation of the hyperplane. The other points are not useful at all, because as long as the limit position is the largest distance from the hyperplane, it is the best separation plane. And the specific solution, is to solve a convex optimization problem, using some mathematical knowledge, we will not expand here. Does that make sense?

Now, there is another problem to discuss, which is that linear divisibility is usually an ideal state in real life. This condition is actually too severe. Where is such good data? No noise at all? What if there’s a bit of data?

Don’t panic, support vector machines are not that fragile. The completely linearly separable one is called hard interval maximization. There is hard, of course there is soft, now talk about soft interval maximization, that is, a little bit of noise.

4. Hard interval, soft interval and nonlinear SVM

If the data is completely linearly separable, the learned model can be called a hard interval support vector machine. In other words, hard interval refers to complete classification accuracy, there can be no classification error. Soft interval, which allows a certain number of samples to be misclassified.

We know that the data in the real world is not so “clean”, more or less there will be some noise. So linearly separable is an ideal case. At this point, we need to use soft interval SVM (approximately linearly separable), such as the following situation:In this case, isn’t it a little softer ah, is allowed to have misclassified points. No one’s perfect. Besides, it’s a machine. Then the optimization goal becomes something like this:I have another formula, but this one is similar to the optimization problem above, except that there is an extra relaxation term for the target. That is to allow for misclassification, but as few as possible. This is also a Lagrange duality problem to solve, again, you don’t have to solve it, you just have to understand it. If you want to know how to do this, please refer to my link below (Notes on Support Vector Machines for Statistical Learning Methods).

There is another case, nonlinear support vector machines.

For example, the sample set below is non-linear. The two types of data in the figure are distributed in the shape of two circles. In this case, no matter how advanced the classifier is, as long as the mapping function is linear, it cannot be processed, nor can SVM.

At this point, we need to introduce a new concept:Kernel function. It can map samples from the original space to a higher dimensional trait space, making samples linearly separable in the new space. So we can use the original derivation to do the calculation, but all the derivation is in the new space, not in the old space.Remember when the ball flew off the table? When you clap your hands, it’s essentially a kernel mapping, taking points from a two-dimensional plane, by equivalence, and putting them into a three-dimensional space, so that it’s possible to find a hyperplane to classify those points.

The nuclear technique looks something like this:Therefore, in nonlinear SVM, the choice of kernel function is the biggest variable that affects SVM. The most common kernel functions are linear kernel, polynomial kernel, Gaussian kernel, Laplacian kernel, Sigmoid kernel, or a combination of these. The difference between these functions is the way they are mapped. With these kernels, we can project our sample space into a new higher dimensional space.

For details on what these kernels look like, see the links below and how to use them when using skLearn.

So much for the theory of vector machines. How much do you understand?

Simple reason all of a sudden: (in the usual is ignored, because this does involve too much mathematical knowledge, only the vernacular description is a little difficult, so the reason above want to express a what meaning) :

u

First of all, when we talk about support vector machines, there are actually three cases:

  1. The samples are completely linearly separable. In this case, we learn by hard interval maximization, that is, to find a hyperplane that can be completely unerrable in classification. The objective function looks like this:To solve this problem, we need to convert to the Lagrangian dual form, and then use the Lagrangian function method, considering KKT conditions and a series of operations, will find the optimal W *, b*. A hyperplane is formed: g(x) = Wx + b. It’s the optimal surface, and the vector machine that you find is calledLinearly separable support vector machines.
  2. The samples are not completely linearly separable. In this case, soft interval maximization is adopted, that is, classification errors are allowed, but as few as possible. The objective function looks like this:To solve this problem, it is also necessary to transform into the dual form of Lagrangian, and then consider Lagrangian multiplier method, KKT, etc., to find the optimal W *, but at this time, there is more than one B *. In this case, the vector machine is calledLinear support vector machines.
  3. The sample itself is a nonlinear data set. And then you have to convert it by kernel, into some other space, make it linearly separable, and then go to the optimal hyperplane. The optimization problem looks like this: of course it’s turned into a duality problem,K(xi, xj) is the transformation by kernel. This one will work out as well. I won’t go into detail here, but such vector machines are calledNonlinear support vector machines.

That’s what I want to say about the principle of vector machines. You know what these three cases are, and what the optimization goal is, in general. Now there is no need to pursue so that in an age of deep learning, neural networks are not to come over, if you don’t want to really research really on a vector machine breakthrough, I suggest first know I described above these, and then focus on the following to learn how to use, you will find that even if you don’t know how to solve these things by hand, you can also learn according to the sample characteristic vector machine, And use support vector machine to detect breast cancer. Even if you can do the support vector machine solution by hand, you can’t do the breast cancer detection by hand, you still have to implement the vector machine tool to do the detection.

Okay, I’m probably confused. What’s this? What’s that? I understand, this is true, if you want to really learn to understand, but also have to know some knowledge of mathematics, the level is limited, both to vernacular, and do not want to write mathematics, I am too difficult.

Now for a lighter topic, how can support vector machines separate the red ball from the basketball? What if I’m not a two-color ball? What a vector machine should do is a multi-classification problem.

5. How to solve the multi-classification problem with SVM

SVM itself is a binary classifier, originally designed for dichotomous questions, that is, to answer Yes or No. In fact, the problem we’re trying to solve may be multi-classification, such as text classification, or image recognition.

In view of this situation, we can combine several binary classifiers to form a multiple classifier. The common methods are “one-to-many method” and “one-to-one method”.

  1. The one-to-many method assumes that we want to classify objects into four categories: A, B, C, and D. Then we can first classify one category as category 1, and all the other categories as category 2. In this way, we can construct four SVMS, which are as follows: (1) Sample A as positive set, B, C and D as negative set; (2) Sample B is the positive set, A, C and D are the negative set; (3) Sample C was taken as the positive set, and A, B, and D as the negative set; (4) Sample D is taken as the positive set, while A, B and C are taken as the negative set. This method, in view of the classification of K, need to train a classifier, K classification speed faster, but the training speed is slow, because every classifier needs for all the training samples, and the negative sample size is far greater than sample size, can cause the condition of asymmetric, and when adding new categories, such as the first K + 1 class, The classifier needs to be reconstructed.
  2. The idea behind one-on-one training is to be more flexible when training. We can construct a SVM between any two types of samples, so that there will be a C(K,2) class classifier for the K type samples. For example, if we want to divide three classes A, B and C, we can construct three classifiers: (1) Classifier 1: A and B; (2) Classifier 2: A, C; (3) Classification 3: B, C. When an unknown sample is classified, each classifier will have a classification result, namely 1 vote, and the category with the most votes is the category of the whole unknown sample. The advantage of this is that, if a new class is added, it does not need to retrain all SVM, but only need to train and add classifiers for this class of samples. Moreover, in this way, the training speed of single SVM model is fast. However, the disadvantage of this method is that the number of classifiers is proportional to the square of K, so when K is large, the time of training and testing will be slow.

6. Support vector machine – How to detect breast cancer?

After understanding the above principle, we can personally implement vector machines, and then practice. Haha, excited? You might say, well, I don’t know how that works, right? I haven’t figured out KKT and Lagrange pairs yet, but that’s okay, you can use them first and learn later.

6.1 How do I Use SVM in SkLearn

The SVM algorithm is available in Python’s Sklearn toolkit, which was introduced first

from sklearn import svm
Copy the code

SVM can do both classification and regression.

  • When doing regression with SVM, we can use SVR or LinearSVR. SVR stands for Support Vector Regression
  • So for the classifier, we use SVC or LinearSVC. SVC is called Support Vector Classification.

A brief explanation of the differences:

u

As you can see from its name, LinearSVC is a linear classifier that handles linearly divisible data. It can only use linear kernels. SVC is required for nonlinear data. In SVC, we can use both linear kernel functions (linear partition) and high-dimensional kernel functions (nonlinear partition).

How do you create a SVM classifier?

u

SVC’s constructor is used first: model = svm.SVC(kernel= ‘RBF’, C=1.0, gamma= ‘auto’), with three important parameters kernel, C, and gamma.

  • Kernel represents the selection of kernel function, and there are four choices, the default RBF, namely Gaussian kernel function

u

  1. Linear kernel function is used in the case of linearly separable data, which is fast and effective. The disadvantage is that it cannot handle linearly indivisible data.
  2. Poly: Polynomial kernel function, polynomial kernel function can map data from low dimensional space to high dimensional space, but there are many parameters, large calculation.
  3. RBF: Gaussian kernel function (default). Gaussian kernel function can also map samples to high-dimensional space, but requires fewer parameters than polynomial kernel function. It usually has good performance, so it is the default kernel function.
  4. Sigmoid: SigmoID kernel function, sigmoID is often used in neural network mapping. Therefore, when sigmoid kernel function is selected, SVM realizes multi-layer neural network.

  • Parameter C represents the penalty coefficient of the objective function. The penalty coefficient refers to the degree of penalty for misdividing samples, which is 1.0 by default. When C is larger, the accuracy of the classifier will be higher, but the same error tolerance rate will be lower, and the generalization ability will be worse. On the contrary, the smaller C is, the stronger the generalization ability will be, but the accuracy will be reduced.
  • The parameter gamma represents the coefficient of the kernel function, which defaults to the reciprocal of the sample feature number, namely gamma = 1 / n_features.

After creating the SVM classifier, you can enter the training set to train it.

u

We use Model.fit (train_X, train_Y), passing in the eigenvalue matrix train_X and the classification identifier train_Y in the training set. The eigenvalue matrix is the eigenvalue matrix extracted after feature selection (of course, you can also use all the data as the eigenvalue matrix). Classification identification is the result of manual identification of each sample. In this way, the model will train classifier automatically. We can use prediction=model.predict(test_X) to predict the results. Passing in the sample characteristic matrix test_X of the test set, the prediction classification result of the test set can be predicted.

We can also create linear SVM classifiers

u

Model: use SVM. LinearSVC (). In LinearSVC there is no kernel, which limits us to using only linear kernel functions. Because LinearSVC is optimized for linear classification, LinearSVC is more efficient than SVC for linearly separable problems with large amounts of data.

If you do not know whether the data set is linear, you can create a SVM classifier directly using the SVC class. LinearSVC, like SVC, uses Model.fit (train_X,train_y) and model.predict(test_X) for training and forecasting.

6.2 SVM for breast cancer detection

6.2.1 Introduction to Data sets

The data set is from the Breast cancer Diagnostic data set in Wisconsin, USA, and can be downloaded here.

Digital images of the patient’s breast mass after fine needle aspiration (FNA) were collected and features were extracted to describe the nuclear appearance in the images. Tumors can be classified as benign or malignant. The screenshot of some data is as follows:The table contains a total of 32 fields, representing the following meanings:In the table above, mean represents the mean, SE represents the standard deviation, and worst represents the maximum (the average of the three maximum values). The corresponding features are calculated for each image, and the 30 feature values (excluding the ID field and the diagnosis result field) are obtained. There are actually 10 characteristic values (radius, texture, Perimeter, area, smoothness, Compactness, concavity, concave Points, symmetry and convave points Fractal_dimension_mean), mean, standard deviation, and maximum. These eigenvalues are reserved for four digits. There are no missing values in the field. Of the 569 patients, 357 were benign and 212 were malignant.

6.2.2 Project execution process

Insert a picture description here

  1. First we need to load the data source;
  2. In the preparation stage, we need to explore the loaded data source and view the sample features and eigenvalues. In this process, you can also use data visualization, which can facilitate us to further understand the data and the relationship between the data. Then, the quality of the data is evaluated according to the criteria of “complete integration”. If the data quality is not high, data cleaning is needed. After data cleaning, you can make feature selection, which is convenient for subsequent model training.
  3. In the classification stage, kernel functions are selected for training. If we do not know whether the data is linear, SVC(kernel= ‘RBF’), which is a SVM classifier with Gaussian kernel function, can be considered. The trained model is then evaluated with test sets.

Before importing the dataset, you need to use these packages:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
Copy the code

Import data set

Data = pd.read_csv("./data.csv"Pd.set_option () pd.set_option() pd.set_option()'display.max_columns', None)
print(data.columns)
print(data.head(5))
print(data.describe()) #'id'.'diagnosis'.'radius_mean'.'texture_mean'.'perimeter_mean'.'area_mean'.'smoothness_mean'.'compactness_mean'.'concavity_mean'.'concave points_mean'.'symmetry_mean'.'fractal_dimension_mean'.'radius_se'.'texture_se'.'perimeter_se'.'area_se'.'smoothness_se'.'compactness_se'.'concavity_se'.'concave points_se'.'symmetry_se'.'fractal_dimension_se'.'radius_worst'.'texture_worst'.'perimeter_worst'.'area_worst'.'smoothness_worst'.'compactness_worst'.'concavity_worst'.'concave points_worst'.'symmetry_worst'.'fractal_dimension_worst'],
      dtype='object')
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0 
Copy the code

Next, data cleaning

u

In the result, you can see that in 32 fields, id has no real meaning and can be removed. The diagnosis field is either B or M, and we can use 0 or 1 instead. In addition, the remaining 30 fields can actually be divided into three groups of fields. Mean, SE and worst behind the underline represent the different measurement methods of each group of fields, which are mean, standard deviation and maximum respectively.

The code is as follows:

# Divide the feature fields into groups3Group features_mean = list (data. The columns [2:12])
features_se= list(data.columns[12:22])
features_worst=list(data.columns[22:32Drop data.drop(); drop data.drop();"id",axis=1,inplace=True) #0M malignant is replaced by1
data['diagnosis']=data['diagnosis'].map({'M':1.'B':0})
Copy the code

Then we need to screen feature fields. Firstly, we need to observe the relationship between features_mean variables. Here, we can use corr() function of DataFrame, and then use thermal diagram to help us visualize. We also look at overall benign and malignant tumors.

# Visualization of tumor diagnosis results SNS. Countplot (data['diagnosis'],label="Count"Corr = data[features_mean].corr() plt.figure(figsize=(features_mean)14.14Heatmap (corr, annot=True) plt.show()Copy the code

This is the result of the run:The correlation coefficient of the univariate on the diagonal in the thermal diagram is 1. The lighter the color, the more relevant it is. So you can see that the correlation between radius_mean, perimeter_mean and area_mean is very large, and the three fields of compactness_mean, concavity_mean and concave_points_mean are also correlated. So we can take one of them as a representative.

So how do you do feature selection?

The purpose of feature selection is dimensionality reduction, and a small number of features are used to represent the characteristics of the data, which can also enhance the generalization ability of the classifier and avoid data overfitting. We can see that mean, SE and worst are different measures of the same group of contents. We can keep mean and ignore SE and worst in feature selection. Meanwhile, we can see that in the group of mean features, the three attributes of radius_mean, perimeter_mean and area_mean are highly correlated. The three attributes of compactness_mean, Daconcavity_mean and convave points_mean are highly correlated. We choose one attribute from these two classes respectively as the representative, such as radius_mean and compactness_mean.

This allows us to reduce the original 10 attributes to 6, as follows:

Features_remain = ['radius_mean'.'texture_mean'.'smoothness_mean'.'compactness_mean'.'symmetry_mean'.'fractal_dimension_mean']
Copy the code

After selecting the features, we can prepare the training set and test set:

# extract30Train, test = train_test_split(data, test_size =0.3)# In this main data, it will splitted into train and test for isolation of feature selection values train_X = train[Features_remain] train_y=train['diagnosis']
test_X= test[features_remain]
test_y =test['diagnosis']
Copy the code

Before the training, we need to normalize the data so that the data are on the same magnitude and avoid data errors caused by dimension problems:

# Z-Score is adopted to normalize data to ensure that the mean value of data of each feature dimension is0, the variance of1
ss = StandardScaler()
train_X = ss.fit_transform(train_X)
test_X = ss.transform(test_X)
Copy the code

Finally, we can let SVM do training and prediction:

Model. Fit (train_X,train_y) # Predict =model. Predict (test_X) # predict =model.print('Accuracy:', accuracy_score(prediction,test_y)) #0.9181286549707602
Copy the code

From above, the accuracy is ok, you can try to compute with all the features, or you can try to do it with a linearly separable support vector machine.

7. To summarize

I finally finished my support vector machine, oh my God, I didn’t expect so much, so let’s sum it up. Today we start from the principle of support vector machines, through a small exercise to get the first understanding of hyperplane, and then introduce the interval and three support vector machines, each support vector machine is trained for different data sets, and behind the hidden very advanced mathematical derivation. This piece of mathematical knowledge is a lot, I did not say, only about how to calculate in general, detailed process I have the following links, interested in their own view.

Then it introduces the situation of vector machine processing multi-classification, two methods of one-to-one and one-to-many.

Finally, after understanding the principle of vector machine, sklearn is used to achieve the support vector machine, and used to do a breast cancer detection example.

I hope today’s learning can also let you harvest full! Come on!

Reference:

  • Note.youdao.com/noteshare?i…
  • Note.youdao.com/noteshare?i…
  • Blog.csdn.net/b285795298/…
  • Note.youdao.com/noteshare?i…
  • Note.youdao.com/noteshare?i…
  • www.jiqizhixin.com/articles/20…

**** individual public number: AI snail car

Stay Humble, Stay Disciplined, And Keep Improving. Past highlights: An introduction to Artificial Intelligence for beginners. Machine Learning Online Manual25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code