In this exercise, we will start with some simple 2D data sets and use SVM to see how they work.
1. Linear Kernel SVM
As the name implies, SVM based on linear kernel function is mainly used to realize the classification problem of linear decision boundary.
1.1 Display of original data
# fetch data
mat_data = sio.loadmat('./data/ex6data1.mat')
data = pd.DataFrame(mat_data['X'], columns=['x1'.'x2'])
data['y'] = mat_data['y']
Copy the code
Data display function:
def show_data(data):
# Display data
Data ['Admitted']. Isin ([1]) : False,True,False...
positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]
Son # canvasAx = PLT. Plots (figsize=(8,6)) ax.'x1'], positive['x2'], s=50, c='g', marker='o', label='T')
ax.scatter(negative['x1'], negative['x2'], s=50, c='r', marker='x', label='F')
ax.legend() # label
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_title('Source Data')
plt.show()
Copy the code
Data Display:
1.2 Observe the influence of changing parameter C on the model
C = λ is the regularized hyperparameter. The larger λ is, the smaller C is, the smaller the model bias is and the larger the variance is.
When C is equal to 1, draw the decision boundary
Training model:
When C=1
svc1 = sklearn.svm.LinearSVC(C=1, loss='hinge')
svc1.fit(data[['x1'.'x2']], data['y'])
svc1.score(data[['x1'.'x2']], data['y'])
Copy the code
Draw decision boundaries:
# Compute the functional distance from the sample point to the decision boundary
data['SVM1 Boundary'] = svc1.decision_function(data[['x1'.'x2']])
# Draw decision boundariesAx. plots(figsize=(8,6)) ax. plots(figure ['x1'], data['x2'], s=50, c=data['SVM1 Boundary'], cmap='RdBu')
ax.set_title('SVM (C=1) Decision Boundary')
plt.show()
Copy the code
Results:
It can be seen that the model is well segmented, and the outliers are far from the decision boundary, hardly affecting the fitting of the model.
When C is equal to 100, draw the decision boundary
Training model:
When C=100
svc100 = sklearn.svm.LinearSVC(C=100, loss='hinge')
svc100.fit(data[['x1'.'x2']], data['y'])
svc100.score(data[['x1'.'x2']], data['y'])
Copy the code
Draw decision boundaries:
# Compute the functional distance from the sample point to the decision boundary
data['SVM100 Boundary'] = svc100.decision_function(data[['x1'.'x2']])
# Draw decision boundariesAx. plots(figsize=(8,6)) ax. plots(figure ['x1'], data['x2'], s=50, c=data['SVM100 Boundary'], cmap='RdBu')
ax.set_title('SVM (C=100) Decision Boundary')
plt.show()
Copy the code
Results:
It can be seen that the decision boundary is greatly affected by outliers, and the model is overfitted.
2. Gaussian kernel SVM
In machine learning, the (Gaussian) Radial basis function kernel, or RBF kernel, is a common kernel function. It can be understood as the similarity transformation of the original eigenvalue, which is the most commonly used kernel function in support vector machine classification. The mathematical formula is as follows:
2.1 Data Display
2.2 Model training
svc = sklearn.svm.SVC(C=100, kernel='rbf', gamma=10, probability=True)
# Training model
svc.fit(data2[['x1'.'x2']], data2['y'])
svc.score(data2[['x1'.'x2']], data2['y'])
Copy the code
Draw decision boundaries:
# return ndarray,shape(n_samples, n_classes), the probability that each column corresponds to a sample of that type
Select only one of the columns
predict_prob = svc.predict_proba(data2[['x1'.'x2',]]) [1]Draw decision boundariesCopy-plots (figsize=(8,6)) ax. copy-plots (figsize=(8,6))'x1'], data2['x2'], s=30, c=predict_prob, cmap='Reds')
plt.show()
Copy the code
Results:
It can be seen that SVM based on Gaussian kernel has a good fitting degree for nonlinear decision boundary.
3. Manual cross validation and SkLearn grid search
Use a cross-validation method to find which (C, σ) parameter combination is optimal. σ is a parameter in the Gaussian kernel function. When σ is large, the model develops towards high bias and low variance.
3.1 Data display and parameter combination
Data Display:
Set 8*8=64 different parameter combinations.
# Set possible parameter valuesParas = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]# combination parameter
combine = []
for C in paras:
combine += [(C, sigma) for sigma in paras]
Copy the code
3.2 Extraction of optimal parameters
Calculate cross validation accuracy:
acc = []
# Calculate cross validation accuracy
for C, sigma in combine:
svc = sklearn.svm.SVC(C=C, gamma=sigma)
svc.fit(data3_train[['x1'.'x2']], data3_train['y'])
acc.append(svc.score(data3_cv[['x1'.'x2']], data3_cv['y']))
Copy the code
Extract the best parameters
best_para = combine[np.argmax(acc)]
best_para
Copy the code
The optimal value is acc=0.965, best_para=(3,30).
Draw decision boundaries:
Note that the decision boundaries of the training set are drawn here.
svc = sklearn.svm.SVC(C=best_para[0], gamma=best_para[1], probability=True)
svc.fit(data3_train[['x1'.'x2']], data3_train['y'])
# return ndarray,shape(n_samples, n_classes), the probability that each column corresponds to a sample of that type
Select only one of the columns
predict_prob = svc.predict_proba(data3_train[['x1'.'x2',]]) [1]Draw decision boundariesAx = PLT. Plots (figsize=(8,6)) ax.'x1'], data3_train['x2'], s=30, c=predict_prob, cmap='Reds')
plt.show()
Copy the code
Results:
Verify the accuracy of the set.
# F1 - score evaluation
y_pred = svc.predict(data3_cv[['x1'.'x2']])
print(metrics.classification_report(data3_cv['y'], y_pred))
Copy the code
3.3 Use grid search to achieve cross validation
Code:
# sklearn GridSearchCV
Svms.svc () = svms.svc (
parameters = {'C': paras, 'gamma': paras}
svc = svm.SVC()
# if n_jobs=-1,all cpus are used.
cscv = GridSearchCV(svc, parameters, n_jobs=-1)
cscv.fit(data3_train[['x1'.'x2']], data3_train['y'])
Copy the code
Best_paras ={‘C’: 10, ‘gamma’: 30}, acc=0.9004739336492891.
The reason why this is different from the result of manual cross-validation is that part of the training set becomes the validation set during grid search, that is, the whole training set is not used for training.
F1-score: