This article is the third in the machine learning series and the eighth in the pre-learning machine series. The concepts in this article are relatively simple and focus on code practices. As I said in the last article, we can use linear regression to make predictions, but obviously in real life there is not only a prediction problem but also a classification problem. We can simply distinguish it from the types of predicted values: the prediction of continuous variables is regression, and the prediction of discrete variables is classification.
Logistic regression: dichotomies
1.1 Understand logistic regression
We artificially define successive predicted values, with one side of the boundary defined as 1 and the other side as 0. So we turn the regression problem into a classification problem.
As shown in the figure above, we suppressed the continuous variable distribution within the range of 0-1, and took 0.5 as the boundary of our classification decision. If the probability is greater than 0.5, the discriminant is 1; if the probability is less than 0.5, the discriminant is 0.
We cannot use infinity and negative infinity for arithmetic operations. We can limit numerical calculation to 0-1 by Logistic regression function (Sigmoid function/S-type function /Logistic function).
So that’s a simple explanation of logistic regression. Let’s apply real data examples to binary code practice.
1.2 Code practices – Import data sets
Add a reference:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Copy the code
Import dataset (don’t worry about this domain name) :
df = pd.read_csv('https://blog.caiyongji.com/assets/hearing_test.csv')
df.head()
Copy the code
age | physical_score | test_result |
---|---|---|
33 | 40.7 | 1 |
50 | 37.2 | 1 |
52 | 24.7 | 0 |
56 | 31 | 0 |
35 | 42.9 | 1 |
The dataset, an experiment with 5,000 participants, looked at the effects of age and physical fitness on hearing loss, specifically the ability to hear high notes. This data shows the results of the study participants were assessed and rated for physical ability and then had to take an audio test (pass/fail) to assess their ability to hear high frequencies.
- Characteristics: 1. age 2. health score
- Tag :(1 pass /0 fail)
1.3 Observed Data
sns.scatterplot(x='age',y='physical_score',data=df,hue='test_result')
Copy the code
Seaborn was used to plot scatter plots of age and health score characteristics corresponding to test results.
sns.pairplot(df,hue='test_result')
Copy the code
The pairplot method is used to plot the corresponding relationship between two features.
We can make a general judgment that it’s hard to pass the test when you’re over 60, and that the average fitness score that passes the test is over 30.
1.4 Training model
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import X = df.drop('test_result',axis=1) y = df['test_result'] X_test, X_test, y_test = train_test_split(X, y, test_size=0.1, random_state=50) scaler = StandardScaler() scaled_X_train = scaler.fit_transform(X_train) scaled_X_test = Scaler. Transform (X_test) # define model log_model = LogisticRegression() # train model log_model.fit(scaled_X_train,y_train) # predict data y_pred = log_model.predict(scaled_X_test) accuracy_score(y_test,y_pred)Copy the code
After data preparation, we defined the model as LogisticRegression model, fitted the training data by FIT method, and finally predicted by PREDICT method. Finally, accuracy_score method was used to obtain 92.2% accuracy of the model.
Ii. Model performance evaluation: accuracy, accuracy and recall rate
How did we get 92.2 percent accuracy? We call the plot_confusion_matrix method to draw the confusion matrix.
plot_confusion_matrix(log_model,scaled_X_test,y_test)
Copy the code
We observed 500 test instances and obtained the matrix as follows:
We define the above matrix as follows:
- True TP: the prediction is Positive, but the actual result is Positive. For example, in the lower right corner of the figure 285.
- True Negative (TN) : The prediction is Negative and the actual result is Negative. For example, upper left corner 176.
- False Positive (FP) : The prediction is Positive but the actual result is negative. For example, the lower left corner of the figure 19.
- False Negative class FN(False Negative) : the prediction is Negative, but the actual result is positive. For example, 20 in the upper right corner of the picture.
The formula for Accuracy is as follows:
In this example:
The formula for Precision is as follows:
In this example:
The formula of Recall is as follows:
In this example:
We call the ClassiFICation_report method to verify the results.
print(classification_report(y_test,y_pred))
Copy the code
Softmax: multiple categories
3.1 Understand Softmax multiple logistic regression
Both Logistic regression and Softmax regression are classification models based on linear regression. There is no essential difference between them. Both of them combine maximum loglikelihood estimation from Bernoulli fraction.
Maximum likelihood estimation: In simple terms, maximum likelihood estimation is to use the known sample results information to deduce the model parameter values that are most likely (maximum probability) to lead to the occurrence of these sample results.
The terms “probability” and “likelihood” are often used interchangeably in English, but they have very different meanings in statistics. Given a statistical model with some parameters θ, the word “probability” is used to describe the rationality of a future outcome x (knowing the parameter value θ), and the word “likelihood” is used to describe the rationality of a particular set of parameter values θ after knowing the outcome x.
The Softmax regression model first calculates the scores for each class and then applies Softmax functions to these scores to estimate the probabilities for each class. We predict the class with the highest estimated probability, simply by finding the class with the highest score.
3.2 Code practice – Import data set
Import dataset (don’t worry about this domain name) :
df = pd.read_csv('https://blog.caiyongji.com/assets/iris.csv')
df.head()
Copy the code
sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 3.6 | 1.4 | 0.2 | setosa |
This dataset contains data of 150 iris samples, including the length and width of petals and the length and width of sepals, including three species of iris, namely setosa, Versicolor and Virginica.
- Features: 1. calyx length 2. calyx width 3. petal length 4 Calyx width
- Iris Setosa, Iris Versicolor and Iris Virginica
3.3 Observed Data
sns.scatterplot(x='sepal_length',y='sepal_width',data=df,hue='species')
Copy the code
Seaborn was used to draw scatter plots of calyx length and width characteristics corresponding to iris species.
sns.scatterplot(x='petal_length',y='petal_width',data=df,hue='species')
Copy the code
Seaborn was used to draw scatter plots of petal length and width characteristics corresponding to iris species.
sns.pairplot(df,hue='species')
Copy the code
The pairplot method is used to plot the corresponding relationship between two features.
We can make a general judgment. Considering the overall size of petals and calyx, iris mountain is the smallest, iris color is the medium size, and iris Virginia is the largest.
3.4 Training model
X = df. Drop ('species',axis=1) y = df['species'] X_train, X_test, y_train, y_test = train_test_split(X, y, Test_size = 0.25, random_state=50) scaler = StandardScaler() scaled_X_train = scaler.fit_transform(X_train) scaled_X_test = Multinomial (multi_class="multinomial",solver=" multinomial", C=10) (scaled_X_test) # predict(scaled_X_test) # predict(scaled_X_test) accuracy_score(y_test,y_pred)Copy the code
Multinomial (multinomial) LogisticRegression model with multi_class=”multinomial” is defined after data preparation, and the solveer is set to LBFGS. Multinomial training data is fitted using fit method and finally predicted using predict method. Finally, accuracy_score method was used to obtain 92.1% accuracy of the model.
We call the Classification_report method to check the accuracy, accuracy and recall rate.
print(classification_report(y_test,y_pred))
Copy the code
3.5 Extension: Draw petal classification
We only extracted the characteristics of petal length and width to draw the classification image of iris.
X = df[['petal_length','petal_width']].to_numpy() y = df["species"].factorize([' species '], Multinomial regression (multi_class="multinomial",solver=" LBFGS ", C=10) 00 00 00 00 00 00 00 00 00 00 0 Np. Linspace (0, 3.5, 200). Reshape (1, 1)) X_new = np. The c_ (x0. Ravel (), Predict_proba (X_new) = predicmax_reg. predict(X_new) = predicmax_reg. predict(X_new) = predict_proba [:, 1].reshape(x0.shape) zz = y_predict.reshape(x0.shape) plt.figure(figsize=(10, 4)) plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica") plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor") plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa") from matplotlib.colors import ListedColormap custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0']) plt.contourf(x0, x1, zz, cmap=custom_cmap) contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg) plt.clabel(contour, inline=1, fontsize=12) plt.xlabel("Petal length", fontsize=14) plt.ylabel("Petal width", fontsize=14) plt.legend(loc="center left", fontsize=14) plt.axis([0, 7, 0, 3.5]) PLT. The show ()Copy the code
The images of irises classified by petals are as follows:
Four, summary
This article focuses on hands-on rather than conceptual understanding, and you should feel “hand hot” through hands-on programming. By the end of this article, you should be familiar with the concept of machine learning. Let’s briefly summarize:
- Classification of machine learning
- Industrial processes for machine learning
- Concept of features, tags, instances, models
- Overfitting, underfitting
- Loss function, least square method
- Gradient descent, learning rate
Linear regression, logistic regression, polynomial regression, stepwise regression, ridge regression, Lasso regression, ElasticNet regression are the most commonly used regression techniques. Sigmoid function, Softmax function, maximum likelihood estimation
If you are still not clear, please refer to:
- Machine learning (II) : Understanding linear regression and gradient descent and making simple predictions
- Machine learning (I) : 5 minutes to understand and practice machine learning
- Machine learning (5) : master common Matplotlib usage in 30 minutes
- 2. Machine learning is used to hunt Pandas
- Pre machine learning (3) : 30 minutes to master common NumPy usage
- Machine learning (2) : 30 minutes to master the usage of common Jupyter Notebook
- Pre-processing machine learning (I) : Mathematical symbols and Greek letters