Related articles:
- Linear regression for machine learning
- Machine learning logistic regression and Python implementation
- Machine learning project actual trading data anomaly detection
- Decision Tree for Machine Learning
- Python implementation of Decision Tree for machine learning
- PCA for Machine Learning
- Feature engineering for machine learning
We now have a batch of processed transaction data from credit card users, and we need to learn a model from this data that can be used to predict whether a new transaction is suspected of credit card fraud.
First, of course, you need to import the necessary Python libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
Copy the code
Let’s take a look at the raw data
data = pd.read_csv("creditcard.csv")
print(data.shape)
data.head() Print the first 5 lines
Copy the code
(31), 284807,Copy the code
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | . | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.359807 | 0.072781 | 2.536347 | 1.378155 | 0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | . | 0.018307 | 0.277838 | 0.110474 | 0.066928 | 0.128539 | 0.189115 | 0.133558 | 0.021053 | 149.62 | 0 |
1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | 0.082361 | 0.078803 | 0.085102 | 0.255425 | . | 0.225775 | 0.638672 | 0.101288 | 0.339846 | 0.167170 | 0.125895 | 0.008983 | 0.014724 | 2.69 | 0 |
2 | 1.0 | 1.358354 | 1.340163 | 1.773209 | 0.379780 | 0.503198 | 1.800499 | 0.791461 | 0.247676 | 1.514654 | . | 0.247998 | 0.771679 | 0.909412 | 0.689281 | 0.327642 | 0.139097 | 0.055353 | 0.059752 | 378.66 | 0 |
3 | 1.0 | 0.966272 | 0.185226 | 1.792993 | 0.863291 | 0.010309 | 1.247203 | 0.237609 | 0.377436 | 1.387024 | . | 0.108300 | 0.005274 | 0.190321 | 1.175575 | 0.647376 | 0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
4 | 2.0 | 1.158233 | 0.877737 | 1.548718 | 0.403034 | 0.407193 | 0.095921 | 0.592941 | 0.270533 | 0.817739 | . | 0.009431 | 0.798278 | 0.137458 | 0.141267 | 0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
5 rows × 31 columns
It can be seen that there are 284,807 samples in total, and each sample has 31 features. Among them, the 28 features V1 to V28 are clean data that have been processed and encrypted. Although I don’t know what they mean, they can be used directly. The few remaining features,
Time Indicates the transaction Time
Amount indicates the total Amount of a transaction
Class is output, indicating whether there is credit card fraud in this transaction. 0 is normal and 1 is abnormal. We call the sample of Class 0 negative, and the sample of Class 1 positive
Characteristics of the scale
In addition, let’s see that the Amount feature has a significant difference in value range compared to other V1 to V28 features, so we need to scale the Amount feature
from sklearn.preprocessing import StandardScaler
# print (data [' Amount ']. Reshape (1, 1))
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(- 1.1))
data = data.drop(['Time'.'Amount'],axis=1) Delete columns that are not in use
data.head()
Copy the code
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | . | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Class | normAmount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.359807 | 0.072781 | 2.536347 | 1.378155 | 0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | . | 0.018307 | 0.277838 | 0.110474 | 0.066928 | 0.128539 | 0.189115 | 0.133558 | 0.021053 | 0 | 0.244964 |
1 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | 0.082361 | 0.078803 | 0.085102 | 0.255425 | 0.166974 | . | 0.225775 | 0.638672 | 0.101288 | 0.339846 | 0.167170 | 0.125895 | 0.008983 | 0.014724 | 0 | 0.342475 |
2 | 1.358354 | 1.340163 | 1.773209 | 0.379780 | 0.503198 | 1.800499 | 0.791461 | 0.247676 | 1.514654 | 0.207643 | . | 0.247998 | 0.771679 | 0.909412 | 0.689281 | 0.327642 | 0.139097 | 0.055353 | 0.059752 | 0 | 1.160686 |
3 | 0.966272 | 0.185226 | 1.792993 | 0.863291 | 0.010309 | 1.247203 | 0.237609 | 0.377436 | 1.387024 | 0.054952 | . | 0.108300 | 0.005274 | 0.190321 | 1.175575 | 0.647376 | 0.221929 | 0.062723 | 0.061458 | 0 | 0.140534 |
4 | 1.158233 | 0.877737 | 1.548718 | 0.403034 | 0.407193 | 0.095921 | 0.592941 | 0.270533 | 0.817739 | 0.753074 | . | 0.009431 | 0.798278 | 0.137458 | 0.141267 | 0.206010 | 0.502292 | 0.219422 | 0.215153 | 0 | 0.073403 |
5 rows × 30 columns
Class imbalance problem
Now, let’s see how many negative samples we have and how many positive samples we have
count_class = pd.value_counts(data['Class'],sort=True).sort_index()
print(count_class)
Copy the code
0 284315
1 492
Name: Class, dtype: int64
Copy the code
We found that there were 284,315 negative samples and only 492 positive samples, indicating a serious imbalance in the proportion of positive and negative samples, namely the quasi-imbalance problem. Let’s talk about it more specifically.
Class imbalance means that when training the classifier model, categories in the sample set are not evenly distributed. For example, in the above question, 284,807 data, ideally, the number of positive and negative samples should be approximately equal. As shown above, there are 284,315 positive samples and only 492 negative samples, which indicates serious quasi-imbalance.
Why avoid this problem? From the perspective of the training model, if the number of samples of a certain type is very small, the “information” provided by this type of lock will be very small, which will lead to poor training effect of our model.
How do you avoid that? There are two ways to balance the samples of different categories, right
- Downsampling: undersampling the classes (most classes) with a large number of samples in the training set, and discarding some samples to alleviate class imbalance.
- Oversampling: Oversampling the classes (minority classes) with a small number of samples in the training set to synthesize new samples to alleviate class imbalance. SMOTE is going to use a very classic oversampling algorithm
So let’s deal with this problem with downsampling
First, feature X and output variable y are separated
# Separate out feature X and output variable y
X = data.iloc[:,data.columns != 'Class']
y = data.iloc[:,data.columns == 'Class']
# print(X.head())
# print(X.shape)
# print(y.head())
# print(y.shape)
Copy the code
The so-called downsampling is to randomly select a small number of samples from the majority of classes and then combine the original minority class samples as a new training data set.
Specifically, since there are only 492 positive samples, we randomly select 492 from the negative samples (284,315) as the majority class, and then re-compose a new training data set with the previous 492 positive samples
# Number of positive samples
positive_sample_count = len(data[data.Class == 1])
print("Number of positive samples is:",positive_sample_count)
The index corresponding to the positive sample is
positive_sample_index = np.array(data[data.Class == 1].index)
print("The index corresponding to the positive sample in the dataset is (print the first 5) :",positive_sample_index[:5])
The index of the negative sample
negative_sample_index = data[data.Class == 0].index
#numpy.random. Choice (a, size=None, replace=True, P =None) generates a random sample from the given one-dimensional array
#replace Whether the sample is replaced True means that it is randomly generated each time, false means that it is randomly generated only once
random_negative_sample_index = np.random.choice(negative_sample_index, positive_sample_count,replace = False)
random_negative_sample_index = np.array(random_negative_sample_index)
print("The index corresponding to the negative sample in the dataset is (print the first 5) :",random_negative_sample_index[:5])
under_sample_index = np.concatenate([positive_sample_index,random_negative_sample_index])
under_sample_data = data.iloc[under_sample_index,:]
X_under_sample = under_sample_data.iloc[:,under_sample_data.columns != 'Class']
y_under_sample = under_sample_data.iloc[:,under_sample_data.columns == 'Class']
print(Percentage of positive samples in the new data set after sampling:,
len(under_sample_data[under_sample_data.Class==1])/len(under_sample_data))
print('After sampling, the proportion of negative samples in the new data set:,
len(under_sample_data[under_sample_data.Class==0])/len(under_sample_data))
print('After sampling, the number of samples of the new dataset is:',len(under_sample_data))
Copy the code
Number of positive samples: 492 Index corresponding to positive samples in the data set is [541 623 4920 6108 6329] Index corresponding to negative samples in the data set is [541 623 4920 6108 6329] [38971 9434 75592 113830 203239] After down-sampling, the proportion of positive samples in the new data set is 0.5. After down-sampling, the proportion of negative samples in the new data set is 0.5. After down-sampling, the number of samples in the new data set is 984Copy the code
Training set test set division and cross validation
The next thing we need to do is divide the current data set into training set and test set.
The training set is used to train the generated model, the test set is used to test the final model and in the training set, we also need to perform an operation called cross-validation for parameter tuning and model selection.
First of all, it needs to be emphasized that cross validation is for the training set, never touch the test set!!
The so-called K-fold cross verification is to randomly divide the training set into K parts. Then, we successively select k-1 parts for training, and the remaining 1 part for testing. Cycle k times (each combination of K-1 parts is different), and then take the final average accuracy as the accuracy of the currently trained model.
We can randomly divide training and test sets by train_test_split under the sklearn.cross_validation module in Sklearn.
X_train,X_test, y_train, y_test =train_test_split(train_data,train_target,test_size=0.4, random_state=0) X_train, y_train, y_test =train_test_split(train_data,train_target,test_size=0.4, random_state=0)
Train_data: Special collection of samples to be divided
Train_target: divided sample output
Test_size: indicates the percentage of the test set if it is a decimal, or the number of test sets if it is an integer
Random_state: Random number seed, in fact, is the number of the group of random numbers, in the case of repeated trials, to ensure that the same set of random numbers. For example, if you fill in 1 every time, all other parameters are the same and you get the same set of random numbers. But zero or no zero, it’s going to be different every time.
from sklearn.cross_validation import train_test_split
# This is to divide the initial sample data before the next sampling into training set and test set.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)
X_train_under_sample, X_test_under_sample, y_train_under_sample, y_test_under_sample = train_test_split(X_under_sample,
y_under_sample,
test_size=0.3,
random_state=0)
print('Training Set sample size:',len(X_train_under_sample))
print('Test set Sample size:',len(X_test_under_sample))
Copy the code
C:\Anaconda3\lib\site-packages\ skLearn \cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. "This module will be removed in 0.20.", DeprecationWarning)Copy the code
Next, we will use logistic regression to train the model through cross-validation
Cross validation can be handled using the KFold method in the sklearn.cross_validation module. The call method is:
class sklearn.cross_validation.KFold(n, n_folds=3, shuffle=False, random_state=None)
Parameter Description:
N: Number of data sets to be split
N_folds: number of folds corresponding to cross validation
Shuffle: Indicates whether to shuffle data before partitioning.
Model training and selection
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report
def Kfold_for_TrainModel(X_train_data, y_train_data):
fold = KFold(len(X_train_data),5,shuffle = False)
Regularize the previous C parameter
c_params = [0.01.0.1.1.10.100]
This block generates a DataFrame to hold the different C parameters, and the corresponding recall rate
result_tables = pd.DataFrame(columns = ['C_parameter'.'Mean recall score'])
result_tables['C_parameter'] = c_params
j = 0
for c_param in c_params:
print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --')
print('C argument: ',c_param)
print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --')
print(' ')
recall_list = []
for iteration, indices in enumerate(fold,start=1) :# Use L1 regularization
lr = LogisticRegression(C=c_param, penalty = 'l1')
#indices[0] keeps an index of the data used for validation at one of the k=5 training sessions
#indices[1] keeps an index of the test data for one of the k=5 training sessions
lr.fit(X_train_data.iloc[indices[0],:],
y_train_data.iloc[indices[0],:].values.ravel())#. Ravel can reduce the output to one dimension
# Test with the remaining data (the subscript stored in indices[1])
y_undersample_pred = lr.predict(X_train_data.iloc[indices[1],:].values)
recall = recall_score(y_train_data.iloc[indices[1],:].values,
y_undersample_pred)
recall_list.append(recall)
print('Iteration ',iteration,"Recall rate is:",recall)
print(' ')
print('Average recall rate is:', np.mean(recall_list))
print(' ')
result_tables.loc[j,'Mean recall score'] = np.mean(recall_list)
j = j+1
# print(result_tables['Mean recall score'])
result_tables['Mean recall score'] = result_tables['Mean recall score'].astype('float64')
best_c_param = result_tables.loc[result_tables['Mean recall score'].idxmax(), 'C_parameter']
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
print('C parameter corresponding to the best model =', best_c_param)
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
return best_c_param
Copy the code
best_c_param = Kfold_for_TrainModel(X_train_under_sample, y_train_under_sample)
Copy the code
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.01 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate for: 0.9315068493150684 Iteration 2 recall rate: 0.9178082191780822 Iteration 3 recall rate: 1.0 Iteration 4 Recall rate: 0.972972972973 Iteration 5 The recall rate is: 0.95454545454546 The average recall rate is: 0.9553666992023157 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.8356164383561644 Iteration 2 recall rate for: 0.863013698630137 Iteration 3 recall rate: 0.9491525423728814 Iteration 4 recall rate: 0.9459459459459459459 Iteration 5 recall rate: Average recall rate was 0.9090909090909091:0.9005639068792076 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.8493150684931506 Iteration 2 recall rate for: 0.8904109589041096 Iteration 3 recall rate: 0.9830508474576272 Iteration 4 recall rate: 0.9459459459459459 Iteration 5 recall rate: Average recall rate was 0.9090909090909091:0.9155627459783485 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 10 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.863013698630137 Iteration 2 recall rate for: 0.863013698630137 Iteration 3 recall rate: 0.9830508474576272 Iteration 4 recall rate: 0.9459459459459459 Iteration 5 recall rate: Average recall rate was 0.8939393939393939:0.9097927169206482 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 100 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.8767123287671232 Iteration 2 recall rate for: 0.863013698630137 Iteration 3 recall rate: 0.9661016949152542 Iteration 4 recall rate: 0.9459459459459459 Iteration 5 recall rate: The average recall rate is: 0.9091426124395708 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the best model corresponding to the parameter C = 0.01 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
When the regularization parameter C is set to 0.01, the recall rate of the obtained model is the highest. Therefore, we choose the f1 regular model with the regular parameter of 0.01 as the final model. Then, we will use the finally trained model to predict the final test set. Let’s see what happens
Performance measurement
First, I need to say something about performance measures in the classification problem. It is the obfuscation matrix, and in the obfuscation matrix, we can get a lot of useful information about performance measures.
In machine learning, confusion matrix is a visual display tool to evaluate the quality of classification models. Where rows represent actual categories and columns represent predicted categories
So let’s explain TP,FP,TN and FN
TP(True Positive): indicates that a sample that is actually a Positive example is correctly judged as a Positive example
FP(False Positive): a False Positive example, that is, a sample that is actually a negative example is wrongly judged as a Positive example
TN (True Negtive) : True negative example, that is, a sample that is actually a negative example is correctly judged as a negative example
FN (False Negtive) : False negative example, that is, a sample that is actually positive is wrongly judged as a negative example
Here are some more commonly used performance metrics
Precision: the proportion of positive cases in the samples predicted to be positive. The formula is as follows:
Recall: The ratio of the number of correctly predicted positive samples to all positive samples. The formula is as follows:
Accuracy: the proportion of accurate prediction in all samples, the formula is:
Now, let’s draw the sparse matrix
import itertools
def plot_confusion_matrix(confusion_matrix, classes):
# print(confusion_matrix)
#plt.imshow Heat map
plt.figure()
plt.imshow(confusion_matrix, interpolation='nearest',cmap=plt.cm.Blues)
plt.title('confusion matrix')
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = confusion_matrix.max() / 2.
for i, j in itertools.product(range(confusion_matrix.shape[0]), range(confusion_matrix.shape[1])):
plt.text(j, i, confusion_matrix[i, j],
horizontalalignment="center",
color="white" if confusion_matrix[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
print('Precision is:',confusion_matrix[1.1]/(confusion_matrix[1.1]+confusion_matrix[0.1]))
print('Recall rate is:',confusion_matrix[1.1]/(confusion_matrix[1.1]+confusion_matrix[1.0]))
print('Accuracy is:',(confusion_matrix[0.0]+confusion_matrix[1.1])/(confusion_matrix[0.0]+confusion_matrix[0.1]+confusion_matrix[1.1]+confusion_matrix[1.0]))
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
Copy the code
lr = LogisticRegression(C = best_c_param, penalty = 'l1')
lr.fit(X_train_under_sample, y_train_under_sample.values.ravel())
Get test results for the test set
y_undersample_pred = lr.predict(X_test_under_sample.values)
# Construct sparse matrix
conf_matrix = confusion_matrix(y_test_under_sample,y_undersample_pred)
# np.set_printoptions(precision=2)
class_names = [0.1]
plot_confusion_matrix(conf_matrix
, classes=class_names)
Copy the code
Recall rate: 0.9387755102040817 Accuracy rate: 0.918918918918919 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
The metrics obtained above are all for the test set in the lower sample, and we need to test the test set in the entire sample data. So, let’s use the model we just trained to test the test set before the lower sample
Get test results for the test set
y_pred = lr.predict(X_test.values)
# Construct sparse matrix
conf_matrix = confusion_matrix(y_test,y_pred)
# np.set_printoptions(precision=2)
class_names = [0.1]
plot_confusion_matrix(conf_matrix
, classes=class_names)
Copy the code
Recall rate: 0.9251700680272109 Accuracy rate: 0.8694217197429865 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
Although the recall rate and accuracy rate are good, the accuracy rate is too low, that is to say, although 136 out of 147 positive samples have been correctly predicted, the cost is that 10894 negative cases have been predicted as positive cases. Really would rather kill a thousand wrong, also don’t let go of a
Let’s think about it. Why does this happen? In fact, we selected too few negative samples. Out of 284,315 negative samples, we only selected 492 for the training model, resulting in poor generalization ability. This is also the disadvantage of downsampling because the sample size is smaller than the original sample set, so some information is missing. The unsampled samples often carry important information
Oversampling, SMOTE
Now, we use oversampling, SMOTE algorithm for data processing
SMOTE is the Synthetic Minority Oversampling Technique. The specific idea is: analyze the Minority sample and add the new sample to the data set artificially based on the Minority sample
The algorithm flow is as follows:
- For every sample in a few classesThe Euclidean distance is used to calculate it to the sample set of a few classesThe k nearest neighbor is obtained from the distance of all samples in
- Determine the sampling rate, for each minority class sample, select some samples randomly from k nearest neighbor, suppose the selected nearest neighbor is the following, we use oversampling, SMOTE algorithm for data processing
SMOTE is the Synthetic Minority Oversampling Technique. The specific idea is: analyze the Minority sample and add the new sample to the data set artificially based on the Minority sample
The algorithm flow is as follows:
- For every sample in a few classesThe Euclidean distance is used to calculate it to the sample set of a few classesThe k nearest neighbor is obtained from the distance of all samples in
- Determine the sampling rate, for each minority class sample, a number of samples are randomly selected from its k neighbors, assuming that the selected neighbors are
- For every random neighbor, respectively, and the original sample according to the following formula to build new samples
Now, let’s look at the code
# PIP install imblearn
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
oversampler = SMOTE(random_state = 0)
X_over_samples, y_over_samples = oversampler.fit_sample(X_train, y_train.values.ravel())
Copy the code
Find out how many positive and negative samples do you have after oversampling, SMOTE
len(y_over_samples[y_over_samples == 1]), len(y_over_samples[y_over_samples == 0])
Copy the code
(199019, 199019)
Copy the code
You can see that the positive and negative samples are well balanced. Ok, now we can retrain and select the model with the new sample set
Note that the Kfold_for_TrainModel function is defined as a DataFrame
best_c_param = Kfold_for_TrainModel(pd.DataFrame(X_over_samples),
pd.DataFrame(y_over_samples))
Copy the code
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.01 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate for: The Iteration 2 recall rate is: 0.912 Iteration 3 recall rate is: 0.9129489124936773 Iteration 4 recall rate is: 0.8972829022573392 Iteration 5 The recall rate is: 0.8974462044795055 The average recall rate is: 0.9096498895603901 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9145422357106727 Iteration 4 recall rate: 0.8986521285816574 Iteration 5 recall rate: Average recall rate was 0.8987777456756315:0.9121087077078782 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9146686899342438 Iteration 4 recall rate: 0.8987777456756315 Iteration 5 recall rate: Average recall rate was 0.8989536096071954:0.9121942947576999 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 10 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9146686899342438 Iteration 4 recall rate: 0.8988028690944264 Iteration 5 recall rate: Average recall rate was 0.8991294735387592:0.9122344922277715 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 100 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9146686899342438 Iteration 4 recall rate: 0.8991169118293617 Iteration 5 recall rate: The average recall rate is: 0.9122646403303254 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the best model corresponding to the parameter C = 100.0 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
It can be seen that when C parameter is 100, the model recall rate is the best, so we use this model to predict the test set data
lr = LogisticRegression(C = best_c_param, penalty = 'l1')
lr.fit(X_over_samples, y_over_samples)
# lr.fit(pd.DataFrame(X_over_samples), pd.DataFrame(y_over_samples).values.ravel())
Get test results for the test set
y_pred = lr.predict(X_test.values)
# Construct sparse matrix
conf_matrix = confusion_matrix(y_test,y_pred)
# np.set_printoptions(precision=2)
class_names = [0.1]
plot_confusion_matrix(conf_matrix
, classes=class_names)
Copy the code
Recall rate: 0.9183673469387755 Accuracy rate: 0.9752817667918964 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
Relative to the above undersampling, this effect is significantly better.
Welcome to pay attention to my personal public account AI computer vision workshop, this public account irregularly push machine learning, deep learning, computer vision and other related articles, welcome to learn with me, communication.