“This is the fifth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Data preprocessing

Import pandas as pd import numpy as NPCopy the code
Csv_df.columns = ['Sex', 'Length', 'Diameter', 'Height', csv_df.columns= ['Sex', 'Length', 'Diameter', 'Height', csv_df.columns= ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight','Shucked weight', 'Viscera weight', 'Shell weight', 'result'] df = pd.DataFrame(csv_df) df.head(5)Copy the code

# Hardcode lable (factorize), one-hot code gender variables (dummies) label = pd.dataframe ({'label': Pd.factorize (df['result'])[0]}) # dummies = pd.get_dummies(df.drop(columns=['result'])) data = Pd. Concat ([dummies, label], Axis =1) # concatenate data.head(10)Copy the code

Pandas. Factorize Converts string features to numeric features

Why is data one-hot encoded? 🤔

For example, blood type is generally divided into four types: A, B, O and AB, which are disordered multi-classification variables. In general, when inputting data, we often assign values of 1, 2, 3 and 4 in order to quantify the data.

From the point of view of numbers, when the values are 1, 2, 3 and 4, they have a certain sequential relationship from small to large. However, in fact, there is no such relationship between the four blood types, and they should be equal and independent. If the values assigned by 1, 2, 3, 4 and carried into the regression model are not reasonable, then we need to convert them into dummy variables.

One-hot encoding can replace classified variables with “dummy matrix”. If a DataFrame contains K distinct values in one of its columns, you can derive a K-column matrix or DataFrame with all 0 and 1 values. Pandas has a get_dummies function to do this. The columns parameter pd.get_dummies(df,columns=[“key”,”sorce”]) specifies the encoded columns and returns the encoded and unencoded columns. The prefix argument prefixes the name of a dummy variable.

EDA(exploratory data analysis)

import matplotlib.pyplot as plt
import seaborn as sns
Copy the code
Describe ()Copy the code

Corr = data.corr() ax = plt.subplots(figsize=(14, 8))[1] ax.set_title('Correlation') SNS. annot=True, cmap='RdBu') plt.show()Copy the code

SMOTE (Synthetic Minority Oversampling Technique), modify few kinds of Oversampling Technique. Due to random sampling take simple copy sample of strategy to increase a few samples, so prone to model fitting problem, even if the information that must study model too special and not enough generalization, SMOTE algorithm is the basic idea of the minority class samples were analyzed according to the minority class synthetic new samples are added to the sample data set.

Model training and testing

code

Def showClassificationData(X, Y, ifFit=False, overSampling='None'): """ :param overSampling: resampling: Param overSampling: Param ifFit: Whether to fit the data when drawing: Param X: Design matrix (without labels) : Param Y: Import matplotlib.pyplot as PLT import seaborn as SNS from sklearn.decomposition import PCA from imblearn.over_sampling import SMOTE from imblearn.combine import SMOTETomek, SMOTEENN def getSMOTE(): return SMOTE(random_state=10) def getSMOTETomek(): return SMOTETomek(random_state=10) def getSMOTEENN(): return SMOTEENN(random_state=10) switch = {'SMOTE': getSMOTE, 'SMOTETomek': getSMOTETomek, 'SMOTEENN': getSMOTEENN} if overSampling ! = 'None': overSampler = switch.get(overSampling)() X, Y = overSampler.fit_resample(X, Fit_transform (X) X = pca.dataframe (X) Y = pd.dataframe (Y) X = Pca.fit_transform (X) X = pd.dataframe (X) Y = Pd.dataframe (Y) X = Pca.fit_transform (X) X = Pd.dataframe (X) Y = Pd.dataframe (Y) X = pd.concat([X, Y], ignore_index=True, axis=1) X.columns = ['firstIngridient', 'secondIngridient', 'label'] sns.lmplot('firstIngridient', 'secondIngridient', X, hue='label', Fit_reg =ifFit) plt.show() def showHistogram(Y, ifLog=False): "" Param Y: Import matplotlib.pyplot as PLT import seaborn as SNS Y = pd.dataframe (Y) sns.set_style('whitegrid') FIG ax = plt.subplots() Y.hist(ax=ax, bins=100) if ifLog: ax.set_yscale('log') ax.tick_params(labelsize=14) ax.set_xlabel('count', fontsize=14) ax.set_ylabel('Y', Fontsize = 14) print (" showHistogram: ") plt.show() def multiClassificationModel(X, Y, smote =0, CV =5, ifKFold=False, ifKFold=False): ["" multiClassificationModel implements naive Bayes, SUPPORT vector machine, Logistic regression, and XGBoost classification performance test on the data set: Param ifSmote: Do YOU get resampling: Param X: Y: Tag :param testSize: test set size, default is 0.2 :param CV: "" print(" use different model classification effect: ") from sklearn import naive_bayes, svm, linear_model,neighbors,tree from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score import xgboost from imblearn.over_sampling import SMOTE def train(model, x_train, y_train): Return model. Fit (x_train, y_train) def printScore(model, x, y): ', recall_score(y, model. Predict (x))) ', precision_score(y, model.predict(x))) print(' f1_score(y, model.predict(x))) ', accuracy_score(y, model.predict(x)) def printKFoldScore(model, x, y, CV): print('recall: ', cross_val_score(model, X, Y, cv=cv, scoring='recall').mean()) print('precision: ', cross_val_score(model, X, Y, cv=cv, scoring='precision').mean()) print('f1: ', cross_val_score(model, X, Y, cv=cv, scoring='f1').mean()) print('accuracy: Cross_val_score (model, X, Y, CV = CV, scoring='accuracy').mean() y_test = train_test_split(X, Y, test_size=testSize, random_state=1) if ifSmote: smo = SMOTE(random_state=10) x_train, y_train = smo.fit_resample(x_train, Y_train) # KNN KNN = neighbors. KNeighborsClassifier () tree. The decision tree decision_tree = # DecisionTreeClassifier () # naive bayesian NB = naive_bayes.GaussianNB() # SVM SVM = svm.SVC() # logisticRegression LR = linear_model.LogisticRegression(max_iter=5000) # xgboost Xgb = xgboost.XGBClassifier() model_list = [NB, SVM, LR, Xgb, KNN,decision_tree] for i in range(len(model_list)): model_list[i] = train(model_list[i], x_train, y_train) print('----- train -----') for model in model_list: print('#### {} ####'.format(str(model).split('(')[0])) printScore(model, x_train, y_train) print() print('----- test -----') for model in model_list: print('#### {} ####'.format(str(model).split('(')[0])) printScore(model, x_test, y_test) print() if ifKFold: Print (" * * * * * * * * * * K - cross - validation K - a Fold cross-validation * * * * * * * * * * ") for the model in model_list: print('#### {} ####'.format(str(model).split('(')[0])) printKFoldScore(model, X, Y, cv=cv) print()Copy the code

application

Y = data["label"] # showHistogram(Y) multiClassificationModel(X, Y, ifSmote=True)Copy the code

Training effects of different models:

Effects of testing with different models:

The evaluation index

Positive sample/positive example/positive class :(dichotomies) the category you want to find (identify) in a task

Precision (Precision)

Represents the proportion of real examples to all positive examples judged by the model.

Purpose: Used to evaluate the accuracy of detector on the basis of successful detection

The recall rate (recall)

Represents the proportion of positive examples correctly judged by the model to all positive examples in the data set.

Purpose: Used to evaluate the detection coverage of detector to the target to be detected.

Generally speaking, the accuracy rate is how many of the retrieved items are accurate, and the recall rate is how many of all the accurate items are retrieved. The difference between accuracy and recall rate is the denominator. One denominator is the number of positive samples predicted, and the other is the number of all positive samples in the original sample.

By drawing a precision-recall curve, the larger the area below the curve is, the higher the recognition accuracy will be, and vice versa.

# sklearn accuracy. The metrics. Precision_score (y_true y_pred, labels = None, pos_label = 1, business = 'binary', Sklear.metrics. Recall_score (y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')Copy the code

Input parameters:

Y_true: indicates the true label.

Y_pred: prediction label.

Labels: Optional argument, is a list. When you’re dichotomizing, you don’t use this parameter.

Pos_label: string or int. The default value is 1.

Average: the value is a string of [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’,’ weighted ‘]. The default is binary.

Sample_weight: sample specific gravity

Zero_division: The default is “WARN”, representing 0 or 1, the value returned when the division is set to zero (that is, all predictions and labels are negative). If set to warn, the value is 0, but warnings are also issued.

Output:

Positive sample accuracy, recall rate, floating point type.

F1 value

Accuracy rate and recall rate are sometimes contradictory, so they need to be considered comprehensively. The most common method is F1 value. F1 integrates the results of accuracy and recall rate, and a higher F1 indicates that the test method is more effective.

Accuracy (Accuracy)

Represents the proportion of the data correctly judged by the model (correctly predicted, positive samples were detected as positive samples, negative samples were detected as negative samples) to the total data.

In general, the higher the accuracy, the better the classifier.

K-fold cross validation

In machine learning modeling, it is common practice to divide data into training sets and test sets. The test set is data independent of the training, not involved in the training at all, and used for the evaluation of the final model. In the course of training, the problem of fitting often appears.

The over-fitting problem is that the model can well match the training data, but can’t predict the data outside the training set.

If the test data is used to adjust model parameters at this time, it is equivalent to knowing part of the test data information during training, which will affect the accuracy of the final evaluation results. It is common practice to set aside a portion of the training data as Validation data to evaluate the training effect of the model.

The validation data is obtained from the training data, but does not participate in the training, so that the matching degree of the model to the data outside the training set can be objectively evaluated. Cross-validation, also known as cyclic validation, is commonly used to evaluate models in validation data. It divides the original data into K-folds, performs a validation set for each subset, and uses the rest k-1 subset data as a training set. In this way, K models can be obtained. The K models are evaluated in the validation set respectively, and the final Error MSE(Mean Squared Error) is added and averaged to obtain the cross validation Error.

Cross-validation effectively utilizes limited data, and the evaluation results can be as close as possible to the model’s performance on the test set, which can be used as an indicator of model optimization.

When to use k-fold 🤔? When the total amount of data is small, other methods cannot improve performance. You can try K-fold. In other cases, it is not recommended. For example, there is a large amount of data, so there is no need to train more data. Meanwhile, the training cost should be increased by K times (mainly referring to the training time).

sklearn

Scikit-learn, also known as SKLearn, is an open source Machine learning toolkit based on python. It uses python’s NumPy, SciPy, and Matplotlib libraries for efficient algorithmic applications and covers almost all major machine learning algorithms.

What’s in Sklearn 👀❓

Commonly used modules in Sklearn include classification, regression, clustering, dimensionality reduction, model selection and pretreatment.

1️ classification: Commonly used algorithms include SVM, nearest neighbors and random forest, and common applications include spam recognition and image recognition.

2️ regression: predicting the continuous value properties associated with the object. The common algorithms include SVR (support vector regression), Ridge regression and Lasso. The common applications include drug reaction and stock price prediction.

3️ clustering: Automatic grouping of similar objects. The commonly used algorithms include k-means, spectral clustering and mean-shift. The common applications include customer segmentation and grouping of experimental results.

4️ dimension reduction: reduce the number of random variables to be considered. Common algorithms include PCA (principal component analysis), feature selection and non-negative matrix factorization. Common applications include visualization and efficiency improvement.

5️ model selection: comparison, verification, parameter selection and model selection, commonly used modules include grid Search, cross validation and metrics. Its goal is to improve accuracy through parameter adjustment.

6️ pre-processing: feature extraction and normalization, commonly used modules include: Preprocessing, feature extraction, and common applications include: converting input data (such as text) into data available for machine learning algorithms.

7️ There are data sets in retail sklearn, such as iris data set and Boston housing price data set that we often see…

Sklearn is as simple to use as 💥

PIP install scikit-learn is required to install scikit-learn

The import program just needs to

from sklearn import naive_bayes, svm, linear_model,neighbors,tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
Copy the code

What algorithm do you want just from sklearn import it, the model algorithm is all in one sentence, you can call take off 😋

The resources

One-hot Coding for Python data processing using Get_dummies — Yuxj recording learning

Python one-hot encoding of data — water… amber

# sklearn(3) Calculate recall: use metrics. Recall_score () to calculate the recall rate of both categories