“This is the 23rd day of my participation in the Gwen Challenge.
Based on the existing employee information and dimission performance, the corresponding employee dimission prediction model can be built to predict whether the employee will dimission later.
1. Data reading and preprocessing
First, read the employee information and dimission performance (i.e. whether to quit or not), the code is as follows.
Import pandas as pd df = pd.read_excel(' XLSX ')Copy the code
View the first five rows of the table by printing df.head(). The result is shown below, with the numbers 1 in the “quit” column for quit and 0 for not quit.
There are 15,000 groups of historical data in this table. The first 3571 groups are data of resigned employees, and the last 11429 groups are data of non-resigned employees. Our goal is to build a decision tree model based on these historical data to predict the possibility of employee dimission.
The “salary” data in the raw data is classified as “high,” “medium,” and “low.” This text is not recognized in Python mathematical modeling, so the contents of the “salary” column need to be numerically processed. The replace() function in pandas replaces the text “high”, “medium”, and “low” with the numbers 2, 1, and 0, respectively.
Df = df the replace ({' wages' : {' low ': 0,' in ': 1, the' high ': 2}}) df, head ()Copy the code
The first 5 rows of the processed table are shown below.
The “dimission” column in the table is taken as the target variable, and the remaining fields are taken as the characteristic variables to judge whether an employee will dimission or not by the characteristics of the employee. For the convenience of demonstration, only 6 feature variables are selected here, and there will be more feature variables selected in commercial practice. Next, the decision tree model is built, which is the routine step of most machine learning model building.
2. Extract characteristic variables and target variables
First, feature variables and target variables are extracted respectively, and the code is as follows
X=df.drop(columns = 'xtype ') y=df[' xtype ']Copy the code
Line 1 drops the “quit” column with drop() and assigns the remaining data to variable X as the characteristic variable; Line 2 extracts the “quit” column as the target variable in the same way a DataFrame extracts the column and assigns it to variable Y.
3. Divide training set and test set
After extracting characteristic variables and target variables, the original 15,000 sets of data need to be divided into training sets and test sets. As the name implies, the training set is used for model training, and the test set is used to verify the results of model training. Here’s the code.
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split (X, y, test_size = 0.2, random_state = 123)Copy the code
The partitioned data is shown in the following figure.
4. Model training and building
After dividing the training set and the test set, we can introduce the decision tree model from the Scikit-learn library for model training. The code is as follows.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3,random_state=123)
model.fit(X_train,y_train)
Copy the code
At this point, a decision tree model is built