Original link:tecdat.cn/?p=23518
Original source:Tuo End number according to the tribe public number
Project Background: The main profitable business of the bank depends on loans, and most of these customers are responsible customers with different sizes of deposits (depositors). Banks have a growing customer base. The bank wants to add borrowers (asset customers), make more loans and earn more from the interest it earns on loans. So banks want to convert indebted customers into personal loan customers. (while keeping them as depositors). A campaign by the bank for indebted clients last year showed a success rate of more than 9 per cent. The division hopes to build a model to help them identify potential customers who are more likely to buy loans. It can increase the success rate while reducing costs.
The data set
The file given below contains data for 5000 customers. The data includes customer demographic information (age, income, etc.), customer’s relationship with the bank (mortgage, security account, etc.) and customer’s dependent variable on last personal loan activity (personal loan). Of these 5000 customers, only 480 (= 9.6%) accepted the personal loan offered to them in the previous activity
data.head()
Copy the code
data.columns
Copy the code
Attribute information
Attributes can be divided accordingly:
- Variable ID There is no correlation between a person’s customer ID and the loan, nor does it provide any general conclusions for future potential loan customers. We can ignore this information for model prediction.
The binary category has five variables, as follows:
- Personal Loan – Did the client accept the personal loan offered in the previous advertising series? This is our target variable
- Securities account – Does the customer have a securities account with the bank?
- CD account – Does the customer have a certificate of deposit (CD) account with the bank?
- Internet Banking – Does the customer use Internet banking?
- Credit card – Does the customer use a bank issued credit card?
The numerical variables are as follows:
- Age – The age of the customer
- Work experience
- Income – Annual income (YUAN)
- CCAvg- Average credit card spending
- Mortgage – The mortgage value of the home
The ordered classification variables are:
- Family – The number of people in the customer’s family
- Education – The education level of the customer
The nominal variables are:
- ID
- The zip code
data.shape
Copy the code
data.info()
Copy the code
Apply (lambda x: sum(x.i. Null ()))Copy the code
Describe ().transpose()Copy the code
Apply (lambda x: len(x.unique()))Copy the code
Pairwise variable scatter plot
- Age characteristics are usually distributed, with most clients aged between 30 and 60.
- Most of the experience is distributed in customers with more than 8 years of experience. The average here is equal to the median. There are negative. This could be a data entry error, as negative work experiences are often not measured. We can delete these values because there are three or four records in the sample.
- There is a positive skew in income. Most clients earn between 45,000 and 55K. We can confirm this by saying that the average is greater than the median
- CCAvg is also a positive partial variable, with the average expenditure between 0K and 10K, and most expenditure less than 2.5K
- 70% of the people with mortgages have mortgages of less than $40,000. But the maximum is 635K
- Family and education variables are ordinal variables. Evenly distributed family
There were 52 records with negative experiences. We need to clean up these records before going any further
data[data['Experience'] < 0]['Experience'].count()
Copy the code
52
Copy the code
DfExp = data.loc[data['Experience'] >0] data.loc[negExp]['ID'].tolist() #Copy the code
I have 52 negative experiences
The following code performs the following steps:
- For records with ids, gets
Age
The value of the column - For records with ids, gets
Education
The value of the column - Filter the records that meet the above criteria from the data box of records with positive experience and take the median
- The median is filled in where the negative experience would have been
data.loc[np.where(['ID']==id)]["Education"].tolist()[0]
df_filtered['Experience'].median()
Copy the code
Data [data['Experience'] < 0]['Experience'].count()Copy the code
0
Copy the code
The impact of income and education on personal loans
boxplot(x='Education',y='Income',data=data)
Copy the code
Observation: It seems that customers with 1 degree of education earn more. However, those who took out personal loans had the same income level
Corollary: As can be seen from the above chart, mortgage loans are higher for customers without personal loans and those with personal loans.
Watch: Most customers who don’t have loans have securities accounts
Observation: Household size has no effect on personal loans. But it seems that three-year-olds are more likely to borrow. This may be a good observation to make when considering future promotions.
Watch: Customers without CD accounts, no loans. That seems to be the majority. But almost all customers with CD accounts also have loans
Observation: The chart shows that people with personal loans have higher credit card charges on average. The median average credit card consumption was 3,800 yuan, indicating a higher likelihood of personal loans. Lower credit card expenses (median 1, 400 yuan) were less likely to get a loan. This could be useful information.
The observation above shows a positive correlation with experience and age. As experience increases, so does age. Colors also indicate education. There is a gap between people in their forties and more people are under college
Corr = data.corr() plt.figure(figsize=(13,7)) So that we only see the relevant value once a = SNS. Heatmap (corr,mask=mask, anNOT =True, FMT ='.2f')Copy the code
To observe the
- Income was moderately correlated with CCAvg.
- Age is highly correlated with work experience
sns.boxplot
Copy the code
If you look at the chart below, households earning less than $100,000 are less likely to get a loan than households earning more.
Application model
Divide the data into training sets and test sets
train_labels = train_set
test_labels = test_set
Copy the code
Decision tree classifier
DecisionTreeClassifier(class_weight=None, criterion='entropy', ...)
Copy the code
dt_model.score
Copy the code
0.9773333333333334
Copy the code
dt_model.predict(test_set)
Copy the code
To predict
array([0, 0, 0, 0, 0])
Copy the code
View test sets
test_set.head(5)
Copy the code
Naive Bayes
naive_model.fit(train_set, train_labels)
naive_model.score
Copy the code
0.8866666666666667
Copy the code
Random forest classifier
RandomForestClassifier(max_depth=2, random_state=0)
Copy the code
Importance.sort_values
Copy the code
randomforest_model.score(test_set,test_labels)
Copy the code
0.8993333333333333
Copy the code
KNN (K- Nearest Neighbor)
data.drop(['Experience' ,'ID'] , axis = 1).drop(labels= "PersonalLoan" , axis = 1)
train_set_dep = data["PersonalLoan"]
Copy the code
acc = accuracy_score(Y_Test, predicted)
print(acc)
Copy the code
0.9106070713809206
Copy the code
Model to compare
for name, model in models: kfold = model_selection.KFold(n_splits=10) cv_results = model_selection.cross_val_score(model, X, y, cv, Comparison of algorithms of scoring # boxplot PLt. figure()Copy the code
conclusion
The purpose of ge Bank is to turn indebted customers into loan customers. They want to launch a new marketing campaign; Therefore, they need information about the relationship between the variables given in the data. Four classification algorithms were used in this study. As can be seen from the figure above, the random forest algorithm seems to have the highest accuracy and we can choose it as the final model.
Most welcome insight
1. Why do employees dimission from decision tree model
2. Tree-based methods of R language: decision tree, random forest
3. Use scikit-learn and PANDAS in Python
4. Machine learning: Running random forest data analysis reports in SAS
5.R language improves airline customer satisfaction with random forest and text mining
6. Machine learning boosts fast fashion precise sales time series
7. Identifying changing Stock Market Conditions with Machine learning: Application of Hidden Markov Models
8. Python Machine learning: Recommendation System Implementation (Matrix factorization for collaborative filtering)
9. Python uses PyTorch machine learning classification to predict bank customer churn