Original link:tecdat.cn/?p=23518 

Original source:Tuo End number according to the tribe public number

Project Background: The main profitable business of the bank depends on loans, and most of these customers are responsible customers with different sizes of deposits (depositors). Banks have a growing customer base. The bank wants to add borrowers (asset customers), make more loans and earn more from the interest it earns on loans. So banks want to convert indebted customers into personal loan customers. (while keeping them as depositors). A campaign by the bank for indebted clients last year showed a success rate of more than 9 per cent. The division hopes to build a model to help them identify potential customers who are more likely to buy loans. It can increase the success rate while reducing costs.

The data set

The file given below contains data for 5000 customers. The data includes customer demographic information (age, income, etc.), customer’s relationship with the bank (mortgage, security account, etc.) and customer’s dependent variable on last personal loan activity (personal loan). Of these 5000 customers, only 480 (= 9.6%) accepted the personal loan offered to them in the previous activity

data.head()
Copy the code

data.columns
Copy the code

Attribute information

Attributes can be divided accordingly:

  • Variable ID There is no correlation between a person’s customer ID and the loan, nor does it provide any general conclusions for future potential loan customers. We can ignore this information for model prediction.

The binary category has five variables, as follows:

  • Personal Loan – Did the client accept the personal loan offered in the previous advertising series? This is our target variable
  • Securities account – Does the customer have a securities account with the bank?
  • CD account – Does the customer have a certificate of deposit (CD) account with the bank?
  • Internet Banking – Does the customer use Internet banking?
  • Credit card – Does the customer use a bank issued credit card?

The numerical variables are as follows:

  • Age – The age of the customer
  • Work experience
  • Income – Annual income (YUAN)
  • CCAvg- Average credit card spending
  • Mortgage – The mortgage value of the home

The ordered classification variables are:

  • Family – The number of people in the customer’s family
  • Education – The education level of the customer

The nominal variables are:

  • ID
  • The zip code
data.shape
Copy the code

data.info()
Copy the code

Apply (lambda x: sum(x.i. Null ()))Copy the code

Describe ().transpose()Copy the code

 

Apply (lambda x: len(x.unique()))Copy the code

Pairwise variable scatter plot

  • Age characteristics are usually distributed, with most clients aged between 30 and 60.
  • Most of the experience is distributed in customers with more than 8 years of experience. The average here is equal to the median. There are negative. This could be a data entry error, as negative work experiences are often not measured. We can delete these values because there are three or four records in the sample.
  • There is a positive skew in income. Most clients earn between 45,000 and 55K. We can confirm this by saying that the average is greater than the median
  • CCAvg is also a positive partial variable, with the average expenditure between 0K and 10K, and most expenditure less than 2.5K
  • 70% of the people with mortgages have mortgages of less than $40,000. But the maximum is 635K
  • Family and education variables are ordinal variables. Evenly distributed family

There were 52 records with negative experiences. We need to clean up these records before going any further

data[data['Experience'] < 0]['Experience'].count()
Copy the code
52
Copy the code
DfExp = data.loc[data['Experience'] >0] data.loc[negExp]['ID'].tolist() #Copy the code

I have 52 negative experiences

The following code performs the following steps:

  • For records with ids, getsAgeThe value of the column
  • For records with ids, getsEducationThe value of the column
  • Filter the records that meet the above criteria from the data box of records with positive experience and take the median
  • The median is filled in where the negative experience would have been

data.loc[np.where(['ID']==id)]["Education"].tolist()[0]
df_filtered['Experience'].median()
Copy the code
Data [data['Experience'] < 0]['Experience'].count()Copy the code
0
Copy the code

The impact of income and education on personal loans

boxplot(x='Education',y='Income',data=data)
Copy the code

Observation: It seems that customers with 1 degree of education earn more. However, those who took out personal loans had the same income level

Corollary: As can be seen from the above chart, mortgage loans are higher for customers without personal loans and those with personal loans.

Watch: Most customers who don’t have loans have securities accounts

Observation: Household size has no effect on personal loans. But it seems that three-year-olds are more likely to borrow. This may be a good observation to make when considering future promotions.

Watch: Customers without CD accounts, no loans. That seems to be the majority. But almost all customers with CD accounts also have loans

Observation: The chart shows that people with personal loans have higher credit card charges on average. The median average credit card consumption was 3,800 yuan, indicating a higher likelihood of personal loans. Lower credit card expenses (median 1, 400 yuan) were less likely to get a loan. This could be useful information.

The observation above shows a positive correlation with experience and age. As experience increases, so does age. Colors also indicate education. There is a gap between people in their forties and more people are under college

Corr = data.corr() plt.figure(figsize=(13,7)) So that we only see the relevant value once a = SNS. Heatmap (corr,mask=mask, anNOT =True, FMT ='.2f')Copy the code

To observe the

  • Income was moderately correlated with CCAvg.
  • Age is highly correlated with work experience
sns.boxplot
Copy the code

If you look at the chart below, households earning less than $100,000 are less likely to get a loan than households earning more.

Application model

Divide the data into training sets and test sets

train_labels = train_set
test_labels = test_set
Copy the code

Decision tree classifier

DecisionTreeClassifier(class_weight=None, criterion='entropy', ...)
Copy the code
dt_model.score
Copy the code
0.9773333333333334
Copy the code
dt_model.predict(test_set)
Copy the code

To predict

array([0, 0, 0, 0, 0])
Copy the code

View test sets

test_set.head(5)
Copy the code

Naive Bayes

naive_model.fit(train_set, train_labels)
naive_model.score
Copy the code
0.8866666666666667
Copy the code

Random forest classifier

RandomForestClassifier(max_depth=2, random_state=0)
Copy the code
Importance.sort_values
Copy the code

randomforest_model.score(test_set,test_labels)
Copy the code
0.8993333333333333
Copy the code

KNN (K- Nearest Neighbor)

data.drop(['Experience' ,'ID'] , axis = 1).drop(labels= "PersonalLoan" , axis = 1)
train_set_dep = data["PersonalLoan"]
Copy the code

acc = accuracy_score(Y_Test, predicted)
print(acc)
Copy the code
0.9106070713809206
Copy the code

Model to compare

for name, model in models: kfold = model_selection.KFold(n_splits=10) cv_results = model_selection.cross_val_score(model, X, y, cv, Comparison of algorithms of scoring # boxplot PLt. figure()Copy the code

 

conclusion

The purpose of ge Bank is to turn indebted customers into loan customers. They want to launch a new marketing campaign; Therefore, they need information about the relationship between the variables given in the data. Four classification algorithms were used in this study. As can be seen from the figure above, the random forest algorithm seems to have the highest accuracy and we can choose it as the final model.


Most welcome insight

1. Why do employees dimission from decision tree model

2. Tree-based methods of R language: decision tree, random forest

3. Use scikit-learn and PANDAS in Python

4. Machine learning: Running random forest data analysis reports in SAS

5.R language improves airline customer satisfaction with random forest and text mining

6. Machine learning boosts fast fashion precise sales time series

7. Identifying changing Stock Market Conditions with Machine learning: Application of Hidden Markov Models

8. Python Machine learning: Recommendation System Implementation (Matrix factorization for collaborative filtering)

9. Python uses PyTorch machine learning classification to predict bank customer churn