Original link:tecdat.cn/?p=5521

Original source:Tuo End number according to the tribe public number

 

Data background

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. 

The data set  is Churn . The fields are as follows:

 

State  discrete.
account length  continuous.
area code  continuous.
phone number  discrete.
international plan  discrete.
voice mail plan  discrete.
number vmail messages  continuous.
total day minutes  continuous.
total day calls  continuous.
total day charge  continuous.
total eve minutes  continuous.
total eve calls  continuous.
total eve charge  continuous.
total night minutes  continuous.
total night calls  continuous.
total night charge  continuous.
total intl minutes  continuous.
total intl calls  continuous.
total intl charge  continuous.
number customer service calls  continuous.
churn  Discrete

Data Preparation and Exploration 

 

## state account.length area.code phone. Number ## WV: 158 Min. : 1.0 Min. :408.0 327-1058:1 ## MN: 1 st Qu. 125:73.0 1 st Qu. : 408.0 327-1319:1 # # AL: 124 Median: Median 100.0:415.0 327-2040:1 # # ID: 119 Mean :100.3 Mean :436.9 327-2475:1 ## VA: 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053:1 ## OH: 116 Max. :243.0 Max. :510.0 327-3587: 1 ## (Other):4240 (Other) :4994 ## international.plan voice.mail.plan number.vmail.messages ## no :4527 no :3677 Min. : 0.000 ## yes: 240 yes: 240 no ## Median: 240 ## Median: 240 7.755 ## 3rd Qu.:17.000 ## Max. :52.000 ## ## total.day.minutes total.day. Calls total.day.charge total.eve.minutes ## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0 ## 1st Qu.:143.7 1st Qu. 87 1st Qu.:24.43 1st Qu.:166.4 ## Median :180.1 Median :100 Median :30.62 Median :201.0 ## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6 ## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 ## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7 ## ## total.eve.calls total.eve.charge total.night.minutes total.night.calls ## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00 ## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 0.00 ## 1st Qu. 87.00 ## Median :100.0 Median :17.09 Median :200.4 Median :100.00 ## Mean :100.2 Mean :17.05 Mean :200.4 Median: 21.0 99.92 ## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 ## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00 ## ## total.night.charge total.intl.minutes total.intl.call total.intl.charge ## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000 ## 1st Qu.: 0.000 Min. :0.000 ## Median: 9.020 | Median :10.30 | Median :0.000 ## Median: 9.020 | Median :10.30 | 4.000 Median :2.780 ## Mean: 9.018 Mean :10.26 Mean: 4.435 Mean :2.771 ## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3 rd Qu. : 3.240 # # Max. : 17.770 Max. : 20.00 Max. : 20.000 Max. : 5.400 # # # # number.. The customer service. The calls churn # # :0.00 False.:4293 ## 1st Qu.:1.00 True. : 707 ## Median :1.00 ## Mean :1.57 ## 3rd Qu.:2.00 ## Max. :9.00 ##Copy the code

From the data overview, we can find that there is no missing data, and we can find that the telephone number area code is a worthless variable, which can be deleted

 

Examine the variables graphically

 

   

From the results above, we can see that the number of samples in which churn was no was much larger than the sample in which churn was yes, so churn was in the majority in all samples.

 

From the above results, we can see that the other numerical variables, except emailcode and Areacode, approximate to normal distribution.

## account.length area.code number.vmail.messages total.day.minutes ## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : # # 1 st Qu. 0.0:1 st Qu. 73.0:1 st Qu. 408.0:1 st Qu. 0.000:143.7 # # Median: Median 100.0: Median 415.0: 0.000 Median :180.1 ## Mean :100.3 Mean :436.9 Mean: 7.755 Mean :180.3 ## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2 ## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5 ## total.day. Calls total.day. Charge total.eve.minutes total. 0.0 Min. : 0.0 ## 1st Qu.: 87 1st Qu. 87.0 ## Median :100 Median :30.62 Median :201.0 Median :100.0 ## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2 ## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0 ## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0 ## total.eve.charge : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. 0.000 ## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510 ## Median :17.09 Median :200.4 Median :100.00 Median: 9.020 ## Mean :17.05 Mean :200.4 Median: 99.92 Mean: 9.018 ## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560 ## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770 Charge ## Min. :0.00 Min. :0.000 Min. :0.000 ## 1st Qu.: 0.00 Min. :0.000 Min. :0.000 ## 1st Qu.: ## Median :10.30 | Median: 4.000 | Median: 2.250 ## Mean :10.26 Mean: 4.435 Mean :2.771 ## 3rd Qu.:12.00 3rd Qu.: 6.000 3 rd Qu. : 3.240 # # Max. : 20.00 Max. : 20.000 Max. : 5.400 # # number. The customer service. Calls # # Min. : 0.00 # # 1 st Qu. : # # Median 1.00:1.00 # # scheme: # # 3 rd Qu. 1.57:2.00 # # Max. : 9.00Copy the code

Relationships between variables

From the results, we can see that there is a significant positive linear correlation between the two.



Using the statistics node, report

Code ## account. Length 1.0000000000-0.018054187 ## area. Code -0.0180541874 1.000000000 ## Messages - 0.0145746663-0.003398983 ## total.day.minutes - 0.0010174908-0.019118245 ## total.day. Calls 0.0282402279-0.019313854 ## total.day.charge - 0.0010191980-0.019119256 ## total.eve.minutes -0.0095913331 0.007097877 ## total.eve.charge -0.0095873958 0.007114130 ## total.night.minutes ## Total. Calls 0.0091425790-0.012299947 ## total 0.0006679112 0.002083626 ## total.night.calls -0.0078254785 0.014656846 ## total.night.charge 0.0006558937 0.002070264 ## total.intl. Minutes 0.0012908394-0.004153729 ## total.intl. Calls 0.0142772733-0.013623309 ## total.intl. Charge 0.0012918112 0.004219099 # # number. Customer. Service. Calls - 0.0014447918 0.020920513 # # number. Vmail. Messages Total.day.minutes ## account. Length -0.0145746663-0.001017491 ## area.code -0.0033989831-0.019118245 ## Messages 1.0000000000 0.005381376 ## total.day.minutes 0.0053813760 1.000000000 ## total.day. Calls 0.0008831280 0.001935149 ## total.day.charge 0.0053767959 0.999999951 ## total.eve.minutes 0.0194901208-0.010750427 ## Call total.eve.charge 0.0194959577-0.010760022 ## total.night.minutes 0.0055413838 0.011798660 ## Total.night. call 0.0026762202 0.004236100 ## total.night.charge 0.0055349281 0.011782533 ## Minutes 0.0024627018-0.019485746 ## total.intl. Calls 0.0001243302-0.001303123 ## total.intl. Charge 0.0025051773 0.019414797 # # number. Customer. Service. Calls - 0.0070856427 0.002732576 # # total. Day.to calls. Total day.to charge ## account.length 0.0282402279-0.001019198 ## area.code - 0.0193138545-0.019119256 ## number.vmail.messages ## account.length 0.0282402279-0.001019198 ## area.code - 0.0193138545-0.019119256 ## number 0.0008831280 0.005376796 ## total.day.minutes 0.0019351487 0.999999951 ## Total.day. Calls 1.0000000000 0.001935884 ## Total.day. charge 0.0019358844 1.000000000 ## total.eve.minutes - 0.0006994115-0.010747297 ## total.eve.calls 0.0037541787 0.008129319 ## total.eve.charge - 0.0006952217-0.010756893 ## total.night.minutes 0.0028044650 0.011801434 ## total.night.charge 0.0028018169 0.011785301 ## total.intl.minutes ## total 0.0130972198-0.019489700 ## total.intl. Calls 0.0108928533-0.001306635 ## total.intl. Charge 0.0131613976-0.019418755 . # # number. The customer. The service calls - 0.0107394951-0.002726370 # # total. Eve. Minutes total. Eve. Calls # # the length ## number. Vmail. messages ## Total.day. Minutes -0.0107504274 0.008128130 ## total.day. Calls -0.0006994115 0.003754179 ## total.day. Charge -0.0107472968 0.008129319 ## total.eve.minutes 1.0000000000 0.002763019 ## total.eve.calls 0.0027630194 1.000000000 ## Total.eve.charge 0.9999997749 0.002778097 ## total.night.minutes -0.0166391160 0.001781411 ## total.night.calls 0.0134202163-0.013682341 ## total.night.charge -0.0166420421 0.001799380 ## total.intl.minutes 0.0001365487 -0.007458458 ## total.intl. Calls 0.0083881559 0.005574500 ## total.intl. Charge 0.0001593155 -0.007507151 ## Number, the customer service. Calls to 0.0138234228 0.006234831 # # total. Eve. Charge total. Night. You can # # the length -0.0095873958 0.0006679112 ## area.code 0.0071141298 0.0020836263 ## number.vmail.messages Total.day.minutes -0.0107600217 0.0117986600 ## total.day. Calls -0.0006952217 0.0028044650 ## total.day.charge -0.0107568931 0.0118014339 ## total.eve.minutes 0.9999997749 -0.0166391160 ## total.eve.calls 0.0027780971 0.0017814106 ## total.eve.charge 1.0000000000-0.0166489191 ## total.night.minutes -0.0166489191 1.0000000000 ## total.night.calls 0.0134220174 0.0269718182 ## total.night.charge -0.0166518367 0.9999992072 ## total.intl.minutes 0.0001320238 -0.0067209669 ## total.intl. Calls 0.0083930603-0.0172140162 ## total.intl. Charge 0.0001547783-0.0066545873 ## Number, the customer service. Calls - 0.0138363623-0.0085325365Copy the code
 
Copy the code
If the variables with high correlation are saved, multicollinearity may be caused, so it is necessary to delete the variables with high correlation.Copy the code

Data Manipulation

 
Copy the code
As can be seen from the results, there is a certain correlation between total.day.calls and total.day.charge.Copy the code
In particular, there is a negative correlation between the variables whose Voicemial is NO.Copy the code

 

Discretize (make categorical) a relevant numeric variable

 

 

 

Discretization of variables

 

 construct a distribution of the variable with a churn overlay 

construct a histogram of the variable with a churn overlay

 

 

 Find a pair of numeric variables which are interesting with respect to churn. 

 
Copy the code
As can be seen from the results, there is a certain correlation between total.day.calls and total.day.charge.Copy the code
 
Copy the code

Model Building

In particular, there is a correlation between the variables churned no.

# # Estimate Std. Error t value (Pr > | | t) # # (Intercept) 0.3082150 0.0735760 4.189 2.85 e-05 * * * # # stateAL 0.0151188 0.0462343 0.327 0.743680 ## stateAR 0.0894792 0.0490897 1.823 0.068399. # stateAZ 0.0329566 0.0494195 0.667 0.504883 ## stateCA 0.1951511 0.0567439 3.439 0.000588 *** ## international. Plan yes 0.3059341 0.0151677 20.170 < 2e-16 *** ## Voice.mail. plan yes -0.1375056 0.0337533-4.074 4.70e-05 *** ## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402 ## Total. Day. Minutes 0.3796323 0.2629027 1.444 0.148802 ## Total Total.day.charge -2.2207671 1.5464583 -1.436 0.151056 ## Total.eve. minutes 0.0288233 0.1307496 0.220 0.825533 ## Total. Eve. charge -0.3316041 1.5382391-0.216 0.829329 ## Total.night. minutes 0.0083224 0.0695916 0.120 0.904814 ## Total. Calls -0.0001824 0.0002225-0.820 0.412290 ## Total.night.charge -0.1760782 1.5464674-0.114 0.909355 ## total.intl.minutes -0.0104679 0.4192270-0.025 0.980080 ## Total.intl. Charge 0.0676460 1.5528267 0.044 0.965254 ## Number. The customer. Service. Calls 0.0566474 0.0033945 16.688 < 2 e - * * * # # 16 total. Day.to minutes1medium 0.0502681 0.0160228 3.137 0.001715 ** ## total.day.minutes1Short 0.2404020 0.0322293 7.459 1.02e-13 ***Copy the code

 

And from the results, We can find that the state total. Intl. Calls, number.. The customer service. The calls, total. Day.to minutes1medium, total. Day.to minutes1short Variables have important effects.Copy the code

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn

##         Direction.2005
## knn.pred   1   2
##        1 760  97
##        2 100  43


 [1] 0.803
Copy the code
 
Copy the code
Confusion matrix (English: confusion matrix) is a visualization tool, especially used in supervised learning, in unsupervised learning is generally called matching matrix. Each column of the matrix represents an instance prediction of a class, and each row represents an instance of an actual class.Copy the code
## direction.2005 ## KNN. Pred 1 2 ## 1 827 104 ## 2 33 36 [1Copy the code

 

From the results of the test set, we can see an accuracy of 86%.Copy the code

 

Findings

 

We can find that there is a correlation between total.day.calls and total.day.charge. In particular, there is a correlation between the variables churned no. At the same time, we can find the state total. Intl. Calls, number.. The customer service. The calls, total. Day.to minutes1medium, total day.to minutes1short Variables have important effects. Meanwhile, we can find a correlation between total.day. Calls and total.day. Charge. Finally, from the RESULTS of KNN model, we can see that the accuracy is 80% in the results of training set, and 86% in the results of test set. It shows that the model has good prediction effect.


Most welcome insight

1.DT model to prevent user loss — analysis of telecom customer loss

2.Python uses PyTorch machine learning classification to predict bank customer churn model

3. Realization of SOM neural network clustering of bank credit data

4. Early warning model of bank credit risk based on decision tree

5. Machine learning boosts accurate sales forecast of fast fashion

6. Use LSTM and PyTorch for time series prediction in Python

7. In Python, SciKit-learn and PANDAS decision trees are used for iris iris data classification modeling and cross-validation

8. Realization of R language volatility prediction: ARCH model and HAR-RV model

9. Python for NLP: Classification using Keras’s multi-label text LSTM neural network