This is the seventh day of my participation in the First Challenge 2022. For details: First Challenge 2022.
I. Prediction of bank customer loss
The 13th phase 3 days AI advanced combat camp is the bank customer loss prediction, the use of BML one key training, prediction, release, the speed is quite good, it seems that I will be defeated by BML, I also try.
1. Introduction to the data set
background
We know that signing up new customers is much harder than keeping existing ones.
It is helpful for banks to understand the decisions that led to a customer’s decision to leave.
Preventing attrition allows banks to develop loyalty programs and retention activities to retain as many customers as possible.
Data description
- RowNumber- corresponds to the record (row) number and has no effect on the output.
- CustomerId- Contains a random value and has no effect on the customer leaving the bank.
- Last name – The customer’s last name has no effect on their decision to leave the bank.
- CreditScore- may affect customer churn as customers with higher credit scores are less likely to leave the bank.
- Location – The customer’s location may influence their decision to leave the bank.
- Gender – It would be interesting to explore whether gender plays a role in customers leaving the bank.
- Age – This is certainly relevant, as older customers are less likely to leave the bank than younger customers.
- Tenure – refers to the number of years a customer has been a customer of the bank. In general, older customers are more loyal and less likely to leave the bank.
- Balances – are also a good indicator of customer churn, as people with higher account balances are less likely to leave the bank than those with lower balances.
- NumOfProducts- Refers to the number of products purchased by the customer through the bank.
- HasCrCard- Indicates whether the customer has a credit card. This column is also important because people with credit cards are less likely to leave the bank.
- IsActiveMember- Active customers are less likely to leave the bank.
- Salary estimates – People who earn less are more likely to leave the bank than people who earn more.
- Logged out – Whether the customer has left the bank.
2. Data set reading
import numpy as np
import warnings
warnings.simplefilter('ignore')
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
Copy the code
data = pd.read_csv('data/data107968/churn.csv')
print(data.shape)
data.head()
Copy the code
(10000, 14)
Copy the code
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619 | France | female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608 | Spain | female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502 | France | female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 4 | 15701354 | Boni | 699 | France | female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 5 | 15737888 | Mitchell | 850 | Spain | female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
3. Delete unnecessary columns
Such as column number, customer ID, and name, these fields have no meaning for customer loss.
# Brackets indicate columns to be deleted, and axis=1 indicates column deletion
data=data.drop(['RowNumber'.'CustomerId'.'Surname'], axis=1)
data.head()
Copy the code
CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 619 | France | female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 608 | Spain | female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 502 | France | female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 699 | France | female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 850 | Spain | female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
4. Data analysis
Borrowing from solitary fly classmate: aistudio.baidu.com/aistudio/pr… I’m too bad to write about it. As can be seen from the chart, the gap between loss and non-loss is quite wide.
# Geographical statistics
data.Geography.unique()
Copy the code
array(['France', 'Spain', 'Germany'], dtype=object)
Copy the code
# Gender statistics
data.Gender.unique()
Copy the code
array(['Female', 'Male'], dtype=object)
Copy the code
How many variables contain null values?
for column in data.columns:
print(column,data[column].isnull().any())
Copy the code
CreditScore False
Geography False
Gender False
Age False
Tenure False
Balance False
NumOfProducts False
HasCrCard False
IsActiveMember False
EstimatedSalary False
Exited False
Copy the code
plt.figure(figsize = (16.6))
sns.countplot(x = data.Exited, palette = 'Blues_r')# Bar chart of attrition rate
Copy the code
<matplotlib.axes._subplots.AxesSubplot at 0x7fed58c05790>
Copy the code
# Data set partitioning
train_dataset, eval_dataset = train_test_split(data, test_size=0.2, random_state=1024)
Copy the code
5. The import library
! pip install catboostCopy the code
import pandas as pd
import lightgbm as lgb
import jieba
import gc
import numpy as np
import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from catboost import CatBoostClassifier
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
Copy the code
6. Data loading
print(data.info())
Copy the code
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null object 2 Gender 10000 non-null object 3 Age 10000 non-null int64 4 Tenure 10000 non-null int64 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null int64 7 HasCrCard 10000 non-null int64 8 IsActiveMember 10000 non-null int64 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null int64 dtypes: Float64 (2), INT64 (7), Object (2) Memory Usage: 859.5+ KB NoneCopy the code
# train_dataset, eval_dataset
data = pd.concat([train_dataset, eval_dataset], axis=0)
cate_cols = ['Geography'.'Gender']
for col in cate_cols:
lb = LabelEncoder()
data[col] = lb.fit(data[col])
train_dataset[col] = lb.transform(train_dataset[col])
eval_dataset[col] = lb.transform(eval_dataset[col])
no_feas = ['Exited']
features = [col for col in train_dataset.columns if col not in no_feas]
X_train = train_dataset[features]
X_test = eval_dataset[features]
y_train = train_dataset['Exited'].astype(int)
Copy the code
def train_model_classification(X, X_test, y, params, num_classes=2,
folds=None, model_type='lgb',
eval_metric='logloss', columns=None,
plot_feature_importance=False,
model=None, verbose=10000,
early_stopping_rounds=200,
splits=None, n_folds=3) :
Classification model functions return dictionaries including oOF Predictions, test Predictions, scores and, if necessary, feature importances. : Params: X - Training data, pd.DataFrame :params: X_test - Test data, pd.DataFrame :params: y - Target :params: Folds - folds to split data: Params: model_type - model :params: eval_metric Plot_feature_importance - whether to display feature importance :params: model - sklearn model, works only for "sklearn" model type ""
start_time = time.time()
global y_pred_valid, y_pred
columns = X.columns if columns is None else columns
X_test = X_test[columns]
splits = folds.split(X, y) if splits is None else splits
n_splits = folds.n_splits if splits is None else n_folds
# to set up scoring parameters
metrics_dict = {
'logloss': {
'lgb_metric_name': 'logloss'.'xgb_metric_name': 'logloss'.'catboost_metric_name': 'Logloss'.'sklearn_scoring_function': metrics.log_loss
},
'lb_score_method': {
'sklearn_scoring_f1': metrics.f1_score, # Online evaluation indicators
'sklearn_scoring_accuracy': metrics.accuracy_score, # Online evaluation indicators
'sklearn_scoring_auc': metrics.roc_auc_score
},
}
result_dict = {}
# out-of-fold predictions on train data
oof = np.zeros(shape=(len(X), num_classes))
# averaged predictions on train data
prediction = np.zeros(shape=(len(X_test), num_classes))
# list of scores on folds
acc_scores=[]
scores = []
# feature importance
feature_importance = pd.DataFrame()
# split and train on folds
for fold_n, (train_index, valid_index) in enumerate(splits):
if verbose:
print(f'Fold {fold_n + 1} started at {time.ctime()}')
if type(X) == np.ndarray:
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
else:
X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
if model_type == 'lgb':
model = lgb.LGBMClassifier(**params)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)],
eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],
verbose=verbose,
early_stopping_rounds=early_stopping_rounds)
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)
if model_type == 'xgb':
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)],
eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],
verbose=bool(verbose), # xgb verbose bool
early_stopping_rounds=early_stopping_rounds)
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)
if model_type == 'sklearn':
model = model
model.fit(X_train, y_train)
y_pred_valid = model.predict_proba(X_valid)
score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)
print(f'Fold {fold_n}. {eval_metric}: {score:4.f}. ')
y_pred = model.predict_proba(X_test)
if model_type == 'cat':
model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],
**params,
loss_function=metrics_dict[eval_metric]['catboost_metric_name'])
model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,
verbose=False)
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test)
oof[valid_index] = y_pred_valid
# Evaluation index
acc_scores.append(
metrics_dict['lb_score_method'] ['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))
scores.append(
metrics_dict['lb_score_method'] ['sklearn_scoring_auc'](y_valid, y_pred_valid[:,1]))
print(acc_scores)
print(scores)
prediction += y_pred
if model_type == 'lgb' and plot_feature_importance:
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
if model_type == 'xgb' and plot_feature_importance:
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
prediction /= n_splits
print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
result_dict['oof'] = oof
result_dict['prediction'] = prediction
result_dict['acc_scores'] = acc_scores
result_dict['scores'] = scores
if model_type == 'lgb' or model_type == 'xgb':
if plot_feature_importance:
feature_importance["importance"] /= n_splits
cols = feature_importance[["feature"."importance"]].groupby("feature").mean().sort_values(
by="importance", ascending=False)[:50].index
best_features = feature_importance.loc[feature_importance.feature.isin(cols)]
plt.figure(figsize=(16.12))
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
plt.title('LGB Features (avg over folds)')
plt.show()
result_dict['feature_importance'] = feature_importance
end_time = time.time()
print("train_model_classification cost time:{}".format(end_time - start_time))
return result_dict
Copy the code
7. LGB model
- GitHub homepage: github.com/microsoft/L…
- The document url: lightgbm. Readthedocs. IO/en/latest /
- The core parameters: lightgbm. Readthedocs. IO/en/latest /
lgb_params = {
'boosting_type': 'gbdt'.'objective': 'binary'.'n_estimators': 100000.'learning_rate': 0.1.'random_state': 2948.'bagging_freq': 8.'bagging_fraction': 0.80718.'feature_fraction': 0.38691.# 0.3
'feature_fraction_seed': 11.'max_depth': 9.'min_data_in_leaf': 40.'min_child_weight': 0.18654."min_split_gain": 0.35079.'min_sum_hessian_in_leaf': 1.11347.'num_leaves': 29.'num_threads': 6."lambda_l1": 0.55831.'lambda_l2': 1.67906.'cat_smooth': 10.4.'subsample': 0.7.'colsample_bytree': 0.7.# 'n_jobs': -1,
'metric': 'auc'
}
n_fold = 5
num_classes = 2
print("Num_classes :{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)
X = train_dataset[features]
print(y_train.value_counts())
X_test = eval_dataset[features]
result_dict_lgb = train_model_classification(X=X,
X_test=X_test,
y=y_train,
params=lgb_params,
num_classes=num_classes,
folds=folds,
model_type='lgb',
eval_metric='logloss',
plot_feature_importance=True,
verbose=200,
early_stopping_rounds=200,
n_folds=n_fold
)
acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
print(score)
Copy the code
分类个数num_classes:2
0 6347
1 1653
Name: Exited, dtype: int64
Fold 1 started at Sun Jan 23 23:34:02 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.230063 training's auc: 0.956131 valid_1's binary_logloss: 0.377107 valid_1's auc: 0.826328
Early stopping, best iteration is:
[87] training's binary_logloss: 0.276623 training's auc: 0.932371 valid_1's binary_logloss: 0.371376 valid_1's auc: 0.830094
[0.850625]
[0.8300943483819361]
Fold 2 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.233911 training's auc: 0.954853 valid_1's binary_logloss: 0.366604 valid_1's auc: 0.844229
Early stopping, best iteration is:
[100] training's binary_logloss: 0.271444 training's auc: 0.934158 valid_1's binary_logloss: 0.359972 valid_1's auc: 0.850745
[0.850625, 0.846875]
[0.8300943483819361, 0.8507448117912859]
Fold 3 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.238286 training's auc: 0.95085 valid_1's binary_logloss: 0.355037 valid_1's auc: 0.849645
Early stopping, best iteration is:
[98] training's binary_logloss: 0.276559 training's auc: 0.93025 valid_1's binary_logloss: 0.353108 valid_1's auc: 0.856094
[0.850625, 0.846875, 0.85375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975]
Fold 4 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.239597 training's auc: 0.95257 valid_1's binary_logloss: 0.318056 valid_1's auc: 0.882052
Early stopping, best iteration is:
[106] training's binary_logloss: 0.275256 training's auc: 0.931964 valid_1's binary_logloss: 0.31983 valid_1's auc: 0.883624
[0.850625, 0.846875, 0.85375, 0.873125]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726]
Fold 5 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.236783 training's auc: 0.952846 valid_1's binary_logloss: 0.350703 valid_1's auc: 0.851403
Early stopping, best iteration is:
[119] training's binary_logloss: 0.265393 training's auc: 0.937635 valid_1's binary_logloss: 0.347245 valid_1's auc: 0.853543
[0.850625, 0.846875, 0.85375, 0.873125, 0.859375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726, 0.8535433070866143]
CV mean score: 0.8548, std: 0.0171.
train_model_classification cost time:2.9498422145843506
0.8548202431242012
Copy the code
8. Compare with BML
BML results
Self-built model results
Training until Validation scores don't improve for 200 rounds [200] Training's binary_logloss: 0.236783 Training's AUC: 0.952846 VALID_1's BINary_logloss: 0.350703 VALID_1's AUC: 0.851403 Early stopping, best Iteration is: [119] Training's AUC: 0.937635 VALID_1 'AUC: 0.347245 valid_1' AUC: 0.853543 [0.850625, 0.846875, 0.85375, 0.873125, 0.859375] [0.8300943483819361, 0.8507448117912859, 0.8560943150516975, CV mean score: 0.8836244333094726, 0.8535433070866143]Copy the code
Comparing the results
It can be seen that the BML score is slightly higher, which may be due to the lack of data balancing in the self-built model.