Bank customer churn forecast

This is the seventh day of my participation in the First Challenge 2022. For details: First Challenge 2022.

I. Prediction of bank customer loss

The 13th phase 3 days AI advanced combat camp is the bank customer loss prediction, the use of BML one key training, prediction, release, the speed is quite good, it seems that I will be defeated by BML, I also try.

1. Introduction to the data set

background

We know that signing up new customers is much harder than keeping existing ones.

It is helpful for banks to understand the decisions that led to a customer’s decision to leave.

Preventing attrition allows banks to develop loyalty programs and retention activities to retain as many customers as possible.

Data description

RowNumber- corresponds to the record (row) number and has no effect on the output.
CustomerId- Contains a random value and has no effect on the customer leaving the bank.
Last name – The customer’s last name has no effect on their decision to leave the bank.
CreditScore- may affect customer churn as customers with higher credit scores are less likely to leave the bank.
Location – The customer’s location may influence their decision to leave the bank.
Gender – It would be interesting to explore whether gender plays a role in customers leaving the bank.
Age – This is certainly relevant, as older customers are less likely to leave the bank than younger customers.
Tenure – refers to the number of years a customer has been a customer of the bank. In general, older customers are more loyal and less likely to leave the bank.
Balances – are also a good indicator of customer churn, as people with higher account balances are less likely to leave the bank than those with lower balances.
NumOfProducts- Refers to the number of products purchased by the customer through the bank.
HasCrCard- Indicates whether the customer has a credit card. This column is also important because people with credit cards are less likely to leave the bank.
IsActiveMember- Active customers are less likely to leave the bank.
Salary estimates – People who earn less are more likely to leave the bank than people who earn more.
Logged out – Whether the customer has left the bank.

2. Data set reading

import numpy as np
import warnings
warnings.simplefilter('ignore')
import pandas as pd

pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

sns.set_style('whitegrid')
Copy the code

data = pd.read_csv('data/data107968/churn.csv')
print(data.shape)
data.head()
Copy the code

(10000, 14)
Copy the code

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	1	15634602	Hargrave	619	France	female	42	2	0.00	1	1	1	101348.88	1
1	2	15647311	Hill	608	Spain	female	41	1	83807.86	1	0	1	112542.58	0
2	3	15619304	Onio	502	France	female	42	8	159660.80	3	1	0	113931.57	1
3	4	15701354	Boni	699	France	female	39	1	0.00	2	0	0	93826.63	0
4	5	15737888	Mitchell	850	Spain	female	43	2	125510.82	1	1	1	79084.10	0

3. Delete unnecessary columns

Such as column number, customer ID, and name, these fields have no meaning for customer loss.

# Brackets indicate columns to be deleted, and axis=1 indicates column deletion
data=data.drop(['RowNumber'.'CustomerId'.'Surname'], axis=1)
data.head()
Copy the code

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	619	France	female	42	2	0.00	1	1	1	101348.88	1
1	608	Spain	female	41	1	83807.86	1	0	1	112542.58	0
2	502	France	female	42	8	159660.80	3	1	0	113931.57	1
3	699	France	female	39	1	0.00	2	0	0	93826.63	0
4	850	Spain	female	43	2	125510.82	1	1	1	79084.10	0

4. Data analysis

Borrowing from solitary fly classmate: aistudio.baidu.com/aistudio/pr… I’m too bad to write about it. As can be seen from the chart, the gap between loss and non-loss is quite wide.

# Geographical statistics
data.Geography.unique()
Copy the code

array(['France', 'Spain', 'Germany'], dtype=object)
Copy the code

# Gender statistics
data.Gender.unique()
Copy the code

array(['Female', 'Male'], dtype=object)
Copy the code

How many variables contain null values?
for column in data.columns:
     print(column,data[column].isnull().any())
Copy the code

CreditScore False
Geography False
Gender False
Age False
Tenure False
Balance False
NumOfProducts False
HasCrCard False
IsActiveMember False
EstimatedSalary False
Exited False
Copy the code

plt.figure(figsize = (16.6))
sns.countplot(x = data.Exited, palette = 'Blues_r')# Bar chart of attrition rate
Copy the code

<matplotlib.axes._subplots.AxesSubplot at 0x7fed58c05790>
Copy the code

# Data set partitioning
train_dataset, eval_dataset = train_test_split(data, test_size=0.2, random_state=1024)
Copy the code

5. The import library

! pip install catboostCopy the code

import pandas as pd
import lightgbm as lgb
import jieba
import gc
import numpy as np
import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from catboost import CatBoostClassifier
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
Copy the code

6. Data loading

print(data.info())
Copy the code

<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null object 2 Gender 10000 non-null object 3 Age 10000 non-null int64 4 Tenure 10000 non-null int64 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null int64 7 HasCrCard 10000 non-null int64 8 IsActiveMember 10000 non-null int64 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null int64 dtypes: Float64 (2), INT64 (7), Object (2) Memory Usage: 859.5+ KB NoneCopy the code

# train_dataset, eval_dataset
data = pd.concat([train_dataset, eval_dataset], axis=0)
cate_cols = ['Geography'.'Gender']
for col in cate_cols:
    lb = LabelEncoder()
    data[col] = lb.fit(data[col])
    train_dataset[col] = lb.transform(train_dataset[col])
    eval_dataset[col] = lb.transform(eval_dataset[col])

no_feas = ['Exited']
features = [col for col in train_dataset.columns if col not in no_feas]
X_train = train_dataset[features]
X_test = eval_dataset[features]

y_train = train_dataset['Exited'].astype(int)
Copy the code

def train_model_classification(X, X_test, y, params, num_classes=2,
                               folds=None, model_type='lgb',
                               eval_metric='logloss', columns=None,
                               plot_feature_importance=False,
                               model=None, verbose=10000,
                               early_stopping_rounds=200,
                               splits=None, n_folds=3) :
    Classification model functions return dictionaries including oOF Predictions, test Predictions, scores and, if necessary, feature importances. : Params: X - Training data, pd.DataFrame :params: X_test - Test data, pd.DataFrame :params: y - Target :params: Folds - folds to split data: Params: model_type - model :params: eval_metric Plot_feature_importance - whether to display feature importance :params: model - sklearn model, works only for "sklearn" model type ""
    start_time = time.time()
    global y_pred_valid, y_pred

    columns = X.columns if columns is None else columns
    X_test = X_test[columns]
    splits = folds.split(X, y) if splits is None else splits
    n_splits = folds.n_splits if splits is None else n_folds

    # to set up scoring parameters
    metrics_dict = {
        'logloss': {
            'lgb_metric_name': 'logloss'.'xgb_metric_name': 'logloss'.'catboost_metric_name': 'Logloss'.'sklearn_scoring_function': metrics.log_loss
        },
        'lb_score_method': {
            'sklearn_scoring_f1': metrics.f1_score,  # Online evaluation indicators
            'sklearn_scoring_accuracy': metrics.accuracy_score,  # Online evaluation indicators
            'sklearn_scoring_auc': metrics.roc_auc_score
        },
    }
    result_dict = {}

    # out-of-fold predictions on train data
    oof = np.zeros(shape=(len(X), num_classes))
    # averaged predictions on train data
    prediction = np.zeros(shape=(len(X_test), num_classes))
    # list of scores on folds
    acc_scores=[]
    scores = []
    # feature importance
    feature_importance = pd.DataFrame()

    # split and train on folds
    for fold_n, (train_index, valid_index) in enumerate(splits):
        if verbose:
            print(f'Fold {fold_n + 1} started at {time.ctime()}')
        if type(X) == np.ndarray:
            X_train, X_valid = X[train_index], X[valid_index]
            y_train, y_valid = y[train_index], y[valid_index]
        else:
            X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

        if model_type == 'lgb':
            model = lgb.LGBMClassifier(**params)
            model.fit(X_train, y_train,
                      eval_set=[(X_train, y_train), (X_valid, y_valid)],
                      eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],
                      verbose=verbose,
                      early_stopping_rounds=early_stopping_rounds)

            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)

        if model_type == 'xgb':
            model = xgb.XGBClassifier(**params)
            model.fit(X_train, y_train,
                      eval_set=[(X_train, y_train), (X_valid, y_valid)],
                      eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],
                      verbose=bool(verbose),  # xgb verbose bool
                      early_stopping_rounds=early_stopping_rounds)
            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)
        if model_type == 'sklearn':
            model = model
            model.fit(X_train, y_train)
            y_pred_valid = model.predict_proba(X_valid)
            score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)
            print(f'Fold {fold_n}. {eval_metric}: {score:4.f}. ')
            y_pred = model.predict_proba(X_test)

        if model_type == 'cat':
            model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],
                                       **params,
                                       loss_function=metrics_dict[eval_metric]['catboost_metric_name'])
            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,
                      verbose=False)

            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test)

        oof[valid_index] = y_pred_valid
        # Evaluation index
        acc_scores.append(
            metrics_dict['lb_score_method'] ['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))
        scores.append(
            metrics_dict['lb_score_method'] ['sklearn_scoring_auc'](y_valid, y_pred_valid[:,1]))
        print(acc_scores)
        print(scores)
        prediction += y_pred

        if model_type == 'lgb' and plot_feature_importance:
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

        if model_type == 'xgb' and plot_feature_importance:
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
    prediction /= n_splits
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))

    result_dict['oof'] = oof
    result_dict['prediction'] = prediction
    result_dict['acc_scores'] = acc_scores
    result_dict['scores'] = scores


    if model_type == 'lgb' or model_type == 'xgb':
        if plot_feature_importance:
            feature_importance["importance"] /= n_splits
            cols = feature_importance[["feature"."importance"]].groupby("feature").mean().sort_values(
                by="importance", ascending=False)[:50].index

            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

            plt.figure(figsize=(16.12))
            sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
            plt.title('LGB Features (avg over folds)')
            plt.show()
            result_dict['feature_importance'] = feature_importance
    end_time = time.time()

    print("train_model_classification cost time:{}".format(end_time - start_time))
    return result_dict
Copy the code

7. LGB model

GitHub homepage: github.com/microsoft/L…
The document url: lightgbm. Readthedocs. IO/en/latest /
The core parameters: lightgbm. Readthedocs. IO/en/latest /

lgb_params = {
    'boosting_type': 'gbdt'.'objective': 'binary'.'n_estimators': 100000.'learning_rate': 0.1.'random_state': 2948.'bagging_freq': 8.'bagging_fraction': 0.80718.'feature_fraction': 0.38691.# 0.3
    'feature_fraction_seed': 11.'max_depth': 9.'min_data_in_leaf': 40.'min_child_weight': 0.18654."min_split_gain": 0.35079.'min_sum_hessian_in_leaf': 1.11347.'num_leaves': 29.'num_threads': 6."lambda_l1": 0.55831.'lambda_l2': 1.67906.'cat_smooth': 10.4.'subsample': 0.7.'colsample_bytree': 0.7.# 'n_jobs': -1,
    'metric': 'auc'
}

n_fold = 5
num_classes = 2
print("Num_classes :{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)
X = train_dataset[features]
print(y_train.value_counts())
X_test = eval_dataset[features]

result_dict_lgb = train_model_classification(X=X,
                                             X_test=X_test,
                                             y=y_train,
                                             params=lgb_params,
                                             num_classes=num_classes,
                                             folds=folds,
                                             model_type='lgb',
                                             eval_metric='logloss',
                                             plot_feature_importance=True,
                                             verbose=200,
                                             early_stopping_rounds=200,
                                             n_folds=n_fold
                                             )

acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
print(score)
Copy the code

分类个数num_classes:2
0    6347
1    1653
Name: Exited, dtype: int64
Fold 1 started at Sun Jan 23 23:34:02 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.230063	training's auc: 0.956131	valid_1's binary_logloss: 0.377107	valid_1's auc: 0.826328
Early stopping, best iteration is:
[87]	training's binary_logloss: 0.276623	training's auc: 0.932371	valid_1's binary_logloss: 0.371376	valid_1's auc: 0.830094
[0.850625]
[0.8300943483819361]
Fold 2 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.233911	training's auc: 0.954853	valid_1's binary_logloss: 0.366604	valid_1's auc: 0.844229
Early stopping, best iteration is:
[100]	training's binary_logloss: 0.271444	training's auc: 0.934158	valid_1's binary_logloss: 0.359972	valid_1's auc: 0.850745
[0.850625, 0.846875]
[0.8300943483819361, 0.8507448117912859]
Fold 3 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.238286	training's auc: 0.95085	valid_1's binary_logloss: 0.355037	valid_1's auc: 0.849645
Early stopping, best iteration is:
[98]	training's binary_logloss: 0.276559	training's auc: 0.93025	valid_1's binary_logloss: 0.353108	valid_1's auc: 0.856094
[0.850625, 0.846875, 0.85375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975]
Fold 4 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.239597	training's auc: 0.95257	valid_1's binary_logloss: 0.318056	valid_1's auc: 0.882052
Early stopping, best iteration is:
[106]	training's binary_logloss: 0.275256	training's auc: 0.931964	valid_1's binary_logloss: 0.31983	valid_1's auc: 0.883624
[0.850625, 0.846875, 0.85375, 0.873125]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726]
Fold 5 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.236783	training's auc: 0.952846	valid_1's binary_logloss: 0.350703	valid_1's auc: 0.851403
Early stopping, best iteration is:
[119]	training's binary_logloss: 0.265393	training's auc: 0.937635	valid_1's binary_logloss: 0.347245	valid_1's auc: 0.853543
[0.850625, 0.846875, 0.85375, 0.873125, 0.859375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726, 0.8535433070866143]
CV mean score: 0.8548, std: 0.0171.




train_model_classification cost time:2.9498422145843506
0.8548202431242012
Copy the code

8. Compare with BML

BML results

Self-built model results

Training until Validation scores don't improve for 200 rounds [200] Training's binary_logloss: 0.236783 Training's AUC: 0.952846 VALID_1's BINary_logloss: 0.350703 VALID_1's AUC: 0.851403 Early stopping, best Iteration is: [119] Training's AUC: 0.937635 VALID_1 'AUC: 0.347245 valid_1' AUC: 0.853543 [0.850625, 0.846875, 0.85375, 0.873125, 0.859375] [0.8300943483819361, 0.8507448117912859, 0.8560943150516975, CV mean score: 0.8836244333094726, 0.8535433070866143]Copy the code

Comparing the results

It can be seen that the BML score is slightly higher, which may be due to the lack of data balancing in the self-built model.