[toc]

Data set Source

2021 China College Student Insurance Digital Challenge — Digital Course (starting on April 26, 2021)

  • The link: www.heywhale.com/org/pingan/…

In 2021 Chinese college students insurance digital challenge, it is by the bank of China insurance regulatory commission in charge of the work only daily “newspaper, the bank of China insurance, Chinese insurance institute, China ping an property insurance competition organizing committee, consisting of shenzhen institute of big data, zhihu cooperation platform of campus comprehensive event projects as content.

The problem description

Based on the data and modeling analysis platform provided by the official event, participants need to make full use of the information contained in the data and combine with the business characteristics of non-car insurance, carry out reasonable data pretreatment and complete data processing and analysis, and predict the purchase intention of each target (CARID) for non-car products. (The final will be titled insurance business related issues and will be more difficult. Details will be announced after the final round is completed)

The result of the match

After a week of hard work, the best result was auc=0.90001787, ranking :(20/314).

Data preprocessing

A preliminary data

  • Read training and test data
df_train = pd.read_csv('/home/mw/input/pre8881/train.csv',index_col=0)
df_test = pd.read_csv('/home/mw/input/pretest_a3048/test_a.csv',index_col=0)
df_train.shape,df_test.shape
((684283.65), (186925.64))
Copy the code

There are 684,283 training sets and 186925 test sets. Contains 65 dimensions of yLabel data. The questions do not give specific field meanings, which makes it more difficult to understand the questions.

Index(['dpt'.'xz'.'xb'.'carid'.'nprem_ly'.'ncd_ly'.'newvalue'.'bi_renewal_year'.'clmnum'.'regdate'.'trademark_cn'.'brand_cn'.'make_cn'.'series'.'capab'.'seats'.'use_type'.'change_owner'.'nprem_od'.'si_od'.'nprem_tp'.'si_tp'.'nprem_bt'.'si_bt'.'nprem_vld'.'si_vld'.'nprem_vlp'.'si_vlp'.'p1_prior_days_to_insure'.'suiche_nonauto_nprem_20'.'suiche_nonauto_nprem_19'.'suiche_nonauto_nprem_18'.'suiche_nonauto_nprem_17'.'suiche_nonauto_nprem_16'.'suiche_nonauto_amount_20'.'suiche_nonauto_amount_19'.'suiche_nonauto_amount_18'.'suiche_nonauto_amount_17'.'suiche_nonauto_amount_16'.'num_notcar_claim'.'p1_gender'.'p1_age'.'p1_census_register'.'p2_marital_status'.'f1_child_flag'.'f2_posses_house_flag'.'f2_cust_housing_price_total'.'p2_client_grade'.'w1_pc_wx_use_flag'.'p1_is_bank_eff'.'p2_is_enterprise_owner'.'p2_is_smeowner'.'active_7'.'active_30'.'active_90'.'active_365'.'p2_is_child_under_15_family'.'p2_is_adult_over_55_family'.'birth_month'.'p1_service_offer_cnt'.'p3_service_use_cnt'.'dur_personal_insurance_90'.'service_score_available'.'y1_is_purchase'],
      dtype='object')
  dtypes: float64(32), int64(9), object(23)
      
Copy the code

Of the 65 fields, 32 are float64, 9 are INT64, and 23 are of type Object

Missing value filling

  • View the 20 fields with the most missing values
df_train.isnull().sum().sort_values(ascending=False)[:20]
Copy the code

If the field is of numeric type, the mode is used to fill the missing value. If the nominal attribute is not filled, lightgBM will classify the null value of the nominal attribute into one class

columns = df_train.columns
for i in tqdm (range(len(columns))):
    # print(df_train[columns[i]].dtype)
    ifdf_train[columns[i]].dtype! ='object' andcolumns[i]! ='y1_is_purchase' :
        # print(columns[i])
        df_train[columns[i]].fillna(df_train[columns[i]].mode()[0], inplace=True)
        df_test[columns[i]].fillna(df_test[columns[i]].mode()[0], inplace=True)
        
    
    
Copy the code

Nominal attribute coding

LabelEncoder is used to encode type characteristic values, that is, to encode discontinuous values or text

# Other category variables, direct category encoding
no_features = ['client_no'.'carid'.'y1_is_purchase']
data = pd.concat([df_train, df_test], axis=0)
for col in df_train.select_dtypes(include=['object']).columns:
    if col not in no_features:
        lb = LabelEncoder()
        lb.fit(data[col].astype(str))
        df_train[col] = lb.transform(df_train[col].astype(str))
        df_test[col] = lb.transform(df_test[col].astype(str))
features = [col for col in df_train.columns if col not in no_features]
features
Copy the code

Processing date data

Convert regDate to convert to dateTime type before obtaining the year data

df_train['regdate']  = pd.to_datetime(df_train['regdate'])
df_test['regdate']  = pd.to_datetime(df_test['regdate'])
df_train['regdate']  = df_train['regdate'].dt.year
df_test['regdate']  = df_test['regdate'].dt.year
Copy the code

Add attributes

The statistics [‘ DPT], [‘ client_no], [‘ trademark_cn], [‘ brand_cn], [‘ make_cn] how many a field attribute value

# count
for f in [['dpt'], ['client_no'], ['trademark_cn'], ['brand_cn'], ['make_cn'], ['series']]:
    df_temp = df_feature.groupby(f).size().reset_index()
    df_temp.columns = f + ['{}_count'.format('_'.join(f))]
    df_feature = df_feature.merge(df_temp, how='left')
Copy the code

Calculate the average value of the different attributes in the [‘ p1_Census_register ‘, ‘DPT ‘] fields purchased in the training set

# Simple statistics
def stat(df, df_merge, group_by, agg) :
    group = df.groupby(group_by).agg(agg)

    columns = []
    for on, methods in agg.items():
        for method in methods:
            columns.append('{} _ {} _ {}'.format('_'.join(group_by), on, method))
    group.columns = columns
    group.reset_index(inplace=True)
    df_merge = df_merge.merge(group, on=group_by, how='left')

    del (group)
    gc.collect()

    return df_merge


def statis_feat(df_know, df_unknow) :
    for f in tqdm(['p1_census_register'.'dpt']):
        df_unknow = stat(df_know, df_unknow, [f], {
                         'y1_is_purchase': ['mean']})

    return df_unknow
Copy the code

The data analysis

Sample equilibrium analysis

fig = plt.figure()
plt.pie(df_train['y1_is_purchase'].value_counts(),
        labels=df_train['y1_is_purchase'].value_counts().index,
        autopct='% 1.2 f % %',counterclock = False)
plt.title('Purchase rate')
plt.show()
Copy the code
# flag_0 = df_train['y1_is_purchase']==0
# n = sum(flag_0)
# flag_1 = df_train['y1_is_purchase']==1
# n,sum(flag_1)

# df_train[flag_0].shape
# df_train = pd. Concat ([df_train [flag_0] [: n], df_train [flag_1] [: n * 1.5]])
# # df_train = df_train.reset_index()
# df_train['y1_is_purchase'].value_counts()
Copy the code

It is obvious that the data is not balanced, so I immediately thought of sampling and oversampling, but the effect is not as good as the original data.

Data distribution analysis

Use a boxplot to view the value field distribution and delete the data that is significantly inconsistent


columns = df_train.columns.tolist()
fig = plt.figure(figsize=(80.60),dpi=75)
j = 0
for i in range(len(columns)):
    if not df_train[columns[i]].dtype=='object':
        j+=1
        plt.subplot(7.10,j+1)
        sns.boxplot(df_train[columns[i]],orient='v',width=0.5)
        plt.ylabel(columns[i],fontsize=36)
plt.show()

Copy the code

train_rows = len(df_train.columns)
plt.figure(figsize=(10*4.10*4))


columns = df_train.columns.tolist()
fig = plt.figure(figsize=(80.60),dpi=75)
j = 0
for i in range(len(columns)):
    # print(df_train[columns[i]].dtype)
    if  df_train[columns[i]].dtype=='int64' andcolumns[i]! ='y1_is_purchase':
        # print(df_train[columns[i]].dtype)
        df_train[columns[i]].fillna(df_train[columns[i]].mode(), inplace=True)
        df_test[columns[i]].fillna(df_test[columns[i]].mode(), inplace=True)
        j+=1
         
    
        ax = plt.subplot(4.4,j)
        sns.distplot(df_train[columns[i]],fit=stats.norm)

        j+=1
        ax = plt.subplot(4.4,j)
        stats.probplot(df_train[columns[i]],plot=plt)
plt.tight_layout()
plt.show()
Copy the code

Model application

The decision tree

Decision tree is a basic classification and regression method. The decision tree model has the advantages of fast classification speed and easy visual interpretation of the model, but at the same time, it is also easy to occur the over-fitting, although there is pruning, but it is not satisfactory.

In the classification problem, it learns multiple classifiers by changing the weight of the training sample (increasing the weight of the wrong sample and reducing the weight of the team sample), and linearly combines these classifiers to improve the performance of the classifiers. Boosting Mathematics is expressed as:


f ( x ) = w 0 + m = 1 M w m ϕ m ( x ) f(x) = w_0 + \sum\limits_{m=1}^M w_m \phi_m(x)

Where W is the weight, ϕ\phiϕ is the set of weak classifiers, and it can be seen that the end is a linear combination of basis functions.

Therefore, the combination of decision tree and Boosting produces many algorithms, mainly including lifting tree, GBDT, etc

Gradient Boosting is a Boosting method, whose main idea is that every time a model is built, the Gradient descent direction of the model loss function is established before. Loss function is used to evaluate the performance of the model (generally, fitting degree + regular term). It is believed that the smaller the loss function, the better the performance. The best way to improve the performance of the model is to make the loss function decline along the gradient direction (reasonably speaking, the fastest decline is in the gradient direction).

lightgbm

Gradient Boost is a framework within which many different algorithms can be built.

  • Decision tree algorithm based on Histogram.

The basic idea of the histogram algorithm is to discretize the continuous floating point eigenvalues into K integers, and construct a histogram of width k. During data traversal, statistics are accumulated in the histogram according to the discretized value as the index. After data traversal, the required statistics are accumulated in the histogram, and then the optimal segmentation point is found according to the discrete value of the histogram.

  • Gradient-based one-side Sampling(GOSS) : Using GOSS can reduce the number of data instances with small gradiences so that only the remaining data with high gradiences can be used to calculate the information gain, saving time and space overhead compared to XGBoost traversing all the eigenvalues.
  • Exclusive Feature Bundling(EFB) : With EFB, many Exclusive features can be bundled into a single Feature, thus reducing dimension.

* Leaf-wise with depth limits: Most GBDT tools use the inefficient level-wise decision tree strategy because it indiscriminately treats leaves in the same layer, resulting in a lot of unnecessary overhead. In fact, many leaves have low fission gain, so there’s no need to search and divide. LightGBM uses a leaf-wise algorithm with depth limits.

  • Direct support for Categorical Feature
  • Support efficient parallelism
  • Cache hit ratio Optimization
```python
ycol = 'y1_is_purchase'
feature_names = features
model = lgb.LGBMClassifier(objective='binary',
                           boosting_type='gbdt',
                           num_leaves=64,
                           max_depth=10,
                           learning_rate=0.01,
                           n_estimators=10000,
                           subsample=0.8,
                           feature_fraction=0.6,
                           reg_alpha=10,
                           reg_lambda=12,
                           random_state=seed,
                           is_unbalance=True,
                           metric='auc')

df_oof = df_train[['carid', ycol]].copy()
df_oof['prob'] = 0
prediction = df_test[['carid']]
prediction['prob'] = 0
df_importance_list = []

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
for fold_id, (trn_idx, val_idx) in enumerate(
        kfold.split(df_train[feature_names], df_train[ycol])):
    X_train = df_train.iloc[trn_idx][feature_names]
    Y_train = df_train.iloc[trn_idx][ycol]

    X_val = df_train.iloc[val_idx][feature_names]
    Y_val = df_train.iloc[val_idx][ycol]

    print('\nFold_{} Training ================================\n'.format(
        fold_id + 1))
    # print(df_oof.loc[val_idx]['prob'])
    lgb_model = model.fit(X_train,
                          Y_train,
                          eval_names=['train'.'valid'],
                          eval_set=[(X_train, Y_train), (X_val, Y_val)],
                          verbose=100,
                          early_stopping_rounds=50)

    pred_val = lgb_model.predict_proba(
        X_val, num_iteration=lgb_model.best_iteration_)[:, 1]
    # print(df_oof.iloc[val_idx])
    df_oof.loc[val_idx, 'prob'] = pred_val

    pred_test = lgb_model.predict_proba(
        df_test[feature_names], num_iteration=lgb_model.best_iteration_)[:, 1]
    prediction['prob'] += pred_test / kfold.n_splits


    df_importance = pd.DataFrame({
        'column': feature_names,
        'importance': lgb_model.feature_importances_,
    })
    df_importance_list.append(df_importance)

    del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val
    gc.collect()
Copy the code

Model to evaluate

Five fold cross validation

Bayesian tuning

The application of Bayesian optimization to machine learning parameter tuning was proposed by J. Snoek(2012). The main idea is that given the optimization objective function (generalized function, only the input and output can be specified, without knowing the internal structure and mathematical properties), The posterior distribution of the objective function is updated by continuously adding sample points (Gaussian process, until the posterior distribution basically fits the real distribution. Simply put, it takes into account the previous parameter information **, so as to better adjust the current parameter.

What distinguishes it from a regular grid or random search is:

  • Bayesian parameter tuning adopts The Gaussian process, considers the previous parameter information, and constantly updates the prior. Grid search does not take into account previous parameter information
  • Bayesian tuning has less iteration times and faster speed. The grid search speed is slow, and the dimension explosion is easy to be caused by multiple parameters
  • Bayesian tuning is still robust for non-convex problems; Grid search is easy to obtain local optimality for non-convex problems
Learning control parameter meaning usage
max_depth The maximum depth of the tree When the model is overfitted, the first reduction can be consideredmax_depth
min_data_in_leaf Minimum number of records a leaf may have The default is 20, used for overfitting
feature_fraction For example, when it is 0.8, it means that 80% of parameters are randomly selected to build trees in each iteration Boosting is used when the value is random Forest
bagging_fraction The proportion of data used in each iteration Used to speed up training and reduce overfitting
early_stopping_round If one measure at a time validates the data in the nearestearly_stopping_roundIf there is no improvement during the turn, the model will stop training Speed up analysis and reduce too many iterations
lambda Specified regularization 0 ~ 1
min_gain_to_split Describe the minimum gain for splitting Control the useful split of the tree
max_cat_group Find the split point on the group boundary When the number of categories is large, it is easy to overfit the segmentation point

```python
# Set several parameters
def lgb_cv(colsample_bytree, min_child_samples,reg_alpha,reg_lambda,min_split_gain,subsample_freq, num_leaves, subsample, max_depth,learning_rate,min_child_weight) :
    model = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',
           colsample_bytree=float(colsample_bytree), learning_rate=float(learning_rate),
           min_child_samples=int(min_child_samples), min_child_weight=float(min_child_weight), 
           min_split_gain = float(min_split_gain),subsample_freq = int(subsample_freq),
           n_estimators=8000, n_jobs=-1, num_leaves=int(num_leaves),
           random_state=None, reg_alpha=float(reg_alpha), reg_lambda=float(reg_lambda),max_depth=int(max_depth),
           subsample=float(subsample))
    cv_score = cross_val_score(model, x, y, scoring="f1", cv=5).mean()
    return cv_score
# Use Bayesian optimization
lgb_bo = BayesianOptimization(
        lgb_cv,
     {
            'learning_rate': (0.008.0.01),
            'max_depth': (3.10),
            'num_leaves': (31.127),
            'min_split_gain': (0.0.0.4),
            'min_child_weight': (0.001.0.002),
            'min_child_samples': (18.22),
            'subsample': (0.6.1.0),
            'subsample_freq': (3.5),
            'colsample_bytree': (0.6.1.0),
            'reg_alpha': (0.0.5),
            'reg_lambda': (0.0.5)
        },



    )
lgb_bo.maximize(n_iter=1000)
lgb_bo.max
# Put the optimized parameters into use



Copy the code

Visualization of decision process

from IPython.display import Image

model = lgb.LGBMClassifier(num_leaves=64,
                           max_depth=10,
                           learning_rate=0.01,
                           boosting_type = 'goss',
                           n_estimators=10000,
                           subsample=0.8,
                           feature_fraction=0.8,
                           reg_alpha=0.5,
                           
                           reg_lambda=0.5,
                           random_state=seed,
                           metric="f1")
X_train = df_train[feature_names]
Y_train = df_train[ycol]
lgb_model = model.fit(X_train,
                          Y_train,
                          eval_names=['valid'],
                          eval_set=[(X_val, Y_val)],
                          verbose=10,
                          eval_metric='auc',
                          early_stopping_rounds=50)
import matplotlib.pyplot as plt
fig2 = plt.figure(figsize=(200.200))
ax = fig2.subplots()
lgb.plot_tree(lgb_model, tree_index=1, ax=ax)
plt.show()


Copy the code

The ROC curve

The importance of each attribute distribution

By checking the importance of each attribute in the decision-making process, and then to the top of the field to further do feature engineering

Model integration

By means of simple weighting, the results of the three models are fused to avoid the overfitting problem

w_lgb = 0.333
w_xgb = 0.333
w_cbt = 0.333

oof['prob'] = oof['prob_lgb'] ** w_lgb * oof['prob_xgb'] ** w_xgb * oof['prob_cbt'] ** w_cbt
auc = roc_auc_score(oof['y1_is_purchase'], oof['prob'])

sub['label'] = sub['prob_lgb'] ** w_lgb * sub['prob_xgb'] ** w_xgb * sub['prob_cbt'] ** w_cbt
sub.head()

Copy the code

View the fusion result

conclusion

As it was the first time to officially participate in the competition, faced with 60W data, I felt a little at a loss. Slowly review what you’ve learned in class as you look up information. From feature engineering, to model selection, to parameter optimization. In the beginning, I built my baseline with decision trees. Finally, XGBoost, LightGBM and CatBoost are merged. Also learned to use Bayesian parameter optimization in the process of this experiment. The design of this experiment may not be perfect, and there are indeed many problems, which will be corrected and improved in the future. Although the experiment is over, there is still a long way to learn. I will go step by step and constantly improve myself.

The resources

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Biau G, Devroye L, Lugosi G. Consistency of Random Forests and Other Averaging Classifiers.[J]. Journal of Machine Learning Research, 2008, 9 (1) : 2015-2033.

Decision tree model, XGBoost, LightGBM and CatBoost model visualization LightGBM algorithm summary

LightGBM tuning notes

scikit-learn

Frazier P I. A tutorial on Bayesian Optimization [J]. ArXiv Preprint arXiv: 187.02811, 2018.