[toc]
Data set Source
2021 China College Student Insurance Digital Challenge — Digital Course (starting on April 26, 2021)
- The link: www.heywhale.com/org/pingan/…
In 2021 Chinese college students insurance digital challenge, it is by the bank of China insurance regulatory commission in charge of the work only daily “newspaper, the bank of China insurance, Chinese insurance institute, China ping an property insurance competition organizing committee, consisting of shenzhen institute of big data, zhihu cooperation platform of campus comprehensive event projects as content.
The problem description
Based on the data and modeling analysis platform provided by the official event, participants need to make full use of the information contained in the data and combine with the business characteristics of non-car insurance, carry out reasonable data pretreatment and complete data processing and analysis, and predict the purchase intention of each target (CARID) for non-car products. (The final will be titled insurance business related issues and will be more difficult. Details will be announced after the final round is completed)
The result of the match
After a week of hard work, the best result was auc=0.90001787, ranking :(20/314).
Data preprocessing
A preliminary data
- Read training and test data
df_train = pd.read_csv('/home/mw/input/pre8881/train.csv',index_col=0)
df_test = pd.read_csv('/home/mw/input/pretest_a3048/test_a.csv',index_col=0)
df_train.shape,df_test.shape
((684283.65), (186925.64))
Copy the code
There are 684,283 training sets and 186925 test sets. Contains 65 dimensions of yLabel data. The questions do not give specific field meanings, which makes it more difficult to understand the questions.
Index(['dpt'.'xz'.'xb'.'carid'.'nprem_ly'.'ncd_ly'.'newvalue'.'bi_renewal_year'.'clmnum'.'regdate'.'trademark_cn'.'brand_cn'.'make_cn'.'series'.'capab'.'seats'.'use_type'.'change_owner'.'nprem_od'.'si_od'.'nprem_tp'.'si_tp'.'nprem_bt'.'si_bt'.'nprem_vld'.'si_vld'.'nprem_vlp'.'si_vlp'.'p1_prior_days_to_insure'.'suiche_nonauto_nprem_20'.'suiche_nonauto_nprem_19'.'suiche_nonauto_nprem_18'.'suiche_nonauto_nprem_17'.'suiche_nonauto_nprem_16'.'suiche_nonauto_amount_20'.'suiche_nonauto_amount_19'.'suiche_nonauto_amount_18'.'suiche_nonauto_amount_17'.'suiche_nonauto_amount_16'.'num_notcar_claim'.'p1_gender'.'p1_age'.'p1_census_register'.'p2_marital_status'.'f1_child_flag'.'f2_posses_house_flag'.'f2_cust_housing_price_total'.'p2_client_grade'.'w1_pc_wx_use_flag'.'p1_is_bank_eff'.'p2_is_enterprise_owner'.'p2_is_smeowner'.'active_7'.'active_30'.'active_90'.'active_365'.'p2_is_child_under_15_family'.'p2_is_adult_over_55_family'.'birth_month'.'p1_service_offer_cnt'.'p3_service_use_cnt'.'dur_personal_insurance_90'.'service_score_available'.'y1_is_purchase'],
dtype='object')
dtypes: float64(32), int64(9), object(23)
Copy the code
Of the 65 fields, 32 are float64, 9 are INT64, and 23 are of type Object
Missing value filling
- View the 20 fields with the most missing values
df_train.isnull().sum().sort_values(ascending=False)[:20]
Copy the code
If the field is of numeric type, the mode is used to fill the missing value. If the nominal attribute is not filled, lightgBM will classify the null value of the nominal attribute into one class
columns = df_train.columns
for i in tqdm (range(len(columns))):
# print(df_train[columns[i]].dtype)
ifdf_train[columns[i]].dtype! ='object' andcolumns[i]! ='y1_is_purchase' :
# print(columns[i])
df_train[columns[i]].fillna(df_train[columns[i]].mode()[0], inplace=True)
df_test[columns[i]].fillna(df_test[columns[i]].mode()[0], inplace=True)
Copy the code
Nominal attribute coding
LabelEncoder is used to encode type characteristic values, that is, to encode discontinuous values or text
# Other category variables, direct category encoding
no_features = ['client_no'.'carid'.'y1_is_purchase']
data = pd.concat([df_train, df_test], axis=0)
for col in df_train.select_dtypes(include=['object']).columns:
if col not in no_features:
lb = LabelEncoder()
lb.fit(data[col].astype(str))
df_train[col] = lb.transform(df_train[col].astype(str))
df_test[col] = lb.transform(df_test[col].astype(str))
features = [col for col in df_train.columns if col not in no_features]
features
Copy the code
Processing date data
Convert regDate to convert to dateTime type before obtaining the year data
df_train['regdate'] = pd.to_datetime(df_train['regdate'])
df_test['regdate'] = pd.to_datetime(df_test['regdate'])
df_train['regdate'] = df_train['regdate'].dt.year
df_test['regdate'] = df_test['regdate'].dt.year
Copy the code
Add attributes
The statistics [‘ DPT], [‘ client_no], [‘ trademark_cn], [‘ brand_cn], [‘ make_cn] how many a field attribute value
# count
for f in [['dpt'], ['client_no'], ['trademark_cn'], ['brand_cn'], ['make_cn'], ['series']]:
df_temp = df_feature.groupby(f).size().reset_index()
df_temp.columns = f + ['{}_count'.format('_'.join(f))]
df_feature = df_feature.merge(df_temp, how='left')
Copy the code
Calculate the average value of the different attributes in the [‘ p1_Census_register ‘, ‘DPT ‘] fields purchased in the training set
# Simple statistics
def stat(df, df_merge, group_by, agg) :
group = df.groupby(group_by).agg(agg)
columns = []
for on, methods in agg.items():
for method in methods:
columns.append('{} _ {} _ {}'.format('_'.join(group_by), on, method))
group.columns = columns
group.reset_index(inplace=True)
df_merge = df_merge.merge(group, on=group_by, how='left')
del (group)
gc.collect()
return df_merge
def statis_feat(df_know, df_unknow) :
for f in tqdm(['p1_census_register'.'dpt']):
df_unknow = stat(df_know, df_unknow, [f], {
'y1_is_purchase': ['mean']})
return df_unknow
Copy the code
The data analysis
Sample equilibrium analysis
fig = plt.figure()
plt.pie(df_train['y1_is_purchase'].value_counts(),
labels=df_train['y1_is_purchase'].value_counts().index,
autopct='% 1.2 f % %',counterclock = False)
plt.title('Purchase rate')
plt.show()
Copy the code
# flag_0 = df_train['y1_is_purchase']==0
# n = sum(flag_0)
# flag_1 = df_train['y1_is_purchase']==1
# n,sum(flag_1)
# df_train[flag_0].shape
# df_train = pd. Concat ([df_train [flag_0] [: n], df_train [flag_1] [: n * 1.5]])
# # df_train = df_train.reset_index()
# df_train['y1_is_purchase'].value_counts()
Copy the code
It is obvious that the data is not balanced, so I immediately thought of sampling and oversampling, but the effect is not as good as the original data.
Data distribution analysis
Use a boxplot to view the value field distribution and delete the data that is significantly inconsistent
columns = df_train.columns.tolist()
fig = plt.figure(figsize=(80.60),dpi=75)
j = 0
for i in range(len(columns)):
if not df_train[columns[i]].dtype=='object':
j+=1
plt.subplot(7.10,j+1)
sns.boxplot(df_train[columns[i]],orient='v',width=0.5)
plt.ylabel(columns[i],fontsize=36)
plt.show()
Copy the code
train_rows = len(df_train.columns)
plt.figure(figsize=(10*4.10*4))
columns = df_train.columns.tolist()
fig = plt.figure(figsize=(80.60),dpi=75)
j = 0
for i in range(len(columns)):
# print(df_train[columns[i]].dtype)
if df_train[columns[i]].dtype=='int64' andcolumns[i]! ='y1_is_purchase':
# print(df_train[columns[i]].dtype)
df_train[columns[i]].fillna(df_train[columns[i]].mode(), inplace=True)
df_test[columns[i]].fillna(df_test[columns[i]].mode(), inplace=True)
j+=1
ax = plt.subplot(4.4,j)
sns.distplot(df_train[columns[i]],fit=stats.norm)
j+=1
ax = plt.subplot(4.4,j)
stats.probplot(df_train[columns[i]],plot=plt)
plt.tight_layout()
plt.show()
Copy the code
Model application
The decision tree
Decision tree is a basic classification and regression method. The decision tree model has the advantages of fast classification speed and easy visual interpretation of the model, but at the same time, it is also easy to occur the over-fitting, although there is pruning, but it is not satisfactory.
In the classification problem, it learns multiple classifiers by changing the weight of the training sample (increasing the weight of the wrong sample and reducing the weight of the team sample), and linearly combines these classifiers to improve the performance of the classifiers. Boosting Mathematics is expressed as:
Where W is the weight, ϕ\phiϕ is the set of weak classifiers, and it can be seen that the end is a linear combination of basis functions.
Therefore, the combination of decision tree and Boosting produces many algorithms, mainly including lifting tree, GBDT, etc
Gradient Boosting is a Boosting method, whose main idea is that every time a model is built, the Gradient descent direction of the model loss function is established before. Loss function is used to evaluate the performance of the model (generally, fitting degree + regular term). It is believed that the smaller the loss function, the better the performance. The best way to improve the performance of the model is to make the loss function decline along the gradient direction (reasonably speaking, the fastest decline is in the gradient direction).
lightgbm
Gradient Boost is a framework within which many different algorithms can be built.
- Decision tree algorithm based on Histogram.
The basic idea of the histogram algorithm is to discretize the continuous floating point eigenvalues into K integers, and construct a histogram of width k. During data traversal, statistics are accumulated in the histogram according to the discretized value as the index. After data traversal, the required statistics are accumulated in the histogram, and then the optimal segmentation point is found according to the discrete value of the histogram.
- Gradient-based one-side Sampling(GOSS) : Using GOSS can reduce the number of data instances with small gradiences so that only the remaining data with high gradiences can be used to calculate the information gain, saving time and space overhead compared to XGBoost traversing all the eigenvalues.
- Exclusive Feature Bundling(EFB) : With EFB, many Exclusive features can be bundled into a single Feature, thus reducing dimension.
* Leaf-wise with depth limits: Most GBDT tools use the inefficient level-wise decision tree strategy because it indiscriminately treats leaves in the same layer, resulting in a lot of unnecessary overhead. In fact, many leaves have low fission gain, so there’s no need to search and divide. LightGBM uses a leaf-wise algorithm with depth limits.
- Direct support for Categorical Feature
- Support efficient parallelism
- Cache hit ratio Optimization
```python
ycol = 'y1_is_purchase'
feature_names = features
model = lgb.LGBMClassifier(objective='binary',
boosting_type='gbdt',
num_leaves=64,
max_depth=10,
learning_rate=0.01,
n_estimators=10000,
subsample=0.8,
feature_fraction=0.6,
reg_alpha=10,
reg_lambda=12,
random_state=seed,
is_unbalance=True,
metric='auc')
df_oof = df_train[['carid', ycol]].copy()
df_oof['prob'] = 0
prediction = df_test[['carid']]
prediction['prob'] = 0
df_importance_list = []
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
for fold_id, (trn_idx, val_idx) in enumerate(
kfold.split(df_train[feature_names], df_train[ycol])):
X_train = df_train.iloc[trn_idx][feature_names]
Y_train = df_train.iloc[trn_idx][ycol]
X_val = df_train.iloc[val_idx][feature_names]
Y_val = df_train.iloc[val_idx][ycol]
print('\nFold_{} Training ================================\n'.format(
fold_id + 1))
# print(df_oof.loc[val_idx]['prob'])
lgb_model = model.fit(X_train,
Y_train,
eval_names=['train'.'valid'],
eval_set=[(X_train, Y_train), (X_val, Y_val)],
verbose=100,
early_stopping_rounds=50)
pred_val = lgb_model.predict_proba(
X_val, num_iteration=lgb_model.best_iteration_)[:, 1]
# print(df_oof.iloc[val_idx])
df_oof.loc[val_idx, 'prob'] = pred_val
pred_test = lgb_model.predict_proba(
df_test[feature_names], num_iteration=lgb_model.best_iteration_)[:, 1]
prediction['prob'] += pred_test / kfold.n_splits
df_importance = pd.DataFrame({
'column': feature_names,
'importance': lgb_model.feature_importances_,
})
df_importance_list.append(df_importance)
del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val
gc.collect()
Copy the code
Model to evaluate
Five fold cross validation
Bayesian tuning
The application of Bayesian optimization to machine learning parameter tuning was proposed by J. Snoek(2012). The main idea is that given the optimization objective function (generalized function, only the input and output can be specified, without knowing the internal structure and mathematical properties), The posterior distribution of the objective function is updated by continuously adding sample points (Gaussian process, until the posterior distribution basically fits the real distribution. Simply put, it takes into account the previous parameter information **, so as to better adjust the current parameter.
What distinguishes it from a regular grid or random search is:
- Bayesian parameter tuning adopts The Gaussian process, considers the previous parameter information, and constantly updates the prior. Grid search does not take into account previous parameter information
- Bayesian tuning has less iteration times and faster speed. The grid search speed is slow, and the dimension explosion is easy to be caused by multiple parameters
- Bayesian tuning is still robust for non-convex problems; Grid search is easy to obtain local optimality for non-convex problems
Learning control parameter | meaning | usage |
---|---|---|
max_depth |
The maximum depth of the tree | When the model is overfitted, the first reduction can be consideredmax_depth |
min_data_in_leaf |
Minimum number of records a leaf may have | The default is 20, used for overfitting |
feature_fraction |
For example, when it is 0.8, it means that 80% of parameters are randomly selected to build trees in each iteration | Boosting is used when the value is random Forest |
bagging_fraction |
The proportion of data used in each iteration | Used to speed up training and reduce overfitting |
early_stopping_round |
If one measure at a time validates the data in the nearestearly_stopping_round If there is no improvement during the turn, the model will stop training |
Speed up analysis and reduce too many iterations |
lambda | Specified regularization | 0 ~ 1 |
min_gain_to_split |
Describe the minimum gain for splitting | Control the useful split of the tree |
max_cat_group |
Find the split point on the group boundary | When the number of categories is large, it is easy to overfit the segmentation point |
```python
# Set several parameters
def lgb_cv(colsample_bytree, min_child_samples,reg_alpha,reg_lambda,min_split_gain,subsample_freq, num_leaves, subsample, max_depth,learning_rate,min_child_weight) :
model = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',
colsample_bytree=float(colsample_bytree), learning_rate=float(learning_rate),
min_child_samples=int(min_child_samples), min_child_weight=float(min_child_weight),
min_split_gain = float(min_split_gain),subsample_freq = int(subsample_freq),
n_estimators=8000, n_jobs=-1, num_leaves=int(num_leaves),
random_state=None, reg_alpha=float(reg_alpha), reg_lambda=float(reg_lambda),max_depth=int(max_depth),
subsample=float(subsample))
cv_score = cross_val_score(model, x, y, scoring="f1", cv=5).mean()
return cv_score
# Use Bayesian optimization
lgb_bo = BayesianOptimization(
lgb_cv,
{
'learning_rate': (0.008.0.01),
'max_depth': (3.10),
'num_leaves': (31.127),
'min_split_gain': (0.0.0.4),
'min_child_weight': (0.001.0.002),
'min_child_samples': (18.22),
'subsample': (0.6.1.0),
'subsample_freq': (3.5),
'colsample_bytree': (0.6.1.0),
'reg_alpha': (0.0.5),
'reg_lambda': (0.0.5)
},
)
lgb_bo.maximize(n_iter=1000)
lgb_bo.max
# Put the optimized parameters into use
Copy the code
Visualization of decision process
from IPython.display import Image
model = lgb.LGBMClassifier(num_leaves=64,
max_depth=10,
learning_rate=0.01,
boosting_type = 'goss',
n_estimators=10000,
subsample=0.8,
feature_fraction=0.8,
reg_alpha=0.5,
reg_lambda=0.5,
random_state=seed,
metric="f1")
X_train = df_train[feature_names]
Y_train = df_train[ycol]
lgb_model = model.fit(X_train,
Y_train,
eval_names=['valid'],
eval_set=[(X_val, Y_val)],
verbose=10,
eval_metric='auc',
early_stopping_rounds=50)
import matplotlib.pyplot as plt
fig2 = plt.figure(figsize=(200.200))
ax = fig2.subplots()
lgb.plot_tree(lgb_model, tree_index=1, ax=ax)
plt.show()
Copy the code
The ROC curve
The importance of each attribute distribution
By checking the importance of each attribute in the decision-making process, and then to the top of the field to further do feature engineering
Model integration
By means of simple weighting, the results of the three models are fused to avoid the overfitting problem
w_lgb = 0.333
w_xgb = 0.333
w_cbt = 0.333
oof['prob'] = oof['prob_lgb'] ** w_lgb * oof['prob_xgb'] ** w_xgb * oof['prob_cbt'] ** w_cbt
auc = roc_auc_score(oof['y1_is_purchase'], oof['prob'])
sub['label'] = sub['prob_lgb'] ** w_lgb * sub['prob_xgb'] ** w_xgb * sub['prob_cbt'] ** w_cbt
sub.head()
Copy the code
View the fusion result
conclusion
As it was the first time to officially participate in the competition, faced with 60W data, I felt a little at a loss. Slowly review what you’ve learned in class as you look up information. From feature engineering, to model selection, to parameter optimization. In the beginning, I built my baseline with decision trees. Finally, XGBoost, LightGBM and CatBoost are merged. Also learned to use Bayesian parameter optimization in the process of this experiment. The design of this experiment may not be perfect, and there are indeed many problems, which will be corrected and improved in the future. Although the experiment is over, there is still a long way to learn. I will go step by step and constantly improve myself.
The resources
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Biau G, Devroye L, Lugosi G. Consistency of Random Forests and Other Averaging Classifiers.[J]. Journal of Machine Learning Research, 2008, 9 (1) : 2015-2033.
Decision tree model, XGBoost, LightGBM and CatBoost model visualization LightGBM algorithm summary
LightGBM tuning notes
scikit-learn
Frazier P I. A tutorial on Bayesian Optimization [J]. ArXiv Preprint arXiv: 187.02811, 2018.