This article has participated in the “new creative Ceremony” activity

Zero, data set and data set introduction and task analysis

Project Tasks:

According to the massive data of user clicks on advertisements, a prediction model is built by machine learning to predict whether the user clicks on advertisements, that is, whether the user clicks on an advertisement is predicted under the conditions of media, context and user tags related to an advertisement. And our job is to do that by analyzing the data

TXT and train2. TXT can be read in pandas

field instructions
instance_id The sample id
click Whether to click
adid AD id
advert_id Advertisers id
orderid The order id
advert_industry_inner Advertiser industry
advert_name Name of advertiser
campaign_id Activity id
creative_id The creative id
creative_type Creative types
creative_tp_dnf Style oriented ID
creative_has_deeplink Whether the response material has deeplink(Boolean)
creative_is_jump Is it page-drop (Boolean)
creative_is_download Whether to download off page (Boolean)
creative_is_js Is it a JS material (Boolean)
creative_is_voicead Whether voice ads (Boolean)
creative_width Creative wide
creative_height Creative high
app_cate_id Classification of app
f_channel Level 1 channel
app_id The media id
inner_slot_id Media advertising space
app_paid Whether the app pays or not
user_tags User label information, separated by commas
city city
carrier Operator,
time The time stamp
province provinces
nnt Networking type
devtype Device type
os_name Operating System Name
osv Operating System Version
os The operating system
make Brand (e.g. Apple)
model Model (e.g. “iPhone “)

1. Data preprocessing

1. Load check data

data = pd.read_csv("train1.txt",sep='\t')
data.append(pd.read_csv("train2.txt",sep='\t'))
Copy the code

(1) Special characteristics

We can see that user_tags feature is very different from other features. It contains multiple features, and each data contains different features of user_tags. Therefore, in order to facilitate post-processing, we first separate user_tags feature from data for separate processing

user_tags = data["user_tags"]
data = data.drop("user_tags", axis=1)
Copy the code

(2) Processing of special features

Because user_tags has a special feature and its length is not unique, we use the following method to process it

1. Complete the empty data, which is considered by the rimmer to mean that there is no label for the empty data. Here we set the label as' 0 '2. Counting the number of occurrences of each user label 3. Use the first save_n labels as reserved label 4. Create a two-dimensional zero matrix of shape (n, save_n) :user_tags_mark 5, assuming the number of data items is N. If there is a reserved label J in the user label of the I entry data, we set the I th column of user_tags_mark to 1, 6. To facilitate future prediction of new data, we saved tags_dict occurrences and tags_map, the mapping between user_tags and user_tags_markCopy the code
def get_tags(user_tags, save_n) :
    user_tags.fillna('0', inplace=True)
    tags_dict = dict(a)for i in range(len(user_tags)):
        tl = user_tags[i].split(', ')
        for t in tl:
            if t in tags_dict.keys():
                tags_dict[t] += 1
            else:
                tags_dict[t] = 1

    tags_dict = sorted(tags_dict.items(), key=lambda x:x[1], reverse=True)[:save_n]
    tags_dict = {key:value for key, value in tags_dict}
    tags_map = {key:value for value, key in enumerate(tags_dict.keys())}

    user_tags_mark = np.zeros((len(user_tags), len(tags_map)))
    for i in range(len(user_tags)):
        tl = user_tags[i].split(', ')
        for t in tl:
            if t in tags_map.keys():
                user_tags_mark[i][tags_map[t]] += 1
    return user_tags, tags_dict, tags_map, user_tags_mark
            
user_tags, tags_dict, tags_map, user_tags_mark = get_tags(user_tags, 20)
Copy the code

(3) New features generated after special feature processing

2. Data type and missing value processing

(1) View data information

data.info()
Copy the code

Note that make, model, osv, os_name, advert_industry_inner, f_channel, inner_slot_id, advert_name,app_cate_id, and app_id are all missing. However, their completion methods are different. Among them, the type of make~advert_name is object. After observing the data, it is found that the data stored in the form of string and the loss of relevant information can be considered to be caused by technical means or users’ reluctance to disclose, so the loss of this part of information also conveys a message. So we mark this type of missing data as “NaN”, meaning that no other data information can be found. The app_cate_id data type is float, and the median is used to complete it to reflect the majority of cases.

Note: At this point the data has been separated from user_tags

(2) Data completion and type conversion

Since the computer cannot operate the data in string format, we need to convert the string into the label that the computer can understand. We use OrdinalEncoder to convert the features in objects_list into the label format while completing the data. And save the mapping between strings and labels (objects_cates)

objects_list = ["make"."model"."osv"."os_name"."advert_industry_inner"."f_channel"."inner_slot_id"."advert_name"]
floats_list = ["app_cate_id"."app_id"]


def Completer(data, bool_list=[], objects_list=[], floats_list=[]) :
    from sklearn.preprocessing import OrdinalEncoder
    objects_cates = dict()
    flag = False
    for obj in objects_list:
        flag = True
        data[obj].fillna("NaN", inplace=True)
        
        data_cat = data[[obj]]
        encoder = OrdinalEncoder()
        data_cat = encoder.fit_transform(data_cat)
        cate_dict = dict()
        categories = encoder.categories_[0]
        for i in range(len(categories)):
            cate_dict[categories[i]] = i
        objects_cates[obj] = cate_dict
        data[obj]=data_cat.reshape(-1.1) :0]

    for f in floats_list:
        median = data[f].median()
        data[f].fillna(median, inplace=True)
        
    return data, objects_cates, flag


data, objects_cates, flag = Completer(data, bool_list, objects_list, floats_list)
Copy the code

(3) Data after completion

3. Outlier analysis

Use EllipticEnvelope and KNNImputer to identify outliers and modify them

bool_list = ["creative_is_jump"."creative_is_download"."creative_is_js"."creative_is_voicead"."creative_has_deeplink"."app_paid"]

def OutlierHander(data, boll_list) :
    from sklearn.covariance import EllipticEnvelope
    from sklearn.impute import KNNImputer
    import numpy as np
    
    for b in bool_list:
        data[b] = data[b].astype(np.float64)
    detector = EllipticEnvelope() # construct an outlier identifier
    detector.fit(data) # Fit recognizer
    idx = detector.predict(data) == -1# Predict outliers
    ls = [i for i in range(len(data))]
    
    data[idx] *= np.nan
    imputer = KNNImputer()
    data = imputer.fit_transform(data)   
    return data

data = OutlierHander(data, bool_list)
Copy the code

Note: 1. This function will run very slowly; 2. All bool types must be converted before completion

Exploratory analysis and feature engineering

As we all know, user_tags can indicate the preferences of each user, thus reflecting to some extent whether the user will click on the advertisement. However, the tag is multi-dimensional data, especially after we use the user tag to generate a new feature (user_tags_mark), only taking part of the user tag information for analysis will obviously lead us into a mistake, so we reserve the user tag feature here without analysis

1. Univariate graph analysis

(1) Analysis of original data variables

data.hist(bins=51, figsize=(20.15))
plt.show()
Copy the code

As can be seen from the figure: 1. Sample ID (instance_id) is evenly distributed, and it has no influence on whether to click an advertisement according to common sense, so this feature can be removed. 2. The time of data distribution is periodic, indicating that the collected data is related to time. The different data collected in different time periods indicates that users see advertisements for different times, so we need to add new features, such as weekly and monthly features 3. Province, city, media ID, advertising ID, advertiser ID, order ID, activity ID, creative ID, style orientation ID. Since the coding problem will affect the drawing effect and subsequent training effect, here we make a data mapping 4. The positive and negative samples of clicked and unclicked data are unbalanced. Undersampling should be carried out before training, that is, the sampling of negative samples should be reduced

(2) Data processing

1. Add new features

data["time"] = pd.to_datetime(df['time'],unit='s',origin=pd.Timestamp('1970-01-01'))
data["month"] = data["time"].dt.month
data["dayofweek"] = data["time"].dt.dayofweek
Copy the code

2. Id mapping

city_dict = {key:value for value, key in enumerate(set(data["city"))}for key, value in city_dict.items():
    tmp = data[data["city"] == key]
    tmp["city"] = value
    data[data["city"] == key] = tmp
    
province_dict = {key:value for value, key in enumerate(set(data["province"))}for key, value in province_dict.items():
    tmp = data[data["province"] == key]
    tmp["province"] = value
    data[data["province"] == key] = tmp 
    
campaign_id_dict = {key:value for value, key in enumerate(set(data["campaign_id"))}for key, value in campaign_id_dict.items():
    tmp = data[data["campaign_id"] == key]
    tmp["campaign_id"] = value
    data[data["campaign_id"] == key] = tmp 
    
app_id_dict = {key:value for value, key in enumerate(set(data["app_id"))}for key, value in app_id_dict.items():
    tmp = data[data["app_id"] == key]
    tmp["app_id"] = value
    data[data["app_id"] == key] = tmp 

adid_dict = {key:value for value, key in enumerate(set(data["adid"))}for key, value in adid_dict.items():
    tmp = data[data["adid"] == key]
    tmp["adid"] = value
    data[data["adid"] == key] = tmp 
    
advert_id_dict = {key:value for value, key in enumerate(set(data["advert_id"))}for key, value in advert_id_dict.items():
    tmp = data[data["advert_id"] == key]
    tmp["advert_id"] = value
    data[data["advert_id"] == key] = tmp 
    
orderid_dict = {key:value for value, key in enumerate(set(data["orderid"))}for key, value in orderid_dict.items():
    tmp = data[data["orderid"] == key]
    tmp["orderid"] = value
    data[data["orderid"] == key] = tmp 
    
campaign_id_dict = {key:value for value, key in enumerate(set(data["campaign_id"))}for key, value in campaign_id_dict.items():
    tmp = data[data["campaign_id"] == key]
    tmp["campaign_id"] = value
    data[data["campaign_id"] == key] = tmp 
    
creative_id_dict = {key:value for value, key in enumerate(set(data["creative_id"))}for key, value in creative_id_dict.items():
    tmp = data[data["creative_id"] == key]
    tmp["creative_id"] = value
    data[data["creative_id"] == key] = tmp 
    
creative_tp_dnf_dict ={key:value for value, key in enumerate(set(data["creative_tp_dnf"))}for key, value in creative_tp_dnf_dict.items():
    tmp = data[data["creative_tp_dnf"] == key]
    tmp["creative_tp_dnf"] = value
    data[data["creative_tp_dnf"] == key] = tmp 
Copy the code

(3) Analysis of variables after data processing

2. Correlation analysis

(1) Calculate the correlation matrix

columns = ["tag"+str(i) for i in range(20)]
tags_mark = pd.DataFrame(user_tags_mark, columns=columns)
tags_mark["click"] = data["click"]
data_corr =data.corr()
tags_corr = tags_mark.corr()
Copy the code

(2) Thermal diagram display

data:

plt.matshow(data_corr, cmap=plt.cm.gray)
plt.show()
Copy the code

Tags_mark: 0

plt.matshow(tags_corr, cmap=plt.cm.gray)
plt.show()
Copy the code

The larger the value is, the stronger the data correlation is, and the brighter the color in the image will be. Several rows/columns of the thermal map generated according to data’s correlation matrix are particularly bright, not because their correlation is particularly strong, but because the correlation is too weak, which leads to the value of NaN. At the same time, you can see that the overall tags_mark is relatively bright, indicating that the overall tags are highly relevant. However, the correlation between data and tags is not large, indicating that the results with prediction are not linearly correlated with features, and multiple data are required to act together

3. Feature engineering and feature selection

data_corr["click"].sort_values(ascending=True)
Copy the code

tags_corr["click"].sort_values(ascending=True)
Copy the code

Here we use the features with strong correlation in data, while tags_ Mark retains them completely

4. Bivariate graph analysis

from pandas.plotting import scatter_matrix

attributes1 = ["creative_tp_dnf"."campaign_id"."creative_width"."creative_height"."app_id"."advert_name"."creative_type"."click"]

scatter_matrix(data[attributes1], figsize=(12.8))
plt.show()
Copy the code

You can see that there is not much correlation between the data, so there is no linear relationship

5. Generation of new features

New features have been generated during the data analysis above: dayofweek, Month, and user_Tags_mask now need to merge the more relevant features in the data with the tags_mark feature data

new_data = pd.concat([data[attributes1], tags_mark], axis=1)
new_data.tail()
Copy the code

Cross validation of machine learning model

1. Undersampling and dividing data sets

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from random import shuffle

idx = [i for i in range(len(new_data[new_data["click"] = =0]))]
shuffle(idx)
idx = idx[:len(new_data[new_data["click"] = =1])]

data_0 = new_data[new_data["click"] = =0]
data_sample = data_0.sample(n=len(new_data[new_data["click"] = =1]))

new_data = data_sample.append(new_data[new_data["click"] = =1])

x_train, x_test,  y_train, y_test = train_test_split(new_data.drop("click", axis=1), new_data["click"], test_size = 0.3, random_state = 7)
Copy the code

2. Normalization

data_std = StandardScaler()
data_std.fit(x_train)
x_train = data_std.transform(x_train)
x_test = data_std.transform(x_test)
Copy the code

3. Cross-verify and compare the effects of different models

Due to the large data, KNN and SVM were firstly excluded in consideration of efficiency

(1) Decision tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
cross_val_score(tree, x_test, y_test)
Copy the code

(2) Bayes

from sklearn.naive_bayes import BernoulliNB

nb = BernoulliNB()
nb.fit(x_train, y_train)
cross_val_score(nb, x_test, y_test)
Copy the code

(3) Random forest

from sklearn.ensemble import RandomForestClassifier

rd_tree = RandomForestClassifier()
rd_tree.fit(x_train, y_train)

cross_val_score(rd_tree, x_test, y_test)
Copy the code

(4) bagging

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100)
bag_clf.fit(x_train, y_train)

print("bagging:")
print("Cross validation:", cross_val_score(sgd, x_test, y_test))
print("Precision:", precision_score(y_test, bag_clf.predict(x_test)))
print("Recall rate:", recall_score(y_test, bag_clf.predict(x_test)))
Copy the code

(5) Ada Boosting

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(), n_estimators=100)
ada_clf.fit(x_train, y_train)

print("ada:")
print("Cross validation:", cross_val_score(ada, x_test, y_test))
print("Precision:", precision_score(y_test, ada_clf.predict(x_test)))
print("Recall rate:", recall_score(y_test, ada_clf.predict(x_test)))
Copy the code

(6) Logistic regression

from sklearn.linear_model import LogisticRegression

reg = LogisticRegression()
reg.fit(x_train, y_train)
cross_val_score(reg, x_test, y_test)
Copy the code

(7) Deep learning fully connected model

def to_one_hot(y) :
    ans = np.zeros((len(y), 2))
    for i in range(len(y)):
        ans[i][y[i]] = 1
    return ans

y_train_hot = to_one_hot(y_train)
y_test_hot =to_one_hot(y_test)

from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Input
from keras.models import Model

early_stopping_cb = EarlyStopping(patience=5, restore_best_weights=True)
checkpoint_cb = ModelCheckpoint("datas.h5", save_best_only=True)

data_input = Input(shape=(27,))
data_layer1 = Dense(128)(data_input)
data_layer2 = Dense(64)(data_layer1)
data_layer3 = Dense(32)(data_layer2)
data_layer4 = Dense(2)(data_layer3)

data_model = Model(data_input, data_layer4)
data_model.compile(optimizer='rmsprop',loss='binary_crossentropy', metrics=['acc'])
data_history = data_model.fit(np.array(list(x_train)), np.array(list(y_train_hot)), epochs=100, validation_split=0.2, batch_size=128, callbacks=[early_stopping_cb, checkpoint_cb])
Copy the code

(8) Stochastic gradient descent

from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
sgd.fit(x_train, y_train)
cross_val_score(sgd, x_test, y_test)
Copy the code

(9) xgboost

from xgboost import XGBClassifier
from sklearn.metrics import log_loss

xgb = XGBClassifier()
xgb.fit(x_train, y_train, eval_metric="logloss")
Copy the code

GradientBoosting (10)

Since GradientBoosting is a weak classifier, we use Bagging to integrate it for better results

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier

gbc = GradientBoostingClassifier()
gbc_bag = BaggingClassifier(gbc, n_estimators=80)
gbc_bag.fit(x_train, y_train)
Copy the code

(11) Voting apparatus

from sklearn.ensemble import VotingClassifier

vot_clf = VotingClassifier(estimators=[
    ('xgb_clf', xgb),
    ('gbc_bag_clf', gbc_bag)], voting = 'soft')

vot_clf.fit(x_train, y_train)

cross_val_score(vot_clf, x_test, y_test, cv=5)
Copy the code

4. Accuracy and recall rate

The decision tree Random forests bagging ada boosting Logistic regression Deep learning fully connected model Stochastic gradient descent The bayesian
Accuracy of test data 0.632 0.635 0.634 0.633 0.611 0.501 0.606 0.613
Test data recall rate 0.783 0.816 0.812 0.8095 0.926 0.326 0.951 0.867

5, logloss

The decision tree Random forests bagging ada boosting Logistic regression Deep learning fully connected model Stochastic gradient descent The bayesian xgboost GradientBoosting
Test data loGloss 13.29 1.00 0.59 0.74 0.58 11.0 13.57 0.78 0.423 0.429

You can see from the table above that this problem is suitable for using tree – and ensemble – based learning models

4. Adjust the hyperparameters of the model

1. For the best model, carry out hyperparameter optimization

According to the above analysis, we can get a table like this

The decision tree Random forests bagging ada boosting Logistic regression Deep learning fully connected model Stochastic gradient descent The bayesian xgboost GradientBoosting
Test data accuracy 0.664 0.669 0.666 0.664 0.667 0.672 0.666 0.657
Accuracy of test data 0.632 0.635 0.634 0.633 0.611 0.501 0.606 0.613
Test data recall rate 0.783 0.816 0.812 0.8095 0.926 0.326 0.951 0.867
Test data loGloss 13.29 1.00 0.59 0.74 0.58 11.0 13.57 0.78 0.423 0.429

After comprehensive consideration, xGBoost and GradientBoosting were selected for parameter optimization

2. Evaluate the effect

Xgboost and GradientBoosting models are evaluated and tuned

xgboost GradientBoosting
logloss 0.417 0.416

Five, the conclusion

A conclusive analysis is obtained

Although the stochastic gradient descent model has a general performance in accuracy, it has a good performance in accuracy and recall rate, especially recall rate, which is the highest. The recall rate of logistic regression is relatively low, while LOGloss is relatively low, but LOgloss still does not reach a satisfactory degree, xGBoost and GradientBoosting Logloss are the lowest among all models. According to the actual needs of advertising, we can choose different models. If we insist that every user who clicks can see the advertisement, we can blindly pursue recall rate to select random gradient descent. If you have requirements for LoGloss, but not high requirements, you can choose logistic regression; If logloss is preferred, XGBoost and GradientBoosting can be selected. If logloss is preferred, the model that best meets the evaluation criteria can also be selected for voting by using a voting device. Besides, LGBM or integrated learning of XGBoost and GradientBoosting can be selected for better results The results of the