preface

Big data era background, many areas need to produce the data value of the mining, through the analysis of the data can be described in the development of things, also can to the thing of the future development path of a certain degree of prediction, based on the medical field as the background, analysis of heart rate signal data to predict which belongs to the category.

1.1 Jupyter setup, packet guide and data set loading

Import related modules.

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
from sklearn.exceptions import ConvergenceWarning
from typing import types
import sklearn
import pandas_profiling
Copy the code

Intercepts warning

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   
Copy the code

To prevent Chinese garbled characters, set the Seaborn Chinese font.

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
Copy the code

Example Set the number of lines displayed in jupyter

pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 30)
pd.set_option('max_columns', 30)
Copy the code

Load data.

df_train =pd.read_csv('train.csv',encoding='utf-8')
df_test=pd.read_csv('testA.csv',encoding='utf-8')
Copy the code

1.2 Exploratory analysis

1.2.1 Dataset Preview

Df_train. Head (5). Append (df_train. The tail (5)) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- id heartbeat_signals Label 0 0, 0.9912297987616655, 0.9435330436439665, 0.764677... 0.0 1 1, 0.9714822034884503, 0.9289687459588268, 0.572932... 2 2, 1.0, 0.0, 0.9591487564065292, 0.7013782792997189, 0.23... 3 3 0.9757952826275774 2.0, 0.9340884687738161, 0.659636... 0.0 4 4 0.0, 0.055816398940721094, 0.26129357194994196, 0... 2.0 99995, 99995, 1.0, 0.677705342021188, 0.22239242747868546, 0.25... 0.0 99996, 99996, 0.9268571578157265, 0.9063471198026871, 0.636993... 2.0 99997, 99997, 0.9258351628306013, 0.5873839035878395, 0.633226... 3.0 99998, 99998, 1.0, 0.9947621698382489, 0.8297017704865509, 0.45... 2.0 99999, 99999, 0.9259994004527861, 0.916476635326053, 0.4042900... 0.0Copy the code
  • Preview relevant statistics
Df_train. The describe () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- id label count 100000.000000 100000.000000 Mean 49999.500000 0.856960 STD 28867.657797 1.217084 min 0.000000 0.000000 25% 24999.750000 0.000000 50% 49999.500000 0.000000 75% 74999.250000 2.000000 Max 99999.000000 3.000000Copy the code
  • Previewing data types
  • The primary signal data is of type STR, which may need to be converted to float to train
Df_train. Info () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- < class 'pandas. Core. Frame. The DataFrame' > RangeIndex: 100000 entries, 0 to 99999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 100000 non-null int64 1 heartbeat_signals 100000 Non-null Object 2 Label 100000 Non-NULL FLOAT64 DTypes: FLOAT64 (1), INT64 (1), Object (1) Memory Usage: 2.3+ MBCopy the code
  • Preview training set, test set dimensions
Df_train. Shape, df_test. Shape -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ((100000, 3), (2), 20000)Copy the code
  • The data is good, there is no missing value, the official processing has been 0
Missing_pct = df_train.isnull().sum() * 100 / len(df_train) df_train.columns, 'missing_pct': missing_pct, }) missing.sort_values(by='missing_pct', Ascending = False), head () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the name missing_pct id id 0.0 Heartbeat_signals heartbeat_signals 0.0 label Label 0.0Copy the code

1.2.2 Number and distribution of heartbeat signal categories

  • Category 0 accounted for the most at 64.3%, while Category 1 accounted for the least at 6.4%
FIG, ax = PLT. Subplots (1, 2, figsize = (15, 8) SNS. Countplot (' label 'data = df_train, ax = ax [0], the palette = [' m', 'c', 'pink', 'orange'] ) df_train [' label '] value_counts (). The plot. The pie (autopct = '% % % 1.1 f, ax = ax [1], colors=['m','c','pink','orange']) ax[0].set_ylabel('') ax[0].set_xlabel('label') ax[1].set_ylabel('') ax[1].set_xlabel('label') plt.show()Copy the code

1.2.3 Variation trend of different types of signals

  • The four kinds of signal data curve trend is different
  • The signal gets stronger, and there’s a spike
Def get_figure(n) -> int: Plots (2,2,figsize=(15,8)) axs=[ax[I][j] for I in range(2) for j in range(2)] range(4): for j in range(n): Df_train [df_train [' label '] = = I] [: n]. Iloc [j: - 1]. The plot (ax = axs [I]) labels_ds. Append (f) 'signal {j + 1}' Axs [I]. Legend (labels_ds) axs[I]. Set_title (f' category {I}',fontsize=20) get_figure(4)Copy the code

1.2.4 Average variation trend of heartbeat signal categories

  • It can be seen that the variation trends of heartbeat signals of the four types are different. The overall variation of category 3 is steep, while that of category 0 is gentle.
def get_figure(n) -> int: labels_ds=[] plt.figure() for i in range(4): Df_train [df_train [' label '] = = I] the iloc [: : - 1]. The mean (). The plot () labels_ds. Append (f 'category {I}') PLT. Legend (labels = labels_ds) plt.title(f'average',fontsize=20) get_figure(4)Copy the code

1.3 Feature Engineering

1.3.1 Data feature extraction

  • Because heartbeat_signals is STR type, it is difficult to analyze and data separation is required to build new features. Add this test set to the training set together.
targets=df_train.label
df_train.drop(['label'],axis=1,inplace=True)
combined=df_train.append(df_test)
Copy the code
  • Build new features through transformation functions
def transform_dataset(): global combined hbs='heartbeat_signals' rebuilt = combined[hbs].str.split(',',expand=True) rebuilt.columns = [f'hbs_{i}'  for i in range(rebuilt.shape[1])] combined.drop(hbs,axis=1,inplace=True) combined = pd.concat([combined,rebuilt],axis=1) combined.drop('id',axis=1,inplace=True) combined =combined.astype(float) return combined combined =transform_dataset()Copy the code
  • Preview the processed data set
Combined. Head () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- hbs_0 hbs_1 hbs_2 hbs_3 hbs_4 hbs_5 hbs_6 hbs_7  hbs_8 hbs_9 hbs_10 hbs_11 hbs_12 hbs_13 hbs_14 ... hbs_190 hbs_191 hbs_192 hbs_193 hbs_194 hbs_195 hbs_196 hbs_197 hbs_198 hbs_199 hbs_200 hbs_201 hbs_202 hbs_203 hbs_204 0 0.991230 0.943533 0.764677 0.618571 0.379632 0.190822 0.040237 0.025995 0.031709 0.065524 0.125531 0.146747 0.167656 0.193374 0.226135... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.971482 0.928969 0.572933 0.178457 0.122962 0.132360 0.094392 0.089575 0.030481 0.040499 0.020392 0.027965 0.035499 0.015321 0.045483... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 1.000000 0.959149 0.701378 0.231778 0.000000 0.080698 0.128376 0.187448 0.280826 0.328261 0.320463 0.322416 0.324367 0.322416 0.332144... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 0.975795 0.934088 0.659637 0.249921 0.237116 0.281445 0.249921 0.249921 0.241397 0.230670 0.224196 0.228515 0.232822 0.234970 0.226357... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 0.000000 0.055816 0.261294 0.359847 0.433143 0.453698 0.499004 0.542796 0.616904 0.676696 0.737882 0.755473 0.772850 0.774175 0.786173... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0Copy the code

1.3.2 Timing feature processing (this is a separate description)

  • Objective: Row and column rotation, distribute each line of sample signal data along axis=1 direction in time series
  • Each sample signal data is extracted and put into the list
train_hbs = df_train["heartbeat_signals"].str.split(",", expand=True)
Copy the code
  • Convert the column index heartbeat_signals to the innermost row index
  • Reset index after index transformation, convert the resulting Level_0 column to row labels and remove label names
  • {“level_1″:”time”, 0:”heartbeat_signals”}
  • Change the data type to float
train_hbs=train_hbs.stack()
train_hbs = train_hbs.reset_index()
train_hbs = train_hbs.set_index("level_0")
train_hbs.index.name = None
train_hbs.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_hbs["heartbeat_signals"] = train_hbs["heartbeat_signals"].astype(float)
train_hbs
Copy the code
  • Add the predicted value label to the new training set
targets =df_train['label']
df_train.drop(['label','heartbeat_signals'],axis=1,inplace=True)
df_train=df_train.join(train_hbs)
Copy the code
  • Import the time feature extraction module
import tsfresh as tsf
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
Copy the code
  • Executive feature extraction
  • Here, memory and computing power requirements are high. The author uses 16G memory, and memory errors still occur at 60% of execution time. After running in Kaggle, 20% will show insufficient memory, so the author focuses on the current thoughts and methods, and then make up for the concrete implementation.
df_features =extract_features(df_train, column_id='id', Df_features column_sort = 'time') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Feature Extraction. 60% | █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ▏ | 18/30 [43:52 "13:09, 65.83 s/it] the Exception in the thread thread - 8: Traceback (most recent call last): File "E:\AI_home\Anaconda3\lib\threading.py", line 926, in _bootstrap_inner self.run() File "E:\AI_home\Anaconda3\lib\threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "E:\AI_home\Anaconda3\lib\multiprocessing\pool.py", line 470, in _handle_results task = get() File "E:\AI_home\Anaconda3\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "E:\AI_home\Anaconda3\lib\multiprocessing\connection.py", line 318, in _recv_bytes return self._get_more_data(ov, maxsize) File "E:\AI_home\Anaconda3\lib\multiprocessing\connection.py", line 340, in _get_more_data ov, err = _winapi.ReadFile(self._handle, left, overlapped=True) MemoryError Feature Extraction: 60% | █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ▌ | 18/30 [58:00 "38:40, 193.36 s/it] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- MemoryError Traceback (most recent call last) <ipython-input-7-250d996ac0cf> in <module> ----> 1 df_features =extract_features(df_train, column_id='id', Column_sort ='time') 2 df_features #kaggle run add Codeadd Markdown Df_features =extract_features(df_train, Column_id = 'id' column_sort = 'time') df_features Feature Extraction: 20% | █ █ | 2/10 [38:37 "2:07:36, 957.02 s/it]Copy the code
  • A null value processing
from tsfresh.utilities.dataframe_functions import impute
from tsfresh import select_features
impute(train_features)
Copy the code
  • Feature selection
train_features_filtered = select_features(train_features, data_train_label)
Copy the code

1.3.3 Unique thermal coding

  • In the end, the predicted value label is independently thermal coded
def dummies_coder():
    global df_train
    name='label'
    df_dummies = pd.get_dummies(df_train[name],prefix=name)
    df_train = pd.concat([df_train,df_dummies],axis=1)
    df_train.drop(name,axis=1,inplace=True)
    return df_train
Copy the code

1.4 Model training

  • Import related modules
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from sklearn.svm import SVC
import lightgbm as lgb
Copy the code
  • Separate training set and test set
df_train = combined[:100000]
df_test = combined[100000:]
Copy the code
  • Change data types to reduce memory usage
def reduce_mem_usage(df): start_mem = df.memory_usage().sum() / 1024**2 print('Memory usage of dataframe is {:.2f} MB'.format(start_mem)) for col in df.columns: col_type = df[col].dtype if col_type ! = object: c_min = df[col].min() c_max = df[col].max() if str(col_type)[:3] == 'int': if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) else: df[col] = df[col].astype('category') end_mem = df.memory_usage().sum() / 1024**2 print('Memory usage after optimization is: {:.2f} MB'.format(end_mem)) print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem)) return df Df_train = reduce_mem_usage (df_train) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the Memory usage of dataframe Is 157.17 MB Memory usage after optimization is: 39.86 MB Decreased by 74.6%Copy the code
  • Divide training set and verification set
X_train, x_val y_train, y_val = train_test_split (df_train, the targets, test_size = 0.2, random_state = 42)Copy the code

1.4.1 Use of basic model and scoring

  • Create a model classifier
lgrcv =LogisticRegressionCV()
rf =RandomForestClassifier()
knn=KNeighborsClassifier()
dt=DecisionTreeClassifier()
xgb =xgb.XGBClassifier()
lgbm = lgb.LGBMClassifier()
models =[lgrcv, rf,knn ,dt ,xgb,lgbm]
Copy the code
  • Model training and evaluation
for model in models:
    model=model.fit(x_train,y_train)
    predict_train =model.predict(x_train)
    predict_val=model.predict(x_val)
    print(model)
#     print('train Accureacy:',metrics.accuracy_score(y_train,predict_train))
    print('val Accureacy:',metrics.accuracy_score(y_val,predict_val))
#     print('train f1-score :',metrics.f1_score(y_train,predict_train))
    print('val f1-score :',metrics.f1_score(y_val,predict_val,average='macro'))
#     print('train mean_squared_error :',metrics.mean_squared_error(y_train,predict_train))
    print('val mean_squared_error :',metrics.mean_squared_error(y_val,predict_val))
    a = model.predict_proba(x_val)
    fpr, tpr, thresholds = metrics.roc_curve(y_val, y_score=[i[1] for i in a], pos_label=1)
    print('auc:',metrics.auc(fpr, tpr))
    print('**********************************')
Copy the code
  • Model evaluation Results
Val Accureacy: 0.8738 val F1-score: 0.7797043643993938 VAL mean_squareD_error: 0.52295 AUC: 0.8751658941402964 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * RandomForestClassifier () val Accureacy: 0.9806 val f1 - score: 0.9486853009647026 val mean_squared_error: 0.06155 auc: 0.9920398427752294 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * KNeighborsClassifier () val Accureacy: 0.9783 val f1 - score: 0.9486774842272341 val mean_squared_error: 0.0706 auc: 0.9699227365057602 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * DecisionTreeClassifier () val Accureacy: 0.9568 val f1 - score: Val mean_squared_error: 0.9094243474402218 0.8794967552617907 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * XGBClassifier (objective = 'multi: softprob') val Accureacy: Val f1-score: 0.8916767743199139 val mean_squared_error: 0.2072 auc: 0.9635107593719082 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LGBMClassifier () val Accureacy: 0.9826 val f1 - score: Val mean_squareD_error: 0.05355 auC: 0.9909607298485397Copy the code

1.4.2 Adjusting hyperparameters

  • Lightgbm performed well, so lightgBM was selected to further optimize parameters for model training
  • Convert to Dataset data format
train_matrix = lgb.Dataset(x_train,label = y_train)
val_matrix = lgb.Dataset(x_val,label = y_val)
Copy the code
  • Define a function f1_score
def val_f1_score(preds, lgtrain):
    preds = np.argmax(preds.reshape(4, -1),axis=0)
    label = lgtrain.get_label()
    result_f1_score = f1_score(preds,label,average = 'macro')
    return 'f1_score',result_f1_score,True
Copy the code
  • Setting initial Parameters
Boosting = {"learning_rate": 0.1, "lambda_l2": 0.1, "max_depth": -1, "num_leaves": 128, "bagging_fraction": 0.8, "Feature_fraction ": 0.8, "metric": None, "objective": "multiclass", "num_class": 4, "nthread": 10, "verbose": -1, }Copy the code
  • Model training
model = lgb.train(params, train_set=train_matrix, valid_sets=val_matrix, num_boost_round=2000, verbose_eval=50, early_stopping_rounds=200, feval=val_f1_score) ______________________________________________________ Training until validation scores don't Rounds [50] valid_0's Multi_loGloss: 0.0472522 Valid_0's F1_Score: 0.962766 [100] VALID_0's multi_logloss: 0.0414565 VALID_0's f1_Score: 0.969621 [150] VALID_0's multi_logloss: 0.0414565 VALID_0's f1_Score: 0.969621 [150] Valid_0's multi_logloss: 0.0432075 VALID_0's F1_score: 0.971069 [200] VALID_0's multi_loGloss: 0.0446821 VALID_0's F1_Score: 0.0446821 0.971485 [250] VALID_0's multi_logloss: 0.0460871 VALID_0's f1_Score: 0.971569 Early stopping, best Iteration is: [99] Valid_0's MULTI_loGloss: 0.0414366 VALID_0's F1_Score: 0.969254Copy the code
  • With the addition of parameters, f1_score is improved
val_pre = model.predict(x_val, num_iteration=model.best_iteration) preds = np.argmax(val_pre, axis=1) score = f1_score(y_true=y_val, y_pred=preds, 0.9692538553684411 = 'macro' business) score _______________________________________________________Copy the code
  • Grid search adjusts parameters
Model_lgb = LGB. LGBMClassifier (random_state = 2021) params_dic = dict (learning_rate = [0.01, 0.1, 1]. ,50,120,300 n_estimators = [20], Num_leaves = [10, 30], max_depth f1_score = =,4,10] [- 1) make_scorer (f1_score, business = 'micro') grid_search = GridSearchCV(model_lgb, cv=5, param_grid=params_dic, Score =f1_score) grid_search. Fit (x_train,y_train) print(f' best parameter :{grid_search. Best_params_}') Print (f 'best scores are: {grid_search. Best_score_}') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- The best parameters are :{'learning_rate': 0.1, 'max_depth': -1, 'n_ESTIMators ': 300, 'num_leaves': 30} The best scores are :0.9863375000000001Copy the code

1.4.3 Calculation of ABS-sum

  • Train the final model and predict dF_test
Model_final = LGB. LGBMClassifier (random_state = 2021, learning_rate = 0.1, n_estimators=300 ,max_depth= -1,num_leaves=30 ) model_final.fit(x_train,y_train) pre_test=model_final.predict(df_test) proba_test=model_final.predict_proba(df_test)Copy the code
  • The predicted value label is thermally coded into 4 classes
df_pred =pd.DataFrame(pre_test,columns=['label'])
def dummies_coder():
    global df_pred
    name='label'
    df_dummies = pd.get_dummies(df_pred[name],prefix=name)
    df_pred = pd.concat([df_pred,df_dummies],axis=1)
    df_pred.drop(name,axis=1,inplace=True)
    return df_pred
df_pred =dummies_coder()
pred_arr=np.array(df_pred)
Copy the code
  • Calculation of abs – sum
def abs_sum(): global pred_arr,proba_test return sum(sum(abs(proba_test-pred_arr))) result = abs_sum() result -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 310.1228671266683Copy the code

1.4.3 Model prediction and output result file

– Generate and submit the forecast probability file

df_proba =pd.DataFrame(proba_test,columns=['1','2','3','4'])
df_result = pd.DataFrame()
abc = pd.read_csv('sample_submit.csv',encoding='utf-8')
df_result['id'] = abc['id']
df_result['label_0']=df_proba['1']
df_result['label_1']=df_proba['2']
df_result['label_2']=df_proba['3']
df_result['label_3']=df_proba['4']
df_result
df_result.to_csv('411_submit.csv', index=False)
Copy the code