Single heat coding and LabelEncoder tag coding
1, the introduction
For some feature projects, sometimes we’ll use OneHotEncoder and LabelEncoder in order to solve some non-numeric classification problems. Take, for example, the gender categories: male and female. These two values are not visible in the model, so they need to be encoded as numbers. Such as:
Characteristics of the | coding |
---|---|
male | 1 |
female | 0 |
female | 0 |
male | 1 |
female | 0 |
male | 1 |
For LabelEncoder, it will be converted into 0 and 1, and if there are three types, it will be 0, 1 and 2.
Using the OneHotEncoder, this is converted to a matrix form
Characteristics of the | Sex_ male | Sex_ female |
---|---|---|
male | 1 | 0 |
female | 0 | 1 |
female | 0 | 1 |
male | 1 | 0 |
female | 0 | 1 |
male | 1 | 0 |
So the question is, you can code either way. What’s the difference?
- With LabelEncoder this feature is still one-dimensional, but produces encoded numbers like 0, 1, 2, and 3
- The OneHotEncoder produces linearly independent vectors
If 0, 1 and 2 are generated after coding for red, blue and green, a new mathematical relationship will be generated, such as green is greater than red, and the mean value of green and red is blue. These categories are independent of each other, and there is no such relationship before transformation. However, if OneHotEncoder is used, multiple linearly independent vectors will be generated, which solves the problem of that relationship. However, if there are too many categories, the feature dimension will be greatly increased, resulting in resource waste, long computing time, too sparse matrix and other problems. However, PCA can be used in some cases.
2. Code testing
2.1 Importing related libraries
Import numpy as NP import pandas as pd # import SVC from sklearn. SVM import SVC # import metrics from sklearn.metrics import Accuracy_score from sklearn.metrics import roc_auc_score from sklearn.metrics import roc_curve # Sklearn. preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder # cross-verify from sklearn.model_selection import cross_val_scoreCopy the code
2.2 Reading Data
data=pd.read_csv('Narrativedata.csv',index_col=0)
data
Copy the code
2.3 Checking The Missing Value
data.isnull().sum()
Copy the code
2.4 Use median to fill in age
data['Age'].fillna(data['Age'].median(),inplace=True)
data.isnull().sum()
Copy the code
2.5 Deleting the lost line Embarked on
data.dropna(inplace=True)
data.isnull().sum()
Copy the code
2.6 Viewing the category of each feature
display(np.unique(data['Sex']))
display(np.unique(data['Embarked']))
display(np.unique(data['Survived']))
Copy the code
x=data.drop(columns=['Survived'])
y=data['Survived']
Copy the code
2.7 LabelEncoder coding for labels
from sklearn.preprocessing import LabelEncoder
y=LabelEncoder().fit_transform(y)
y
Copy the code
2.8 Use dummy variable handling in PANDAS
y=data['Survived']
y=pd.get_dummies(y)
y
Copy the code
2.9 Perform dummy variable processing on features
x=pd.get_dummies(x.drop(columns=['Age']))
x
Copy the code
2.10 Conduct unique thermal coding for features
from sklearn.preprocessing import OneHotEncoder
x=data.drop(columns=['Survived','Age'])
x=OneHotEncoder().fit_transform(x).toarray()
pd.DataFrame(x)
Copy the code
2.11. Model test
2.11.1 Unique thermal coding
x=data.drop(columns=['Age','Survived']) y=data['Survived'] x=pd.get_dummies(x) x['Age']=data['Age'] Y =LabelEncoder().fit_transform(y) # model test for kernel in [" Linear ","poly"," RBF ","sigmoid"]: clf = SVC(kernel = kernel ,gamma="auto" ,degree = 1 ,cache_size = 5000 ) score=cross_val_score(clf,x,y,cv=5,scoring='accuracy').mean() print('{:10s}:{}'.format(kernel,score))Copy the code
2.11.1 LabelEncoder coding
X =data.drop(columns=['Age','Survived']) y=data['Survived'] df= pd.dataframe () # for I in x.collumns: df=pd.concat([df,pd.DataFrame(LabelEncoder().fit_transform(x[i]))],axis=1) y=LabelEncoder().fit_transform(y) for kernel in ["linear","poly","rbf","sigmoid"]: clf = SVC(kernel = kernel ,gamma="auto" ,degree = 1 ,cache_size = 5000 ) score=cross_val_score(clf,df,y,cv=5,scoring='accuracy').mean() print('{:10s}:{}'.format(kernel,score))Copy the code