The Label Encoding and One Hot Encoding are the two main Encoding methods for the category row. What is the difference between them?
Get straight to the conclusion:
- If the source is an ordered discrete value => Label Encoding
- Raw data are unordered discrete values => One Hot Encoding (Dummies)
The following two points are explained:
Why do we convert discrete to numeric?
Since most models are based on mathematical operations, string data is inoperable
Why turn disorder to one-hot?
Mathematical operations generally refer to the use of distance as a proxy for similarity (from a geometric point of view), which means that the difference between two transformed values is used as the degree of similarity.
If the male/female field is changed to 0, 1 and binary, there is no problem. If the frequency fruit, banana and watermelon in the unordered field of fruit are changed to 0, 1 and 2, it implies that “banana and apple” is more similar than “watermelon and apple”, but this is wrong. If it is old, middle-aged and young in the ordered column of age, it is appropriate for Label to be 0, 1 and 2. If it is hard to be one-hot, the gap relationship will be removed.
Implemented in Python
Coincidentally, both the Python Pandas and SciKit-Learn suites provide operations for coding. The following illustrates the differences in usage:
1) LabelEncoding
df = pd.DataFrame({'size': ['XXL'.'XL'.'L'.'M'.'S']})
# Using Pandas
import pandas as pd
cat = pd.Categorical(df['size'], categories=df['size'].unique(), ordered=True))
df['size_code'] = cat.codes
# Using sklearn
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['tw'])
le.transform(df['size'])
Copy the code
② One Hot Encoding (Dummies)
# Using Pandas
df = pd.DataFrame({'A': ['a'.'b'.'a'].'B': ['b'.'a'.'c'])
pd.get_dummies(df)
# Using sklearn
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
enc.transform([[0, 1, 3]]).toarray()
Copy the code
Note that they are all the same moves, just different kits.
License
This work is by Chang Wei-Yaun (V123582) and is distributed under an INNOVATIVE CC name – Share in the same way with 3.0 Unported license.