The Label Encoding and One Hot Encoding are the two main Encoding methods for the category row. What is the difference between them?

Get straight to the conclusion:

  • If the source is an ordered discrete value => Label Encoding
  • Raw data are unordered discrete values => One Hot Encoding (Dummies)

The following two points are explained:

Why do we convert discrete to numeric?

Since most models are based on mathematical operations, string data is inoperable

Why turn disorder to one-hot?

Mathematical operations generally refer to the use of distance as a proxy for similarity (from a geometric point of view), which means that the difference between two transformed values is used as the degree of similarity.

If the male/female field is changed to 0, 1 and binary, there is no problem. If the frequency fruit, banana and watermelon in the unordered field of fruit are changed to 0, 1 and 2, it implies that “banana and apple” is more similar than “watermelon and apple”, but this is wrong. If it is old, middle-aged and young in the ordered column of age, it is appropriate for Label to be 0, 1 and 2. If it is hard to be one-hot, the gap relationship will be removed.

Implemented in Python

Coincidentally, both the Python Pandas and SciKit-Learn suites provide operations for coding. The following illustrates the differences in usage:

1) LabelEncoding

df = pd.DataFrame({'size': ['XXL'.'XL'.'L'.'M'.'S']})

# Using Pandas
import pandas as pd 
cat = pd.Categorical(df['size'], categories=df['size'].unique(), ordered=True))
df['size_code'] = cat.codes

# Using sklearn
from sklearn import preprocessing 
le = preprocessing.LabelEncoder()
le.fit(df['tw'])
le.transform(df['size'])

Copy the code

② One Hot Encoding (Dummies)

# Using Pandas
df = pd.DataFrame({'A': ['a'.'b'.'a'].'B': ['b'.'a'.'c'])
pd.get_dummies(df)

# Using sklearn
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
enc.transform([[0, 1, 3]]).toarray()

Copy the code

Note that they are all the same moves, just different kits.


License

This work is by Chang Wei-Yaun (V123582) and is distributed under an INNOVATIVE CC name – Share in the same way with 3.0 Unported license.