• Why one-hot Encode Data in Machine Learning
  • Originally by Jason Brownlee
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: lsvih
  • Proofread by: TrWestdoor, Portandbridge

Why one-hot coding in machine learning?

Getting started with machine learning applications, especially when it involves processing real data, can be difficult.

Generally, machine learning tutorials will recommend or require you to prepare your data in a specific way before you start fitting the model.

A good example is one-hot coding (also known as single Hot coding) for Categorical data.

  • Why is one-hot coding necessary?
  • Why can’t you just use the data to fit the model?

In this article, you will get answers to these important questions and gain a better understanding of data preparation in machine learning applications.

What is category data?

Category data is a variable that has label values but no values.

Its values usually belong to a fixed and finite set of sizes.

Category variables are also often referred to as nominal.

Here are some examples:

  • The pet variable contains the following values: dog, cat.
  • The color variable contains the following values: red, green, and blue.
  • The place variable contains the following values: first, second, and third.

Each value in the above example represents a different category.

Some categories have a natural relationship with each other, such as a natural ordering relationship.

In the example above, the values of the place variable have this natural ordering relationship. Such variables are called ordinal variables.

What’s wrong with category data?

Some algorithms can be applied directly to categorical data.

For example, you can apply the decision tree algorithm directly to the category data without any data transformation (depending on how you implement it).

But there are many machine learning algorithms that don’t manipulate label data directly. These algorithms require all input and output variables to be numeric (numeric).

In general, this limitation is mainly due to the efficient implementation of these machine learning algorithms, rather than the limitations of the algorithms themselves.

But this also means that we need to convert the category data into numeric form. If the output variables are category variables, you may also have to convert the predicted values of the model back to category form in order to display or use them in some applications.

How do I convert category data to numeric data?

This consists of two steps:

  1. Integer coding
  2. One – Hot coding

1. Integer encoding

The first step is to assign an integer value to each category value.

For example, 1 is red, 2 is green, and 3 is blue.

This method is called label encoding or integer encoding, and you can easily revert it back to the category value.

For some variables, this encoding is sufficient.

There is a natural ordering relationship between integers that machine learning algorithms may be able to understand and exploit.

For example, the ordinal variable in the previous place example is a good example. All we need to do is tag code it.

2. One – Hot coding

However, it is not enough to use the integer encoding only for category variables without order relation.

In fact, using integer encoding makes the model assume a natural ordering relationship between categories, leading to poor or unexpected results (predicted values fall in the middle of the two categories).

In this case, use one-hot encoding for integer representations. One-hot encoding removes the integer encoding and creates a binary variable for each integer value.

In the color example, there are three categories, so three binary variables are required to encode. The corresponding color position will be marked as “1” and other color positions will be marked as “0”.

Such as:

red, green, blue
1, 0, 0
0, 1, 0
0, 0, 1
Copy the code

In fields such as statistics, such binary variables are often referred to as “dummy variables” or “dummy variables”.

One-hot coding tutorial

If you want to learn how to one-hot encode your data in Python, see:

  • Boosting with XGBoost in Python Data Preparation for Gradient Boosting with XGBoost in Python
  • How to One Hot Encode Sequence Data in Python — How to One Hot Encode Sequence Data in Python

Develop reading

  • Categorical Variable. Wikipedia
  • Nominal Category, Wikipedia
  • Dummy Variables, Wikipedia

conclusion

In this article, you should have seen why category data is often encoded when using machine learning algorithms.

Pay special attention to:

  • Category data is defined as a variable made up of values from a finite set.
  • Most machine learning algorithms require input and output numerical variables.
  • The class data can be converted into integer data by integer encoding and one-HOT encoding.

Any other questions?

Leave your questions in the comments and I’ll do my best to answer them.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.