A simple tool I use to deal with classification features that have many unique values
Photo: George Pagan IIIonUnsplash
What is high cardinality?
Almost all data sets now have categorical variables. Each classification variable is made up of unique values. When there are too many of these unique values, a classification feature is said to have high cardinality. In the _ case, one hot coding becomes a big problem because we have to set a separate column for each unique value in the classification variable that indicates its presence or absence. This leads to two problems, one is the obvious space consumption, but this problem is not as big as the second problem, namely the curse of dimensions. I’ll discuss the dimension curse in more detail, but first, let’s look at the data before and after one-click coding.
Take a peek at our classification features before and after one-click coding
What we want to look at is qualification characteristics. Because the data is collected from a form filled in by many people, this column contains many different qualifications. Here’s what this column looks like, with all its unique values.
Medium.com/media/beb16… Medium.com/media/332c6…
We can see that there are 15 unique values in this feature, and it consumes 316KB of space. Let’s one_hot code this feature with PANDAS.
Medium.com/media/c725c… Medium.com/media/ab507…
We now see that our original feature has become so large that, naturally, the amount of space required to store it has increased to **592KB. ** This is just one feature, if during training we have hundreds of classification variables, we end up with hundreds of features, which in some cases is not good for model training. Simple models cannot handle so many variables. But now let’s look at another major problem, the curse of dimensions.
Curse of dimensions
Here’s a quick summary.
As the number of features grows, the amount of data we need to be able to accurately distinguish between those features (to give us a prediction) and generalize our model (learned functionality) grows exponentially.
If you don’t want to see the technical issues below, feel free to skip to the next section.
I wanted to use Yoshua Bengio (yes, the legendary Yoshua Bengio!). Quora answers to explain this in more detail. I highly recommend reading the entire answer here. According to the answer, increasing the number of different values in a feature simply increases the total number of possible combinations using the input line ** (containing n such features **). Suppose we have two features, each with two different values, which gives us a total of four possible ways to combine the two features. Now, if one of them had three different values, we would have 3X2=6 possible ways to combine them.
In classical non parameter learning algorithm (e.g., nearest neighbor, gaussian kernel SVM, gaussian kernel gaussian process, etc.), the model need to see every one example of these combinations (or at least it is necessary to cover all the configuration changes of interest), in order to produce a correct answer, near a with other configuration needed to target different answers.
A workaround for this is that even without a large amount of training data, as long as there is some structure (pattern) in these combinations, the model can discriminate between configurations (not in the training set) for future prediction. In most cases, high cardinality makes it difficult for the model to recognize this pattern, so the model does not generalize well for examples outside the training set.
Reduce cardinality by using a simple aggregate function
Below is a simple function I used to reduce cardinality of a feature. The idea is very simple. The instances that belong to the high frequency values are kept, and the other instances are replaced by a new category, which we call “high frequency”. The other.
- Select a threshold
- Sort unique values in a column in descending order of their frequency
- Increase the frequency of these sorted (descending) unique values until a threshold is reached.
- These are the only categories we will keep, and all instances of other categories will be replaced by **” Other “. 支那
Before reading the code, let’s look at a simple example. Assuming that our color column has 100 values, our threshold is 90% (that is, 90). We have 5 different categories of colors. ** Red (50), blue (40), yellow (5), green (3) and orange (2). ** The numbers in parentheses indicate how many instances of the class exist in the column.
We see that red (50) + blue (40) reaches our threshold of 90. In this case, we keep only two categories (red, blue) and mark all instances of the other colors as “Other.”
Therefore, we reduced cardinality from 5 to 3 (red, blue, and others).
This is a utility function I wrote to facilitate our work. It’s well annotated and follows exactly what I described above, so you won’t have a problem following along. We can set a custom threshold, and the return_categories option allows us to see a list of all unique values after lowering cardinality.
Medium.com/media/290c5… Medium.com/media/adee6…
As you can see, using this function, we reduced the cardinality of the Qualification column from 15 to 6!
conclusion
We saw how cardinality can be reduced by using a simple function, and more importantly why this is necessary (curse of dimensions). But remember, we are lucky that the numerical distribution in our columns allows us to use this method. We cannot use this method if all 15 categories are evenly distributed, in which case PCA may need to be applied in conjunction with other features of the data set, but more will be said at other times.