17 methods to transform discrete features into digital features

The author | Samuele Mazzanti

Compile | vitamin k

Source | forward Data Science

“What kind of gradient lifting algorithm do you know?

“Xgboost, LightGBM, Catboost, HistGradient.

“What discrete variables do you know the code for?”

“One – hot”

I wouldn’t be surprised to hear this conversation in a data science interview. It would be quite amazing, though, “because only a small percentage of data science projects involve machine learning, and virtually all of them involve some discrete data”.

❝

The encoding of discrete variables is the process of converting a discrete column into one (or more) numeric columns.

❞

This is necessary because computers can process numbers more easily than strings. Why is that? Because it’s easy to find relationships with numbers (” big, “” small,” “double,” “half”). However, when given strings, the computer can only say whether they are “equal” or “different.”

However, while the encoding of discrete variables has an impact, it is easily overlooked by data science practitioners.

❝

The coding of discrete variables is a surprisingly underrated topic.

❞

That’s why I decided to deepen my knowledge of coding algorithms. I started with a Python library called “category_encoders” (here’s the Github link: github.com/scikit-lear…

! pip install category_encodersimport category_encoders as ce

ce.OrdinalEncoder().fit_transform(x)
Copy the code

This article is a walkthrough of the 17 coding algorithms included in the library. For each algorithm, I provide a brief explanation and Python implementation in a few lines of code. The goal is not to reinvent the wheel, but to recognize how algorithms work. After all,

❝

“Unless you can write code, you don’t understand.”

❞

Not all codes are the same

I classified 17 coding algorithms according to some of their characteristics. Similar decision tree:

The segmentation point is:

“Supervised/unsupervised” : When coding is based entirely on discrete columns, it is unsupervised. It is supervised if the encoding is based on a function of the original column and the second column (number).
“Output dimension” : The encoding of the classification column may result in one numeric column (output dimension =1) or multiple numeric columns (output dimension >1).
“Mapping” : If each level has the same output – either scalar (for example OrdinalEncoder) or array (for example OneHotcoder), then the mapping is unique. Conversely, if the same level is allowed to have different possible outputs, the mapping is not unique.

17 discrete coding algorithms

1.“OrdinalEncoder”

Each level maps to an integer, from 1 to L (where L is the number of levels). In this case, we used alphabetical order, but any other custom order is acceptable.

sorted_x = sorted(set(x))
ordinal_encoding = x.replace(dict(zip(sorted_x, range(1.len(sorted_x) + 1))))
Copy the code

You might think that this code is meaningless, especially if the hierarchy has no inherent order. You’re right! In fact, it’s just a convenient representation, often used to save memory or as an intermediate step in other types of encoding.

2.CountEncoder

Each level maps to the number of observations at that level.

count_encoding = x.replace(x.value_counts().to_dict())
Copy the code

This code can be used as an indicator of “confidence” at each level. For example, a machine learning algorithm might automatically decide to consider only information from levels where it counts above a certain threshold.

3.OneHotEncoder

The most commonly used coding algorithm. Each level maps to a pseudo column (that is, a 0/1 column) indicating whether the row carries the level.

one_hot_encoding = ordinal_encoding.apply(lambda oe: pd.Series(np.diag(np.ones(len(set(x))))[oe - 1].astype(int)))
Copy the code

This means that while your input is a single column, your output consists of L columns (one for each level of the original column). This is why OneHot encoding should be handled with care: you may end up with data frames that are much larger than the original.

Once the data is OneHot encoded, it can be used in any prediction algorithm. To make things clear, let’s take a look at each level.

Suppose we observe a target variable, called Y, that contains everyone’s income (in thousands of dollars). Let’s use linear regression (OLS) to fit the data.

To make the results easy to read, I append OLS coefficients to the side of the table.

In the case of OneHot encoding, the intercept has no specific meaning. In this case, since we only have one observation per layer, by adding the intercept and multiplying the coefficient, we get the exact value (without error) for y.

4.SumEncoder

The following code may be a little obscure at first. But don’t worry: in this case, understanding how to get the code isn’t important, but how to use it.

sum_encoding = one_hot_encoding.iloc[:, :- 1].apply(lambda row: row if row.sum() == 1 else row.replace(0.- 1), axis = 1)
Copy the code

SumEncoder belongs to a class called contrast Coding. These codes are designed to have specific behaviors when used in regression problems. In other words, if you want regression coefficients to have certain properties, you can use one of these encodings.

In particular, use SumEncoder when you want the regression coefficients to add up to zero. If we take the same data as before and fit OLS, we get the following results:

This time, the intercept corresponds to the average of y. Furthermore, by taking the y of the last level and subtracting it from the intercept (68-50), we get 18, which is the opposite of the sum of the remaining coefficients (-15-5+2=-18). This is the property of the SumEncoder I mentioned earlier.

5.BackwardDifferenceEncoder

Another contrast coding.

This encoder is useful for ordinal variables, that is, variables whose rank can be sorted in a meaningful way. BackwardDifferenceEncoder designed to compare the adjacent level.

backward_difference_encoding = ordinal_encoding.apply(
    lambda oe: pd.Series(
        [i / len(set(x)) for i in range(1, oe)] + [- i / len(set(x)) for i in range(len(set(x)) - oe, 0.- 1)))Copy the code

Suppose you have an ordered variable (such as education level) and you want to know how it relates to a numerical variable (such as income). It might be interesting to compare each successive level (e.g., bachelor vs. high school, master vs. bachelor) with the target variable. This is the design purpose of BackwardDifferenceEncoder. Let’s look at an example using the same data.

The intercept is the same as the mean of y. The coefficient for a bachelor is 10, because the y of a bachelor is 10 higher than that of a high school, the coefficient for a master is 7, because the Y of a master is 7 higher than that of a bachelor, and so on.

6.HelmertEncoder

HelmertEncoder with BackwardDifferenceEncoder very similar, but not only compared with the previous, but each level comparing with previous all levels.

helmert_encoding = ordinal_encoding.apply(
    lambda oe: pd.Series([0] * (oe - 2) + ([oe - 1] if oe > 1 else []) + [- 1] * (len(set(x)) - oe))
).div(pd.Series(range(2.len(set(x)) + 1)))
Copy the code

]

Let’s see what the OLS model gives us:

The coefficient for PhD is 24, because PhD is 24- ((35+45+52) /3) =24 above the average of previous levels. The same goes for all grades.

7.PolynomialEncoder

Another kind of contrast coding.

As the name implies, the PolynomialEncoder is designed to quantify the linear, quadratic, and cubic behavior of the object variable relative to the discrete variable.

While def do_polynomial_encoding(order)://github.com/pydata/patsy/blob/master/patsy/contrasts.py
    n = len(set(x))
    scores = np.arange(n)
    scores = np.asarray(scores, dtype=float)
    scores -= scores.mean()
    raw_poly = scores.reshape((- 1.1)) ** np.arange(n).reshape((1.- 1))
    q, r = np.linalg.qr(raw_poly)
    q *= np.sign(np.diag(r))
    q /= np.sqrt(np.sum(q ** 2, axis=1))
    q = q[:, 1:]
    return q[order - 1]

polynomial_encoding = ordinal_encoding.apply(lambda oe: pd.Series(do_polynomial_encoding(oe)))
Copy the code

I know what you’re thinking. How can a numerical variable have a linear (or quadratic or cubic) relationship with a non-numerical variable? This is based on the assumption that potentially discrete variables are not only sequential but equally spaced.

For this reason, I recommend using it with caution, only if you are convinced that the assumption is reasonable.

8.BinaryEncoder

BinaryEncoder is basically the same as OrdinalEncoder, the only difference being that integers are converted to binary numbers, and then each positional number is one-hot encoded.

binary_base = ordinal_encoding.apply(lambda oe: str(bin(oe))[2:].zfill(len(bin(len(set(x)))) - 2))
binary_encoding = binary_base.apply(lambda bb: pd.Series(list(bb))).astype(int)
Copy the code

The output consists of pseudo-columns, as in the case of OneHotEncoder, but it results in a lower dimension relative to one-Hot.

To be honest, I don’t know what practical application this kind of coding has.

9.BaseNEncoder

BaseNEncoder is just a generalization of BinaryEncoder. In fact, in BinaryEncoder, the numbers have base 2, while in BaseNEncoder, the numbers have base N, which is greater than 1.

def int2base(n, base):
    out = ' '
    while n:
        out += str(int(n % base))
        n //= base
    return out[::- 1]

base_n = ordinal_encoding.apply(lambda oe: int2base(n = oe, base = base))
base_n_encoding = base_n.apply(lambda bn: pd.Series(list(bn.zfill(base_n.apply(len).max())))).astype(int)
Copy the code

Let’s look at an example where base=3.

To be honest, I don’t know what practical application this kind of coding has.

10.HashingEncoder

In HashingEncoder, each primitive level is hashed using some hash algorithm, such as SHA-256. The result is then converted to an integer and the modulo of that integer relative to some (large) divisor. By doing so, we map each raw string to an integer in some range. Finally, the procedure yields integers that are one-hot coded.

def do_hash(string, output_dimension):
    hasher = hashlib.new('sha256')
    hasher.update(bytes(string.'utf-8'))
    string_hashed = hasher.hexdigest()
    string_hashed_int = int(string_hashed, 16)
    string_hashed_int_remainder = string_hashed_int % output_dimension
    return string_hashed, string_hashed_int, string_hashed_int_remainder

hashing = x.apply(
    lambda string: pd.Series(do_hash(string, output_dimension), 
        index = ['x_hashed'.'x_hashed_int'.'x_hashed_int_remainder']))
hashing_encoding = hashing['x_hashed_int_remainder'].apply(lambda rem: pd.Series(np.diag(np.ones(output_dimension))[rem])).astype(int)
Copy the code

Let’s look at an example with an output dimension of 10.

The basic property of a hash is that the resulting integers are uniformly distributed. So, if the divisor is large enough, two different strings are unlikely to map to the same integer. Why is that useful? In fact, there’s a very practical application for this called the hashing technique.

Suppose you want to use logistic regression to generate an E-mail spam classifier. You can do this by one-hot encoding all the words contained in the dataset. The main disadvantages are that you need to store the mapping in a separate dictionary, and your model dimensions will change when new strings appear.

These problems can be easily overcome using the hash technique, because with hashed input, you no longer need a dictionary, and the output dimension is fixed (it only depends on the divisor you originally chose). In addition, for hashed attributes, you can assume that the new string may have a different encoding than the existing string.

11.TargetEncoder

Suppose there are two variables: a discrete variable (x) and a numerical variable (y). Suppose you want to convert x to a numerical variable. You may need to “carry” information using y. The obvious idea is to take the average of y for each level of x. In the formula:

This is reasonable, but there is a big problem with this approach: some groups may be too small or unstable to be reliable. Many supervised codes overcome this problem by choosing an intermediate method between the group mean and the global mean of Y:

Where $w_i$is between 0 and 1, depending on how “trusted” the group is.

The next three algorithms (TargetEncoder, MEstimateEncoder, and JamesSteinEncoder) differ depending on how they define $w_i$.

In TargetEncoder, the weight depends on the number of groups and a parameter called “smoothing.” When smoothing is 0, we rely only on the group average. Then, as the smoothness increases, there are more and more global average weights, leading to stronger regularization.

y_mean = y.mean()
y_level_mean = x.replace(y.groupby(x).mean())
weight = 1 / (1 + np.exp(-(count_encoding - 1) / smoothing))
target_encoding = y_level_mean * weight + y_mean * (1 - weight)
Copy the code

Let’s see how the results vary with some different smoothing values.

12.MEstimateEncoder

MEstimateEncoder is similar to TargetEncoder, but $w_i$depends on a parameter named “m” that sets the absolute weight of the global mean. M is easy to understand because it can be viewed as several observations: if the rank has exactly M observations, then the rank mean and the population mean weight are the same.

y_mean = y.mean()
y_level_mean = x.replace(y.groupby(x).mean())
weight = count_encoding / (count_encoding + m)
m_estimate_encoding =  y_level_mean * weight + y_grand_mean * (1 - weight)
Copy the code

Let’s see how the results vary with different m values:

13.“JamesSteinEncoder”

TargetEncoder and MEstimateEncoder depend on both the number of groups and the parameter values that the user sets (Smoothing and M, respectively). This is inconvenient because setting these weights is a manual task.

A natural question is: is there a way to create an optimal work environment without any human intervention? JamesSteinEncoder tries to do this in a statistics-based way.

y_mean = y.mean()
y_var = y.var()
y_level_mean = x.replace(y.groupby(x).mean())
y_level_var = x.replace(y.groupby(x).var())

weight = 1 - (y_level_var / (y_var + y_level_var) * (len(set(x)) - 3)/(len(set(x)) - 1))
james_stein_encoding = y_level_mean * weight + y_mean * (1 - weight)
Copy the code

The intuition is that the average of a population with a high variance should be less reliable. Therefore, the higher the population variance, the lower the weight (IF you want to know more about the formula, I suggest this article by Chris Said).

Let’s look at a numerical example:

JamesSteinEncoder has two significant advantages: it provides a better estimate than the maximum likelihood estimate and does not require any parameter Settings.

14.GLMMEncoder

GLMMEncoder takes a completely different approach.

Basically, it fits the linear mixed effect model on Y. This approach takes advantage of the fact that linear mixed effects models are carefully designed to deal with homogeneous observation groups. Therefore, the idea is to fit a model with no regression variables (only intercepts) and use hierarchies as groups.

The output is then the sum of the intercept and random effects.

model = smf.mixedlm(formula = 'y ~ 1', data = y.to_frame(), groups = x).fit()
intercept = model.params['Intercept']
random_effect = x.replace({k: float(v) for k, v in model.random_effects.items()})
glmm_encoding = intercept + random_effect
Copy the code

15.WOEEncoder

WOEEncoder can only be used for binary variables, i.e. target variables of level 0/1.

The idea behind the weight of evidence is that you have two distributions:

Distribution of 1 (number of 1 in each group/number of 1 in Y)
Distribution of 0’s (number of 0’s in each group/number of 0’s in y)

The core of the algorithm is to divide the distribution of 1 by the distribution of 0 (for each group). Of course, the higher this number, the more confident we are that this group is “biased” toward 1, and vice versa. And then you take the log of that value.

y_level_ones = x.replace(y.groupby(x).apply(lambda l: (l == 1).sum()))
y_level_zeros = x.replace(y.groupby(x).apply(lambda l: (l == 0).sum()))
y_ones = (y == 1).sum()
y_zeros = (y == 0).sum()
nominator = y_level_ones / y_ones
denominator = y_level_zeros / y_zeros
woe_encoder = np.log(nominator / denominator)
Copy the code

As you can see, the output cannot be interpreted directly because there are logarithms in the formula. However, it works well as a pre-processing step for machine learning.

16.LeaveOneOutEncoder

So far, all 15 encoders have a unique mapping.

However, this can be a problem if you plan to use coding as input to the prediction model (e.g. GB). In fact, suppose you use TargetEncoder. This means that you introduce information about y_train into X_train, which can lead to a serious risk of overfitting.

The key is: how do you limit the risk of overfitting while maintaining supervised coding? LeaveOneOutEncoder provides an excellent solution. It performs ordinary object encoding, but for each row, it does not consider the observed y value for that row. In this way, directional leakage is avoided.

y_level_except_self = x.to_frame().apply(lambda row: y[x == row['x']].drop(row.name).to_list(), axis = 1)
leave_one_out_encoding = y_level_except_self.apply(np.mean)
Copy the code

17.CatBoostEncoder

CatBoost is a gradient lifting algorithm (such as XGBoost or LightGBM) that performs very well in many problems.

CatboostEncoder works basically like LeaveOneOutEncoder, but as an online method.

But how do you simulate online behavior? Imagine you have a desk. Then, make a row somewhere in the middle of the table. What CatBoost does is pretend that the row above the current row has been observed in time, while the row below has not been observed (as it will soon be). The algorithm then performs leave One out encoding, but only based on observed lines.

y_mean = y.mean()
y_level_before_self = x.to_frame().apply(lambda row: y[(x == row['x']) & (y.index < row.name)].to_list(), axis = 1)
catboost_encoding = y_level_before_self.apply(lambda ylbs: (sum(ylbs) + y_mean * a) / (len(ylbs) + a))
Copy the code

This seems absurd. Why throw away something that might be useful? You can think of it simply as a more extreme attempt to randomize the output (for example, reduce overfitting).

Thanks for reading! I hope you found this article useful.

Highlights of past For beginners entry route of artificial intelligence and data download machine learning and deep learning notes such as printing machine learning online manual deep learning notes album "statistical learning method" code retrieval based album album download AI based machine learning mathematics Access to this site knowledge planet coupons, copy the link directly open: HTTPS://t.zsxq.com/qFiUFMVThis qq group704220115. To join the wechat group, scan the code:Copy the code