• Normalization vs Standardization — Quantitative Analysis
  • Author: Shay Geller
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: ccJia
  • Proofreader: Fengziyin1234, portandbridge

Stopping using StandardScaler from Sklearn as your feature compression method can even give your trained model a 7% improvement in accuracy.

Every ML practitioner knows that feature compression is an important issue (more)

The two most discussed methods are normalization and standardization. Normalization is usually a compression of values into the range [0,1]. Standardization refers to the readjustment of data so that the mean is 0 and the standard deviation is 1.

This blog aims to answer the following questions through some experiments:

  1. Do we always need compressed features?

  2. Is there a best way to compress?

  3. How do different compression techniques affect different classifiers?

  4. Should the compression method also be considered an important overparameter?

I will analyze the experimental results of several different compression methods applied to different features.

Content overview

    1. Why are you here?
    1. Mature classifiers
    1. Classifier + compression
    1. Classifier + compression + PCA
    1. Classifier + compression + PCA + overparameter adjustment
    1. The experiment was repeated with more data sets
  • — 5.1 Rain in Australia dataset
  • — 5.2 Bank Marketing data set
  • — 5.3 Sloan Digital Sky Survey DR14 dataset
  • — 5.4 Income Classification data set
  • conclusion

Why are you here?

First, I try to understand the difference between normalization and standardization.

Then I found this great blog by Sebastian Raschka, which satisfied my curiosity from a mathematical point of view. If you are not familiar with the concepts of normalization and standardization, take five minutes to read this blog post.

Here’s another article by Hinton explaining why classifiers trained by gradient descent (such as neural networks) need feature compression.

Okay, so we’ve done a bunch of math cramming, right? Not nearly enough.

I found that Sklearn provides many different compression methods. We can have an intuitive understanding through the effect of different scalers on data with outliers. But they don’t make it clear how these methods affect different classifier tasks.

We read a lot of ML mainline tutorials, usually using StandardScaler (commonly called zero-mean normalization) or MinMaxScaler (commonly called min-max normalization) to compress features. Why doesn’t anyone use other compression methods to classify? Are StandardScaler and MinMaxScaler already the best compression methods?

I found no explanation in the tutorial as to why or when to use these methods. So, I think we should study the performance of these technologies experimentally. That’s all there is to this article.

Project details

Like a lot of data science projects, we’re going to read some data and use some sophisticated classifiers to do experiments.

The data set

The Sonar data set contains 208 rows and 60 columns of features. The sorting task was to determine whether the sonar signal was coming from a metal cylinder or an irregular cylindrical rock.

Here’s a balanced data set:

sonar[60].value_counts() # 60 is the name of the tag column

M    111
R     97
Copy the code

All features in the dataset are between 0 and 1, but not every feature can guarantee that 1 is the maximum or 0 is the minimum.

I choose this data set for two reasons. First, the data set is small enough for me to complete the experiment quickly. Secondly, this problem is complicated. No classifier can achieve 100% accuracy, so the comparison data I obtained is more meaningful.

In later chapters, we will also do experiments on other data sets.

code

In the pre-processing, I’ve already calculated all the results (which took a while). So, we just read the result file and analyze on it.

You can get the code to produce the results on my GitHub: github.com/shaygeller/…

I have selected some of the most popular classifiers from Sklearn as follows:

(MLP is a multilevel perceptron, a neural network)

The compression method used is as follows:

  • Don’t confuse Normalizer, the last compression method in the table above, with minimax normalization we mentioned earlier. The minimax normalization corresponds to the second row, MinMaxScalar. The Normalizer in Sklearn is to separate the sample into a unit norm. This is a row-based, not column-based, normalization approach.

Experiment details:

  • To reproduce the experimental scene, we use the same random number seed.

  • The ratio of training set to test set was 8:2 and was randomly divided.

  • The accuracy of all results was obtained on 10 random cross-validation sets taken from the training set.

  • We don’t talk about the results on the test set. In general, test sets are invisible, and our conclusions are derived only from the classifier’s score on the cross-validation set.

  • In Part 4, I used nested cross validation sets. An internal cross-validation set consists of five random blocks, tuned by superparameters. Outside are 10 randomly divided cross validation sets and the corresponding scores are obtained using the best model parameters. This part of the data is derived from the training set. Pictures are the most convincing:

So let’s see what happens

import os
import pandas as pd

results_file = "sonar_results.csv"
results_df = pd.read_csv(os.path.join(".."."data"."processed",results_file)).dropna().round(3)
results_df
Copy the code

1. Mature classifiers

import operator

results_df.loc[operator.and_(results_df["Classifier_Name"].str.startswith("_"), ~results_df["Classifier_Name"].str.endswith("PCA"))].dropna()
Copy the code

A good result, by looking at the mean of cross validation sets, we can find that MLP is the best, while SVM is the worst.

The standard deviation results are basically the same, so we mainly focus on the mean score. We use the mean values of 10 randomly divided cross-validation sets as the results.

So, let’s look at how different compression methods change each classifier’s score.

2. Classifier + compression

import operator
temp = results_df.loc[~results_df["Classifier_Name"].str.endswith("PCA")].dropna()
temp["model"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[1])
temp["scaler"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[0])

def df_style(val):
    return 'font-weight: 800'

pivot_t = pd.pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.sum)
pivot_t_bold = pivot_t.style.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t["CART"].idxmax(),"CART"])
for col in list(pivot_t):
    pivot_t_bold = pivot_t_bold.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t[col].idxmax(),col])
pivot_t_bold
Copy the code

The first line, the one without the index name, is our original algorithm score without any compression method.

import operator

cols_max_vals = {}
cols_max_row_names = {}
for col in list(pivot_t):
    row_name = pivot_t[col].idxmax()
    cell_val = pivot_t[col].max()
    cols_max_vals[col] = cell_val
    cols_max_row_names[col] = row_name

sorted_cols_max_vals = sorted(cols_max_vals.items(), key=lambda kv: kv[1], reverse=True)

print("Best classifiers sorted:\n")
counter = 1
for model, score in sorted_cols_max_vals:
    print(str(counter) + "." + model + "+" +cols_max_row_names[model] + ":" +str(score))
    counter +=1
Copy the code

The best combination is as follows:

  1. SVM + StandardScaler: 0.849
  2. MLP + PowerTransformer-Yeo-Johnson: 0.839
  3. KNN + MinMaxScaler: 0.813
  4. LR + QuantileTransformer-Uniform: 0.808
  5. NB + PowerTransformer-Yeo-Johnson: 0.752
  6. LDA + PowerTransformer-Yeo-Johnson: 0.747
  7. CART + QuantileTransformer-Uniform: 0.74
  8. RF + Normalizer: 0.723

Let’s analyze the results

  1. There is no compression method that will get the best results for every classifier.

  2. We find that compression is a gain. SVM, MLP, KNN and NB gain greatly from different compression methods.

  3. It is worth noting that some compression methods are not effective for NB, RF, LDA and CART. This phenomenon is related to the working principle of each classifier. Tree classifiers are not affected because they sort the values and calculate entropy for each grouping before partitioning. Some compression functions maintain this order, so there is no improvement. NB is not affected because its model priors are determined by counters in each class rather than actual values. Linear discriminant analysis (LDA) looks for a coefficient by changing between classes, so it is also not affected by compression.

  4. Some compression methods, such as QuantileTransformer-Uniform, do not store the actual order of features, so it still changes the scores of the classifiers mentioned above that are independent of other compression methods.

3. Classifier + compression + PCA

We know that well-known ML methods such as PCA can benefit from compression (blog). We try to add a PCA (N_components =4) to the experiment and analyze the results.

import operator
temp = results_df.copy()
temp["model"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[1])
temp["scaler"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[0])

def df_style(val):
    return 'font-weight: 800'

pivot_t = pd.pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.sum)
pivot_t_bold = pivot_t.style.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t["CART"].idxmax(),"CART"])
for col in list(pivot_t):
    pivot_t_bold = pivot_t_bold.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t[col].idxmax(),col])
pivot_t_bold
Copy the code

Results analysis

  1. In most cases, compression improves on models with PCA, ** however, ** does not specify a specific compression method. Let’s look at the QuantileTransformer-Uniform, which works well on most models. It improved the accuracy of LDA-PCA from 0.704 to 0.783 by 8%! However, for RF-PCA, it has a negative gain, and the accuracy of the model decreases from 0.711 to 0.668, a decrease of 4.35%. On the other hand, if “QuantileTransformer-Normal” is used, the accuracy of RF-PCA can be improved to 0.766 with 5% improvement.

  2. We can see that PCA only improves LDA and RF, so PCA is not a perfect solution. We did not adjust the n_components superparameter, and even if we did, there was no guarantee of improvement.

  3. Also, StandardScaler and MinMaxScaler only scored the best in 4 of the 16 experiments, so we should consider how to choose the most appropriate default compression method.

I can conclude that even though PCA is a unit known to gain from compression, no compression method is guaranteed to improve all experimental results, some of which would be negatively affected even by using StandardScaler for rf-pca models.

The data set was also an important factor in the above experiment. To better understand the impact of compression methods on PCA, we will conduct experiments on more data sets (including data sets with category imbalance, different feature scales, and data sets with both numerical and categorical characteristics). We’ll analyze that in section 5.

4. Classifier + compression + PCA + overparameter adjustment

For a given classifier, different compression methods can lead to significant differences in accuracy. We think that the model will be less affected by the different compression methods after the overparameter is adjusted, so that we can use StandardScaler or MinMaxScaler as the compression method for the classifier like many tutorials on the web. So let’s verify that.

First, NB is not included in this section because it does not have parameter adjustments.

A comparison with the results from earlier stages shows that almost all algorithms benefit from overparameter adjustment. An interesting exception is MLP, which has gotten worse. It is likely that the neural network can easily overfit the data set (especially when the number of parameters is much larger than the training sample), and we do not avoid overfitting by means of early stop or regularization.

However, even if we have a set of adjusted superparameters, the results obtained by using different compression methods are quite different. When we experimented with other methods, we found a 7% improvement in accuracy when comparing these methods with the widely used StandardScaler algorithm on KNN.

The main conclusion of this section is that even if we have a well-tuned set of superparameters, varying the compression methods will still have a significant impact on the model results. So we should also consider the compression method used by the model as a key superparameter.

In the fifth part, we will conduct in-depth analysis on more data sets. If you don’t want to dig any deeper, go straight to the conclusion.

5. Repeat the experiment with more data sets

We need to do more experiments on more data sets in order to get better understanding and more general results.

We will use the similar form of classifier + compression +PCA in section 3 to conduct experiments on several data sets with different characteristics and analyze the results in different sections. All data sets are from Kaggel.

  • For convenience, I selected numeric columns from each dataset. How to compress diverse data sets (numerical and classification features) has always been controversial.

  • I didn’t tweak the classifier any more.

5.1 Rain in Australia dataset

Link sorting Task: Predicting rain? Measurement: Precision Dataset size: (56420, 18) Number of each category: no rain 43993 rain 12427

Here we show a partial column of five rows of data, not all of them in one diagram.

dataset.describe()
Copy the code

We speculate that compression may improve the classifier’s performance due to the scale difference of the features (observe the maximum and minimum values in the table above, and the scale difference of the remaining data will be greater than shown).

The results of

Results analysis

  • We’ll see that StandardScaler and MinMaxScaler never get the highest score.

  • We can see even 20% difference between StandardScaler and other methods in cart-pca algorithm.

  • We can also see that compression is usually effective. The accuracy of SVM even increased from 78% to 99%.

5.2 Bank Marketing data set

Link sorting Task: Predict whether the customer has ordered a time deposit? Measurement: AUC (Dataset imbalance) Dataset size: (41188, 11) Category Number: 36548 ordered 4640 ordered

Here we show a partial column of five rows of data, not all of them in one diagram.

dataset.describe()
Copy the code

Again, the scale of the features is different.

The results of

Results analysis

  • We will find that on this data set, compression does not necessarily benefit all models using PCA, even if the features are of different scales. Nevertheless, in all models with PCA, the second-highest score was very close to the highest score. This may mean that adjusting the final dimension of PCA while using compression is superior to all results without compression.

  • Again, none of the compression methods perform very well.

  • Another interesting result is that none of the compression methods resulted in a very large improvement (generally between 1% and 3%) on most models. This is because the data set itself is unbalanced and we have not adjusted the parameters. Another reason is that AUC scores are already so high (around 90%) that it’s hard to see a big improvement.

5.3 Sloan Digital Sky Survey DR14 data set

Link Classification Task: Predicting galaxies, Stars or quasars? Measurement: Accuracy (multiple categories) Dataset size: (10000, 18) Number of categories: Galaxies 4998 planets 4152 quasars 850

Here we show a partial column of five rows of data, not all of them in one diagram.

dataset.describe()
Copy the code

Again, the scale of the features is different.

The results of

Results analysis

  • Compression makes a big difference to the results. This is to be expected because the scale of the features in the dataset is different.

  • We will find RobustScaler performs well on almost all models using PCA. It may be that a large number of outliers lead to the translation of PCA eigenvectors. On the other hand, these outliers do not matter as much when we do not use PCA. We need to dig deep into the data set to be sure.

  • The accuracy difference between StandardScaler and other compression methods can be up to 5%. This also means that we need to experiment with a variety of compression methods.

  • PCA can always gain from compression.

5.4 Income Classification Data set

Link sorting task: Revenue >50K or <=50K? Measurement: AUC (Unbalanced dataset) Dataset size: (32561, 7) Number of categories: <=50K 24720

50K 7841

Here we show a partial column of five rows of data, not all of them in one diagram.

dataset.describe()
Copy the code

This is again a data set with different scales of features.

The results of

Results analysis

  • Again, the data set is lopsided, but we have found that compression is very effective and can improve results by up to 20%. This is probably because the AUC score is low (80%) compared to the Bank Marketing data set, so it is easy to achieve a large improvement.

  • Although StandardScaler is not highlighted (I only highlight the highest-scoring item in each column), it is close to the best result in many columns, and it is not always the case. At runtime (not shown), StandardScaler is faster than most compression methods. If you’re more concerned with speed, StandardScaler is a great choice. But if accuracy is your concern, then you need to try other compression methods.

  • Again, no compression method performs very well on all algorithms.

  • PCA can almost always gain from compression.

conclusion

  • Experiments show that compression can bring gains to the results even on the model with good overparameter adjustment. Therefore, the compression method needs to be considered as an important superparameter.

  • Different compression methods will affect different classifiers. Distance – based classifiers such as SVM, KNN and MLP (neural network) all benefit greatly from compression. But even tree classifiers (CART and RF), which do not work with some compression techniques, can benefit from other compression methods.

  • Understanding the mathematics behind the model and preprocessing methods is the best way to understand these results. (For example, how does a tree classifier work? Why do some compression methods not work against them? . This will save you a lot of time if you know you can’t use StandardScaler when using random forest.

  • Preprocessing methods like PCA do gain from compression. If it does not work, it may be because the dimension setting of PCA is not good, there are too many outliers or the wrong compression method is chosen.

Please feel free to contact me if you find any bugs, improvements in lab coverage, or suggestions for improvements.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.