Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

In this article you will learn why and how to standardize.

Why we need standardization

Data set standardization is a common requirement of many machine learning estimators implemented in SciKit-Learn; If the individual features are not more or less like the standard normal distribution (zero mean, normal distribution of unit standard deviation), the performance of the algorithm may be compromised. In fact, we often ignore the distribution shape of the data and just do zero mean, unit standard deviation processing. Many elements in the objective function of a machine learning algorithm all feature approximately zero mean, variance of the same order. If the variance of a feature is of an order of magnitude larger than the other features, then that feature may dominate the objective function, making the model unable to learn effectively from the other features.


Three ways to standardize

Z – score standard

  • The principle of: Based on raw datamean(Mean) andstandard deviation(Standard deviation) to standardize data. The original value of the featurexuseZ-scoreNormalized toX '. Data by its characteristics (by column)- meanAnd thenpresentVariance. The result is that for each feature/column all the data is clustered around 0 and the variance is 1.
  • Scope of application: the maximum and minimum values of features are unknown, or there are outlier data beyond the value range.

from sklearn import preprocessing
import numpy as np

X_train = np.array([[ 1., -1..2.],
                    [ 2..0..0.],
                    [ 0..1., -1.]])
X_scaled = preprocessing.scale(X_train)
X_scaled
Copy the code

X_scaled.mean(axis=0)
# array([0., 0., 0.])

X_scaled.std(axis=0)
array([ 1..1..1.])
Copy the code


Min – Max standardization

  • The principle of: Performs linear transformation on the original data. A featureAThe minimum and maximum values of areminA,maxAThat will beAA raw value ofxthroughmin-maxThe normalization maps to the interval[0, 1]The values in theX '.
  • Scope of application: the maximum and minimum values of features are known

from sklearn import preprocessing
import numpy as np

X_train = np.array([[ 1., -1..2.],
                    [ 2..0..0.],
                    [ 0..1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax
Copy the code


MaxAbs standardization

  • The principle of: Scale the training data features to the maximum value of each feature[1, 1]This means that the training data should have zero center or sparse data.
  • Range of use: feature maximum known.

from sklearn import preprocessing
import numpy as np

X_train = np.array([[ 1., -1..2.],
                    [ 2..0..0.],
                    [ 0..1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs
Copy the code

For startersPythonOr they want to get startedPythonYou can search on wechat [A new vision of PythonSometimes a simple question card for a long time, but others may dial a point will suddenly see light, heartfelt hope that we can make progress together.