Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”
In this article you will learn why and how to standardize.
Why we need standardization
Data set standardization is a common requirement of many machine learning estimators implemented in SciKit-Learn; If the individual features are not more or less like the standard normal distribution (zero mean, normal distribution of unit standard deviation), the performance of the algorithm may be compromised. In fact, we often ignore the distribution shape of the data and just do zero mean, unit standard deviation processing. Many elements in the objective function of a machine learning algorithm all feature approximately zero mean, variance of the same order. If the variance of a feature is of an order of magnitude larger than the other features, then that feature may dominate the objective function, making the model unable to learn effectively from the other features.
Three ways to standardize
Z – score standard
- The principle of: Based on raw data
mean
(Mean) andstandard deviation
(Standard deviation) to standardize data. The original value of the featurex
useZ-score
Normalized toX '
. Data by its characteristics (by column)-
mean
And thenpresent
Variance. The result is that for each feature/column all the data is clustered around 0 and the variance is 1. - Scope of application: the maximum and minimum values of features are unknown, or there are outlier data beyond the value range.
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1..2.],
[ 2..0..0.],
[ 0..1., -1.]])
X_scaled = preprocessing.scale(X_train)
X_scaled
Copy the code
X_scaled.mean(axis=0)
# array([0., 0., 0.])
X_scaled.std(axis=0)
array([ 1..1..1.])
Copy the code
Min – Max standardization
- The principle of: Performs linear transformation on the original data. A feature
A
The minimum and maximum values of areminA
,maxA
That will beA
A raw value ofx
throughmin-max
The normalization maps to the interval[0, 1]
The values in theX '
. - Scope of application: the maximum and minimum values of features are known
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1..2.],
[ 2..0..0.],
[ 0..1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax
Copy the code
MaxAbs standardization
- The principle of: Scale the training data features to the maximum value of each feature
[1, 1]
This means that the training data should have zero center or sparse data. - Range of use: feature maximum known.
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1..2.],
[ 2..0..0.],
[ 0..1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs
Copy the code
For starters
Python
Or they want to get startedPython
You can search on wechat [A new vision of Python
Sometimes a simple question card for a long time, but others may dial a point will suddenly see light, heartfelt hope that we can make progress together.