In machine learning, data normalization is very important. If data normalization is not carried out, the model may be broken or a strange model may be trained.
Why data normalization
There is now a training dataset containing two samples, which are as follows:
Tumor size (cm) | Discovery time (Day) | |
---|---|---|
Sample 1 | 1 | 200 |
Sample 2 | 5 | 100 |
Take k-nearest Neighbor algorithm as an example, the value of “discovery time” is much larger than that of “tumor size”, and the distance between samples is dominated by “discovery time”. The trained model is mainly influenced by “discovery time”, and even the influence of “tumor size” is negligible.
The solution is to map the data to the same scale, which is called data normalization.
Two common methods of data normalization are maximum value normalization and mean variance normalization.
Normalization of maximum value
Maximum value normalization is the mapping of data between 0 and 1, which is applicable to the case of data distribution with obvious boundaries. Subtract the minimum value of the feature from the eigenvalue of the sample and divide by the value range of the feature. The corresponding mathematical formula is as follows:
Use Np. random to generate a 50*2 two-dimensional integer array and convert it to a floating-point type:
import numpy as np
X = np.random.randint(0, 100, size=(50, 2))
X = np.array(X, dtype=float)Copy the code
For the first column of data, = np.min(X[:, 0])
. = np.max(X[:, 0])
:
X[:, 0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))Copy the code
The second column has the same data:
X[:, 1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))Copy the code
At this point, all eigenvalues of the sample are between 0 and 1.
Mean variance standardization
The variance normalization of the mean is the normalization of all the data into a distribution with a mean of 0 and a variance of 1. This applies to data distribution with or without distinct boundaries. The mathematical formula is:
: Characteristic mean value,: Characteristic variance.
Use Np. random to generate a 50*2 2-d integer array and convert it to float:
X2 = np.random.randint(0, 100, size=(50, 2))
X2 = np.array(X2, dtype=float)Copy the code
For the first column of data, = np.mean(X2[:, 0])
. = np.std(X2[:, 0])
:
X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])Copy the code
The second column has the same data:
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])Copy the code
You can see that the mean of the X2 columns is very close to 0 and the variance is very close to 1:
# np. Mean (X2[:, 0]) -4.440892098500626E-18 # NP. Mean (X2[:, 1]) -1.2878587085651815E-16 # Np.std (X2[:, 0]) 0]) 0.9999999999 # np.std(X2[:, 1]) 0.999999999999 # np.std(X2[:, 1]) 0.999999999999 # np.std(X2[:, 1]Copy the code
The test data set is normalized
All of the above are normalized to the training data set, but the normalization to the test data set is different. Since the test data is simulated in the real environment, and it is difficult to get the mean and variance of all the test data in the real environment, it is wrong to perform the above operation on the test data set at this time. The correct method is to use the normalized data of the training data set.
For example, the normalization of the maximum value of the test data set is:
The mean variance of the test data set is normalized as follows:
Take mean variance normalization as an example. The StandardScaler class is encapsulated in Scikit Learn for the normalization of training and test data sets.
Take the data of iris as an example:
import numpy as np from sklearn import datasets from sklearn.model_selection import train_test_split iris = datasets.load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, Test_size = 0.2)Copy the code
The StandardScaler class is in the Preprocessing module:
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()Copy the code
Pass the training data into the FIT () method, which saves the variance and mean of the training data and returns the StandardScaler instance itself:
standardScaler.fit(X_train)Copy the code
Where mean_ and SCALe_ properties store mean and variance:
# standardScaler. Mean_ array([5.83416667, 3.08666667, 3.70833333, 1.17]) # standardScaler. 0.44327067, 1.76401924, 0.75317107])Copy the code
We can then pass training data and test data into the transform() method to get normalized data:
X_train = standardScaler.transform(X_train)
X_test = standardScaler.transform(X_test)Copy the code
The source address
Github | ML-Algorithms-Action