Python data preprocessing – normalization, normalization, regularization
January 19, 2024
by Eli Meads
No Comments
Python data preprocessing – normalization, normalization, regularization
A few concepts about data preprocessing
Normalization:
Properties, and scaling to a specified minimum and maximum values (usually 1-0) between, this can be achieved by preprocessing. MinMaxScaler class implements.
Common min Max normalization methods (X-min (x))/(Max (x)-min(x))
In addition to the above method, and the other a common approach is to attribute, and scaling to a specified minimum and maximum values (usually 1-0) between, this can be achieved by preprocessing. MinMaxScaler class implements.
The purposes for using this method include:
1. The stability of attributes with very small variance can be enhanced.
2. Maintain the entries 0 in the sparse matrix
>>> X_train = np.array([[ 1., -1., 2.],… [ 2., 0., 0.],… [ 0., 1., -1.]])… >>> min_max_scaler = preprocessing.MinMaxScaler()>>> X_train_minmax = min_max_scaler.fit_transform(X_train)>>> X_train_minmaxarray ([[0.5, 0, 1], [1., 0.5, 0.33333333], [0. 1., # 0.]]) > > > will be the same scaling is applied to the test set data > > > X_test = np, array ([[- 3-1, 4.]]) > > > X_test_minmax = min_max_scaler. Transform (X_test) > > > X_test_minmaxarray ([[- 1.5, 0. 1.66666667]]) > > > > # scaling factor attribute > > min_max_scaler. Scale_array ([0.5, 0.5, 0.33… ) > > > min_max_scaler. Min_array ([0., 0.5, 0.33… )
The data is scaled so that it falls into a small, specific range. The normalized data can be positive or negative, and the absolute value is generally not very large.
Each attribute/column is evaluated separately
Subtracting the mean of the periodic attributes of the data (in columns) and its variance. The result is that for each attribute/column all the data is clustered around 0 and the variance is 1.
Normalize (X-mean (x))/ STD (x) using z-Score method
This has a specific equation in Matlab
Use sklearn. Preprocessing. Scale () function, you can directly to the given data for standardization:
>>> from sklearn import preprocessing>>> import numpy as np>>> X = np.array([[ 1., -1., 2.],… [ 2., 0., 0.],… [ 0., 1-1.]]) > > > X_scaled = preprocessing. The scale (X) > > > X_scaled array ([[1.33, 1.22 0…….,…], [1.22… 0…. 0.26…], [1.06 to 1.22-1.22…,…,…]]) > > > # after processing the data of mean and variance > > > X_scaled. The scheme (axis = 0) array ([0., 0. 0.]) >>> X_scaled.std(axis=0)array([ 1., 1., 1.])
Use sklearn. Preprocessing. StandardScaler class, the use of the advantage of this class is to keep training focus parameters (mean and variance) directly using the object data conversion test set:
>>> scaler = preprocessing.StandardScaler().fit(X)>>> scalerStandardScaler(copy=True, with_mean=True, With_std = True) > > > scaler. Mean_array (0.33 [1…, 0….,…]. ) > > > scaler. Std_array ([1.24 0.81 to 0.81…,…,…]. ) > > > scaler. Transform (X) array ([[1.33, 1.22 0…….,…], [1.22… 0…. – 0.26…], [1.22-1.22…,… 1.06…]]) > > > # can be used directly training set set of test data conversion > > > scaler. Transform ([[- 1, 1, 0.]]) array ([[0.26 to 1.22-2.44…,…,…]])
Regularization:
The regularization process involves scaling each sample to the unit norm (the norm for each sample is 1), which is useful if the similarity between two samples is later calculated using such methods as quadratic (dot product) or other kernel methods.
The main idea of Normalization is to calculate the P-norm for each sample and then divide each element in that sample by that norm. The result of this process is that the P-norm (L1-norm, L2-norm) of each processed sample is equal to 1.
The calculating formula of p – norm: | | X | | p = (| x1 | | ^ p + x2 | ^ p +… +|xn|^p)^1/p
This method is mainly applied to text classification and clustering. For example, the cosine similarity of two TF-IDF vectors can be obtained by dot product of the L2-norm of the two vectors.
You can use the preprocessing.normalize() function to convert the specified data: