Data normalization problem is feature vector to express the important problem in data mining, when different characteristics into columns together of the time, due to the cause of the expression characteristics of itself on the absolute value of small data are large data “eat”, this time we need to do is to draw out the features of the vector normalization processing, To ensure that each feature is treated equally by the classifier. Several common methods of data normalization have been introduced in previous articles, but this article focuses on how to translate these formulas and methods into Python code.

Normalization of Min-Max

Also known as deviation normalization, is a linear transformation of the original data so that the resulting values are mapped between [0 — 1]. The conversion function is as follows:

X_min represents the minimum value of sample data, x_max represents the maximum value of sample data.

Python code implementation:

def max_min_normalization(x, max, min):
    return (x - min) / (max - min)Copy the code

Or:

def max_min_normalization(x):
    return [(float(i)-min(x))/float(max(x)-min(x)) for i in x]Copy the code

In addition to using the Max and min of the list, np. Max and Np. min are recommended because they are more powerful.

>>> a = np.array([[0, 1, 6], [2, 4, 1]])
>>> np.max(a)
6
>>> np.max(a, axis=0) # max of each column
array([2, 4, 6])Copy the code

Reference links: stackoverflow.com/questions/3…

If you want to map the data to [-1,1], replace the formula with:

X_mean indicates the average value.

Python code implementation:

import numpy as np
 
def normalization(x):
    return [(float(i)-np.mean(x))/(max(x)-min(x)) for i in x]Copy the code

One drawback of this standardization method is that if there are some outliers in the data that deviate from the normal data, the standardization results will be inaccurate.

Z – score standard

The z-Score standardization method is applicable to the situation where the maximum and minimum values of attribute A are unknown, or there are outlier data beyond the value range. This method gives mean and standard deviation of original data for data standardization.

The processed data conform to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, and the transformation function is as follows:

μ is the mean value of all sample data, and σ is the standard deviation of all sample data.

Python implementation:

def z_score_normalization(x,mu,sigma):  
    return (x - mu) / sigmaCopy the code

Mu is the mean value and sigma is the annotation difference, so the code can be rewritten as:

X = numpy.array(x) x.std(axis = 0) x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] x = numpy.array(x) def z_score(x): return (x - np.mean(x)) / np.std(x, ddof = 1)Copy the code

Z-score standardization method also has influence on outliers. Improved Z-Score standardization: Change mean to median and standard deviation to absolute deviation in standard score formula.

Median refers to sorting all the data and taking the middle value. If the data amount is even, take the average of the two middle values.

For a finite set of numbers, you can find the median by sorting all observations from the highest to the lowest. # If there is an even number of observations, the median is usually taken as the average of the two middle values. def get_median(data): data = sorted(data) size = len(data) if size % 2 == 0: Median = (data[size/2]+data[size/2-1])/2 if size % 2 == 1: Odd error = data[(sie-1)//2] return errorCopy the code

σ1 is the absolute deviation of all sample data, which can be calculated as follows:

The Sigmoid function

Sigmoid function is a function with an S-shaped curve, which is a good threshold function. It is center-symmetric at (0, 0.5) and has a relatively large slope near (0, 0.5). When the data tends to positive and negative infinity, the mapped value will infinitely tend to 1 and 0, which is a “normalization method” that I like very much. The reason for the quotation marks is that I think Sigmoid function also has a good performance in threshold segmentation. According to the change of the formula, the segmentation threshold can be changed. Here, as a normalization method, we only consider the point (0, 0.5) as the segmentation threshold:

Python implementation:

Def sigmoid(X,useStatus): if useStatus: return 1.0 / (1 + np.exp(-float(X))); else: return float(X);Copy the code

Here useStatus manages whether to use sigmoID status for easy debugging.

Normalization in SkLearn

Sklearn. preprocessing provides some useful functions for processing the dimensions of data for use by algorithms.

1) Mean-standard deviation scaling

The corresponding z-Score standardization we have above.

from sklearn import preprocessing import numpy as np x_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]]) x_scaled = preprocessing.scale(x_train) print(x_scaled) # output: # [[0. -1.22474487 1.33630621] # [1.22474487 0.0.26726124] # [-1.22474487 1.22474487-1.06904497]]Copy the code

2) Min-max standardization

from sklearn import preprocessing import numpy as np x_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]]) min_max_scaler = preprocessing.MinMaxScaler() x_train_minmax = min_max_scaler.fit_transform(x_train) Print (x_train_minmax) # print(x_train_minmax) # print(x_train_minmax) # print(x_train_minmax) # printCopy the code

3) Maximum normalization (maximum value per value/per dimension)

from sklearn import preprocessing import numpy as np x_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]]) max_abs_scaler = preprocessing.MaxAbsScaler() x_train_maxabs = max_abs_scaler.fit_transform(x_train) Print (x_train_maxabs) # output: # [[0.5 1. 1.] # # [1. 0. 0.] [0. 1-0.5]]Copy the code

4) Normalization

Normalization is the basis of vector space model in text classification and clustering.

from sklearn import preprocessing import numpy as np x_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]]) x_normalized = preprocessing.normalize(x_train, norm='l2') print(x_normalized) # output: # [[0.40824829-0.40824829 0.81649658] # [1.0.0.] # [0.0.70710678-0.70710678]]Copy the code

Norm This parameter is optional and defaults to l2 (sum of squares of vector elements and then square root) to normalize each non-zero vector. If the axis parameter is set to 0, it normalizes the characteristic dimension of each non-zero vector.

For details, see normalizing L0, L1, and L2 norms

5) Binarization (converting data to 0 and 1)

from sklearn import preprocessing import numpy as np x_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]]) binarizer = preprocessing.Binarizer().fit(x_train) print(binarizer) print(binarizer.transform(x_train)) # output: # Binarizer (copy = True, threshold = 0.0) # [[1. 0. 1.] # # [1. 0. 0.] [0. 1. 0.]]Copy the code

Reference links:

  • Blog.csdn.net/gamer_gyt/a…
  • Scikit-learn.org/stable/modu…