This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money. Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

A preprocessing. StandardScaler

As data (x) is normalized to mean (μ) and scaled to standard deviation (σ), the data follows a normal distribution with mean 0 and variance 1. This process is called z-score normalization.

Preprocessing import StandardScaler data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]] scaler = StandardScaler() Var_ x_std = scaler.transform(data) x_std = scaler.transform(data) x_std = scaler.mean () The output is an array, Use mean() to check the mean x_std.std() # use STD () to check the variance Scaler.fit_transform (data) # Use fit_transform(data) to reach the result in one step Scaler. inverse_transform(x_std) # Inverse_transform inverse_transformCopy the code

For StandardScaler and MinMaxScaler, a null NaN is treated as a missing value, ignored for fit, and kept as a missing NaN for Transform. In addition, although the dimensionalization process is not a specific algorithm, it is still allowed to import at least two-dimensional number groups in FIT interface, and one-dimensional array import will report an error. Normally, our input X will be our eigen matrix, and in real cases the eigen matrix is unlikely to be one-dimensional so there won’t be this problem.

Which is StandardScaler or MinMaxScaler?

  • It depends. In most machine learning algorithms, StandardScaler is chosen for feature scaling because the MinMaxScaler is very sensitive to outliers. Among PCA, clustering, logistic regression, support vector machines, neural networks, StandardScaler is often the best choice.
  • MinMaxScaler is widely used when distance measurement, gradient, covariance calculation and data need to be compressed to a specific interval are not involved. For example, when quantizing pixel intensity in digital image processing, MinMaxScaler is used to compress data into the interval [0,1].
  • Try StandardScaler first. It doesn’t work well. Switch to MinMaxScaler.
  • In addition to StandardScaler and MinMaxScaler, skLearn provides a variety of other scaling processes (the center only requires subtracting a number in a PANDA, so skLearn does not provide any central functionality). For example, when we want to compress data without affecting the sparsity of the data (without affecting the number of zeros in the matrix), we use MaxAbsScaler; In the case of high outliers and very loud noise, we may choose to use quantiles for dimensionless, in which case RobustScaler is used.

The missing value

The data used in machine learning and data mining will never be perfect. Many features are of great significance for analysis and modeling, but not for those who collect data in the real world. Therefore, in data mining, there are often many missing values of important fields, but fields cannot be abandoned. Therefore, a very important part of data preprocessing is to deal with missing values.

Read_csv (r".\ narrativedata.csv ",index_col=0)#index_col=0 import pandas as pd data = pd.read_csv(r".\ narrativedata.csv ",index_col=0)#index_col=0Copy the code

Impute. SimpleImputer class sklearn. Impute. SimpleImputer (missing_values = nan, strategy = ‘mean’, fill_value = None, Verbose =0, copy=True) we use this class to fill in missing values. This class is specifically designed to fill in missing values. It consists of four important parameters

  • missing_values
  • Tell Simplelmputer what the missing value in the data looks like, the default null value np.nan
  • strategy
  • We fill in the missing values of the strategy, the default mean.
  • Enter “mean” with mean padding (only for numerical features)
  • Enter “median” to fill in with the median (only for numeric features)
  • Enter “most_frequent” to fill in with modes (both numeric and character characteristics are available)
  • Enter “constant” to refer to the value in parameter “fill_value” (for both numeric and character features)
  • fill_value
  • This parameter is available when startegy is set to “constant”. You can enter a string or number to indicate the value to be filled
  • copy
  • Default to True, a copy of the eigenmatrix will be created, otherwise missing values will be filled into the original eigenmatrix.
Data.info () # fill Age = data.loc[:,"Age"].Values. 0 (-1,1) # skLearn The feature matrix must be 2-d Age[:20] from sklearn SimpleImputer imp_mean = SimpleImputer() # Imp_median = SimpleImputer(strategy="constant",fill_value=0) imp_median = SimpleImputer(strategy="constant",fill_value=0) Imp_median = imp_median. Fit_transform (Age) #fit_transform imp_0 = Imp_median [:20] imp_median[:20] imp_0[:20] # imp_median[:,"Age"] = imp_median 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "most_frequent") data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked) data.info()Copy the code