1 Data Normalization

Dimension refers to the expression in which a physical derived quantity is expressed as the product of the power of several fundamental quantities. Data comparison needs to pay attention to two points — absolute value and dimension. However, due to the existence of dimension, it is impossible to directly compare the size of features by absolute value, so it is impossible to judge the importance of features. For example, if the variance of a feature is several orders of magnitude larger than other features, it will occupy a dominant position in the learning algorithm and weaken other features, and even lead to model convergence failure.

Dimensionless Nondimensionalization (NDIMENSIONalization) data preprocessing allows features to have the same weight — from absolute value comparison to relative value comparison, which is no longer affected by dimensionality, thus improving model accuracy, stability and accelerating convergence. The main method of dimensionless is Standardization, in which distributions of different ranges of numerical variation are mapped to the same fixed range. In particular, it is called Normalization when mapped to the 0-1 range.

1.1 Normalization of maximum values

The core is to map the feature value linearly to the 0-1 interval through the sample feature maximum value without destroying the distribution. The transformation function is as follows: Xscale wasn’t entirely = X – Xmin ⁡ Xmax ⁡ – Xmin ⁡ X_ {scale} = \ frac {X – X_ {\ min}} {X_ {\ Max} – X_ {\ min}} Xscale wasn’t entirely = Xmax – XminX – Xmin

Its characteristics are:

  • The algorithm process is simple and intuitive;
  • The addition of new data may lead to the change of the maximum value, which needs to be redefined.
  • It is sensitive to outliers because they directly affect the maxima. Therefore, maximum value normalization is only applicable to the case where data are distributed within a range without outliers, such as human height data and test score data

1.2 Z-Score normalization

Xscale=X−μσX_{scale}=\frac{X-\mu}{\sigma}Xscale=σX−μ

Its characteristics are:

  • It fits nicely with data that is inherently normally distributed;
  • Even if there are outliers in the original data set, the normalized data set still satisfies the mean value of 0 and will not form biased data.
  • The sensitivity to Outliers is low, but Outliers still affect the calculation of variance and mean. Therefore, in the case of Outliers, the distribution of data in different dimensions may be completely different after the transformation.

2. Category balancing

Class-imbalance refers to a situation in which the number of training samples of different types varies greatly in classified tasks, for example, there are 998 counter-examples in the training set, but only 2 positive examples. In the case of class imbalance, the learner at the training site is often worthless because it only needs to always judge the class as a large sample number to achieve a small training error.

Strategies to solve the problem of category imbalance mainly include:

2.1 Threshold Movement

Also known as a rebalance or rescaling strategy. Suppose there are m+ M ^+ M + positive examples and M − M ^-m− negative examples in the training set, and the probability of positive example prediction output of the learner is YYy. Assume that the training set is a sampling of the real sample space, then for any test sample, the y ‘1 -‘ y = y1 – x y m – m + {\ frac {‘} {1 – y y} = \ frac {y} {1} y \ times \ frac {m ^ -} {m ^ +}} – 1 y ‘y’ x – m + m = 1 – yy

Positive and negative examples are determined by comparing the size of y’y ‘y ‘with 0.5.

2.2 Undersampling

The core principle is to achieve category balance by removing some samples in categories with large sample numbers. The under-sampling method reduces the time cost due to the loss of samples, but attention should be paid to avoid under-fitting. The representative algorithm for undersampling is EasyEnsemble.

2.3 Oversampling

The core principle is to increase some samples in the category with fewer samples to achieve category balance. The oversampling method increases the time cost due to the addition of samples, but it should be careful to avoid overfitting. SMOTE is the typical method of oversampling.

3. Discretization of continuous values

Discretization of Continuous Attributes is the segmentation of Continuous data into a series of discrete intervals, each corresponding to an attribute value. The main reasons for the discretization of continuous attributes are as follows:

  • Algorithm requirements, such as classification decision tree algorithm based on classification attributes;
  • To improve the expression ability of features at the knowledge level, for example, the two features of age 5 and 65 need to be compared at the numerical level for the continuous type, but it is more intuitive if they are mapped as “young” and “old”.
  • The discretized data is more robust to outliers and can improve the stability of the model.

The main methods of continuous attribute discretization are described as follows.

  • Unsupervised discrete method
  • Isometric dispersion, that is, continuous attributes are divided into a number of finite intervals, each of equal length.
  • Constant frequency discretization, that is, continuous attributes are divided into several finite intervals with the same number of samples in each interval.
  • There are supervised discrete methods
  • Information gain method is a kind of bisection method (Bi-partition), the core of which is the point with the maximum information gain before and after discretization.

4 Processing missing values

High detection cost, privacy protection, invalid data, information omission and other conditions will lead to missing data set attributes in practical application, so missing value processing is inevitable.

The main ways of missing value processing are described as follows.

  • Interpolation padding, which uses the distribution of existing data to infer missing values. For example, mean filling (mainly for continuous attributes), mode filling (mainly for discrete attributes), regression filling (establishing regression equation based on existing attribute values to estimate missing values), etc.
  • Similar padding refers to the assumption of missing values from samples similar to missing attribute samples. For example, in Hot Deck Imputation, the most similar sample attribute in the dataset is selected to replace the missing value based on a certain similarity measure. Cluster filling, based on cluster analysis to select the most similar sample subset of the data set for interpolation filling.
  • C4.5 method, which directly uses missing attribute samples and measures the influence of samples on results by weighting method, is mainly used in decision tree algorithm. Decision tree algorithm can refer to Python machine learning actual combat (A) : the principle of hand-torn decision tree, construction, pruning, visualization

5 Dumb speech coding

Dummy Encode is a coding method that vectorizes and quantifies the characteristic values of a set of qualitative discrete features in 0-1 mode.

The advantages of dumb coding are:

  • Sparse data, sparse vector operation speed is fast, and there are many optimization methods;
  • To improve the expression ability of the model, dumb speech coding is equivalent to introducing nonlinear elements into the model and increasing the capacity of the model.
  • Dimensionless and quantized, different types of features are quantized as 0-1 vectors for formal expression, which is convenient for reasoning and calculus.

The disadvantage of dumb speech coding is that The dumb speech codes of different features are stacked on each other, and The final feature vector will lead to The Curse of Dimensionality in The feature space. Therefore, PCA Dimensionality reduction is generally used to cooperate with dumb speech coding.

Specifically, there are two forms of dumb speech coding. As shown in Figure 1.2.8, NNN bit state register is used to Encode NNN states. Each state has independent register bits, and only One effective encoding method is called one-hot Encode at any time. If one degree of freedom is reduced, it is the general dumb speech coding.

6 regularization

Regularization is a strategy to introduce structural risk minimization based on the minimization of empirical risk. Regularization provides a way to introduce domain prior knowledge and training intention, and it is also a punishment function method commonly used to avoid overfitting. The general expression for regularization is as follows: Min ⁡ lambda Ω f (f) + ∑ I = 1 m ℓ (f (xi), yi) \ underset {f} {\ min \ lambda} \ Omega \ left (f \ right) + \ sum_ {I = 1} {\ ^ m ell \ left (f \ left ( _i \ \ boldsymbol {x} right), y_i \} fmin lambda Ω right) (f) + ∑ I = 1 m ℓ (f (xi), yi)

ω (F)\Omega \left(F \right) ω (f) is the regularization term, which is used to describe some properties of the model to reduce structural risk. ∑ I = 1 m ℓ (f (xi), yi) \ sum \ nolimits_ {I = 1} ^ m {\ ell \ left (f \ left (\ boldsymbol {x} _i \ right), y_i \ right)} ∑ I = 1 m ℓ (f (xi), yi) as empirical risk, It is used to describe the fit degree of model and training data. The constant lambda lambda represents a preference for structural and empirical risk. To summarize the common regularization methods:

6.1 L1 regular

Because under L1L^1L1 norm Regularization, the optimal solution mostly occurs at the edges and corners of the regular term to generate sparsity, so norm Regularization is also called Lasso Regularization operator, which can introduce preference for sparse parameters and help highlight key features for feature selection. In fact, L0L^0L0 norm can also achieve feature selection and sparse, but it is not easy to solve optimization compared with L1L^1L1 norm, so L1L^1L1 norm is preferred in practical applications.

6.2 L2 regular

The L2L^2L2 norm regular, also known as Weight Decay, prefers to retain more features that are more uniform (going to zero). In Regression analysis, the cost function for regularization of L1L^1L1 norm is called Ridge Regression.

As shown in the figure, only two features are considered. if
Lambda. 0 \ lambda – > 0
That is, if the regular constraint is small, the cone height of the regular term decreases and the regular solution approaches the least square solution. If the increase
Lambda. \lambda
That is, if the regular constraint is large, the height of the regular term cone increases, the regular solution deviates from the least square solution, the position of the solution is closer to the axis, and the parameters are smaller and smaller.

7 Data dimension reduction

PCA dimension reduction is mainly introduced.

As shown in the figure, most of the data points are distributed in the x2x_2X2 direction, and the values in the x1x_1X1 direction are approximately the same. Therefore, for some problems, the x1X_1X1 coordinates can be directly removed and only the x2X_2X2 coordinate values can be retained.


However, some cases cannot be directly handled in this way. For example, the data in the figure is evenly distributed in the x1x_1x1 and x2X_2X2 directions, and the removal of any dimension may have a great influence on the result. At this point, the principle of PCA should be used to find a dimension with the most scattered data distribution and the largest variance, that is, the red sitting table system in the figure, so as to achieve the purpose of dimensionality reduction.

The optimization objectives of PCA algorithm can be summarized from the above examples:

  • The correlation between the selected feature dimensions should be as small as possible — to reduce the calculation dimension and reduce the calculation cost;
  • The retained feature dimension should reflect the nature of data as much as possible — this dimension has the largest variance;

These two optimization goals can be unified by a covariance matrix:


C = [ C o v ( Alpha. 1 . Alpha. 1 ) C o v ( Alpha. 1 . Alpha. 2 ) C o v ( Alpha. 1 . Alpha. n ) C o v ( Alpha. 2 . Alpha. 1 ) C o v ( Alpha. 2 . Alpha. 2 ) C o v ( Alpha. 2 . Alpha. n ) C o v ( Alpha. n . Alpha. 1 ) C o v ( Alpha. n . Alpha. 2 ) C o v ( Alpha. n . Alpha. n ) ] Ideal covariance matrix after dimensionality reduction [ Delta t. 1 Delta t. 2 Delta t. n ] C=\left[ \begin{matrix} Cov\left( \boldsymbol{\alpha }_1, \boldsymbol{\alpha }_1 \right)& Cov\left( \boldsymbol{\alpha }_1, \boldsymbol{\alpha }_2 \right)& \cdots& Cov\left( \boldsymbol{\alpha }_1, \boldsymbol{\alpha }_n \right)\\ Cov\left( \boldsymbol{\alpha }_2, \boldsymbol{\alpha }_1 \right)& Cov\left( \boldsymbol{\alpha }_2, \boldsymbol{\alpha }_2 \right)& \cdots& Cov\left( \boldsymbol{\alpha }_2, \boldsymbol{\alpha }_n \right)\\ \vdots& \vdots& \ddots& \vdots\\ Cov\left( \boldsymbol{\alpha }_n, \boldsymbol{\alpha }_1 \right)& Cov\left( \boldsymbol{\alpha }_n, \boldsymbol{\alpha }_2 \right)& \cdots& Cov\left( \boldsymbol{\alpha }_n, \ boldSymbol {\alpha}_n \right)\\ end{matrix} \right] \xrightarrow{\text{dimensionally reduced ideal covariance matrix}}\left[\begin{matrix} \delta _1&& & \\ & \delta _2& & \\ & & \ddots& \\ & & & \delta _n\\\end{matrix} \right]

Based on this, Xm×nX_{m\times n}Xm×n is the sample centralization matrix, Pr×mP_{r\times m}Pr×m is the PCA dimension reduction matrix, Yr×n=PXY_{r\times n}=PXYr×n=PX is the sample matrix after dimension reduction, CXC_XCX and CYC_YCY are the covariance matrices of the original sample and the sample after dimension reduction respectively. Since the correlation between different features is considered here, the matrix is uniformly written in the form of row vector group: Yr (n = [beta 1 beta. Beta 2 r] TY_ \ times n {r} = \ left [\ begin {matrix} \ boldsymbol {\ \ beta} _1 & \ boldsymbol {\ \ beta} _2 & \ cdots & _r \ boldsymbol {\ \ beta} {matrix} \ \ \ \ end right] ^ TYr * n = [beta 1 beta. Beta 2 r] T, are: Max ⁡ tr (CY) S.T.P TP = \ Max tr \ I left (C_Y \ right) \ \, s.t. P ^ TP = Imaxtr (CY) S.T.P TP = I

The former reflects the optimization objective (a), while the latter reflects the optimization objective (b).

The following is a simple derivation of PCA dimension reduction matrix conditions.


max t r ( C Y ) = max t r [ 1 n 1 P X ( P X ) T ] = max t r [ P C X P T ] \max tr\left( C_Y \right) =\max tr\left[ \frac{1}{n-1}PX\left( PX \right) ^T \right] \\=\max tr\left[ PC_XP^T \right]

By the Lagrange multiplier method, Let f(P)=tr(PCXPT)+λ(PTP−I)f\left(P \right) =tr\left(PC_XP^T \right) + lambda \left(P^ tP-i) \ right) f (P) = tr (PCXPT) + lambda (inhibits PTP – I)

Is:


partial f ( P ) partial P = partial t r ( P C X P T ) partial P + Lambda. partial ( P T P ) partial P = P C X T + Lambda. P \frac{\partial f\left( P \right)}{\partial P}=\frac{\partial tr\left( PC_XP^T \right)}{\partial P}+\lambda \frac{\partial \left( P^TP \right)}{\partial P}\\=PC_{X}^{T}+\lambda P

If the derivative is 0, CXPT=−λPTC_XP^T=-\lambda P^TCXPT=−λPT

That is, the dimension reduction matrix PPP is the orthogonal matrix of row vector group with the first RRR feature vectors of the original sample covariance matrix CXC_XCX.


Welcome to my AI channel “AI Technology Club”.