This article has participated in the “new creative Ceremony” activity

The concept of dimensionality reduction

Dimensionality reduction, as the name implies, maps high-dimensional data to low-dimensional data. Before we introduce dimensionality reduction, let’s first consider the question, why do we do dimensionality reduction?

We are exposed to all kinds of data are generally has a number of characteristics, so can be regarded as high-dimensional data, and the data of the result of many of the hallmarks of a latitude for us to predict the effect is negligible, if do not add processing these data directly into the model to training not only to spend much more time, The final prediction results may be affected by these almost irrelevant data, so dimensionality reduction is particularly critical for accuracy.

If that’s not intuitive, let’s look at an intuitive example: Suppose we were to do something that could predict — predict a student’s score on an exam. Here we get the characteristics are: students’ usual scores, mock test scores, mock test scores, students’ clothing, students’ hair amount, what stationery the students use, whether the student entered the classroom with his left foot or his right foot… And more than a hundred features. Obviously, it is most reasonable to choose students’ usual scores, mock test scores and mock test scores, while other characteristics have little impact even if they are slightly different.

From the above examples we have a general idea of the purpose of dimensionality reduction

Common methods of dimensionality reduction

Direct dimension reduction

The example of predicting test scores in the concept of dimensionality reduction above can be seen as an intuitive example. When some features are completely irrelevant to the desired result, they can be removed directly. Of course, this is only the ideal case. In many cases, the classification of features is mixed together. In this case, singular value decomposition or covariance matrix is needed.

Singular value decomposition

The principle of the author

We know that when the direct relations of the elements in the matrix remain unchanged, the relations between the elements of the matrix remain unchanged when the matrix is stretched or rotated.

Suppose the data matrix we get is M, and U is the projection of M on the coordinate axis. This projection can well divide the hybridised features. Of course, this projection includes but is not limited to one of the axes, which is the projection of n-dimensional data onto n-dimensional axes (N <=N).

Since projection to low-dimensional coordinates can solve the feature mixture problem that cannot be solved directly by dimensionality reduction, how to carry out projection?

We learned in middle school that projecting in a direction is multiplying the original data by a trig function. Now let’s do the divergence of thinking — in the case of a certain Angle, the value of the trigonometric function has been determined, equivalent to multiplying the original data by a real number, in matrix operations, the corresponding is multiplying by a matrix. So our goal becomes how to find the matrix that can make the original data M projected on the low-dimensional coordinate axis.

Singular value decomposition analysis

The formula


M = U S V T M = USV^T

Where M: represents the original data U: can be regarded as the data on the low-dimensional coordinate axis S: is used to stretch data U (in linear algebra, the matrix can be multiplied by a diagonal matrix, also known as the stretching matrix to stretch data). V.T: Rotates USCopy the code

Formula to explain

First, assume that there is a matrix U that can partition data well in the low-dimensional coordinate system. By stretching and rotating U, the matrix M can be obtained. And the matrix that we have is M so we just need S and V.T to solve for U. Here, let’s think more deeply: in fact, there is no need to solve for S, because S only performs stretching transformation on U. Even after stretching transformation, data can still be well separated. Therefore, the problem becomes that the value of US can be solved by solving V.T, that is, the value of U after stretching transformation

V to calculate

U and V are orthonormal bases, so (UT)U=E, VT=V−1(U^T)U=E, V^T= V^{-1}(UT)U=E, VT=V−1


( M T ) M = ( U S V T ) T ( U S V T ) ( M T ) M = ( V S U T ) ( U S V T ) ( M T ) M = V S U T U S V T ( M T ) M = V S S V T ( M T ) M V = V S S \\ (M^T)M = (USV^T)^T(USV^T) \\(M^T)M=(VSU^T)(USV^T) \\(M^T)M=VSU^TUSV^T \\(M^T)M=VSSV^T \\(M^T)MV=VSS

Here you can observe the result of the derivation, SS is the eigenvalue of (M.T)M, V is the eigenvector of (M.T)M, and here you just need the eigenvector of (M.T)M to solve for V

Mapping of matrices to lower dimensional coordinates

X=MWX =MWX =MW where the vector W is the first k column vectors 0

So this begs the question, why does it reduce dimension by taking the first n columns?

Here we assume that the shape of the m-matrix is M by N, the shape of U is M by M, the shape of S is M by S, the shape of V.T is S by n, and the shape of V is n by S.

The cause of dimensionality reduction is intuitively understood

The initial m-matrix shape was M × N, but after transformation, it became M × K (k<=n). Obviously, the number of features was reduced, so the effect of dimensionality reduction was achieved.

The reason of dimensionality reduction is explained

Taking the front k term of V here is equivalent to changing the shape of V to n×k, so the shape of V.T becomes K × N and that of S becomes M × K. (k<= S) Since S is a stretch matrix, the stretch matrix reduces column vectors, which is equivalent to clearing part of the data of the original S matrix to 0. When U and S are multiplied, part of the column vectors of U will be cleared by 0, which is equivalent to removing part of the eigenvalues, so the effect of dimension reduction is achieved.

Covariance method

The feature vector of the covariance matrix is the direction of the principal component of principal component analysis (PCA). In fact, it is similar to the singular value decomposition method, which is to find a rotation matrix to rotate the original data.

Analysis of covariance principle

The formula


c o v ( x . y ) = i = 1 n ( x i x ˉ ) ( y i y ˉ ) n 1 cov(x, y)=\frac { \sum_{i=1}^n {(x_i- \bar{x})}{(y_i- \bar{y})}}{n-1}

Principal component direction

C=[cov(x,x)cov(x,y)cov(x,y)cov(y,y)]C=\begin{bmatrix} cov(x, x) & cov(x, y) \\ cov(x, y) & cov(y, y) \\ \end{bmatrix} C=[cov(x,x)cov(x,y)cov(x,y)cov(y,y)] = [nxi2n ∑ I = 1-1 ∑ I = 1 nxiyin nyixin – 1 ∑ I = 1-1 ∑ I = 1 nyi2n – 1] = \ begin {bmatrix} \ frac {\ sum_ {I = 1} ^ n {x_i ^ 2}}} {n – 1 & \ frac { \sum_{i=1}^n {x_iy_i}}{n-1} \\ \frac { \sum_{i=1}^n {y_ix_i}}{n-1} & \frac { \sum_{i=1}^n {y_i^2}}{n-1} \\ \end{bmatrix} Nxi2n = [n – 1 ∑ I = 1-1 ∑ I = 1 nyixin nxiyin – 1 ∑ I = 1-1 ∑ I = 1 nyi2]

If you look carefully, you can find:


C = 1 n 1 [ x 1 x 2 x n y 1 y 2 y n ] [ x 1 y 1 x 2 y 2 x n y n ] = 1 n 1 M T M C= \frac{1}{n-1} \begin{bmatrix} x_1 & x_2 & \cdot\cdot\cdot &x_n\\ y_1 & y_2 & \cdot\cdot\cdot & y_n \\ \end{bmatrix} \begin{bmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \cdot & \cdot \\ \cdot & \cdot \\x_n & y_n\\ \end{bmatrix} = \frac{1}{n-1}M^TM

Correlation between covariance and singular value decomposition

C = n – 1 MTM = 1 n – 1 VSSVTC = \ frac {1}} {n – 1 M ^ TM = \ frac {1}} {n – 1 VSSV ^ TC = n = n – 11-11 MTM VSSVT CV = n – 1 VSSCV = \ frac {1}} {n – 1 VSSCV = n – 11 VSS

You can see that using covariance and singular value decomposition has the same effect on the rotation effect of the matrix