The mathematical principle and visualization effect of PCA

Structure of this paper:

  • What is the PCA
  • Mathematical principles
  • Visual effect

1. What is PCA

Principal Component Analysis (PCA) is a method of dimensionality reduction in machine learning.

For example, if we have such transaction data, it has the following characteristics :(date, number of views, number of visitors, number of transactions, amount of transactions), we know from experience that there will be a strong correlation between “views” and “number of visitors”, “number of transactions” and “number of transactions”. In this case, we keep two of the dimensions to keep the original information intact.

But when we do dimensional reduction, we lose some information.

For example, if the following five data are projected to the X-axis, the two points on the left will overlap and so will the two points in the middle. After the five points are projected to the X-axis, there will be only three points left, which is a serious information loss:

\

Therefore, we hope that in the direction of low-dimensional space, the variance after the original data mapping will be as large as possible, which means that data points will be scattered as far as possible, so as to retain more information.

PCA seeks to maximize the internal information of data after dimensionality reduction, and measures the importance of this direction by measuring the variance of data in the projection direction, which is a linear dimensionality reduction method with the least loss of original data information.

PCA algorithm steps:

Let’s say m pieces of n-dimensional data. 1) Form the original data into a matrix X with n rows and m columns according to the columns; 2) Zero-mean each row of X (representing an attribute field), That is, subtracting the mean value of this row 3) calculate the covariance matrix C=1/mXX𝖳 4) Calculate the eigenvalues and corresponding eigenvectors of the covariance matrix 5) Arrange the eigenvectors into a matrix from top to bottom according to the size of the corresponding eigenvalues, and take the first k rows to form the matrix P 6) Y=PX is the data after dimension reduction to the K-dimension


2. Mathematical principles

To find this low-dimensional space means to determine a basis

As shown below, the blue orthogonal arrows are the basis vectors for the new coordinate system:

\

We want the projection of the original data on the new coordinate system to be as diffuse as possible.

The projection of the same point, in the new coordinate system, is the inner product of it with two bases:

Because the inner product of A and B is equal to A ⋅ B = | | | | B cos B (A) as the base, | | = 1, B is A ⋅ B = | A | cos (A), i.e. the projection length of the inner product is equal to A to B, and coordinate values.

This dispersion can be expressed in terms of mathematical variance:

The variance of a field can be regarded as the mean of the sum of squares of the differences between each element and the mean of the field, namely:

\

In the second step of PCA, the mean value of each field has been reduced to 0, so the variance can be directly expressed by dividing the sum of squares of each element by the number of elements:

\

So we want to maximize the variance of the mapped data.

Moreover, when we choose the second projection direction, we do not want linear correlation between them, otherwise there will inevitably be repeated information.

Mathematically, the covariance of two fields can be used to represent their correlation. Since the mean value of each field has been set to 0, then:

\

So I want the covariance to be zero, so that the two fields are completely independent, so that the second basis can only be chosen in directions that are orthogonal to the first basis.

Then, we get the dimension reduction problem of the optimization goal: will be reduced to a set of N d vector K d, the goal is to choose K units orthogonal basis, so that when the raw data transformation to the group based on, in various fields between the two covariance is 0, and the variances of the field is as large as possible (i.e. under the constraint of orthogonal, take the biggest K variance).

Suppose we have data X, which has two fields a and B:

\

The covariance matrix of X is calculated as follows:

\

You can see that the two elements on the diagonal of this matrix are the variances of the two fields, and the other elements are the covariances of A and B.

Thus, optimization is currently equivalent to diagonalizing the covariance matrix represented by the new data after the original data is transformed to this basis, and arranging the elements on the diagonal from top to bottom in terms of size.

If P is the row basis matrix, then Y=PX is the basis transformation of X with respect to P. (6) Let the covariance matrix of Y be D, then D and C have the following relationship:

\

Thus, the objective of optimization becomes to find a matrix P that satisfies PCP𝖳 is a diagonal matrix, that is, diagonalize C (3)

How to find P:

C is a symmetric matrix. You can find the content of “diagonalization of real symmetric matrices” in linear algebra books.

A real symmetric matrix with n rows and n columns must find n unit orthogonal eigenvectors E1,e2… en,

Their column matrix E=(e1, e2… en) diagonalizes C :(4)

\

The diagonal elements are the eigenvalues corresponding to each eigenvector.

So P=E𝖳, that is, each row of P is an eigenvector of C. (5)

Thus, steps 3 to 6 in the PCA step are obtained.


3. Visualization

For example, in the following figure, we want to change 2 dimensions into 1 dimensions, and the hollow circle is the data. The red arrow is the direction of the first principal component, and the blue arrow is the direction of the second principal component:

\

Using the two principal components as the new coordinate system, you can see the state of the cross data as shown below:

\

When we only look at the first principal component, as shown below, the blue dot is the data after dimensionality reduction: it can be seen that the variance on PC1 is large, while that on PC2 is small. Comparatively speaking, PC2 can be ignored, thus achieving dimensionality reduction

\

We can compare the data after dimensionality reduction with the original data and find that the information of the original data on PC2 has been discarded

\


Material: machine learning stats.stackexchange.com/questions/7… Blog.codinglabs.org/articles/pc…