Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Author: Wang Yixiong

Introduction: This paper explains in detail the data cleaning and feature extraction method PCA, which is often used in machine learning, and analyzes it from three levels of theory, data and code.


Machine learning is a familiar term. Although this concept has been put forward for a long time, it has been developing slowly due to the backward technology level. However, in recent years, with the rapid improvement of computer hardware capabilities, this concept is slowly coming back to our vision, and the speed of development makes many people sit up and take notice. Especially in the past two years, afa Dog in the weiqi session of the brave performance, give people in this field has a huge daydream space.

Machine Learning (ML) is a multidisciplinary discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in the study of how computers simulate or realize human learning behavior to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve its own performance.

Machine learning is a combination of technologies. In this combination, how to conduct data analysis and processing is the most core content in my opinion. Usually in machine learning, we refer to data analysis, which is sifting through a lot of data to extract some meaningful data and infer a potential conclusion. The usual steps to arrive at this unproven conclusion are:

1. Pre-processing: process the data into some meaningful features. The purpose of this step is mainly to reduce dimension.

2. Modeling: This part is mainly to build a model (usually curve fitting) and build a possible boundary for the classifier.

3. Classifier processing: Classify the data according to the model and predict the data conclusions.

This paper mainly focuses on data preprocessing (dimensionality reduction), and PCA is adopted here.

Personal theoretical analysis of PCA:

Suppose there is a student information management system, which needs to store the fields of gender, we can have two fields M and F in the database, with 1 and 0 respectively representing yes and no. For male students, M is listed as 1 and F is listed as 0. For female students, M is listed as 0 and F is listed as 1. We find that for any record, if M is 1, F must be 0, and vice versa. So the actual process, we don’t lose any information by getting rid of M columns or F columns, because we can deduce the conclusion backwards. In this case, the correlation ratio of M and F columns is the highest, which is 100%.

Take another example. Xiao Ming runs a shop, and he counts the page view V and volume D of his shop every day. You can see that when you have a lot of V, you usually have a lot of D. When D is low, V is usually low. You can guess that V and D are necessarily related, but not absolutely related. At this time, if Xiao Ming wants to measure the value of the day according to V and D, he can often calculate the correlation ratio of V and D according to some historical data. If the correlation ratio is greater than 80%, you can take VD to measure the value of the day. This reduces the dimension.

Of course, dimensionality reduction is not limited to the selection of dimension 1 [V] as the eigenvalue of the 2-dimensional data [V, D], for example. It may maximize the correlation ratio of [V, D] in the case of V+D.

But that’s the idea of PCA. To put it simply: Suppose I have x1, x2, x3… Xn dimension data, we want to reduce the data to m dimension, we can from the history of the n dimension data, calculate a similar to x1… Xn is related to m-dimension data, which maximizes the association ratio of m-dimension data to historical data.

Mathematical analysis

Suppose we have a set of two-dimensional data

If we have to represent this data in one dimension, but we want to keep the original information as much as possible, what do you choose?

The problem is actually to choose a direction in a two-dimensional plane, project all the data onto the line in this direction, and use the projection value to represent the original record. This is a real two-dimensional down to one dimensional problem.

So how do you choose this direction in order to retain as much original information as possible? An intuitive view is that it is hoped that the projection values after projection should be scattered as much as possible, so that the larger the scope of projection is, the easier it is to make classifiers when doing classification.

As can be seen from the above example, if it is projected to the X-axis, the two points on the left will overlap, and so will the two points in the middle. Therefore, there are only two different values left after the projection of the four different two-dimensional points, which is a serious information loss. And similarly, if you project it onto the Y-axis and the three points in the middle all overlap, it’s even worse. So neither x nor y seems to be the best choice for projection. Intuitively, if you project a diagonal line through the first and third quadrants, the five points will still be distinguishable after the projection.

We want the post-projection values to be as diffuse as possible, but what are the statistics that measure dispersion, obviously, in terms of the variance mathematically.

And in general, for the sake of calculation, we subtract the mean from each point, so that the mean of each point is 0. This process is called homogenization. After homogenization:

Thus, the above problem is formalized as: to find a basis on which the variance value of all data can be maximized by the coordinate representation on this basis.

Let’s jump out of that example, because it’s easy to generalize to arbitrary latitudes. To require the basis U corresponding to the maximum variance of the projection point, there are two ways to solve:

Method one:

Suppose we have A projection A:

Obviously the variance V can be used to express:

And projection A = original data x.u;

Then the variance can be expressed as:

To maximize this variance, we can do it using Lagrange interpolation

L (u, λ) is:

Derivation L ‘:

Let the derivative be 0:

This transforms the problem into finding the eigenvalues and eigenvectors of x. xt, and the problem is solved.

At the same time, we can know that there are many eigenvalues and eigenvectors, and the corresponding eigenvectors when λ is the largest are called principal component vectors. If the dimension of m needs to be reduced to N, only the eigenvector corresponding to the eigenvalue of the first n is needed.

Method 2:

For the problem of reducing two dimensions to one, find the direction that maximizes the variance. However, for higher dimensions, first of all, we want to find a direction (basis) to maximize the post-projection variance. When we look for the second direction (basis), in order to restore as much information as possible, we obviously do not want the second direction and the first direction to have the same information. That means from the vector point of view that the projection of this vector onto the other vector has to be 0.

This is:

At this time, our thinking is very clear: to reduce a set of N-dimensional vectors to k-dimension (K is greater than 0, less than N), its goal is to select K units (module 1) orthogonal basis, so that after the original data is transformed to this basis, the covariance between each pair of fields is 0, and the variance of the field itself is as large as possible.

So let’s say that our original number is A

Let’s do a processA.ATGet:

And we figured out that if we could find a basis that would make this matrix zero except for the diagonal angles, that would be the basis we needed. So the problem becomes diagonalization of the matrix.

Let me start with a prior knowledge:

In linear algebra, we can know that the eigenvectors corresponding to different eigenvalues of a real symmetric matrix must be orthogonal. For a real symmetric matrix with n rows and n columns, n unit orthogonal eigenvectors must be found. Let these n eigenvectors be E1,e2… en.

The form of the combination matrix is shown as follows:

A new conclusion from the above conclusion is that, for the real symmetric matrix A, its eigenvector matrix is E, which must satisfy:

Given this prior knowledge, we assume that the original data A, the basis is U, and the projected data is Y. Have Y = UA. And according to what I said above it’s going to be the projection of YY.YTIs a diagonal matrix, then we have:

ifY.YTIs the diagonal matrix, so you just need U to beA.ATThe eigenvector of, then the problem is finally converted to finding AATEigenvectors of.

Code implementation:

Just said two kinds of PCA calculation ideas, we simply look at the implementation of the code, because matlab has its own feature vector function, the use of MATLAB simulation.

Let’s try using test data:

When we retained only 0.5 components, newA was reduced from 3 dimensions to 1 dimension, and the accuracy was slightly worse when reduction was performed

When we retain the 0.9 component, newA decreases from 3 to 2 dimensions, and when reduction occurs, the degree of reduction is slightly better.

When we keep the 0.97 component, we can’t reduce the dimension. At this point you can restore 100%.

To sum up:

When we do machine learning data analysis, the dimension of data set may be very high, so we need to reduce the dimension of data. This paper introduces a classic method of dimensionality reduction PCA from all directions, and also tells the process of dimensionality reduction from the perspective of code. The actual operation may be relatively simple, but the principle of personal feel that there is still a place to learn.

reading

Regression Theory of Machine Learning (PART 1)

Machine learning decision tree and random forest model

Introduction of mainstream machine learning algorithms and analysis of their advantages and disadvantages

This article has been published by Tencent Cloud Technology community authorized by the author