Principle Component Analysis (PCA) is an unsupervised learning machine algorithm, mainly used for dimensionality reduction of data.
Basic principle of PCA
Take a two-dimensional plane with two features as an example, as shown in the figure:
The horizontal axis represents feature 1, the vertical axis represents feature 2, and four points represent two-dimensional feature samples. If you want to reduce the dimension of the sample to one dimension, you can map these points to the horizontal, vertical, or other axes as follows:
The spacing between samples will vary depending on which axis you map it to, and what you want to find is the axis that maximizes the spacing between samples. Let’s say that this axis is in the direction of theta.
You can use variance for spacing between samplesThe larger the variance, the more sparse the sample.
Then the mean reversion to 0 (demean) treatment is carried out, i.e,. The mean to zero treatment changed the original sample into a new sample. Map it to the axisAnd you get a new sampleWhere the variance is:
becauseIt’s a vector, so the actual variance is expressed as:
The final calculation, as shown in the figure:
(Is unit vector), so the desired objective function can be obtained:
omakeThe biggest; For multidimensional features, namelyIs the largest.
PAC problem is solved by gradient rise method
In the gradient ascent method,On behalf of theThe direction of increase.
For the objective function of PACmakeMaximum, can be solved using gradient ascent method.
The gradient of is:
Note: The conversion process is omitted here.
Using gradient formula
In fact, the gradient ascent method is batch gradient ascent method (there are also random gradient ascent method and small batch gradient ascent method, which are not involved in this paper).
Principal component is obtained by gradient ascent method
First principal component
Start by defining a data set with two characteristics, 100 samples in total:
Import numpy as NP X = np.empty((100, 2)) X[:, 0] = Np.random. Uniform (0., 100., size=100) X[:, 1] = 0.75 * X[:, 0] + 3. + np.random.normal(0, 10., size=100)Copy the code
Visualization:
demean()
The method ofThe mean value is returned to 0:
def demean(X):
return X - np.mean(X, axis=0)
X_demean = demean(X)Copy the code
The mean goes to 0Visualization:
f()
Method finding functionValue:
def f(w, X):
return np.sum(X.dot(w) ** 2) / len(X)Copy the code
df()
Methods According to gradient formulaA gradient:
def df(w, X):
return X.T.dot(X.dot(w)) * 2 / len(X)Copy the code
gradient_ascent()
The method is the process of gradient ascent method, inn_iters
Each time we get a new one, if the newAnd the oldCorresponding functionThe difference between the values of satisfies the accuracyepsilon
, indicates that the right one is found:
def gradient_ascent(X, initial_w, eta, n_iters=1e4, epsilon=1e-8): w = direction(initial_w) cur_iter = 0 while cur_iter < n_iters: Gradient = df(w, X) last_w = w w = w + eta * gradient W = direction(w) # X)) < epsilon): break cur_iter += 1 return wCopy the code
A method is calleddirection()
, are used toConvert to unit vector:
def direction(w):
return w / np.linalg.norm(w)Copy the code
And then using gradient ascent, I’m going to set up an initial alphaVector and:
Initial_w = NP.random. Random (X.shape[1]) ETA = 0.001Copy the code
Call the gradient_ascent() method directly:
w = gradient_ascent(X_demean, initial_w, eta)Copy the code
Will get theVisualized with samples (As a straight line) :
Take the first n principal components
So now that we have the first principal component, how do we figure out the next principal component? Take two-dimensional features as an example, as shown in the figure:
sampleIn the first principal componentThe component of phi is zeroAnd the next component on the principal component, so just remove the component of the data on the first principal component, and then calculate the first principal component of the new data.
first_n_component()
Method is used to findThe first n principal components of the for loop are set to be random each time we evaluate the principal components(initial_w), after obtaining a principal component, remove the component on this principal componentX_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w
:
Def first_n_component(n, X, eta=0.001, n_iters=1e4, epsilon=1e-8): X_pca = X.copy() X_pca = demean(X_pca) res = [] for i in range(n): initial_w = np.random.random(X_pca.shape[1]) w = gradient_ascent(X_pca, initial_w, eta) res.append(w) X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w return resCopy the code
Dimension reduction mapping
Once you have multiple principal components, you can map high-dimensional data to low-dimensional data through these principal components.
For the high-dimensional data with m samples and N features(m*n), the first K (k<n) principal components are obtained after principal component analysisN * (k) :
The new data after dimension reduction from dimension N to dimension K is:.
Information can be lost in this dimensionality reduction process. Dimensionality reduction may also reduce dryness if there is some useless information in the original data.
You can also restore the new data after dimensionality reduction to the original dimension:.
Implement PCA
Define a classPCA
Constructor functionn_components
Represents the number of principal components, namely the dimension after dimensionality reduction,components_
Principal component;
functionfit()
With the abovefirst_n_component()
It’s the same method that we use to figure out;
functiontransform()
将 Map to each principal component and get, namely dimension reduction;
functiontransform()
将 Map to the original eigenspace, and get.
import numpy as np class PCA: def __init__(self, n_components): Self.n_components = self.n_components self.n_components = self.n_components self.n_components = self.n_components self.n_components = self.n_components self.n_components = self.n_components return X - np.mean(X, axis=0) def f(w, X): return np.sum(X.dot(w) ** 2) / len(X) def df(w, X): return X.T.dot(X.dot(w)) * 2 / len(X) def direction(w): return w / np.linalg.norm(w) def first_component(X, initial_w, eta, n_iters, epsilon=1e-8): w = direction(initial_w) cur_iter = 0 while cur_iter < n_iters: gradient = df(w, X) last_w = w w = w + eta * gradient w = direction(w) if(abs(f(w, X) - f(last_w, X)) < epsilon): break cur_iter += 1 return w X_pca = demean(X) self.components_ = np.empty(shape=(self.n_components, X.shape[1])) for i in range(self.n_components): initial_w = np.random.random(X_pca.shape[1]) w = first_component(X_pca, initial_w, eta, n_iters) self.components_[i:] = w X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w return self def transform(self, X): return X.dot(self.components_.T) def inverse_transform(self, X): return X.dot(self.components_)Copy the code
PCA in Scikit Learn
PCA in Scikit Learn does not use gradient ascent, but uses corresponding mathematical methods. The usage is as follows:
from sklearn.decomposition import PCA
pca = PCA(n_components=1)Copy the code
The n_components parameter indicates the number of principal components.
Since PCA dimensionality reduction will lose the original data information, PCA in Scikit Learn when 0 < n_components < 1 indicates how much information is retained during dimensionality reduction, for example:
Pca = pca (0.95)Copy the code
That is to retain 95% of the original data information, at this time the program will automatically get a ComponentS_.
Using kNN to see PCA
Now let’s look at the advantages of PCA with the help of kNN classification algorithm. Firstly, a regular handwritten dataset MNIST dataset is used, and the procedure for obtaining the sample dataset is as follows:
import numpy as np from sklearn.datasets import fetch_mldata mnist = fetch_mldata("MNIST original") X, Y = mnist['data'], mnist['target'] # mnist dataset already allocates training dataset and test dataset by default, X_train = np.array(X[:60000], dtype=float) y_train = np.array(y[:60000], dtype=float) X_test = np.array(X[60000:], dtype=float) y_test = np.array(y[60000:], dtype=float)Copy the code
First, let’s look at the efficiency of kNN algorithm without PCA:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train, y_train)Copy the code
At this point, the time consumed in the process of FIT is:
CPU times: user 29.6s, SYS: 306 ms, Total: 29.9s Wall time: 30.7sCopy the code
Then verify the accuracy of the test data set:
%time knn_clf.score(X_test, y_test)Copy the code
The time consumed in this process is:
CPU times: user 10min 23s, sys: 2.05s, total: 10min 25s Wall time: 10min 31sCopy the code
And the accuracy of the test data set is 0.9688.
Then take a look at the advantages of PCA. When using PCA, set the retention information to 90% :
From sklearn. Decomposition import PCA PCA = PCA(0.9) PCA. Fit (X_train) X_train_reduction = Pca.transform (X_train) KNn_clf = KNeighborsClassifier() %time knn_clf.fit(X_train_reduction, y_train)Copy the code
At this point, the time consumed in the process of FIT is:
CPU times: user 317 ms, sys: 5.31 ms, total: 323 ms
Wall time: 323 msCopy the code
Then verify the accuracy of the test data set:
X_test_reduction = pca.transform(X_test)
%time knn_clf.score(X_test_reduction, y_test)Copy the code
The time consumed in this process is:
CPU times: user 1min 15s, sys: 469 ms, total: 1min 15s
Wall time: 1min 17sCopy the code
It can be seen that the consumption of time after PCA is greatly reduced, and the accuracy rate of the test data set is 0.9728. The accuracy rate not only does not decrease but also increases, because the lost information in the dimensionality reduction process is useless information, thus achieving the dryness reduction effect.
The source address
Github | ML-Algorithms-Action