Produced by: The Cabin by: Peter Edited by: Peter

Ng Machine Learning -9- Dimensionality reduction PCA

This paper mainly introduces the content related to data dimension reduction, focusing on PCA algorithm

  • Why is dimension reduction implemented
    • Data compression
    • Data visualization
  • PCA algorithm
    • PCA and linear regression algorithm
    • Characteristics of PCA algorithm
    • Python implementation PCA
  • PCA is implemented in SkLearn

Why the dimension reduction

In the case of real high-dimensional data, there will be sparse data samples, difficult distance calculation and other problems, known as dimension disaster.

The solution is dimension reduction, also known as “dimension reduction”, that is, the original high-dimensional attribute space is transformed into a low-dimensional “subspace” through some data method. In this subspace, the sample density is greatly increased, “embedding” a lower dimension in a higher dimensional space.

Dimension Reduction Dimensionality Reduction

There are two main motivations for data dimensionality reduction:

  • Data compressionData Compression
  • Data visualizationData Visualization

Data Compression

The illustration above explains:

  1. The eigen vectors in a three-dimensional space are reduced to eigen vectors in two dimensions.
  2. Projecting three dimensions onto a two dimensional plane forces all data to be on the same plane.
  3. Such a process can be used to reduce data from any dimension to any desired dimension, for example, to1000The characteristic of dimension is reduced to100D.

Data Visualization

Dimensionality reduction can help us visualize the data.

Explanation of the above image:

  1. Suppose that given data has several different attributes
  2. Some attributes may have the same meaning, so they can be placed on the same axis in the graph to reduce the dimension of data

PCA- Principal Component Analysis

In PCA, what we need to do is to find a Vector direction. When all the data is projected onto this Vector, the key point of PCA is to find a projection plane to minimize the projection error.

A direction vector is a vector passing through the origin, and the projection error is the length of a perpendicular from the eigenvector to that direction vector.

Difference between PCA and linear regression

  1. The vertical axis in the linear regression is the predicted value,PCAWhere is the characteristic attribute
  2. Errors are different:PCAIs the projection error, linear regression is an attempt to minimize the prediction error.
  3. The purpose of linear regression is to predict the results, PCA is not to do any analysis.

PCA algorithm

In principal component analysis, the given data is normalized first so that the average value of each variable is 0 and the variance is 1.

After the orthogonal transformation of the data, the data used to be represented by linear correlation, through the orthogonal transformation into a number of linearly independent new variable representation of the data.

The new variable is the largest of the possible orthogonal transformations in terms of the variance and (information preservation) of the variable, which represents the amount of information on the new variable. The new variable becomes the first principal component, the second principal component, etc. Through principal component analysis, the original data can be approximately represented by principal components, which is dimension reduction of data.

The process from n to K dimensions in the PCA algorithm is

  • Mean normalization. Calculate the mean of all features so that Xj =xj−μjx_j= X_j -\mu_jxj=xj−μj, or divide by the standard deviation if the features are not on the same order of magnitude

  • Calculate the covariance matrix


Σ : = 1 m i = 1 n ( x ( i ) ) ( x ( i ) ) T \Sigma: \quad \sum=\frac{1}{m} \sum_{i=1}^{n}\left(x^{(i)}\right)\left(x^{(i)}\right)^{T}

  • Compute the eigenvectors of the covariance matrix ∑\sum∑

It’s described in the watermelon book

The number of principal components is determined

As for the determination of the number of principal components K in PCA algorithm, it is generally based on the formula:


1 m i = 1 m x ( i ) x a p p r o x ( i ) 2 1 m i = 1 m x ( i ) 2 Or less 0.01 \frac{\frac{1}{m} \sum_{i=1}^{m}\left|x^{(i)}-x_{a p p r o x}^{(i)}\right|^{2}}{\frac{1}{m} \ sum_ {I = 1} ^ {m} \ left | x ^ {(I)} \ right | ^ {2}} \ leq 0.01

The 0.01 on the right-hand side of the inequality can be 0.05, 0.1, etc., which are fairly common. When it is 0.01, it means that 99% of the variance data is retained, that is, most of the data features are retained.

When the number k is given, each eigenvalue solved in the covariance matrix S satisfies the formula:


1 i = 1 k S i i i = 1 n S i i Or less 0.01 1 – \ frac {\ sum ^ k_ S_ (I = 1} {2}} {\ sum ^ n_ S_ (I = 1} {2}} \ leq0.01

That is, satisfaction:


i = 1 k S i i i = 1 n S i i p 0.99 \ frac {\ sum ^ k_ S_ (I = 1} {2}} {\ sum ^ n_ S_ (I = 1} {2}} \ geq 0.99

This is equivalent to the formula above.

A compressed representation of the reconstruction

Reconstruction from Compressed Representation refers to the process of restoring data from low dimension to high dimension.

There are two samples s(1),x(2)s^{(1)},x^{(2)}s(1),x(2). With the given real number ZZZ, z=UreduceTxz={U_{r educe}^{T}} xz=UreduceTx,

Map the specified point position to a three-dimensional surface and reverse solve the previous equation:


x appox = U reduce z . x appox material x x_{\text {appox}}=U_{\text {reduce}} \cdot z, x_{\text {appox}} \approx x

PCA characteristics

  1. PCAEssentially, the direction with the largest variance is taken as the main feature, so that these features have no correlation in the different orthogonal directions.
  2. PCAIs a parameterless technology, does not need to adjust any parameters

Python implementation PCA

PCA algorithm was implemented by using numpy, PANDAS and Matplotlib libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def loadData(filename) :
  # file loading function
  df = pd.read_table(filename, seq='\t')
  return np.array(df)  Be sure to return array array

def showData(dataMat, reconMat) :
  # Picture display function
  fig = plt.figure()  # the canvas
  ax = fig.add_subplot(111)  # Subgraph display
  ax.scatter(dataMat[:, 0], dataMat[:, 1], c='green')  # a scatter diagram
  ax.scatter(np.array(reconMat[:, 0]), reconMat[:, 1], c='red')
  plt.show()

def pca(dataMat, topNfeat) :   # topNfeat is the first K principal components to filter
  # 1. Sample centralization process: All sample attributes minus the average value of attributes
  meanVals = np.mena(dataMat, axis=0)   The average #
  meanRemoved = dataMat - meanVals  # Centralized data

  # 2. Calculate the covariance matrix XXT of the sample
  covmat = np.cov(meanRemoved, rowvar=0)
  print(covmat)

  # 3. Perform eigenvalue decomposition on the covariance matrix, work out the eigenvectors and eigenvalues, sort the eigenvalues from large to small, and screen out topNfeat
  # np.mat is essentially creating a matrix
  # Np.linalg.eig solve matrix eigenvectors and eigenvalues
  eigVals, eigVects = np.linalg.eig(np.mat(covmat))
  eigValInd = np.argsort(eigVals)  Argsort returns the index to sort the eigenvalues
  eigValInd = eigValInd[:-(topNfeat + 1) : -1]   [:-8:-1]
  redEigVects = eigVects[:, eigValInd]   # Select the eigenvector corresponding to the large eigenvalue of the previous topNfeat

  Convert data to lower dimensional space
  lowDataMat = meanRemoved * redEigVects   # Only topNfeat dimension, after dimensionality reduction
  reconMat = (lowDataMat * redEigVects.T) + meanVals   # Refactoring data
  return np.array(lowDataMat), np.array(reconMat)

# Main function part
if __name__ == "__main__":
  dataMat = loadDataSet(filepath)   Enter the file path
  loadDataMat, reconMat = pca(dataMat, 1)
  # showData(dataMat, lowDataMat)
  showData(dataMat, reconMat)
  print(lowDataMat)
Copy the code

PCA is implemented in SkLearn

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

Learn PCA with SKLearn

Implement module

In SciKit-learn, pCA-related classes are in the SkLear.Decomposition package. The most commonly used PCA class is sklearn. Decomposition. PCA.

Whitening: Normalize each feature of the data after dimensionality reduction so that the variance is 1

class sklearn.decomposition.PCA(n_components=None.# Number of features after dimensionality reduction, specify an integer directly
                                copy=True,
                                whiten=False.The default value is no whitening
                                svd_solver='auto'.# specify a method for singular value decomposition SVD
                                tol=0.0,
                                iterated_power='auto',
                                random_state=None)
Copy the code

demo

Here is an example of using PCA algorithm to classify IRIS data

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # 3 d module
from sklearn import decomposition  # Compression module
from sklearn import datasets

np.random.seed(5)

centers = [[1.1], [...1, -1], [1, -1]]
iris = datasets.load_iris()  # import data

X = iris.data  # Sample space
y = iris.target  # output

fig = plt.figure(1, figsize=(4.3))
plt.clf()
ax = Axes3D(fig, rect=[0.0.95..1], elev=48, azim=134)

plt.cla()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

for name, label in [('Setosa'.0), ('Versicolour'.1), ('Virginica'.2)]:
  ax.text3D(X[y == label, 0].mean(),
           X[y == label, 1].mean() + 1.5,
           X[y == label, 2].mean(), name,
           horizontalalignment = 'center',
           bbox = dict(alpha=. 5, edgecolor='w', facecolor='w'))

y = np.choose(y, [1.2.0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.nipy_spectral, edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()
Copy the code