Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money

1. PCA principal component analysis

1.1 Introduction to the Algorithm

Although the data sample is high-dimensional, it may only be a low-dimensional embedding closely related to the learning task, so the data can be effectively reduced in dimension.

Principal component analysis (PCA) is a method for statistical analysis and simplification of data sets.


It uses orthogonal transformation to perform linear transformation on the observed values of a series of possibly related variables, thus projecting the values of a series of linearly unrelated variables, which are called principal components.

1.2 Implementation Roadmap

In general, the simplest way to obtain a lower dimensional subspace is to perform a linear transformation on the original higher dimensional space.

Given a data point in a 𝒎 dimensional space, project it into a lower dimensional space while retaining as much information as possible.

  • Orthogonal projection of data in low dimensional linear space

Maximize the variance of projected data (purple line). Minimize the mean square distance (sum of blue lines) between the data point and the projection.

  • Principal component concept:

    1. Principal component analysis(PCA)The idea is to be𝒎The dimensional features map to𝒌D on(𝒌 < 𝒎), this𝒌Dimension is a new orthogonal feature.
    2. this𝒌The dimensional characteristics are calledThe principal components (PC)Is reconstructed𝒌 d characteristics.
  • Principal component characteristics:

    1. Vector from the center of mass.
    2. Principal component #1 points in the direction of maximum variance.
    3. Each subsequent principal component is orthogonal to the previous principal component and points to the direction of the maximum variance of the residual subspace

1.3 Calculation by Formula

1.3.1 PCA Sequencing

Given the centralized data {𝒙_𝟏,𝒙_𝟐,…, alde_ e}, calculate the principal vector:We maximize the projection variance of 𝒙

We maximize the variance of the projection in the residual subspace

1.3.2 Sample covariance matrix

Given the data {𝒙_𝟏,𝒙_𝟐,…, called_e}, calculate the covariance matrix (x)

I’m not going to prove it. There are too many formulas.

1.4 small practice

Given image data set, the relationship between the number of features and clustering performance after pca dimensionality reduction is discussed.

from PIL import Image
import numpy as np
import os
from ex1.clustering_performance import clusteringMetrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

def getImage(path) :
    images = []
    for root, dirs, files in os.walk(path):
        if len(dirs) == 0:
            images.append([root + "\ \" + x for x in files])
    return images

# loading images
images_files = getImage('face_images')
y = []
all_imgs = []
for i in range(len(images_files)):
    y.append(i)
    imgs = []
    for j in range(len(images_files[i])):
        img = np.array(Image.open(images_files[i][j]).convert("L"))  # gray
        # img = np.array(Image.open(images_files[i][j])) #RGB
        imgs.append(img)
    all_imgs.append(imgs)

# Visualization
w, h = 180.200
pic_all = np.zeros((h * 10, w * 10))  # gray
for i in range(10) :for j in range(10):
        pic_all[i * h:(i + 1) * h, j * w:(j + 1) * w] = all_imgs[i][j]
pic_all = np.uint8(pic_all)
pic_all = Image.fromarray(pic_all)
pic_all.show()

Construct input X
label = []
X = []
for i in range(len(all_imgs)):
    for j in all_imgs[i]:
        label.append(i)
        # temp = j.reshape(h * w, 3) #RGB
        temp = j.reshape(h * w)  # GRAY
        X.append(temp)

def keams_in(X_Data, k) :
    kMeans1 = KMeans(k)
    y_p = kMeans1.fit_predict(X_Data)
    ACC, NMI, ARI = clusteringMetrics(label, y_p)
    t = "ACC:{},NMI:{:.4f},ARI:{:.4f}".format(ACC, NMI, ARI)
    print(t)
    return ACC, NMI, ARI

# PCA
def pca(X_Data, n_component, height, weight) :
    X_Data = np.array(X_Data)
    pca1 = PCA(n_component)
    pca1.fit(X_Data)
    faces = pca1.components_
    faces = faces.reshape(n_component, height, weight)
    X_t = pca1.transform(X_Data)
    return faces, X_t

def draw(n_component, faces) :
    plt.figure(figsize=(10.4))
    plt.subplots_adjust(hspace=0, wspace=0)
    for i in range(n_component):
        plt.subplot(2.5, i + 1)
        plt.imshow(faces[i], cmap='gray')
        plt.title(i + 1)
        plt.xticks(())
        plt.yticks(())
    plt.show()

score = []
for i in range(10):
    _, X_trans = pca(X, i + 1, h, w)
    acc, nmi, ari = keams_in(X_trans, 10)
    score.append([acc, nmi, ari])

score = np.array(score)
bar_width = 0.25
x = np.arange(1.11)
plt.bar(x, score[:, 0], bar_width, align="center", color="orange", label="ACC", alpha=0.5)
plt.bar(x + bar_width, score[:, 1], bar_width, color="blue", align="center", label="NMI", alpha=0.5)
plt.bar(x + bar_width*2, score[:, 2], bar_width, color="red", align="center", label="ARI", alpha=0.5)
plt.xlabel("n_component")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
Copy the code

2. LDA linear judgment analysis

2.1 Introduction to the Algorithm

When we map, due to the location of the mapping, we will have different results after dimensionality reduction. For the following two, we can see that the classification of method 2 is more obvious and method 2 is better.

PCAMapping comparison of.

2.2 Implementation Roadmap

After projection, the variance within class is minimum and the variance between classes is maximum

As we can see in the 3d mapping example above, method 2 is better because it has the smallest variance within classes and the largest variance between classes.

The data is mapped to Rk(from d dimension down to K dimension) and the transformation is expected to map samples belonging to the same class as close as possible (i.e. the smallest in-class distance) and samples of different classes as far as possible (i.e., the maximum inter-class distance). At the same time, the discriminant information of sample data can be retained as much as possible.

Remember 𝒁 _ 𝒊 = {𝑻 (𝒙) | 𝒙 ∊ 𝑿 _ 𝒊}, which according to the basic idea of linear discriminant analysis, we want to:

12, (𝒛_𝟏) 12 and (𝒛_2) 12, the further away, the better

Degree of interclass dispersionThe more elements in 𝒁_𝒊 are concentrated near (𝒛_𝒊) 12

Inclass dispersion

Input: training sample [{𝒙_𝒊,𝒚_𝒊} _(port = x)^ password, dimension (number of features) k after dimension reduction.

Output: 𝑿 = [𝒙 _ 𝟏,…, 𝒙 _ 𝒏] low dimension said 𝒁 = [𝐳 _ 𝟏,…, 𝐳 _ 𝒏].

Step 1. Calculate the intra-class divergence matrix Sw; 2. Calculate the divergence matrix Sb between classes; 3. Calculate the matrix S to the minus first power wSb; 4. Calculate the maximum k eigenvalues and corresponding K eigenvectors (w1, W2… 5. Transform each sample feature xi in the sample set into a new sample ZI =WTxi 6. The output sample set is obtained [{𝒛_𝒊,𝒚_𝒊} _(port = x)^ _.

2.3 small practice

Given image data set, the dimension reduction effect of LDA is discussed

from sklearn import datasets# Import data set
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt # PLT is used to display images
from matplotlib import offsetbox

def calLDA(k) :
    # LDA
    lda = LinearDiscriminantAnalysis(n_components=k).fit(data,label) # n_components sets dimension reduction to n
    dataLDA = lda.transform(data)  # Apply the rule to the training set
    return dataLDA

def calPCA(k) :
    # PCA
    pca = PCA(n_components=k).fit(data)
    Return the test set and training set after dimensionality reduction
    dataPCA = pca.transform(data)
    return dataPCA

def draw() :
    # matplotlib there will be a problem with Chinese display in the drawing, you need these two lines to set the default font

    fig = plt.figure('example', figsize=(11.6))
    # plt.xlabel('X')
    # plt.ylabel('Y')
    # plt.xlim(xmax=9, xmin=-9)
    # plt.ylim(ymax=9, ymin=-9)
    color = ["red"."yellow"."blue"."green"."black"."purple"."pink"."brown"."gray"."Orange"]
    colors = []
    for target in label:
        colors.append(color[target])
    plt.subplot(121)
    plt.title("Visualization of LDA dimension Reduction")
    plt.scatter(dataLDA.T[0], dataLDA.T[1], s=10,c=colors)
    plt.subplot(122)
    plt.title("PCA Dimensionality Reduction Visualization")
    plt.scatter(dataPCA.T[0], dataPCA.T[1], s=10, c=colors)

    #plt.legend()
    plt.show()

def plot_embedding(X,title=None) :
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)  # Normalize each dimension 0-1. Note that X has only two dimensions
    colors = ['#5dbe80'.'#2d9ed8'.'#a290c4'.'#efab40'.'#eb4e4f'.'# 929591'.'#ababab'.'#eeeeee'.'#aaaaaa'.'# 213832']

    ax = plt.subplot()

    # Draw sample points
    for i in range(X.shape[0) :Each row represents a sample
        plt.text(X[i, 0], X[i, 1].str(label[i]),
                 # color=plt.cm.Set1(y[i] / 10.),
                 color=colors[label[i]],
                 fontdict={'weight': 'bold'.'size': 9})  # Draw the number label of the sample point at the location of the sample point

    # Draw thumbnails on sample points and make sure the thumbnails are sparse enough not to cover each other
    if hasattr(offsetbox, 'AnnotationBbox'):
        shown_images = np.array([[1..1.]])  # assume that the original thumbnail appears in position (1,1)
        for i in range(data.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2.1)  # Calculate the distance between the sample point and all the images shown (shown_images)
            if np.min(dist) < 4e-3:  # If the minimum distance is less than 4E-3, i.e. there are two sample points close to each other, display the digital image thumbnail with continue skip
                continue
            shown_images = np.r_[shown_images, [X[i]]]  Sample points showing thumbnails are added to the Shown_images matrix by vertical stitching

            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(datasets.load_digits().images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)

    #plt.xticks([]), plt.yticks([]) #
    if title is not None:
        plt.title(title)

    plt.show()

data = datasets.load_digits().data# a number of 64 dimensions, 1797 numbers
label = datasets.load_digits().target
dataLDA = calLDA(2)
dataPCA = calPCA(2)

#draw() # common graph


plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

plot_embedding(dataLDA,"Visualization of LDA dimension Reduction")
plot_embedding(dataPCA,"PCA Dimensionality Reduction Visualization")
Copy the code

The last

Xiao Sheng Fan Yi, looking forward to your attention.