Machine learning Exercise 7: Implementing K-means and PCA in Python

In this exercise, you will implement the K-means algorithm and apply it to compressed images. In Part 2, you will use principal component analysis (PCA) to find low-dimensional representations of facial images.

1. Clustering (principle realization of k-means algorithm)

In this exercise, we will implement k-means clustering and use it to compress images. We’ll start with a simple 2D data set to see how K-means works, and then we’ll apply it to image compression.

We will implement and apply K-means to a simple 2d data set to get some intuition of how it works. K-means is an iterative, unsupervised clustering algorithm that combines similar instances into clusters. The algorithm starts by guessing the initial cluster center of each cluster, then repeatedly assigns instances to the nearest cluster and recalculates the cluster center of that cluster. The first part of our implementation is to find the function that is closest to the cluster center for each instance of the data.

1.1 Original Data

# Display data
plt.figure(figsize=(12, 8))
plt.scatter(data['x1'], data['x2'])
plt.title('Source Data')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Copy the code

It can be seen from observation that the data should be divided into three categories.

1.2 Finding cluster centers (seed points)

The seed points should be randomly initialized first, and three points in the sample can be selected.

# Randomly initialize seed points
centroids = np.array(data.sample(3))
Copy the code

Calculate the distance between sample points and each center and find the minimum value to determine the corresponding of each sample.

def find_closest_centroids(X, centroids):
    """Find the seed spot closest to each sample."""
    # sample
    m = X.shape[0]
    # class number
    k = centroids.shape[0]
    Store a list of sample categories
    idx = np.zeros(m)
    Iterate to find the minimum distance
    # Sample sampling
    for i in range(m):
        Minimum distance, default is 100000
        min_dist = 100000
        # Extract seed points
        for j in range(k):
            # Calculate distance
            dist = np.sum((X[i,:] - centroids[j,:]) ** 2)
            Is smaller than the previous storage distance
            if dist < min_dist:
                min_dist = dist
                idx[i] = j
    return idx
Copy the code

Returns a list of categories for each sample.

1.3 Draw clustering results

def plot_cluster(k, idx, X):
    """Draw k-means results"""FIG, ax = PLT. Subplots (figsize = (12, 8))for i in range(k):
        # Loop over each class
        cluster = X[np.where(idx == i)[0], :]
        ax.scatter(cluster[:,0], cluster[:,1], s=30, cmap=plt.cm.gist_rainbow, label='Cluster {}'.format(i))
    ax.legend()
    plt.show()
Copy the code

The matching result of the first sample with the randomly initialized cluster center is shown in the figure:

1.4 Calculate the new clustering center

Next, we need a function to calculate the cluster center of the cluster. The cluster center is the average of all samples currently assigned to the cluster.

def compute_centroids(X, idx, k):
    """Calculate the new center, K: category number"""
    # m: sample number, n: feature number
    m, n = X.shape
    Initialize the matrix that stores the cluster center
    centroids = np.zeros((k, n))
    # Loop to calculate the centers of different classes
    for i in range(k):
        # return the index number of position idx== I, tuple type
        samples_id = np.where(idx == i)
        # average by column
        centroids[i,:] = (np.sum(X[samples_id[0],:], axis=0) / len(samples_id[0])).ravel()
    return centroids
Copy the code

Where np. Sum (X[samples_id[0],:], axis=0 is used to calculate the average value of each eigenvalue in different classes.

The new clustering center is used to match the samples again, and the results are shown as follows:

1.5 Establishment of algorithm model

To sum up, we just need to repeat the process:

Sample and cluster center match
Compute new clustering centers

def run_k_means(X, centroids, max_iters):
    """Max_iters: Maximum number of iterations"""
    # m: sample number, n: feature number
    m, n = X.shape
    # category number
    k = centroids.shape[0]
    Iteration #
    for i in range(max_iters):
        # Find the category of the sample
        idx = find_closest_centroids(X, centroids)
        # Calculate the new center point
        centroids = compute_centroids(X, idx, k)
    return idx, centroids
Copy the code

Return the final matching result and clustering center, and the result of 10 cycles is shown in the figure:

2.K-means is used for image compression

We can use clustering to find the most representative few colors, and use cluster allocation to map the original 24-bit colors (3 * 8) into a lower-dimensional color space.

Why is it a 24-bit color?

Since the image we used (240 * 240) is represented in RGB space, with three layers, and each pixel on each layer has 0-255 states, each pixel in the 240 * 240 image needs 24 bits of space to represent color elements (it can also be understood as havingColor). The total space occupied by the image is: 240 * 240 * 3 * 8 = 1,382,400 bit.

Our goal is to group the color information of the original image into 16 classes, i.e., 16 colors. At this time, color elements only need 4bit space (). The total space occupied by the image is: 240 * 240 * 4 = 230400 bit. At the same time, extra space is needed to store the 16 colors that are clustered. The space is 16 * 3 * 8 = 384 bit.

Compress the image nearly 6 times.

2.1 Reading Data

The original image

from IPython.display import Image
Image(filename='data/20180402155901_62188.jpg')
# Read image data
img = plt.imread('data/20180402155901_62188.jpg')
img.shape
Copy the code

Shape :(240, 240, 3)

The data processing

The image data needs to be converted from 3D to 2D. Due to the large difference between data (0-255), feature normalization is required.

(240, 240, 3) –> (57600, 3)

# Expand the pixels
X = np.reshape(img, (img.shape[0] * img.shape[1], img.shape[2]))
# Feature normalization
X = X / 255
X.shape
Copy the code

2.2 Running the K-means algorithm

# Randomly initialize seed points
centroids = np.array(pd.DataFrame(X).sample(16))
# Run algorithm
idx, centroids = run_k_means(X, centroids, 10)
Copy the code

The clustering results are shown as follows:

This is a plot based on the first two rows of the sample

2.3 Observe the result of image compression

Next, we need to replace the value of each pixel in the original image with the value of its corresponding cluster center.

# Replace each pixel point with 16 well-clustered classes
X_recovered = centroids[idx.astype(int), :]
Copy the code

The result is shown below:

Combine pixels by row shape
X_recovered = np.reshape(X_recovered, (img.shape[0], img.shape[1], img.shape[2]))
plt.imshow(X_recovered)
# Remove the scale line
plt.xticks([])
plt.yticks([])
plt.show()
Copy the code

You can see that there are fewer color elements.

3. Principal Component Analysis (PCA)

PCA is a linear transformation to find the direction of the “principal component” or maximum variance in a data set. It can be used for data dimensionality reduction. In this exercise, we are first responsible for implementing PCA and applying it to a simple two-dimensional data set to see how it works. We start by loading and visualizing the data set.

3.1 Loading Data

Before PCA algorithm, we need to normalize the eigenvalues. Normalization with DataFrame is simple.

# Feature normalization
X = pd.DataFrame(X)
X = (X - X.mean()) / X.std()
X = np.array(X)
Copy the code

# Data Visualization
plt.figure(figsize=(12, 8))
plt.scatter(X[:, 0], X[:, 1])
plt.title('Source Data')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Copy the code

Show the raw data after normalization:

3.2 Calculate the covariance matrix and perform singular value decomposition

Covariance matrix:

def pca(X):
    """Return: U: eigenvector, S: diagonal matrix"""
    
    # Calculate the covariance matrix
    X = np.matrix(X)
    cov = (X.T * X) / X.shape[0]
    
    # Singular value decomposition
    U, S, V = np.linalg.svd(cov)
    
    return U, S, V
Copy the code

The returned U is a matrix of shape (n, n), where n is the number of eigenvalues of the sample.

3.3 Data dimension reduction

Now we have the eigenvectors of the covariance matrix (the matrix U), and we can use these to project the original data into a lower dimensional space. For this task, we will implement a function that computs the projection and selects only the top K components, reducing the dimension to K.

def dimensionality_reduction(X, U, k):
    # shape of X (m, n)
    # Select K components of U (n, K)
    U_reduced = U[:,:k]
    # Calculate the value of dimension reduction (m, k)
    Z = np.dot(X, U_reduced)
    return Z
Copy the code

The return value Z is the data after dimension reduction.

Let’s restore the data to two dimensions and observe:

Approximating the restored data, the restored data plot becomes a projection in the main direction.

def recover_data(Z, U, k):
    # (n, k)
    U_reduced = U[:,:k]
    # (m, k) @ (k, n)
    X_re = np.dot(Z, U_reduced.T)
    return X_re
Copy the code

Drawing:

# Data Visualization
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(np.array(X_re[:,0]), np.array(X_re[:,1]), s=30, color='r', label='Recover Data')
ax.scatter(X[:, 0], X[:, 1], s=30, color='b', label='Source Data')
plt.title('Versus')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Copy the code

3.4 Compress images through PCA

Our final task in this exercise is to apply PCA to face images. By using the same dimensionality reduction technique, we can capture the “essence” of the image with much less data than the original image.

3.4.1 Data Display

The data is 32 by 32 grayscale images, and now let’s draw the first 100.

def plot_100figs(X):
    """Draw 100 pictures"""
    Select the first 100 index
    sample_list = np.arange(100)
    Extract the 100 values from the data
    sample_data = X[sample_list, :]
    Sharex,sharey: subgraphs share the X-axis, Y-axis, ax_array: a 10 × 10 matrix
    fig, ax_array = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True, figsize=(8, 8))
    # subgraph
    for r in range(10):
        for c in range(10):
            ax_array[r, c].matshow(sample_data[10 * r + c, :].reshape((32, 32)).T, 
                                   cmap=plt.cm.gray)
    # Remove the scale line
    plt.xticks([])
    plt.yticks([])
    plt.show()
Copy the code

Original data:

3.4.2 Data Dimension reduction

Obviously each of our images has 32 * 32 = 1024 eigenvalues.

Now we take the top 100 principal components through PCA

U, S, V = pca(X)
Z = dimensionality_reduction(X, U, 100)
X_re100 = recover_data(Z, U, 100)
Copy the code

The result is shown below:

You can see that details are missing, but the general outline is preserved.

What if I just took 10 principal components

U, S, V = pca(X)
Z = dimensionality_reduction(X, U, 10)
X_re10 = recover_data(Z, U, 10)
Copy the code

The result is shown below:

What if I just took one principal component

U, S, V = pca(X)
Z = dimensionality_reduction(X, U, 1)
X_re1 = recover_data(Z, U, 1)
Copy the code

The result is shown below:

Set pieces

You can notice that when using the K-means algorithm, we compress the color information of RGB images. Then what will be the result if PCA is used to compress the color information?

I’ll use this picture again:

Therefore, it is very necessary to use appropriate algorithms in different application scenarios.

Machine learning Exercise 7: Implementing K-means and PCA in Python

1. Clustering (principle realization of k-means algorithm)

1.1 Original Data

1.2 Finding cluster centers (seed points)

1.3 Draw clustering results

1.4 Calculate the new clustering center

1.5 Establishment of algorithm model

2.K-means is used for image compression

2.1 Reading Data

2.2 Running the K-means algorithm

2.3 Observe the result of image compression

3. Principal Component Analysis (PCA)

3.1 Loading Data

3.2 Calculate the covariance matrix and perform singular value decomposition

3.3 Data dimension reduction

3.4 Compress images through PCA

3.4.1 Data Display

3.4.2 Data Dimension reduction

Set pieces

Related Posts

Detection and classification of plant leaf diseases based on MATLAB SVM

Week19 Recommend you in the eyes of system

Can’t miss it! 7 React libraries to make your life easier