In this exercise, you will implement the K-means algorithm and apply it to compressed images. In Part 2, you will use principal component analysis (PCA) to find low-dimensional representations of facial images.
1. Clustering (principle realization of k-means algorithm)
In this exercise, we will implement k-means clustering and use it to compress images. We’ll start with a simple 2D data set to see how K-means works, and then we’ll apply it to image compression.
We will implement and apply K-means to a simple 2d data set to get some intuition of how it works. K-means is an iterative, unsupervised clustering algorithm that combines similar instances into clusters. The algorithm starts by guessing the initial cluster center of each cluster, then repeatedly assigns instances to the nearest cluster and recalculates the cluster center of that cluster. The first part of our implementation is to find the function that is closest to the cluster center for each instance of the data.
1.1 Original Data
# Display data
plt.figure(figsize=(12, 8))
plt.scatter(data['x1'], data['x2'])
plt.title('Source Data')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Copy the code
It can be seen from observation that the data should be divided into three categories.
1.2 Finding cluster centers (seed points)
The seed points should be randomly initialized first, and three points in the sample can be selected.
# Randomly initialize seed points
centroids = np.array(data.sample(3))
Copy the code
Calculate the distance between sample points and each center and find the minimum value to determine the corresponding of each sample.
def find_closest_centroids(X, centroids):
"""Find the seed spot closest to each sample."""
# sample
m = X.shape[0]
# class number
k = centroids.shape[0]
Store a list of sample categories
idx = np.zeros(m)
Iterate to find the minimum distance
# Sample sampling
for i in range(m):
Minimum distance, default is 100000
min_dist = 100000
# Extract seed points
for j in range(k):
# Calculate distance
dist = np.sum((X[i,:] - centroids[j,:]) ** 2)
Is smaller than the previous storage distance
if dist < min_dist:
min_dist = dist
idx[i] = j
return idx
Copy the code
Returns a list of categories for each sample.
1.3 Draw clustering results
def plot_cluster(k, idx, X):
"""Draw k-means results"""FIG, ax = PLT. Subplots (figsize = (12, 8))for i in range(k):
# Loop over each class
cluster = X[np.where(idx == i)[0], :]
ax.scatter(cluster[:,0], cluster[:,1], s=30, cmap=plt.cm.gist_rainbow, label='Cluster {}'.format(i))
ax.legend()
plt.show()
Copy the code
The matching result of the first sample with the randomly initialized cluster center is shown in the figure:
1.4 Calculate the new clustering center
Next, we need a function to calculate the cluster center of the cluster. The cluster center is the average of all samples currently assigned to the cluster.
def compute_centroids(X, idx, k):
"""Calculate the new center, K: category number"""
# m: sample number, n: feature number
m, n = X.shape
Initialize the matrix that stores the cluster center
centroids = np.zeros((k, n))
# Loop to calculate the centers of different classes
for i in range(k):
# return the index number of position idx== I, tuple type
samples_id = np.where(idx == i)
# average by column
centroids[i,:] = (np.sum(X[samples_id[0],:], axis=0) / len(samples_id[0])).ravel()
return centroids
Copy the code
Where np. Sum (X[samples_id[0],:], axis=0 is used to calculate the average value of each eigenvalue in different classes.
The new clustering center is used to match the samples again, and the results are shown as follows:
1.5 Establishment of algorithm model
To sum up, we just need to repeat the process:
- Sample and cluster center match
- Compute new clustering centers
def run_k_means(X, centroids, max_iters):
"""Max_iters: Maximum number of iterations"""
# m: sample number, n: feature number
m, n = X.shape
# category number
k = centroids.shape[0]
Iteration #
for i in range(max_iters):
# Find the category of the sample
idx = find_closest_centroids(X, centroids)
# Calculate the new center point
centroids = compute_centroids(X, idx, k)
return idx, centroids
Copy the code
Return the final matching result and clustering center, and the result of 10 cycles is shown in the figure:
2.K-means is used for image compression
We can use clustering to find the most representative few colors, and use cluster allocation to map the original 24-bit colors (3 * 8) into a lower-dimensional color space.
Why is it a 24-bit color?
Since the image we used (240 * 240) is represented in RGB space, with three layers, and each pixel on each layer has 0-255 states, each pixel in the 240 * 240 image needs 24 bits of space to represent color elements (it can also be understood as havingColor). The total space occupied by the image is: 240 * 240 * 3 * 8 = 1,382,400 bit.
Our goal is to group the color information of the original image into 16 classes, i.e., 16 colors. At this time, color elements only need 4bit space (). The total space occupied by the image is: 240 * 240 * 4 = 230400 bit. At the same time, extra space is needed to store the 16 colors that are clustered. The space is 16 * 3 * 8 = 384 bit.
Compress the image nearly 6 times.
2.1 Reading Data
- The original image
from IPython.display import Image
Image(filename='data/20180402155901_62188.jpg')
# Read image data
img = plt.imread('data/20180402155901_62188.jpg')
img.shape
Copy the code
Shape :(240, 240, 3)
- The data processing
The image data needs to be converted from 3D to 2D. Due to the large difference between data (0-255), feature normalization is required.
(240, 240, 3) –> (57600, 3)
# Expand the pixels
X = np.reshape(img, (img.shape[0] * img.shape[1], img.shape[2]))
# Feature normalization
X = X / 255
X.shape
Copy the code
2.2 Running the K-means algorithm
# Randomly initialize seed points
centroids = np.array(pd.DataFrame(X).sample(16))
# Run algorithm
idx, centroids = run_k_means(X, centroids, 10)
Copy the code
The clustering results are shown as follows:
This is a plot based on the first two rows of the sample
2.3 Observe the result of image compression
Next, we need to replace the value of each pixel in the original image with the value of its corresponding cluster center.
# Replace each pixel point with 16 well-clustered classes
X_recovered = centroids[idx.astype(int), :]
Copy the code
The result is shown below:
Combine pixels by row shape
X_recovered = np.reshape(X_recovered, (img.shape[0], img.shape[1], img.shape[2]))
plt.imshow(X_recovered)
# Remove the scale line
plt.xticks([])
plt.yticks([])
plt.show()
Copy the code
You can see that there are fewer color elements.
3. Principal Component Analysis (PCA)
PCA is a linear transformation to find the direction of the “principal component” or maximum variance in a data set. It can be used for data dimensionality reduction. In this exercise, we are first responsible for implementing PCA and applying it to a simple two-dimensional data set to see how it works. We start by loading and visualizing the data set.
3.1 Loading Data
Before PCA algorithm, we need to normalize the eigenvalues. Normalization with DataFrame is simple.
# Feature normalization
X = pd.DataFrame(X)
X = (X - X.mean()) / X.std()
X = np.array(X)
Copy the code
# Data Visualization
plt.figure(figsize=(12, 8))
plt.scatter(X[:, 0], X[:, 1])
plt.title('Source Data')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Copy the code
Show the raw data after normalization:
3.2 Calculate the covariance matrix and perform singular value decomposition
Covariance matrix:
def pca(X):
"""Return: U: eigenvector, S: diagonal matrix"""
# Calculate the covariance matrix
X = np.matrix(X)
cov = (X.T * X) / X.shape[0]
# Singular value decomposition
U, S, V = np.linalg.svd(cov)
return U, S, V
Copy the code
The returned U is a matrix of shape (n, n), where n is the number of eigenvalues of the sample.
3.3 Data dimension reduction
Now we have the eigenvectors of the covariance matrix (the matrix U), and we can use these to project the original data into a lower dimensional space. For this task, we will implement a function that computs the projection and selects only the top K components, reducing the dimension to K.
def dimensionality_reduction(X, U, k):
# shape of X (m, n)
# Select K components of U (n, K)
U_reduced = U[:,:k]
# Calculate the value of dimension reduction (m, k)
Z = np.dot(X, U_reduced)
return Z
Copy the code
The return value Z is the data after dimension reduction.
Let’s restore the data to two dimensions and observe:
Approximating the restored data, the restored data plot becomes a projection in the main direction.
def recover_data(Z, U, k):
# (n, k)
U_reduced = U[:,:k]
# (m, k) @ (k, n)
X_re = np.dot(Z, U_reduced.T)
return X_re
Copy the code
Drawing:
# Data Visualization
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(np.array(X_re[:,0]), np.array(X_re[:,1]), s=30, color='r', label='Recover Data')
ax.scatter(X[:, 0], X[:, 1], s=30, color='b', label='Source Data')
plt.title('Versus')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Copy the code
3.4 Compress images through PCA
Our final task in this exercise is to apply PCA to face images. By using the same dimensionality reduction technique, we can capture the “essence” of the image with much less data than the original image.
3.4.1 Data Display
The data is 32 by 32 grayscale images, and now let’s draw the first 100.
def plot_100figs(X):
"""Draw 100 pictures"""
Select the first 100 index
sample_list = np.arange(100)
Extract the 100 values from the data
sample_data = X[sample_list, :]
Sharex,sharey: subgraphs share the X-axis, Y-axis, ax_array: a 10 × 10 matrix
fig, ax_array = plt.subplots(nrows=10, ncols=10, sharex=True, sharey=True, figsize=(8, 8))
# subgraph
for r in range(10):
for c in range(10):
ax_array[r, c].matshow(sample_data[10 * r + c, :].reshape((32, 32)).T,
cmap=plt.cm.gray)
# Remove the scale line
plt.xticks([])
plt.yticks([])
plt.show()
Copy the code
Original data:
3.4.2 Data Dimension reduction
Obviously each of our images has 32 * 32 = 1024 eigenvalues.
- Now we take the top 100 principal components through PCA
U, S, V = pca(X)
Z = dimensionality_reduction(X, U, 100)
X_re100 = recover_data(Z, U, 100)
Copy the code
The result is shown below:
You can see that details are missing, but the general outline is preserved.
- What if I just took 10 principal components
U, S, V = pca(X)
Z = dimensionality_reduction(X, U, 10)
X_re10 = recover_data(Z, U, 10)
Copy the code
The result is shown below:
- What if I just took one principal component
U, S, V = pca(X)
Z = dimensionality_reduction(X, U, 1)
X_re1 = recover_data(Z, U, 1)
Copy the code
The result is shown below:
Set pieces
You can notice that when using the K-means algorithm, we compress the color information of RGB images. Then what will be the result if PCA is used to compress the color information?
I’ll use this picture again:
Therefore, it is very necessary to use appropriate algorithms in different application scenarios.