Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money
Birds of a feather clustering classical unsupervised learning algorithm – K-means clustering algorithm
1. K – Means definition
The K-means clustering algorithm firstly randomly selects K objects as the initial cluster center, then calculates the distance between each sample and each cluster center, and assigns each sample to the cluster center closest to it.
Cluster centers and the objects assigned to them represent a cluster. Each time a sample is allocated, the cluster center of the cluster is recalculated according to the existing objects in the cluster. This process is repeated until some termination condition is met.
Termination conditions can be that no (or minimum number) samples are reassigned to different clusters, no (or minimum number) clustering center changes again, and error squares and local minima.
2. K – Means steps
Theory: 1. Generate K cluster centers randomly. 2. Calculate the distance (Euclidean distance) between each sample and each cluster center. If it is close to the cluster center, it will be divided into the set to which the cluster center belongs. 3. Recalculate the cluster center of each set. 4. Repeat steps 2 and 3 until convergence occurs. 5. Return all cluster labels.
The illustration K – Means:
The data set is make_blobs,
(1) setK=2
First random partitionTwo clustering
, can be seen is very unbalanced! Has not yet beenconvergence
~
② Calculate the distance (Euclidean distance) between each sample and these two cluster centers. If it is close to the cluster center, it will be divided into the set to which the cluster center belongs.
- If the distance between this sample point and both clusters is zero
equal
If it were, it wouldRandomly assigned
Go to one of them. - Then when the clustering ends, the two clusters will be repeated
Calculation of cluster center
, the clustering center will change. - So the next time
Sample points
andCluster center
In the clustering calculation, this sample point will be divided into whatever category it belongs to. - In other words, this randomness doesn’t matter
In the end
theclustering
The results.
Think about it in a different reference system. These two clustering centers are constantly looking for their own samples. Even if they make a mistake in the middle process, it will not affect the final result, which sample points or sample points they should have
(3) After dividing all the sample points to each cluster center (that is, the sample point close to the cluster center will be divided into the set of the cluster center), recalculate the location of the cluster center, that is, the location of the black point.
In the following figure, the red point is not calculated, which is the center point of the last clustering, so the purple sample point is close to the red point on the right, the yellow sample point is close to the red point on the left, and the black point is recalculated
Then, through the recalculated black point, the distance between each sample point and the black point was calculated. The area close to the left cluster center was the yellow area, and the area close to the right cluster center was the one on the right.
We can take a look at the clustering process: it took 15 iterations to converge
3. Comparison between K-means and KNN
The name of the | K-Means | KNN |
---|---|---|
The training type | Unsupervised learning | Supervised learning |
Calculate sample range | Global computing | Local calculation between neighbors |
The meaning of K. | K clustering | K neighbor |
The final result | We only know what class it is, but we don’t know the label of that class | The tag category after the final classification is clearly known |
advantages | The principle is relatively simple, implementation is also very easy, convergence speed is fast | Simple and easy to use, easy to understand, high precision, mature theory, can be used to do classification can also be used to do regression |
disadvantages | It’s very sensitive to K, to the sample distribution | Sample imbalance problem (i.e. some categories have a large number of samples and others have a small number of samples) |
The following two pictures are from station B UpFive minutes of machine learning
4. Small practice
4.1 the first question
The make_circles method in Sklearn generates the data, clustering and visualizing it with K-means.
""" The make_circles method in Sklearn generates the data, clustering and visualizing it with K-means. "" "
from sklearn.cluster import KMeans
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
from ex1.clustering_performance import clusteringMetrics # Import the teacher-written library
fig = plt.figure(1, figsize=(10.5))
X1, y1 = make_circles(n_samples=400, factor=0.5, noise=0.1)
plt.subplot(121)
plt.title('original')
plt.scatter(X1[:, 0], X1[:, 1], c=y1)
plt.subplot(122)
plt.title('K-means')
kms = KMeans(n_clusters=2, max_iter=400) # n_cluster Number of cluster centers max_iter number of iterations
y1_sample = kms.fit_predict(X1, y1) # Calculate and predict sample categories
centroids = kms.cluster_centers_
plt.scatter(X1[:, 0], X1[:, 1], c=y1_sample)
plt.scatter(centroids[:, 0], centroids[:, 1], s=30, marker=The '*', c='b')
print(clusteringMetrics(y1, y1_sample))
plt.show()
Copy the code
4.2 the second question
The make_moons method in Sklearn generates data, clustering and visualization using K-means.
""" The make_moons method in Sklearn generates data, clustering and visualization using K-means. "" "
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
def create_data() :
X, y = datasets.make_moons(n_samples=400, noise=0.1)
return X, y
def init_centers(data, k) :
m, n = data.shape
# m sample number, n feature number
center_ids = np.random.choice(m, k)
centers = data[center_ids]
return centers
def cal_dist(ptA, ptB) :
return np.linalg.norm(ptA - ptB)
def kmeans_process(data, k) :
centers = init_centers(data, k)
m, n = data.shape
keep_changing = True
pred_y = np.zeros((m,))
iteration = 0
while keep_changing:
keep_changing = False
# Calculate the category of the remaining samples
for i in range(m):
min_distance = np.inf
for center in range(k):
distance = cal_dist(data[i, :], centers[center, :])
if distance < min_distance: # Decide which is closer
min_distance = distance
idx = center # Change the category
ifpred_y[i] ! = idx:# Determine if something has changed
keep_changing = True
pred_y[i] = idx
# Update the center point coordinates of the category
for center in range(k):
cluster_data = data[pred_y == center]
centers[center, :] = np.mean(cluster_data, axis=0) Find the centroid of data points of the same category
print(centers)
plt.clf()
plt.title(f'iteration: {iteration}')
plt.scatter(X[:, 0], X[:, 1], s=30, c=pred_y)
plt.scatter(centers[:, 0], centers[:, 1], s=100, c='k')
plt.pause(1)
iteration += 1
return centers, pred_y
if __name__ == '__main__':
X, y = create_data()
plt.ion()
centers, pred_y = kmeans_process(data=X, k=2)
plt.ioff()
plt.show()
Copy the code
4.3 the third question
Given an image, its pixels are clustered and visualized
from scipy.cluster.vq import *
from pylab import *
from PIL import Image
def clusterpixels(infile, k, steps) :
im = array(Image.open(infile))
dx = im.shape[0] / steps
dy = im.shape[1] / steps
features = []
for x in range(steps): # RGB tri-color channel
for y in range(steps):
R = mean(im[int(x * dx):int((x + 1) * dx), int(y * dy):int((y + 1) * dy), 0])
G = mean(im[int(x * dx):int((x + 1) * dx), int(y * dy):int((y + 1) * dy), 1])
B = mean(im[int(x * dx):int((x + 1) * dx), int(y * dy):int((y + 1) * dy), 2])
features.append([R, G, B])
features = array(features, 'f') # make into array
# clustering, k is the number of clusters
centroids, variance = kmeans(features, k)
code, distance = vq(features, centroids)
codeim = code.reshape(steps, steps)
codeim = np.array(Image.fromarray(codeim).resize((im.shape[1], im.shape[0)))return codeim
# k = 5
infile_Stones = 'stones.jpg'
im_Stones = array(Image.open(infile_Stones))
steps = (50.100) # image is divided in steps*steps region
# Show the original image
figure()
subplot(231)
title('original')
axis('off')
imshow(im_Stones)
for k in range(2.7):
codeim = clusterpixels(infile_Stones, k, steps[-1])
subplot(2.3, k)
title('K=' + str(k))
axis('off')
imshow(codeim)
show()
Copy the code
The last
Xiao Sheng Fan Yi, looking forward to your attention.