Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money

Birds of a feather clustering classical unsupervised learning algorithm – K-means clustering algorithm

1. K – Means definition

The K-means clustering algorithm firstly randomly selects K objects as the initial cluster center, then calculates the distance between each sample and each cluster center, and assigns each sample to the cluster center closest to it.

Cluster centers and the objects assigned to them represent a cluster. Each time a sample is allocated, the cluster center of the cluster is recalculated according to the existing objects in the cluster. This process is repeated until some termination condition is met.

Termination conditions can be that no (or minimum number) samples are reassigned to different clusters, no (or minimum number) clustering center changes again, and error squares and local minima.

2. K – Means steps

Theory: 1. Generate K cluster centers randomly. 2. Calculate the distance (Euclidean distance) between each sample and each cluster center. If it is close to the cluster center, it will be divided into the set to which the cluster center belongs. 3. Recalculate the cluster center of each set. 4. Repeat steps 2 and 3 until convergence occurs. 5. Return all cluster labels.

The illustration K – Means:

The data set is make_blobs,

(1) setK=2First random partitionTwo clustering, can be seen is very unbalanced! Has not yet beenconvergence~

② Calculate the distance (Euclidean distance) between each sample and these two cluster centers. If it is close to the cluster center, it will be divided into the set to which the cluster center belongs.

  • If the distance between this sample point and both clusters is zeroequalIf it were, it wouldRandomly assignedGo to one of them.
  • Then when the clustering ends, the two clusters will be repeatedCalculation of cluster center, the clustering center will change.
  • So the next timeSample pointsandCluster centerIn the clustering calculation, this sample point will be divided into whatever category it belongs to.
  • In other words, this randomness doesn’t matterIn the endtheclusteringThe results.

Think about it in a different reference system. These two clustering centers are constantly looking for their own samples. Even if they make a mistake in the middle process, it will not affect the final result, which sample points or sample points they should have

(3) After dividing all the sample points to each cluster center (that is, the sample point close to the cluster center will be divided into the set of the cluster center), recalculate the location of the cluster center, that is, the location of the black point.

In the following figure, the red point is not calculated, which is the center point of the last clustering, so the purple sample point is close to the red point on the right, the yellow sample point is close to the red point on the left, and the black point is recalculated

Then, through the recalculated black point, the distance between each sample point and the black point was calculated. The area close to the left cluster center was the yellow area, and the area close to the right cluster center was the one on the right.

We can take a look at the clustering process: it took 15 iterations to converge

3. Comparison between K-means and KNN

The name of the K-Means KNN
The training type Unsupervised learning Supervised learning
Calculate sample range Global computing Local calculation between neighbors
The meaning of K. K clustering K neighbor
The final result We only know what class it is, but we don’t know the label of that class The tag category after the final classification is clearly known
advantages The principle is relatively simple, implementation is also very easy, convergence speed is fast Simple and easy to use, easy to understand, high precision, mature theory, can be used to do classification can also be used to do regression
disadvantages It’s very sensitive to K, to the sample distribution Sample imbalance problem (i.e. some categories have a large number of samples and others have a small number of samples)

The following two pictures are from station B UpFive minutes of machine learning

4. Small practice

4.1 the first question

The make_circles method in Sklearn generates the data, clustering and visualizing it with K-means.

""" The make_circles method in Sklearn generates the data, clustering and visualizing it with K-means. "" "
from sklearn.cluster import KMeans
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
from ex1.clustering_performance import clusteringMetrics  # Import the teacher-written library

fig = plt.figure(1, figsize=(10.5))
X1, y1 = make_circles(n_samples=400, factor=0.5, noise=0.1)
plt.subplot(121)
plt.title('original')
plt.scatter(X1[:, 0], X1[:, 1], c=y1)
plt.subplot(122)
plt.title('K-means')
kms = KMeans(n_clusters=2, max_iter=400)  # n_cluster Number of cluster centers max_iter number of iterations
y1_sample = kms.fit_predict(X1, y1)  # Calculate and predict sample categories
centroids = kms.cluster_centers_
plt.scatter(X1[:, 0], X1[:, 1], c=y1_sample)
plt.scatter(centroids[:, 0], centroids[:, 1], s=30, marker=The '*', c='b')

print(clusteringMetrics(y1, y1_sample))
plt.show()
Copy the code

4.2 the second question

The make_moons method in Sklearn generates data, clustering and visualization using K-means.

""" The make_moons method in Sklearn generates data, clustering and visualization using K-means. "" "
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as datasets


def create_data() :
    X, y = datasets.make_moons(n_samples=400, noise=0.1)
    return X, y

def init_centers(data, k) :
    m, n = data.shape
    # m sample number, n feature number
    center_ids = np.random.choice(m, k)
    centers = data[center_ids]
    return centers

def cal_dist(ptA, ptB) :
    return np.linalg.norm(ptA - ptB)

def kmeans_process(data, k) :
    centers = init_centers(data, k)
    m, n = data.shape
    keep_changing = True
    pred_y = np.zeros((m,))
    iteration = 0
    while keep_changing:
        keep_changing = False
        # Calculate the category of the remaining samples
        for i in range(m):
            min_distance = np.inf
            for center in range(k):
                distance = cal_dist(data[i, :], centers[center, :])
                if distance < min_distance:  # Decide which is closer
                    min_distance = distance
                    idx = center  # Change the category
            ifpred_y[i] ! = idx:# Determine if something has changed
                keep_changing = True
            pred_y[i] = idx
        # Update the center point coordinates of the category
        for center in range(k):
            cluster_data = data[pred_y == center]
            centers[center, :] = np.mean(cluster_data, axis=0)  Find the centroid of data points of the same category
        print(centers)
        plt.clf()
        plt.title(f'iteration: {iteration}')
        plt.scatter(X[:, 0], X[:, 1], s=30, c=pred_y)
        plt.scatter(centers[:, 0], centers[:, 1], s=100, c='k')
        plt.pause(1)
        iteration += 1
    return centers, pred_y

if __name__ == '__main__':
    X, y = create_data()
    plt.ion()
    centers, pred_y = kmeans_process(data=X, k=2)
    plt.ioff()
    plt.show()
Copy the code

4.3 the third question

Given an image, its pixels are clustered and visualized

from scipy.cluster.vq import *
from pylab import *
from PIL import Image


def clusterpixels(infile, k, steps) :
    im = array(Image.open(infile))
    dx = im.shape[0] / steps
    dy = im.shape[1] / steps
    features = []

    for x in range(steps):  # RGB tri-color channel
        for y in range(steps):
            R = mean(im[int(x * dx):int((x + 1) * dx), int(y * dy):int((y + 1) * dy), 0])
            G = mean(im[int(x * dx):int((x + 1) * dx), int(y * dy):int((y + 1) * dy), 1])
            B = mean(im[int(x * dx):int((x + 1) * dx), int(y * dy):int((y + 1) * dy), 2])
            features.append([R, G, B])
    features = array(features, 'f')  # make into array
    # clustering, k is the number of clusters
    centroids, variance = kmeans(features, k)
    code, distance = vq(features, centroids)
    codeim = code.reshape(steps, steps)
    codeim = np.array(Image.fromarray(codeim).resize((im.shape[1], im.shape[0)))return codeim


# k = 5
infile_Stones = 'stones.jpg'
im_Stones = array(Image.open(infile_Stones))
steps = (50.100)  # image is divided in steps*steps region

# Show the original image
figure()
subplot(231)
title('original')
axis('off')
imshow(im_Stones)

for k in range(2.7):
    codeim = clusterpixels(infile_Stones, k, steps[-1])
    subplot(2.3, k)
    title('K=' + str(k))
    axis('off')
    imshow(codeim)

show()
Copy the code

The last

Xiao Sheng Fan Yi, looking forward to your attention.