K-means clustering in Python, OpenCV

This blog will introduce what k-means clustering is and how to use the cv2.kmeans() function for data clustering.

  • K-means Cluster K-means clustering
  • Cv2.kmeans () performs data clustering

1. The rendering

5 heap points were sampled to generate clustering, and each classification was drawn in different colors. The effect figure 1 is as follows: 5 heap points cluster 5 is also generated, and the effect picture is as follows: After sampling, 10 heap points are generated, and the effect picture is as follows: Single feature clustering diagram:T-shirt sizes are classified into three categories based only on height 2 feature cluster diagrams:T-shirt size is clustered according to height and weight.

Color quantization is used to reduce the number of colors, and a new image is obtained. The original image VS clustering of 15 colors is shown as follows:

The more color clusters there are, the slower the algorithm is, but the closer it looks to the original image.

Principle 2.

2.1 What is K-means clustering?

Consider a company that will introduce a new type of T-shirt to the market. Obviously, they will have to make models of different sizes to suit people of all sizes. So the company made a chart of a person’s height and weight, as shown below:The company cannot produce t-shirts in all sizes. Instead, they divide people into small, medium and large, and make just three sizes that fit everyone.This kind of dividing people into three groups can be accomplished by K-means clustering. The algorithm provides us with the best three sizes, which will meet the needs of all people. If not, the company can divide people into more groups, maybe five, as shown below:

2.2 K-means clustering process

K-means clustering is an iterative process, and the most basic k-means clustering has the following four steps.

  1. Let’s say I have a bunch of points, and I’m going to divide them into two categories;
  2. Two points C1 and C2 were randomly selected as the center of mass, and the distance between the surrounding points and C1 and C2 was calculated. The closest points C1 were classified as C1, and the closest points C2 were classified as C2.
  3. After the segmentation, calculate the average value of all points of class C1 as the new centroid of C1, and calculate the average value of all points of class C2 as the new centroid of C2.
  4. The process of iteration 2,3 terminates according to the conditions set (how many iterations the process reaches, or the precision set).
  5. The final convergence is the distance between C1 and all of its points + the distance between C2 and all of its points = the minimum.

These points minimize the sum of the distances between the test data and its corresponding center of mass.Copy the code

The above four steps are only the top layer of k-means clustering. The algorithm has many optimizations, such as how to choose the initial center of mass and how to speed up the iterative process.

The iterative process can be approximated by the following pictures, which respectively represent the above four processes.

The following figure iterates step 1 and step 2. Randomly select points as centroid points and calculate distances for classification. After recalculating the centroid, step 3 and step 4 are iterated and the result graph is obtained. It can be seen that the random points are divided into two heap points (red and blue) after multiple iterations. The center of mass point is also relatively centered, more regular.

Cv2.kmeans (z, 2, None, criteria, 10, flags)

compactness, labels, centers = cv2.kmeans(z, 2, None, criteria, 10, flags)

The arguments:

  • Samples: NP. float32 data type, with each feature in a column.
  • Nclusters (K) : The final number of clusters required
  • Creteria: Iteration termination criteria. When this condition is met, the algorithm iteration stops. Instead, it should be a tuple of three arguments.

(Type, MAX_iter, EPsilon) : A – Type of termination condition: it has three flags, Cv2. TERM_CRITERIA_EPS – Stops algorithm iteration if the specified accuracy EPSILon is reached. Cv2.term_criteria_max_iter – Stops the algorithm after the specified number of iterations max_iter. Cv2.term_criteria_eps + Cv2.terM_CRITERiA_MAX_ITER – Stop iterating when any of the above conditions are met.

B – Maximum number of iterations – An integer specifying the maximum number of iterations. C – Accuracy – The accuracy required

  • Attempts: tags to specify how many times the algorithm is executed using different initial tags. The algorithm returns the label that produces the best compactness. This compactness is returned as output.
  • Flags: This flag specifies how to adopt the initial center. Two flags are usually used for this: cv2.kmeANS_pp_centers and cv2.kmeans_random_centers.

The return value:

  • Compactness: The sum of the squares of the distances from each point to its corresponding center
  • Labels: an array of labels in which each element is labeled “0”, “1”…..
  • Centers: Array of cluster centers

3. The source code

3.1 Clustering after sampling generation points

# K-means clustering randomly generates a bunch of points and clustering demo

from __future__ import print_function

import cv2 as cv
import numpy as np
from numpy import random


def make_gaussians(cluster_n, img_size) :
    points = []
    ref_distrs = []
    for _i in range(cluster_n):
        mean = (0.1 + 0.8 * random.rand(2)) * img_size
        a = (random.rand(2.2) - 0.5) * img_size * 0.1
        cov = np.dot(a.T, a) + img_size * 0.05 * np.eye(2)
        n = 100 + random.randint(900)
        pts = random.multivariate_normal(mean, cov, n)
        points.append(pts)
        ref_distrs.append((mean, cov))
    points = np.float32(np.vstack(points))
    return points, ref_distrs


def main() :
    cluster_n = 5
    img_size = 300

    # Generate a bright palette
    colors = np.zeros((1, cluster_n, 3), np.uint8)
    colors[0, :] = 255
    colors[0,,,0] = np.arange(0.180.180.0 / cluster_n)
    colors = cv.cvtColor(colors, cv.COLOR_HSV2BGR)[0]

    while True:
        print('sampling distributions... ')
        points, _ = make_gaussians(cluster_n, img_size)

        term_crit = (cv.TERM_CRITERIA_EPS, 30.0.1)
        _ret, labels, _centers = cv.kmeans(points, cluster_n, None, term_crit, 10.0)

        img = np.zeros((img_size, img_size, 3), np.uint8)
        for (x, y), label in zip(np.int32(points), labels.ravel()):
            c = list(map(int, colors[label]))

            cv.circle(img, (x, y), 1, c, -1)

        cv.imshow('kmeans', img)
        ch = cv.waitKey(0)
        if ch == 27:
            break

    print('Done')


if __name__ == '__main__':
    print(__doc__)
    main()
    cv.destroyAllWindows()

Copy the code

3.2 Single feature clustering, multi-feature clustering and color quantization

# OpenCV implementation of K-means Cluster K-means clustering
# input parameter
# -samples: NP. float32 data type, with each characteristic in a column.
# -nclusters (K) : The final number of clusters needed
# -creteria: Iteration termination criteria. When this condition is met, the algorithm iteration stops. Instead, it should be a tuple of three arguments. They are (type, max_iter, epsilon) :
# a - Type of termination condition: It has three flags as follows: Cv2.terM_CRITERiA_EPS - Stops algorithm iteration if the specified accuracy EPSILon is reached.
# cv2.terM_criteriA_MAX_iter - Stops the algorithm after the specified number of iterations max_iter.
# Cv2.TERM_CRITERiA_EPS + Cv2.terM_CRITERiA_MAX_iter - Stop iterating when any of the above conditions are met.
# b - Maximum number of iterations - Specifies the maximum number of iterations.
# c - Accuracy - The accuracy required
# -attempts: flags to specify the number of times the algorithm is executed using different initial tags. The algorithm returns the label that produces the best compactness. This compactness is returned as output.
# -flags: This flag specifies how to adopt the initial center. Two flags are usually used for this: cv2.kmeANS_pp_centers and cv2.kmeans_random_centers.

#
# output parameter
#
# - compactness: the sum of the squares of the distances from each point to its corresponding center
# -labels: array of labels, where each element is labeled "0", "1".....
# -centers: array of cluster centers

# 1. Data with only one feature (T-shirt question, based on height only)
import numpy as np
import cv2
from matplotlib import pyplot as plt

x = np.random.randint(25.100.25)
y = np.random.randint(175.255.25)
z = np.hstack((x, y))
z = z.reshape((50.1))
z = np.float32(z)
plt.hist(z, 256[0.256]), plt.show()

# Stop the algorithm and return each time 10 iterations of the algorithm are run or epsilon = 1.0 accuracy is reached
Criteria = (type, max_iter = 10, epsilon = 1.0)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10.1.0)

# set flags (just to avoid line breaks in code)
flags = cv2.KMEANS_RANDOM_CENTERS

# Apply k-means clustering algorithm
compactness, labels, centers = cv2.kmeans(z, 2.None, criteria, 10, flags)

A = z[labels == 0]
B = z[labels == 1]

Draw A in red, B in blue, and their centers of mass in yellow.
plt.hist(A, 256[0.256], color='r')
plt.hist(B, 256[0.256], color='b')
plt.hist(centers, 32[0.256], color='y')
plt.title("one feature res")
plt.show()

# 2. Features with multiple data (T-shirt questions, based on height, weight)
import numpy as np
import cv2
from matplotlib import pyplot as plt

X = np.random.randint(25.50, (25.2))  # height
Y = np.random.randint(60.85, (25.2))  # weight
Z = np.vstack((X, Y))

# Convert to NP.float32
Z = np.float32(Z)

Define termination criteria and apply KMeans clustering
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10.1.0)
ret, label, center = cv2.kmeans(Z, 2.None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)

# Separate data and flatten
A = Z[label.ravel() == 0]
B = Z[label.ravel() == 1]

# draw data
plt.scatter(A[:, 0], A[:, 1])
plt.scatter(B[:, 0], B[:, 1], c='r')
plt.scatter(center[:, 0], center[:, 1], s=80, c='y', marker='s')
plt.xlabel('Height'), plt.ylabel('Weight')
plt.title("two features res")
plt.show()

## 3. Color quantization
Color quantization is the process of reducing the number of colors in an image. One reason for this is to reduce memory. Sometimes some devices may be so limited that it can only produce a limited number of colors.
# In these cases, color quantization is also performed. Here we use K-means clustering for color quantization. There are three features, such as R, G, and B. You need to reshape the image into an array of Mx3 size (M is the number of pixels in the image).
# Apply the centroid value (also R, G, B) to all pixels after clustering, so that the generated image will have a specified number of colors, and finally need to be reshaped back to the shape of the original image.
import numpy as np
import cv2

img = cv2.imread('images/ml.jpg')
Z = img.reshape((-1.3))

Np # conversion. Float32
Z = np.float32(Z)

Define termination criteria and apply KMeans clustering
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10.1.0)
K = 15
ret, label, center = cv2.kmeans(Z, K, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)

# Convert back to uint8 and make the original image
center = np.uint8(center)
res = center[label.flatten()]
res2 = res.reshape((img.shape))

cv2.imshow("origin",img)
cv2.imshow('res2', res2)
cv2.waitKey(0)
cv2.destroyAllWindows()
Copy the code

reference

  • Docs.opencv.org/3.0-beta/do…