KNN algorithm was briefly introduced last time. To put it simply, k nearest values are selected by calculating the distance between the target value and sample data, and the classification value with high probability is used to represent the classification of the target value. The algorithm is relatively simple to implement and belongs to supervised learning method. This paper intends to briefly introduce the K-means clustering algorithm, which is an unsupervised learning method different from the previous one. The two big problems in machine learning are classification and clustering. Classification is to train a learning machine according to some given samples of known category labels, so that it can classify samples of unknown categories. This is supervised learning. Clustering means that we do not know the category number of any samples in advance and hope to divide a group of samples of unknown category into several categories through an algorithm. This is called unsupervised learning in machine learning.

K-means clustering algorithm

Usually, samples are classified into different categories according to certain distance or similarity between samples, which is called clustering. For example, given data set, some data (two-dimensional, a total of 80) are as follows:

1.658985 4.285136-3.453687 3.424321 4.838138-1.151539-5.379713-3.362104 0.972564 2.924086Copy the code

The visualization is as follows:

  1. K points are chosen at random as the initial center of mass.
  2. For each point in the sample, find the distance from point K. The one with the smallest distance falls into this category.
  3. In this case, the new center of mass is recalculated for the obtained k classes.
  4. When the error between the center of mass obtained in step 3 and the center of mass obtained before is very small, the classification ends. The formulas used are extremely simple, as described in the code below.

Python code implementation

# Data initialization
import numpy as np
import random
import re
import matplotlib.pyplot as plt

def loadDataSet():
    dataSet = np.loadtxt("dataSet.csv")
    return dataSet
Copy the code

def initCentroids(dataSet, k):
    Select k data randomly from the dataset
    dataSet = list(dataSet)
    return random.sample(dataSet, k)
Copy the code

Corresponding to step 2, calculate distances and classify them according to the shortest distances to different centers of mass, and save them in dictionaries.

def minDistance(dataSet, centroidList):

    For each item belonging to the dataSet, calculate the distance between item and k centroidList centroidcenters, find out the smallest distance, and add item to the corresponding cluster class
    clusterDict = dict() Dict holds the result of the cluster class
    k = len(centroidList)
    for item in dataSet:
        vec1 = item
        flag = -1
        minDis = float("inf") Initialize to the maximum value
        for i in range(k):
            vec2 = centroidList[i]
            distance = calcuDistance(vec1, vec2)  # error
            if distance < minDis:
                minDis = distance
                flag = i  At the end of the loop, flag stores the closest mounting flag to the current item
        if flag not in clusterDict.keys():
            clusterDict.setdefault(flag, [])
        clusterDict[flag].append(item)  # Add to the appropriate category
    return clusterDict  # Different categories
Copy the code

def getCentroids(clusterDict):
    # Recalculate k centers of mass
    centroidList = []
    for key in clusterDict.keys():
        centroid = np.mean(clusterDict[key], axis=0)
        centroidList.append(centroid)
    return centroidList  # get the new center of mass
Copy the code

The mean square error of each cluster was calculated to measure the clustering effect

def getVar(centroidList, clusterDict):
    Calculate the mean square error between mounting assemblies
    Sum the distance between each vector in the mounting class and the center of massSum = 0.0for key inClusterdict.keys (): vec1 = centroidList[key] distance = 0.0for item in clusterDict[key]:
            vec2 = item
            distance += calcuDistance(vec1, vec2)
        sum += distance
    return sum
Copy the code

# Test the clustering effect and visualize it
def test_k_means():
    dataSet = loadDataSet()
    centroidList = initCentroids(dataSet, 4)
    clusterDict = minDistance(dataSet, centroidList)
    # # getCentroids(clusterDict)
    # showCluster(centroidList, clusterDict)
    newVar = getVar(centroidList, clusterDict)
    oldVar = 1  # When the error of two clustering is less than a certain value, it indicates that the center of mass is basically determined.

    times = 2
    whileAbs (newvar-oldvar) >= 0.00001: centroidList = getCentroids(clusterDict) clusterDict = minDistance(dataSet, centroidList) oldVar = newVar newVar = getVar(centroidList, clusterDict)times += 1
        showCluster(centroidList, clusterDict)

if __name__ == '__main__':
    # show_fig()
    test_k_means()
Copy the code

As mentioned above, clustering can be considered complete when the error between two computed centroids is within 0.00001. Run the function:

As can be seen from the results, with continuous iteration, the clustering effect is getting better and better until it is less than the error and the clustering is completed.

To complete the code and data refer to Github: github:k-means

conclusion

  • Unsupervised learning
  • K – means algorithm
  • Minimize the square variance to characterize the density of the in-frame vector.

Machine learning algorithm and Python Practice (5) K-means clustering (K-means clustering)