KNN algorithm was briefly introduced last time. To put it simply, k nearest values are selected by calculating the distance between the target value and sample data, and the classification value with high probability is used to represent the classification of the target value. The algorithm is relatively simple to implement and belongs to supervised learning method. This paper intends to briefly introduce the K-means clustering algorithm, which is an unsupervised learning method different from the previous one. The two big problems in machine learning are classification and clustering. Classification is to train a learning machine according to some given samples of known category labels, so that it can classify samples of unknown categories. This is supervised learning. Clustering means that we do not know the category number of any samples in advance and hope to divide a group of samples of unknown category into several categories through an algorithm. This is called unsupervised learning in machine learning.
K-means clustering algorithm
Usually, samples are classified into different categories according to certain distance or similarity between samples, which is called clustering. For example, given data set, some data (two-dimensional, a total of 80) are as follows:
1.658985 4.285136-3.453687 3.424321 4.838138-1.151539-5.379713-3.362104 0.972564 2.924086Copy the code
The visualization is as follows:
- K points are chosen at random as the initial center of mass.
- For each point in the sample, find the distance from point K. The one with the smallest distance falls into this category.
- In this case, the new center of mass is recalculated for the obtained k classes.
- When the error between the center of mass obtained in step 3 and the center of mass obtained before is very small, the classification ends. The formulas used are extremely simple, as described in the code below.
Python code implementation
# Data initialization
import numpy as np
import random
import re
import matplotlib.pyplot as plt
def loadDataSet():
dataSet = np.loadtxt("dataSet.csv")
return dataSet
Copy the code
def initCentroids(dataSet, k):
Select k data randomly from the dataset
dataSet = list(dataSet)
return random.sample(dataSet, k)
Copy the code
Corresponding to step 2, calculate distances and classify them according to the shortest distances to different centers of mass, and save them in dictionaries.
def minDistance(dataSet, centroidList):
For each item belonging to the dataSet, calculate the distance between item and k centroidList centroidcenters, find out the smallest distance, and add item to the corresponding cluster class
clusterDict = dict() Dict holds the result of the cluster class
k = len(centroidList)
for item in dataSet:
vec1 = item
flag = -1
minDis = float("inf") Initialize to the maximum value
for i in range(k):
vec2 = centroidList[i]
distance = calcuDistance(vec1, vec2) # error
if distance < minDis:
minDis = distance
flag = i At the end of the loop, flag stores the closest mounting flag to the current item
if flag not in clusterDict.keys():
clusterDict.setdefault(flag, [])
clusterDict[flag].append(item) # Add to the appropriate category
return clusterDict # Different categories
Copy the code
def getCentroids(clusterDict):
# Recalculate k centers of mass
centroidList = []
for key in clusterDict.keys():
centroid = np.mean(clusterDict[key], axis=0)
centroidList.append(centroid)
return centroidList # get the new center of mass
Copy the code
The mean square error of each cluster was calculated to measure the clustering effect
def getVar(centroidList, clusterDict):
Calculate the mean square error between mounting assemblies
Sum the distance between each vector in the mounting class and the center of massSum = 0.0for key inClusterdict.keys (): vec1 = centroidList[key] distance = 0.0for item in clusterDict[key]:
vec2 = item
distance += calcuDistance(vec1, vec2)
sum += distance
return sum
Copy the code
# Test the clustering effect and visualize it
def test_k_means():
dataSet = loadDataSet()
centroidList = initCentroids(dataSet, 4)
clusterDict = minDistance(dataSet, centroidList)
# # getCentroids(clusterDict)
# showCluster(centroidList, clusterDict)
newVar = getVar(centroidList, clusterDict)
oldVar = 1 # When the error of two clustering is less than a certain value, it indicates that the center of mass is basically determined.
times = 2
whileAbs (newvar-oldvar) >= 0.00001: centroidList = getCentroids(clusterDict) clusterDict = minDistance(dataSet, centroidList) oldVar = newVar newVar = getVar(centroidList, clusterDict)times += 1
showCluster(centroidList, clusterDict)
if __name__ == '__main__':
# show_fig()
test_k_means()
Copy the code
As mentioned above, clustering can be considered complete when the error between two computed centroids is within 0.00001. Run the function:
As can be seen from the results, with continuous iteration, the clustering effect is getting better and better until it is less than the error and the clustering is completed.
To complete the code and data refer to Github: github:k-means
conclusion
- Unsupervised learning
- K – means algorithm
- Minimize the square variance to characterize the density of the in-frame vector.
Machine learning algorithm and Python Practice (5) K-means clustering (K-means clustering)