Scikti-learn divides machine learning into four areas, namely classification, clustering, regression and Dimensionality reduction. Although k-means algorithm is a relatively simple clustering algorithm, it contains rich ideological content, which is very suitable for beginners’ introductory exercises.

There are many principles and implementation codes of k-means mean clustering algorithm on the Internet, but the operation efficiency seems to be a little problem. Today I have a little spare time, and I wrote a K-means mean clustering algorithm of less than 20 lines, which took an average of 20 milliseconds (10 times mean) for 10,000 samples. Popular algorithms on the web take an average of 3,000 milliseconds (10 times mean) for the same data sample. The gap was more than a hundredfold, which made me feel so surprised that I could not help but offer my knees to Numpy again!

Here is my code, with comments, blank lines and a total of 26 lines of valid code.

 1import numpy as np
 2
 3def kmeans_xufive(ds, k):
 4 """K-means clustering algorithm 5 6 K-specified number of clusters 7 DS-ndarray (m, N), dataset of m samples, n attribute values per sample 8"""
 9
10 m, n = ds.shape # m: number of samples, n: number of attribute values per sample
11 result = np.empty(m, dtype=np.int) # Clustering results of m samples
12 cores = np.empty((k, n)) K centers of mass
13 cores = ds[np.random.choice(np.arange(m), k, replace=False)] # Randomly select k samples from m data samples without repeating as the center of mass14 and 15while True: # Iterative calculation
16 d = np.square(np.repeat(ds, k, axis=0).reshape(m, k, n) - cores)
17 distance = np.sqrt(np.sum(d, axis=2)) # ndarray(m, k), the distance between each sample and k centers of mass, total m rows
18 index_min = np.argmin(distance, axis=1) # Index number of the nearest center of mass for each sample
19
20 if (index_min == result).all(): # If the sample clustering is not changed
21 return result, cores # returns clustering results and centroid data
22
23 result[:] = index_min # reclassify
24 for i in range(k): # Walk through the centroid set
25 items = ds[result==i] Find the subsample set corresponding to the current center of mass
26 cores[i] = np.mean(items, axis=0) # Take the mean of the subsample set as the current centroid position
Copy the code

This is the code of k-means mean clustering algorithm which is popular on the Internet, with 57 lines in total including annotations and blank lines, and 37 lines of effective code.

 1import numpy as np
 2
 3# load data
 4def loadDataSet(fileName):
 5 data = np.loadtxt(fileName,delimiter='\t'6)return data
 7
 8# Euclidean distance calculation
 9def distEclud(x,y):
10 return np.sqrt(np.sum((x-y)**2)) # Compute Euclidean distance11 to 12Construct a set of K random centroids for a given dataset
13def randCent(dataSet,k):
14 m,n = dataSet.shape
15 centroids = np.zeros((k,n))
16 for i in range(k):
17 index = int(np.random.uniform(0,m)) #
18 centroids[i,:] = dataSet[index,:]
19 return centroids
20
21# K-means clustering
22def kmeans_open(dataSet,k):
23
24 m = np.shape(dataSet)[0] # number of rows
25 Which cluster does the first column store sample belong to
26 The second column stores the error from the sample to the center of the cluster
27 clusterAssment = np.mat(np.zeros((m,2)))
28 clusterChange = True
29
30 Step 1 initialize centroids
31 centroids = randCent(dataSet,k)
32 while clusterChange:
33 clusterChange = False
34
35 Pass through all the samples (rows)
36 for i inRange (m): 37 minDist = 100000.0 38 minIndex = -1 39 40# walk through all the centers of mass
41 # step 2 find the nearest center of mass
42 for j in range(k):
43 # Calculate the Euclidean distance of the sample to the center of mass
44 distance = distEclud(centroids[j,:],dataSet[i,:])
45 if distance < minDist:
46 minDist = distance
47 minIndex = j
48 Step 3: Update the cluster to which each row of samples belongs
49 ifclusterAssment[i,0] ! = minIndex: 50 clusterChange = True 51 clusterAssment[i,:] = minIndex,minDist**2 52# Step 4: Update the center of mass
53 for j in range(k):
54 pointsInCluster = dataSet[np.nonzero(clusterAssment[:,0].A == j)[0]] Get all points of the cluster class
55 centroids[j,:] = np.mean(pointsInCluster,axis=0) Take the mean of the rows of the matrix
56
57 return clusterAssment.A[:,0], centroids
Copy the code

The create_data_set() function is used to generate test data. Cores are triples, each of which is the x and y coordinates of the center of mass and the number of data points corresponding to the center of mass.

 1def create_data_set(*cores):
 2 """Generating k-means clustering test Data sets"""
 3
 4 ds = list()
 5 for x0, y0, z0 incores: X = np.random. Normal (x0, 0.1+ NP.random. Random ()/3, z0) y = Np.random. Normal (y0, 0.1+ NP.random. z0) 8 ds.append(np.stack((x,y), axis=1)) 9 10return np.vstack(ds)
Copy the code

The test code is as follows:

1import time 2import matplotlib.pyplot as PLT 3 4k = 4 5ds = create_data_set((0,0 2500), (0,2 2500), (2,0 2500), Time () 8result, cores = kmeans_xufive(ds, k) 9t = time.time() -t0 10 11plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int)) 12plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k))
13plt.show()
14
15print(u'Using the Kmeans_XUfive algorithm, 10,000 sample points, %f0.3 seconds'%t)
16
17t0 = time.time()
18result, cores = kmeans_open(ds, k)
19t = time.time() - t0
20
21plt.scatter(ds[:,0], ds[:,1], s=1, c=result.astype(np.int))
22plt.scatter(cores[:,0], cores[:,1], marker='x', c=np.arange(k))
23plt.show()
24
25print(u'Using kmeans_open algorithm, 10,000 sample points, %f0.3 seconds'%t)
Copy the code

The test results are as follows:

1PS D:\XufiveGit\CSDN\code> py-3. \k-means.py 2 Using kmeans_XUfive algorithm, 10,000 sample points, 0.0156550.3s 3 Using Kmeans_open algorithm, 10,000 sample points, Take 3.9990890.3 secondsCopy the code

The effect is as follows: