[Machine learning] : Principle of Kmeans mean clustering algorithm (implemented with Python code)

This algorithm is called k-means clustering algorithm in Chinese. First, we discuss the process of its implementation under the special conditions of two dimensions for your understanding.

Step 1. Generate the center of mass randomly

Since this is an unsupervised learning algorithm, so the first thing we in under the axis of a two-dimensional random given a bunch of points, and then the given two center of mass, we will the purpose of this algorithm is a pile of the coordinates of points according to their own characteristics are divided into two classes, so choose the two center of mass, when the pile point can according to the two on the center of mass is divided into two piles. As shown below:

Step 2: Classify by distance

The red and blue dots represent our randomly chosen center of mass. Since we want to divide this bunch of points into two piles, and each of these points is closest to the center of mass, let’s first figure out how far each of these points is from the center of mass. If a point is closer to the red center of mass than to the blue center of mass, it is classified as the red center of mass, and vice versa, as shown in the figure:

The third step is to find the mean value of the same kind of points and update the centroid position

In this step, we average the values of x\y of the same kind of points to find the average of the sum of all points, and this value (x,y) is the location of our new center of mass, as shown in the figure:

We can see that the position of the center of mass has changed.

Step 4. Repeat step 2 and step 3

We repeat the operation of the second and the third department, continuously after the minimum value of point to the center of mass classification, classification and then update the position of the center of mass, until they get the cap on the number of iterations (the number of iterations can be set by our own, for example, 10000), or after done n iterations, the last two iterations centroid position has remained the same, As shown below:

At this time we will this pile of points according to their characteristics in the condition of no supervision, divided into two categories!!

5. How to realize clustering if there is a single point with multiple features? First we introduce a concept, that is Euclidean distance, Euclidean distance is defined like this, it is easy to understand:

Obviously, the Euclidean distance d(xi,xj) is equal to the sum of the squares of the distances of each of our points minus the square root of the distances of the other points in that dimension, and it’s easy to understand.

We can also understand kmeans algorithm in another way, that is, clustering can be realized by minimizing the variance of one point and some other points, as shown in the figure below:

Have to solve!

Six: code implementation

We now use Python to implement the Kmeans means algorithm by importing a dataset named make_blobs, datasets, and receiving them using two variables, X and y. X represents the data we obtained, and Y represents the category into which the data should be classified. Of course, our actual data will not tell us which data is classified into which category, but only the data in X. The make_blobs library requires us to accept these two arguments, not just X. Here’s the code

Plt. figure(figsize=(15,15))# specify the size of our drawing as 12*12 X, Y =make_blobs(n_samples=1600,random_state=170)# take 1600 samples and set the state to random And this data set specifies three data centers, The three clusters y_pred=KMeans(n_clusters=3, random_State =170).fit_predict(X) plt.subplot(221)# represent the first of the four squares [0] :, PLT. Scatter (X, X [:, 1), c = y_pred) # represent data of the zeroth and first dimension, Plt.title ("The result of The Kmeans") plt.subplot(222)# plt.scatter(X[:,0],X[:,1],c=y) plt.title("The Real result of the Kmeans") Array = np. Array ([[0.60834549, 0.63667341], [0.40887178, 0.85253229]]) lashen = np. Dot (X, array) Y_pred =KMeans(N_clusters =3,random_state=170).fit_predict(Lashen) plt.subplot(223)# represents the first square of the four squares Plt. scatter(lashen[:,0], Lashen [:,1], C =y_pred)# PLT. Title ("The Real result of The tranfored data")Copy the code

When we use Scatter function to draw, we will write corresponding code according to the shape of our data knot. Here, we get 1600 lines of X data set, because we get 1600 data, each data only has two features, namely the coordinates in the XY axis. So X is a two-dimensional NDARray object (X is the NDARray object in Numpy), and we can print it out to see the composition of the data, as shown below:

We can also see that y is also an Ndarray object. Since we only accept three clusters when we collect data, make_blobs accepts three clusters (or clusters) by default, so there are only three possible values for y: 0, 1, and 2. We use matplotlib plot to draw the result graph of our classification, that is, the operation result of the above code is as follows:

※ Some articles from the network, if any infringement, please contact to delete; More articles and materials | click behind the text to the left left left 100 gpython self-study data package Ali cloud K8s practical manual guide] [ali cloud CDN row pit CDN ECS Hadoop large data of actual combat operations guide the conversation practice manual manual Knative cloud native application development guide OSS Operation and maintenance actual combat manual cloud native architecture white paper Zabbix enterprise distributed monitoring system source document 10G large factory interview questions