Original link:tecdat.cn/?p=9997

Original source:Tuo End number according to the tribe public number


A brief introduction to k-medoids clustering

K-medoids is another clustering algorithm that can be used to find groups in a dataset. K-medoids clustering is very similar to k-means clustering, except for some differences. The optimization function of k-Medoids clustering algorithm is slightly different from that of K-means. In this section, we will study k-medoids clustering.

K – Medoids clustering algorithm

There are many different types of algorithms that can perform k-medoids clustering, of which the simplest and most efficient algorithm is PAM. In PAM, we perform the following steps to find the cluster center:

  1. K data points were selected from the scatter graph as the starting point of the clustering center.
  2. Calculate their distances from all the points in the scatter plot.
  3. Classify each point into the cluster closest to the center.
  4. Select a new point in each cluster to minimize the sum of the distances between all points in the cluster and itself.
  5. repeatStep 2,Until the center stops changing.

As you can see, the PAM algorithm is the same as the K-means clustering algorithm except for Step 1 and Step 4. For most practical purposes, k-medoids clustering gives almost the same results as K-means clustering. But in some special cases, we have outliers in the data set, so k-medoids clustering is preferred because it is more robust than outliers.

K-medoids clustering code

In this section, we will use the same iris data set used in the previous two sections and make comparisons to see if the results are significantly different from those obtained last time.

Realize k-medoID clustering

In this exercise, we will perform k-medoids using R’s pre-built library:

  1. Store the first two columns of the dataset in the iris_data variable:

     

    iris_data<-iris[,1:2]
    Copy the code
  2. Installation package:

     

    install.packages("cluster")
    Copy the code
  3. Importing software packages:

     

    library("cluster")
    Copy the code
  4. Store PAM clustering results in the km.res variable:

     

    km<-pam(iris_data,3)
    Copy the code
  5. Import libraries:

     

    library("factoextra")
    Copy the code
  6. Plot PAM clustering results in the figure:

     

    fviz_cluster(km, data = iris_data,palette = "jco",ggtheme = theme_minimal())
    Copy the code

    The output is as follows:

    Figure: Results of k-medoids clustering

The results of the K-Medoids clustering are not much different from the results of the K-means clustering we did in the previous section.

Thus, we can see that the previous PAM algorithm divides our data set into three clusters that are similar to the clusters we obtained through k-means clustering.

 

Figure: Results of K-Medoids clustering and K-means clustering

In the previous figure, observe how the centers of the K-means cluster and the K-means cluster are so close, but the centers of the K-means cluster overlap directly with existing points in the data, whereas the centers of the K-means cluster do not.

K-means clustering and K-medoids clustering

Now that we have looked at k-means and K-Medoids clustering, which are almost identical, we will look at the differences between them and when to use which type of clustering:

  • Computational complexity: Of the two methods, k-Medoids clustering is more computationally complex. When our data set is too large (> 10,000 points) and we want to save computing time, we prefer k-means clustering to K-medoids clustering.

    Whether the data set is large or not depends entirely on the computing power available.

  • Existence of outliers: K-means clustering is more sensitive to outliers than outliers.

  • Clustering centers: Both k-means and K-clustering algorithms find clustering centers in different ways.

K-medoids clustering is used for customer segmentation

K-means and K-medoids clustering is performed using the customer data set and the results are then compared.

Steps:

  1. Select just two columns, grocery store and freezer store, to easily visualize the cluster in two dimensions.
  2. Draw a diagram using k-Medoids clustering to show four clusters of this data.
  3. K-means clustering was used to draw a four-cluster graph.
  4. Compare the two graphs to comment on how the results differ between the two methods.

The result will be a K-mean graph of the cluster, as shown below:

 

Figure: The expected K-means of the cluster

Determine the optimal number of clusters

So far, we have been studying the iris data set, where we know how many flowers there are, and have chosen to divide the data set into three clusters based on this knowledge. However, in unsupervised learning, our main task is to process data without any information, such as how many natural clusters or categories there are in the data set. Similarly, clustering can be a form of exploratory data analysis.

Types of clustering indicators

There is more than one method to determine the optimal number of clusters in unsupervised learning. Here’s what we’ll look at in this chapter:

  • Contour points
  • Elbow method/WSS
  • The gap between statistics

Contour points

The contour score or average contour score calculation is used to quantify the clustering quality achieved by the clustering algorithm.

The contour score is between 1 and -1. If the contour score of the cluster is low (between 0 and -1), it indicates that the cluster is spread out or the distance between the points of the cluster is high. If the contour score of the cluster is high (close to 1), it indicates that the cluster is well defined and that the clustered points are lower apart from each other and higher apart from other clustered points. Therefore, the ideal contour score is close to 1.

 

Calculate contour fraction

We learned how to calculate the contour score of a dataset with a fixed number of clusters:

  1. Place the first two columns of the IRIS data set (spacer length and spacer width) in the irIS_data variable:

     

  2. Implementing k-means clustering:

     

  3. Store k-means clusters in km.res variables:

     

  4. Store the pair_dis matrix of all data points in the pair_dis variable:

     

  5. Calculate the contour score of each point in the data set:

     

  6. Draw contour score diagram:

     

    The output is as follows:

  7. Figure: The contour score of each point in each cluster is represented by a single bar

The previous figure gives an average contour score of 0.45 for the dataset. It also shows the average contour scores of clusters and point clusters.

We calculate the contour scores of three clusters. However, to determine how many clusters you want to have, you must calculate the profile scores of multiple clusters in the dataset.

Determine the optimal number of clusters

Calculate contour scores for each value of K to determine the optimal number of clusters:

From the previous figure, select the k value with the highest score; 2. According to the contour score, the optimal number of clustering is 2.

  1. Place the first two columns of the dataset (length and width) in the iris_data variable:

  2. Import libraries

  3. Draw a graph with contour score and number of clusters (maximum 20) :

    Pay attention to

    In the second parameter, you can change k-means to k-medoids or any other type of clustering.

    The output is as follows:

    Figure: Clustering number and average contour fraction

WSS/elbow method

To identify clusters in the dataset, we try to minimize the distance between points in the cluster, and the sum of squares (WSS) method can measure this distance. The WSS score is the sum of the square distances of all points in the cluster.

WSS is used to determine the number of clusters

In this exercise, we will see how to use WSS to determine the number of clusters. Perform the following steps.

  1. Place the first two columns of the iris data set (septum length and septum width) in the irIS_data variable:

  2. Import libraries

  3. Chart WSS and cluster number

    The output is as follows:

  4. Figure: WSS and cluster number

In the previous figure, we could have chosen the elbow of the figure as k = 3 because after k = 3 the value of WSS starts to fall more slowly. Choosing the elbows of the graph is always a subjective choice, sometimes it is possible to choose k = 4 or k = 2 instead of K = 3, but for this graph, it is clear that k> 5 is the wrong value for K, because they are not the elbows of the graph, but rather where the slope of the graph changes dramatically.

The gap between statistics

Gap statistics are one of the most efficient ways to find the optimal number of clusters in a data set. It is suitable for any type of clustering method. The Gap statistics are calculated by comparing the WSS values of the clusters generated by the observed data set with the reference data set without obvious clustering.

Thus, in a nutshell, Gap statistics are used to measure the WSS values of observed and random data sets, and to find deviations between observed and random data sets. To find the ideal number of clusters, we choose the value of K, which gives us the maximum Gap statistic.

The ideal number of clusters is calculated by gap statistics

In this exercise, we will use Gap statistics to calculate the ideal number of clusters:

  1. Place the first two columns of the Iris data set (spacer length and spacer width) in the irIS_data variable

     

  2. Import factoextra library

     

  3. Plot the gap statistics and the number of clusters (up to 20) :

     

    Figure 1.35: Gap statistics and cluster numbers

As shown in the figure above, the maximum Gap statistic is k = 3. Therefore, the ideal number of clusters in the dataset is 3.

Find the ideal number of market segments

Use all three methods to find the optimal number of clusters in the customer data set:

Load columns 5 through 6 of the wholesale customer dataset in the variable.

  1. The optimal number of k-means clustering was calculated by contour fraction.
  2. WSS score was used to calculate the optimal number of k-means clustering.
  3. Gap statistics were used to calculate the optimal number of k-means clustering.

The result will be three graphs representing the optimal number of clusters for contour score, WSS score and Gap statistic.


Most welcome insight

1.R language K-Shape algorithm stock price time series clustering

2. Comparison of different types of clustering methods in R language

3. K-medoids clustering modeling and GAM regression are performed for time series data of electricity load using R language

4. Hierarchical clustering of IRIS data set of R. language

5.Python Monte Carlo K-means clustering

6. Use R to conduct website comment text mining clustering

7. Python for NLP: Multi-label text LSTM neural network using Keras

8.R language for MNIST data set analysis and exploration of handwritten digital classification data

9.R language deep learning image classification based on Keras small data sets