Original link:tecdat.cn/?p=7275
Original source:Tuo End number according to the tribe public number
Determining the optimal number of clusters in a data set is a basic problem in partitioned clustering (such as k-means clustering), which requires the user to specify the number of clusters k to be generated.
A simple and popular solution involves examining a tree graph generated using hierarchical clustering to see if it implies a specific number of clusters. Unfortunately, this approach is also subjective.
We will introduce different methods for determining the optimal number of clusters for k-means, K-medoids (PAM), and hierarchical clustering.
These methods include direct methods and statistical test methods:
- Direct methods: include optimization criteria, such as the sum of squares within a cluster or the sum of average contours. The corresponding methods are called respectivelyelbowMethods andoutlineMethods.
- Statistical tests: Involve comparing evidence with invalid hypotheses. 天安门事件
In addition to elbow, contour, and gap statistical methods, more than thirty other indicators and methods have been published to identify the optimal number of clusters. We will provide R code to calculate all 30 indexes to determine the optimal number of clusters using the “majority rule.”
For each of the following methods:
- We will describe the basic ideas and algorithms
- We will provide easy-to-use R code with many examples to determine the optimal number of clusters and visualize the output.
The elbow method
Recall that the basic idea behind partitioning methods such as K-mean clustering is to define clustering to minimize the total intra-cluster variation [or total intra-cluster sum of squares (WSS)]. The total WSS measures the compactness of the cluster, and we want it to be as small as possible.
The Elbow approach treats the total WSS as a function of the number of clusters: Multiple clusters should be selected so that adding another cluster does not improve the total WSS.
The optimal number of sets can be defined as follows:
- Clustering algorithms are calculated for different values of K (for example, k-means clustering). For example, by changing K from 1 cluster to 10 clusters.
- For each k, calculate the total sum of squares (WSS) within the cluster.
- Draw the WSS curve according to the cluster number K.
- The position of the inflection point (knee) in the curve is generally regarded as an indicator of the appropriate cluster number.
Average contour method
The average contour method computes the average contour of the observed values with different K values. The optimal number of clustering k is the number of average contours maximized within the range of possible values of K (Kaufman and Rousseeuw 1990).
Gap statistics
This method can be applied to any clustering method.
The gap statistic compares the sum of the different values of K as they vary within the cluster with the expected value under the data null reference distribution. The estimate of the best cluster will be the value that maximizes the gap statistics (that is, the value that produces the largest gap statistics).
Data preparation
We will use THE USArrests data as a demonstration dataset. We start by standardizing the data to make the variables comparable.
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.2426 0.783-0.521-0.00342
Alaska 0.5079 1.107-1.212 2.48420
## Arizona 0.0716 1.479 0.999 1.04288
## Arkansas 0.231-1.074-0.18492
## California 0.2783 1.263 1.759 2.06782
## Colorado 0.0257 0.399 0.861 1.86497
Copy the code
Copy the code
Silhouhette and Gap statistical methods
The simplified format is ****
The following R code determines the optimal number of clusters for k-means clustering:
# Elbow method
fviz_nbclust(df, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
# Gap statistic
Copy the code
## Clustering k = 1,2... , K.max (= 10): .. Done ## Bootstrapping b = 1,2... , B (= 50) [one "." per sample]: ## .................................................. 50Copy the code
Based on these observations, it is possible to define k = 4 as the optimal number of clusters in the data.
30 indexes to select the optimal number of clusters
Data: Matrix
- Diss: The dissimilarity matrix to use. By default, diss = NULL, but if it is replaced with a difference matrix, the distance should be “NULL”
- Distance: A distance measure used to calculate the difference matrix. Possible values include “Euclidean”, “Manhattan”, or “NULL”.
- Min. Nc, Max. Nc: are the minimum and maximum number of clusters respectively
- To calculate for kmeansNbClust() please use method = “kmeans”.
- To compute for hierarchical clusteringNbClust(), the method should be one of C (” ward.d “, “ward.d2”, “single”, “complete”, “average”).
The following R code is k-means calculation ** :
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 10 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 8 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
Copy the code
According to the majority rule, the optimal number of clusters is 2.
Most welcome insight
1.R language K-Shape algorithm stock price time series clustering
2. Comparison of different types of clustering methods in R language
3. K-medoids clustering modeling and GAM regression are performed for time series data of electricity load using R language
4. Hierarchical clustering of IRIS data set of R. language
5.Python Monte Carlo K-means clustering
6. Use R to conduct website comment text mining clustering
7. Python for NLP: Multi-label text LSTM neural network using Keras
8.R language for MNIST data set analysis and exploration of handwritten digital classification data
9.R language deep learning image classification based on Keras small data sets