Clustering or cluster analysis is an unsupervised learning problem. It is often used as a data analysis technique to discover interesting patterns in data, such as customer groups based on their behavior. There are many clustering algorithms to choose from, and there is no single optimal clustering algorithm for all cases. Instead, it is best to explore a series of clustering algorithms and different configurations for each algorithm. In this tutorial, you will discover how to install and use top-level clustering algorithms in Python. After completing this tutorial, you will know:

  • Clustering is the unsupervised problem of finding natural groups in the feature space of input data.

  • There are many different clustering algorithms and a single best method for all data sets.

  • How to implement, adapt, and use top-level clustering algorithms in Python with the SciKit-Learn machine learning library.

This paper is divided into three parts

  1. clustering
  2. Clustering algorithm
  3. Examples of clustering algorithms
    • Libraries installed
    • Cluster data sets
    • Affinity propagation
    • The aggregation clustering
    • BIRCH
    • DBSCAN
    • The mean K –
    • The Mini – Batch K – average
    • Mean Shift
    • OPTICS
    • Spectral clustering
    • Gaussian mixture model

First, clustering

Cluster analysis, or clustering, is an unsupervised machine learning task. It includes automatic discovery of natural groupings in data. Unlike supervised learning (similar to predictive modeling), clustering algorithms only interpret input data and find natural groups or clusters in feature Spaces.

Clustering techniques are useful when there are no classes to predict, but instances are divided into natural groups. — From: Data Mining Pages: Practical Machine Learning Tools and Techniques, 2016.

A cluster is typically a region of density in a feature space where examples (observations or data rows) from the domain are closer to the cluster than other clusters. A cluster can have a center (centroid) as a sample or point feature space, and can have boundaries or ranges.

These clusters may reflect a mechanism at work in the domain from which instances are drawn that makes some instances more similar to each other than they are to the rest. — From: Data Mining Pages: Practical Machine Learning Tools and Techniques, 2016.

Clustering can help as a data analysis activity to learn more about the problem domain, known as pattern discovery or knowledge discovery. Such as:

  • The evolutionary tree can be regarded as the result of artificial clustering analysis.
  • Separating normal data from outliers or anomalies may be considered a clustering problem;
  • Splitting clusters according to natural behavior is a cluster problem called market segmentation.

Clustering can also be used as a type of feature engineering, where existing and new examples can be mapped and tagged as belonging to one of the clusters identified in the data. While many cluster-specific quantitative measures do exist, the evaluation of the identified clusters is subjective and may require domain experts. Typically, clustering algorithms perform academic comparisons on synthetic data sets with predefined clusters that are expected to be found by the algorithm.

Clustering is an unsupervised learning technique, so it is difficult to assess the output quality of any given method. — From: Machine Learning Pages: Probabilistic Perspectives 2012.

Second, clustering algorithm

There are many types of clustering algorithms. Many algorithms use similarity or distance measures between examples in a feature space to discover dense areas of observation. Therefore, it is usually good practice to extend the data before using clustering algorithms.

At the heart of all the goals of cluster analysis is the concept of the degree of similarity (or difference) between the various objects being clustered. Clustering methods attempt to group objects based on the definition of similarity provided to them. — From: Elements of Statistical Learning: Data Mining, Reasoning, and Prediction, 2016

Some clustering algorithms require you to specify or guess the number of clusters to be found in the data, while others require you to specify a minimum distance between observations, where examples can be viewed as “closed” or “connected.” Thus, cluster analysis is an iterative process in which subjective assessments of the identified clusters are fed back into changes in algorithm configuration until desired or appropriate results are achieved. The SciKit-Learn library provides a different set of clustering algorithms to choose from. Here are 10 popular algorithms:

  1. Affinity propagation
  2. The aggregation clustering
  3. BIRCH
  4. DBSCAN
  5. The mean K –
  6. The Mini – Batch K – average
  7. Mean Shift
  8. OPTICS
  9. Spectral clustering
  10. Gaussian mixture

Each algorithm provides a different approach to the challenge of discovering natural groups in data. There is no best clustering algorithm, nor is there an easy way to find the best algorithm for your data without using controlled experiments. In this tutorial, we will review how to use each of the 10 popular clustering algorithms from the SciKit-Learn library. These examples will provide a foundation for you to copy and paste the examples and test your methods on your own data. We won’t delve into the theories of how algorithms work or compare them directly. Let’s dig deeper.

Examples of clustering algorithm

In this section, we will review how to use 10 popular clustering algorithms in SciKit-Learn. This includes an example of fitting the model and an example of visualizing the results. These examples are used to copy pastes into your own projects and apply methods to your own data.

1. The library installed

First, let’s install the library. Don’t skip this step, because you need to make sure you have the latest version installed. You can install the SciKit-learn repository using the PIP Python installer, as shown below:

sudo pip install scikit-learn
Copy the code

Next, let’s confirm that the library is installed and that you are using a modern version. Run the following script to output the library version number.

Import sklearn print(sklearn.__version__)Copy the code

When you run the example, you should see the following version number or higher.

0.22.1
Copy the code

2. Clustering data sets

We will use the make _ classification () function to create a test binary data set. The dataset will have 1000 examples, with two input elements and a cluster for each class. These clusters are visible in two dimensions, so we can plot data with scatter plots and color plot points in the plots by the specified clusters. This will help you understand how well the cluster is discernable, at least on test issues. The clustering in this test problem is based on multivariable Gaussian and not all clustering algorithms can effectively identify these types of clustering. Therefore, the results in this tutorial should not be used as a basis for comparing general methods. Examples of creating and summarizing composite clustering datasets are listed below.

Datasets import make_classification from matplotlib import pyplot  X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Random_state =4) # create scatter graph for each class sample for class_value in range(2): Pyplot. scatter(X[row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example creates a composite cluster data set, and then a scatter diagram of the input data, where the points are colored by class labels (ideally clusters). We can clearly see two different data sets in two dimensions and hope that an automatic clustering algorithm can detect these groupings.

Scatter diagram of a synthetic cluster data set with known cluster coloring points Next we can begin to look at an example of the clustering algorithm applied to this data set. I have made some minimal attempts to adjust each method to the data set. 3. Affinity communication Affinity communication involves finding a set of paradigms that best encapsulates the data.

We designed a method called affinity propagation, which serves as an input measure of similarity between two data points. Exchanging real-value messages between data points until a set of high-quality paradigms and clusters emerge — from: Passing Messages between data points, 2007.

It is implemented through the AffinityPropagation class, and the main configuration to adjust is to set “damping” to 0.5 to 1, and maybe even “preferences”. The complete example is listed below.

# affinity spread clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from Sklearn. cluster import AffinityPropagation from matplotlib import pyplot _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Random_state =4) # define model model = AffinityPropagation(damping=0.9) # match model model.fit(X) # assign a cluster to each example yhat = Model. Predict (X) # retrieve the unique clusters = unique(yhat) # create a scatter plot for each cluster sample: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, I can’t get good results.

Scatter plots of data sets with clusters identified using affinity propagation

4. Clustering

Aggregating clustering involves merging examples until the desired number of clusters is reached. It is part of a broader class of hierarchical clustering methods, implemented through the AgglomerationClustering class, with the main configuration being the “N _ clusters” set, which is an estimate of the number of clusters in the data, e.g. 2. The complete example is listed below.

From numpy import unique from numpy import where from sklearn.datasets import make_classification from Cluster import agglomerativecluplot from matplotlib import pyplot _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Yhat = model.fit_predict(X) # Retrieve unique clusters Clusters = unique(yhat) # Create a scatter plot for cluster in clusters for samples of each cluster: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, a reasonable grouping can be found.

Use clustered clustering to identify scatter plots with clustered data sets

5.BIRCHBIRCH

Clustering (BIRCH stands for balanced iterative reduction, and clustering uses hierarchies) involves constructing a tree-like structure from which the center of mass of the cluster is extracted.

BIRCH incrementally and dynamically clustered incoming multidimensional metric data points in an attempt to generate best-quality clusters using available resources (available memory and time constraints). — From: BIRCH: Efficient Data Clustering methods for Large Databases, 1996

It is implemented through the Birch class and is primarily configured with the “threshold” and “N-clusters” hyperparameters, which provide an estimate of the number of clusters. The complete example is listed below.

# birch clustering from numpy import unique from numpy import where from sklearn. Datasets import make_classification from Sklearn. cluster import Birch from matplotlib import pyplot # n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Birch(threshold=0.01, N_clusters =2) # Fit model model.fit(X) # Allocate a cluster for each example. Yhat = model.predict(X) # Retrieve the unique cluster clusters= unique(yhat) # Create scatter plots for samples of each cluster  for cluster in clusters: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, a good grouping can be found.

BIRCH clustering was used to determine scatter plots of datasets with clustering

6.DBSCANDBSCAN

Clustering (where DBSCAN is a noise application for density-based spatial clustering) involves finding high-density regions in a domain and expanding the feature space regions around them into clusters.

… We propose a new clustering algorithm, DBSCAN, that relies on density-based cluster design to find clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user to determine the appropriate value for it – from: Density-based clustering discovery algorithms for Noisy Large-space Databases, 1996

It is implemented through the DBSCAN class with the main configuration of the “EPS” and “min_ samples” hyperparameters. The complete example is listed below.

# dbSCAN clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from Sklearn. cluster import DBSCAN from matplotlib import pyplot # Define data set X, _ = make_classification(n_samples=1000, N_informative =2, n_informative=2, n_redundant=0, n_clusterS_per_class =1, random_state=4) Min_samples =9) # Model fitting and clustering prediction yhat = model.fit_predict(X) # Retrieving unique clusters = unique(Yhat) # Creating scatter plots for cluster in for samples of each cluster clusters: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, a reasonable grouping was found, although more adjustments were needed.

The DBSCAN cluster is used to identify scatter plots of data sets with clusters

7. K-means

K-means clustering can be the most common clustering algorithm and involves assigning examples to clusters to minimize variance within each cluster.

The main purpose of this paper is to describe a process for dividing n-dimensional populations into K sets based on samples. This process called “k-means” seems to give fairly efficient partitioning in the sense of inclass variance. – From: Some methods for classification and analysis of multivariate observations. 1967.

It is implemented using the K-mean class, and the main configuration to be optimized is the “N _ clusters” hyperparameter set to the estimated number of clusters in the data. The complete example is listed below.

# k-means clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from Sklearn. cluster import KMeans from matplotlib import pyplot # Define data set X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Random_state =4) # Define model model = KMeans(N_clusters =2) # Model fit model.fit(X) # Allocate a cluster for each example yhat = model.predict(X) # Retrieve the unique cluster Clusters = unique(yhat) # Create a scatter plot for cluster in clusters for samples of each cluster: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, a reasonable grouping can be found, although the unequal variance in each dimension makes this approach less suitable for the data set.

K-means clustering is used to identify scatter plots of data sets with clustering

8.Mini-Batch

The k-mean is a modified version of the K-mean that updates the cluster centroid using small batches of samples rather than the entire data set, which can make updates to large data sets faster and potentially more robust to statistical noise.

. We recommend mini-batch optimization using k-means clustering. This reduces the computational cost by an order of magnitude compared to classical batch algorithms, while providing a better solution than on-line stochastic gradient descent. — From: Web-Scale K-mean Clustering 2010

It is implemented through the MiniBatchKMeans class, and the primary configuration to be optimized is the “N _ clusters” hyperparameter, set to the estimated number of clusters in the data. The complete example is listed below.

# mini-batch k-means clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification From sklearn.cluster import MiniBatchKMeans from matplotlib import pyplot _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Random_state =4) # Define model model = MiniBatchKMeans(N_clusters =2) # Model fit model.fit(X) # Assign a cluster for each example yhat = model.predict(X) # # Create a scatter plot for cluster in clusters for samples of each cluster: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, an equivalent result is found to the standard K-mean algorithm.

Scatter plot of cluster data set with minimum batch k-means clustering

9. Mean shift clustering

Mean-shift clustering involves finding and adjusting the centroid according to the instance density in the feature space.

It is proved that the recursive average shift program converges to the basic density function closest to the stagnation point for discrete data, and its application in detecting density patterns is proved. — From: Mean Shift: Robust Methods for Feature Space Analysis, 2002

It is implemented through the MeanShift class, with the primary configuration being the “bandwidth” hyperparameter. The complete example is listed below.

# mean shift clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from Sklearn. cluster import MeanShift from matplotlib import pyplot # Define data set X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Yhat = model.fit_predict(X) # Retrieve unique clusters (yhat) # Create a scatter diagram for cluster in Clusters for samples of each cluster: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, a reasonable set of clusters can be found in the data.

Clustering data point graph with mean shift clustering

10.OPTICSOPTICS

Clustering (OPTICS shorter than the order points to identify the clustering structure) is a modified version of DBSCAN above.

We introduce a new algorithm for clustering analysis, which does not explicitly generate a cluster of data sets. Rather, it creates an enhanced sort of database that represents its density-based clustering structure. This cluster ordering contains information equivalent to density clustering, which corresponds to a wide range of parameter Settings. — From: OPTICS: Sorting points to identify clustering structures, 1999

It is implemented through the OPTICS class with the main configuration of the “EPS” and “min_ samples” hyperparameters. The complete example is listed below.

# optics from numpy import unique from numpy import where from sklearn.datasets import make_classification from Cluster import OPTICS from matplotlib import pyplot # Define data set X, _ = make_classification(n_samples=1000, N_informative =2, n_informative=2, n_redundant=0, n_clusterS_per_class =1, random_state=4) Min_samples =10) # Model fitting and clustering prediction yhat = model.fit_predict(X) # Retrieving unique clusters = unique(Yhat) # Creating scatter plots for cluster in for samples of each cluster clusters: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, I can’t get a reasonable result on this data set.

Use OPTICS clustering to determine scatter graphs of datasets with clustering

11. Spectral clustering

Spectral clustering is a general clustering method derived from linear linear algebra.

A promising alternative that has recently emerged in many fields is the use of clustering spectroscopic methods. Here, the top eigenvectors of the matrix derived from the distance between points are used. — From: On spectral clustering: Analysis and algorithms, 2002

It is achieved through the Spectral clustering class, the main Spectral clustering class is a generic class composed of clustering methods derived from linear linear algebra. What you want to optimize is the “N _ clusters” hyperparameter, which specifies the estimated number of clusters in the data. The complete example is listed below.

# spectral clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification From sklearn.cluster import SpectralClustering from matplotlib import pyplot _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Yhat = model.fit_predict(X) # Retrieve unique clusters Clusters = unique(yhat) # Create a scatter plot for cluster in clusters for samples of each cluster: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, a reasonable cluster is found.

Use spectral clustering clustering to identify scatter plots with clustering data sets

12. Gaussian mixture model

The Gaussian mixture model summarizes a multivariable probability density function, which, as its name implies, is a mixture of gaussian probability distributions. It is implemented through the Gaussian Mixture class, and the main configuration to be optimized is the “N _ clusters” hyperparameter, which specifies the estimated number of clusters in the data. The complete example is listed below.

# Gaussian mixture model from numpy import unique from numpy import where from sklearn.datasets import make_classification from Sklearn. mixture import GaussianMixture from matplotlib import Pyplot # define data set X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, Yhat = model.predict(X) # model = GaussianMixture(n_components=2) # create a scatter plot for cluster in clusters for samples of each cluster: Pyplot. scatter(X[Row_ix, 0], X[row_ix, 1]) # plot pyplot.show()Copy the code

Running the example fits the model on the training dataset and predicts the clustering of each example in the dataset. A scatter diagram is then created and colored by the cluster it specifies. In this case, we can see that the cluster is perfectly identified. This is not surprising, since the dataset is generated as a mix of Gaussian.

Gaussian mixture clustering is used to identify scatter plots of data sets with clustering

4. To summarize

In this article, you found out how to install and use top-level clustering algorithms in Python. Specifically, you learned:

  • Clustering is the unsupervised problem of discovering natural groups in feature space input data.
  • There are many different clustering algorithms and there is no single best method for all data sets.
  • How to implement, fit, and use top-level clustering algorithms in Python from the SciKit-Learn machine learning library.