introduce

I love working on unsupervised learning. They provide a completely different challenge to the supervised learning problem, and there is much more room for experimentation with the data I have. There is no doubt that most developments and breakthroughs in machine learning have taken place in unsupervised learning.

One of the most popular techniques in unsupervised learning is clustering. This is a concept we usually learn in the early days of machine learning, and it’s easy to understand. I’m sure you’ve been involved in customer segmentation, basket analysis, etc.

But the problem is that clustering has many facets. It’s not limited to the basic algorithms that we’ve seen before. It’s a powerful unsupervised learning technology that we can use accurately in the real world.

Gaussian mixture model is a clustering algorithm that I want to discuss in this paper.

Want to predict the sales of your favorite product? Maybe you want to understand churn through the eyes of different customer groups. Either way, you’ll find the Gaussian mixture model very useful.

In this paper, we will take a bottom-up approach. Therefore, let’s first take a look at the basics of clustering, including a quick review of the K-means algorithm. We will then delve into the concepts of Gaussian mixture models and implement them in Python.

directory

  1. Introduction of clustering
  2. Introduction to K-means clustering
  3. Disadvantages of K-means clustering
  4. Gaussian mixture model is introduced
  5. Gaussian distribution
  6. Expectation maximization EM algorithm
  7. The expectation maximization of gaussian mixture model
  8. Implement the Gaussian mixture model for clustering in Python

Introduction of clustering

Before we get into the nitty gritty of the Gaussian mixture model, let’s quickly update some basic concepts.

Note: If you are already familiar with the idea behind clustering and how the K-means clustering algorithm works, then you can skip straight to Part 4, “Introduction to gaussian Mixture Models.”

So, let’s start by formally defining the core idea:

Clustering refers to grouping similar data points together according to their attributes or characteristics.

For example, if we have the income and expenditure of a group of people, we can divide them into the following groups:

  • Earn more, spend more
  • Earn more, spend less
  • Earn less, spend less
  • Earn less, spend more

Each of these groups has a similar feature that is particularly useful in some cases. Credit cards, car/home loans, etc. In simple terms:

Introduction to K-means clustering

K-means clustering is a distance-based algorithm. This means that it tries to group the nearest points into a cluster.

Let’s take a closer look at how this algorithm works. This will establish the basics to help you understand how gaussian mixture models will come into play later in this article.

So, we first define the number of groups we want to divide the population into — this is the value of k. Based on the number of clusters or groups we want, then we randomly initialize k centrosomes.

These data points are then assigned to the nearest cluster. Then update the center and reallocate the data points. This process is repeated until the position of the center of the cluster does not change.

Note: This is a brief overview of k-means clustering and is sufficient for this article.

Disadvantages of K-means clustering

The k-means clustering concept sounds great, doesn’t it? It is easy to understand, relatively easy to implement, and can be applied to a considerable number of use cases. But there are some drawbacks and limitations that we need to be aware of.

Let’s take the income-expenditure example we saw above. The K-means algorithm seems to work pretty well, right? Wait — if you look closely, you’ll see that all clusters are circular. This is because the centrosomes of the cluster are iteratively updated using average values.

Now, consider the following example where the distribution of points is not circular. What do you think would happen if we used K-means clustering on this data? It still tries to group data points in a circular fashion. That’s not very good.

Therefore, we need a different method to assign clusters to data points. Therefore, instead of using a distance-based model, we will use a distribution-based model. Gaussian mixture Models introduce distribution-based models!

Introduction to gaussian mixture model

The Gaussian mixture model (GMMs) assumes that there are a certain number of Gaussian distributions, each representing a cluster. Therefore, gaussian mixture models tend to cluster data points belonging to a single distribution.

Suppose we have three Gaussian distributions (more on that in the next section) — GD1, GD2, and GD3. The mean values are (μ1, μ2, μ3) and variance values are (σ1, σ2, σ3) respectively. For a given set of data points, our GMM will identify the probability of each data point belonging to these distributions.

Wait a minute, probability?

You read that right! The mixed Gaussian model is a probabilistic model, which distributes points in different clusters by soft clustering method. Let me give you another example, just to make it easier to understand.

Here, we have three clusters represented by three colors — blue, green, and cyan. Let’s take the data points highlighted in red for example. The probability that this point is part blue is 1, and the probability that it’s part green or cyan is 0.

Now, consider another point, somewhere between blue and cyan (highlighted in the figure below). The probability that this point is green is 0. The probability of this point being blue and cyan is 0.2 and 0.8, respectively.

The Gaussian mixture model uses soft clustering to assign data points to the Gaussian distribution.

Gaussian distribution

I’m sure you’re familiar with the Gaussian (or normal) distribution. It has a bell-shaped curve with data points symmetrically distributed around the average. The following image shows several normal distribution images with different mean (μ) and variance (σ2) of the Gaussian distribution. Remember, the lower the value of sigma, the sharper the image:

In one-dimensional space, the probability density function of gaussian distribution is:

Where μ is the mean and σ2 is the variance.

But that’s only true in one dimension. In the two-dimensional case, instead of using a 2D bell curve, we use a 3D bell curve, as shown below:

The probability density function is:

Where x is the input vector,μ is the 2-dimensional mean vector, and sigma is a 2-by-2 covariance matrix. The covariance defines the shape of the curve. We can generalize to the D dimensional case.

So this multivariate Gaussian model x and μ vector are both of length D, sigma is the covariance matrix of DXD.

Therefore, for a data set with D features, we will have k mixtures of Gaussian distributions (where K is equal to the number of clusters), each with a specific mean vector and covariance matrix. But wait, how do you distribute the mean and variance of each Gaussian distribution?

These values are determined using a technique called expectation maximization (EM). Before delving into gaussian mixture models, we need to understand this technique.

Expectation maximization EM algorithm

Expectation maximization (EM) is a statistical algorithm for finding the correct model parameters. We usually use EM when data is missing values, or in other words, when data is incomplete.

These missing variables are called hidden variables. When dealing with unsupervised learning, we assume that the goal (or cluster number) is unknown.

The absence of these variables makes it difficult to determine the correct model parameters. Think of it this way — if you know which data points belong to which cluster, then you can easily determine the mean vector and the covariance matrix.

Since we do not have values for hidden variables, expectation maximization tries to use existing data to determine the best values for these variables and then find model parameters. Based on these model parameters, we return and update the values of implicit variables, and so on.

Broadly speaking, the expectation maximization algorithm has two steps:

  • Step E: In this step, available data is used to estimate (guess) the value of the missing variable
  • Step M: Update parameters with complete data based on the estimates generated in step E

Expectation maximization is the basis of many algorithms, including gaussian mixture models. So how does GMM use the EM concept? How do we apply this to a given set of points? Let’s see!

The expectation maximization of gaussian mixture model

Let’s use another example to understand it. I want you to make this more concrete as you read. This will help you better understand what we are talking about.

Let’s say we need to allocate k clusters. Gaussian distribution, which means there are k mean μ1,μ2,.. μ K and the covariance matrix σ 1, σ 2,.. σ K. In addition, there is a parameter for distributions that defines the weights for each distribution, which represent the number of points in each cluster, denoted by π I.

Now, we need to find the values of these parameters to define the Gaussian distribution. We have determined the number of clusters and randomly assigned the mean, covariance, and weight. Next, we will perform steps E and M!

E:

For each point xi, calculate that it belongs to the distribution C1, c2… The probability of CK. This is done using the following formula:

This value will be higher when the point is assigned to the correct cluster, and lower otherwise.

M step:

After step E, we go back and update the π,μ, and sigma values. The information is updated as follows:

  1. The new weight is defined as the ratio of the number of data in the cluster to the total number of data:

  1. The mean and covariance matrices are updated according to the values assigned to the distribution, proportional to the probability values of the data points. Thus, a data point that is more likely to be part of the distribution will have a greater contribution:

Based on the updated values generated in this step, we calculate the new probabilities for each data point and update these values iteratively. This process is repeated to maximize the logarithmic likelihood function. In fact we can say

K-means only considers the mean of the updated cluster center, while GMM considers the mean and variance of data.

Implement gaussian mixture model in Python

It’s time to dig into the code! This is one of my favorite parts of any article, so let’s get started.

We’ll start by loading the data. This is a temporary file I created – you can download the data from this link: https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/10/Clustering_gmm.csv.

Import pandas as pd data = pd.read_csv(' clustering_gm.csv ') plt.figure(figsize=(7,7)) plt.scatter(data["Weight"],data["Height"]) plt.xlabel('Weight') plt.ylabel('Height') plt.title('Data Distribution') plt.show()Copy the code

So that’s our data. We first establish a K-means model on this data:

Cluster import KMeans KMeans = KMeans(N_clusters =4) KMeans. Fit (data) # KMeans predict Pred = kmeans.predict(data) frame = pd.DataFrame(data) frame['cluster'] = pred frame.columns = ['Weight', 'Height', Color =['blue','green','cyan', 'black'] for range(0,4): data = frame[frame["cluster"]==k] plt.scatter(data["Weight"],data["Height"],c=color[k]) plt.show()Copy the code

The results are not accurate. The K-means model fails to identify the correct clusters. We took a closer look at the cluster at the center, and although the data distribution was elliptical, K-means had tried to build a circular cluster (remember the shortcomings we discussed earlier?).

Now let’s build a Gaussian mixture model on the same data and see if we can improve k-means:

Import pandas as PD data = pd.read_csv(' clustering_GM.csv ') # = GaussianMixture(n_components=4) gmm.fit(data) #GMM forecast labels = GMM. predict(data) frame = pd.dataframe (data) frame['cluster'] = labels frame.columns = ['Weight', 'Height', 'cluster'] color=['blue','green','cyan', For k in range(0,4): data = frame[frame["cluster"]==k] plt.scatter(data["Weight"],data["Height"],c=color[k]) plt.show()Copy the code

This is exactly what we had hoped for. The Gaussian mixture model beats the K-means model in this data set

At the end

This is an introduction to gaussian mixture models. My purpose here is to introduce you to this powerful clustering technique and to show how effective and efficient it is compared to traditional algorithms.

I encourage you to start a clustering project and try GMMs out there. This is the best way to learn and reinforce a concept, and trust me, you will fully appreciate how useful this algorithm can be.