Clustering is a common and basic approach to understanding big data. Recently, data scientist and programmer Peter Gleeson published an in-depth article on freeCodeCamp, giving a basic introduction to some clustering algorithms and explaining how they work with simple and detailed examples.

Look at the picture below. There are all kinds of bugs and snails. Can you try grouping them into different groups?

It’s not so difficult, let’s start by finding out the spider!

Are you done? Although there isn’t necessarily a “right answer” here, in general we can divide the bugs into four groups: spiders, snails, butterflies/moths, and bees/wasps.

Easy, right? You could double the number of bugs and still tell them apart, right? All you need is a little time and a passion for entomology — thousands of bugs can be separated.

But it’s not easy for a machine to categorize those 10 objects into meaningful groupings — with the help of a branch of mathematics called Combinatorics, we know there are 115,975 different ways we can categorize those 10 bugs. If the number of bugs increases to 20, there are more than 50 trillion possible ways to group them. If the insect population reaches 100, the number of possible solutions will exceed the number of particles in the known universe. How much more? According to my calculations, about 500000000000000000000000000000000000 times more, is hard to imagine the ultra astronomical!

But most of these grouping schemes are meaningless, and you can only find a few useful ways to group bugs in a sea of grouping options.

We humans can do it quickly, and we tend to take for granted our ability to group quickly and understand large amounts of data. Whether it’s a piece of text, an image on a screen, or a sequence of objects, humans are usually very effective at understanding the data they’re presented with.

Given that the key to AI and machine learning is to quickly understand large amounts of input data, what are the shortcuts to developing these technologies? In this article, you’ll read about three clustering algorithms that machines can use to quickly understand large data sets. Of course, there are other algorithms out there, but hopefully this will give you a good start!

In this article, I will give an overview of each clustering algorithm, a brief description of how it works, and a more detailed step-by-step implementation case. I’m sure this will help you understand these algorithms.

Three neat clusters, K=3

K-means clustering

When to use it?

When you know in advance how many groups you’re going to find?

Way to work

The algorithm can randomly assign each observation to one of the k classes and then calculate the average of each class. Next, it reassigns each observation to the category closest to its mean, and then recalculates its mean. This step is repeated until no new allocation is needed.

Effective case

Suppose you have a group of nine footballers, each of whom has scored a certain number of goals in a season (let’s say between 3-30). Then we divide them into groups — say, three.

Step 1: We need to randomly divide these athletes into 3 groups and calculate the mean value of each group.

Group 1 Player A (5 balls), player B (20 balls), player C (11 balls) Average =(5 + 20 + 11) / 3 = 12 group 2 player D (5 balls), player E (9 balls), player F (19 balls) Group 3 player G (30 balls), player H (3 balls), player I (15 balls) The group average =16Copy the code

Step 2: For each athlete, reassign them to the group whose score is closest to the mean. For example, player A (5 balls) is reassigned to group 2 (mean =11). And then we calculate the new mean.

Group 1 (original mean =12) Player C (11 balls), player E (9 balls) New average =(11 + 9) / 2 = 10 Group 2 (original mean =11) Player A (5 balls), player D (5 balls), player H (3 balls) New mean =4.33 New mean =21 for group 3 (original mean =16) player B (20 balls), player F (19 balls), player G (30 balls), player I (15 balls)Copy the code

Repeat the second step until the mean of each group has stopped changing. For this simple task, the next iteration will achieve our goal. Now that you’re done, you’ve got 3 clusters from the original data set!

Group 1 (original mean =10) Player C (11 balls), player E (9 balls), player I (15 balls) final average =11.3 Group 2 (original mean =4.33) Player A (5 balls), player D (5 balls), player H (3 balls) Group 3 (original mean =21) player B (20 balls), player F (19 balls), player G (30 balls), final mean =4.33Copy the code

In this case, the cluster might be able to correspond to the players’ positions on the field — such as defense, midfield, and attack. The k-mean works here because we can reasonably predict that the data will naturally fall into these three groups.

In this way, when given a set of performance statistics, the machine can do a good job of estimating the positions of the players on any football team — either for sports analysis or for any other classification task for the purpose of classifying data sets into predefined groups.

More subtle details:

There are also some variations of the algorithm described above. The initial “seed” clustering can be done in a number of ways. Here, we randomly assigned each athlete to a group and calculated the mean of that group. That would cause the initial mean values to be close to each other, which would increase the subsequent steps.

Another way to select a seed cluster is to have just one athlete in each group and then start assigning other athletes to the group closest to them. The clustering returned in this way is more sensitive to the initial seed, thus reducing repeatability in highly variable data sets. However, this approach has the potential to reduce the number of iterations required to complete the algorithm because these groupings will have less time to converge.

One of the obvious limitations of k-means clustering is that you have to provide assumptions about the expected number of clusters in advance. There are also several methods for evaluating the fit of specific clusters. For example, within-cluster sum-of-squares can measure the variance Within each Cluster. The better the clustering, the lower the overall WCSS.

Hierarchical Clustering

When to use it?

When we want to further explore the potential relationship of observed data, we can use hierarchical clustering algorithm.

Way to work

Firstly, we will calculate the distance matrix, in which the elements (I, j) of the matrix represent the distance measure between the observed values I and j. The two closest observations are then paired and their average is calculated. By combining paired observations into one object, we generate a new distance matrix. The specific merging process is to calculate the mean value of each pair of nearest observations and fill in the new distance matrix until all observations have been combined.

Effective cases:

Here is a super simple data set on the classification of whale or dolphin species. As a professionally trained biologist, I can assure you that we usually build systems with more elaborate data sets. Now we can look at the typical body lengths of these six species. In this case we will use two repeat steps.

Species Initials Length(M) Dobbie 3.0 Risso's Dolphin RD 3.6 Pilot Whale PW 6.5 Killer Whale KW 7.5 Humpback Whale HW 15.0 Fin Whale FW 20.0Copy the code

Step 1: Calculate the distance matrix between each species. In this case, Euclidean Distance is used, namely the distance between data points. You can calculate the distance just as you would on a road map looking at a distance chart. We can look up the length difference between any two species by looking at the values at the intersection of the relevant rows and columns.

  BD   RD   PW   KW   HW
RD  0.6                    
PW  3.5  2.9               
KW  4.5  3.9  1.0          
HW 12.0 11.4  8.5  7.5     
FW 17.0 16.4 13.5 12.5  5.0
Copy the code

Step 2: Select the two closest species, in this case bottlenose dolphins and grey dolphins, which have an average body length of 3.3 meters. Repeat the first step and calculate the distance matrix again, but this time replace the bottlenose and grey dolphin data with their mean length of 3.3m.

[BD, RD] PW KW HW PW 3.2kW 4.2 1.0HW 11.7 8.5 7.5 FW 16.7 13.5 12.5 5.0Copy the code

Next, repeat step two with the new distance matrix. Now, the closest distance is between pilot whales and killer whales, so we calculated the average length (7.0m) and merged it into a new one.

We then repeat step 1 and calculate the distance matrix again, but now combine the pilot whale and orca into one term and set the length to 7.0m.

[BD, RD] [PW, KW] HW [PW, KW] 3.7HW 11.7 8.0 FW 16.7 13.0 5.0Copy the code

Let’s repeat step 2 again using the current distance matrix. The closest distance (3.7m) occurs in the two terms that have been combined, and we now combine the two terms into a larger one (with an average of 5.2m).

[[BD, RD], [PW, KW]] HW HW 9.8 FW 14.8 5.0Copy the code

Next, we repeat step 2 again, and the minimum distance (5.0m) appears between humpback and fin whales, so we continue to merge them into one term and calculate the mean (17.5m).

Go back to step 1 and compute a new distance matrix where humpback and fin whales have been combined into one term.

[[BD, RD], [PW, KW]] [HW, FW] 12.3Copy the code

Finally, repeating step 2, only one value (12.3m) exists in the distance matrix, we combine all of them into one term, and can now stop the cycle. Let’s look at the final merge term first.

[[[BD, RD],[PW, KW]],[HW, FW]]
Copy the code

It now has a nested structure (see JSON) that can be drawn as a tree. It is similar to the way a family tree is read. In a tree diagram, the closer two observations are, the more similar and closely related they are.

In aR-Fiddle.orgThe resulting tree

Through the structure of the tree graph, we can have a deeper understanding of the structure of the data set. In the above case, we see two main branches, one is HW and FW, and the other is BD, RD, PW, KW.

In biological evolution, large data sets containing more species and measurements are often used to infer taxonomic relationships between these species. Beyond biology, hierarchical clustering is also used in machine learning and data mining.

Importantly, using this method does not require setting the number of groups as k-means clustering does. You can “slice” the tree by a given height to return the segmented cluster. Height selection can be done in several ways, depending on the resolution at which we want to cluster the data.

In the figure above, for example, if we draw a line at height equal to 10, we split the two main branches into two subgraphs. If we split from height equal to 2, three clusters are generated.

More details:

For hierarchical Clustering algorithms presented here, there are three different aspects.

The most fundamental approach is the agglomerative process we use, in which we iterate from a single data point and aggregate the data points together until it becomes a large cluster. The other (more computationally intensive) approach starts with giant clusters and then breaks down the data into smaller clusters, down to individual data points.

There are also ways to compute the distance matrix. For many cases, Euclidean distances (refer to Pythagoras’ theorem) will suffice, but there are other alternatives that are more suitable for special situations.

Finally, the linkage criterion can be changed. Clusters are connected according to their different distances, but the way we define “near” is flexible. In the above case, we matched the nearest cluster by measuring the distance between the mean values of each cluster, known as centroids. But you might want to use other definitions. (SciKit-Learn provides three definitions: “Ward”, “complete” and “average”. See the following for details: Connection criteria for hierarchical clustering. My understanding is to choose which two clusters to merge)

Connection criteria for hierarchical clustering

The reason

The linkage criterion was mentioned in a blog post about clustering, which translated it into the linkage standard. I haven’t used hierarchical clustering very much before, so I’m not familiar with the concept. So I did a bit of research and put together a few ideas, mostly from Wikipedia and Quora, as well as sciKit-learn documents.

Linkage criteria

Wikipedia defines it as: The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

Translation: The connection criterion determines the distance function between two clusters. In other words, how to measure and calculate the distance between two clusters is determined by the connection standard.

Wikipedia offers 10 ways to measure distance:

  1. Maximum or complete-linkage clustering
  2. Minimum or single-linkage clustering
  3. Mean or average linkage clustering, or UPGMA
  4. Centroid linkage clustering, or UPGMC
  5. Minimum energy clustering
  6. The sum of all intra-cluster variance.
  7. The decrease in variance for the cluster being merged (Ward’s criterion).
  8. The probability that candidate clusters spawn from the same distribution function (V-linkage).
  9. The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree 10. linkage).
  10. The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.

There are so many criteria here that I’m not going to go through them all, because some of them involve quite complicated mathematical formulas that we rarely use.

which linkage criterion to use

What is the best linkage criterion for hierarchical Cluster analysis?

A phD at MIT said that many people have done experiments on this topic, and there are many papers on this topic. Finally, the conclusion is that average linkage is the most effective. When we do hierarchical clustering, we should choose Average linkage first, and single linkage is the least effective.

Linkage criterion in sklearn

Here we focus on the three standards provided in SkLearn: Ward, Complete and Average. (specific can see sklearn. Cluster. AgglomerativeClustering document) sklearn definition of these three is:

  • ward minimizes the variance of the clusters being merged.
  • average uses the average of the distances of each observation of the two sets.
  • complete or maximum linkage uses the maximum distances between all observations of the two sets.

The second and third ones are easier to understand, corresponding to the third and first ones in the wiki. Here, variance is mentioned in ward’s definition, so it’s hard to understand.

Ward’s Method on the wiki has this quote: Ward’s minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.

My understanding is that at the beginning each point is a single cluster, and then all of the variance is 0, so the total variance is also 0. When there is a merge, the total variance will be larger, and we need to choose the combination of the two clusters that minimizes the total variance.

For example, each cluster consists of several discrete points. We can define the distance between two clusters as the minimum (or maximum) distance between any points, as shown in the figure below. There are other ways to define connection standards that may be adapted to different scenarios.

Red/blue: centroid connection; Red/green: minimum connection; Green/blue: maximum connection

Graph Community Detection

When to use it?

When your data can be represented as a network or a graph.

Way to work

A graph community is usually defined as a subset of vertices that are more closely connected than the rest of the network. A variety of algorithms exist for recognizing graphs, based on more specific definitions, which include (but are not limited to) : Edge Betweenness, Modularity-Maximsation, Walktrap, Clique Percolation, Leading Eigenvector…

Effective case

Graph theory is a branch of mathematics that deals with networks. See the previous article, “To understand probabilistic graph models, you must understand the basic definitions and forms of graph theory.” Using graph theory, we can model a complex system as an abstract set of vertice and edge.

Perhaps the most obvious example is social networking. The vertices represent people, and the edges connecting them represent users who are friends or fans of each other.

However, to model a system as a network, you have to find an effective way to connect the various components. Some innovative applications of graph theory in clustering include feature extraction of image data and analysis of gene regulatory networks.

An entry-level example is given below. This is a simple and straightforward graph showing eight websites I recently visited, linked by links from their Wikipedia pages. This data is simple enough that you can draw it manually, but for larger projects a faster way is to write Python scripts. Here’s one I wrote:

1 import wikipedia as wiki 2 import csv 3 4 project_name = input("Project name: ") 5 6 def from_user(): 7 pageList = [] 8 while True: 9 try: 10 searchTerm = input("Enter page title: ") 11 wiki.page(searchTerm) 12 pageList.append(searchTerm) 13 done = input("Page added! Press enter to continue, or type 'Q' to finish adding pages. ") 14 print("") 15 if(done == "Q" or done == "q"): 16 break 17 except: 18 print("Error, please try a different search term!" ) 19 print("") 20 return pageList 21 22 def makeCSV(pageList): 23 with open(project_name.replace(" ","_")+"_adj_matrix.csv","w") as f: 24 wr = csv.writer(f,quoting=csv.QUOTE_ALL) 25 wr.writerow(pageList) 26 data = [] 27 counter = 0 28 for p1 in pageList: 29 p1L = list(wiki.page(p1).links) 30 row = [p1] 31 for p2 in pageList: 32 if p2 in p1L: 33 row.append(1) 34 else: 35 row.append(0) 36 data.append(row) 37 counter += 1 38 prog = round(counter/len(pageList),3)*100 39 print(p1+" links read! Added to. csv file. PROGRESS: "+str(prog)+"%") 40 wr.writerow(row) 41 print("") 42 43 while True: 44 makeCSV(from_user()) 45 breakCopy the code

Graph drawn using igraph in version 3.3.3 of the R language

The color of these vertices indicates their community relationship, and the size is determined according to their centrality. You can see Google and Twitter are central?

In addition, these clusters make sense in real life (and are always an important performance indicator). Yellow vertices are usually reference/search sites, blue vertices are all online publishing sites (articles, tweets or code), and orange vertices are YouTube and PayPal — since YouTube was founded by former PayPal employees. The machine sums it up pretty well!

In addition to being a useful way of visualizing large systems, the real power of networks is their mathematical analysis power. Let’s translate the network in the picture above into a more mathematical form. Here is the adjacency matrix of the network:

         GH  Gl M  P  Q  T  W  Y
GitHub    0  1  0  0  0  1  0  0  
Google    1  0  1  1  1  1  1  1
Medium    0  1  0  0  0  1  0  0
PayPal    0  1  0  0  0  1  0  1
Quora     0  1  0  0  0  1  1  0
Twitter   1  1  1  1  1  0  0  1
Wikipedia 0  1  0  0  1  0  0  0
YouTube   0  1  0  1  0  1  0  0
Copy the code

The value at the intersection of each row and column indicates whether there are edges between the corresponding pair of vertices. For example, there’s an edge between Medium and Twitter, so their rows and columns intersect at 1. Similarly, there are no edges between Medium and PayPal, so their rows and columns intersect at 0.

This adjacency matrix encodes all the properties of the network — it gives us the key to unlock all the possibilities for valuable insights. First, the sum of the numbers in each row or column gives you the degree of each vertex — that is, how many other vertices it is connected to. This number is usually denoted by the letter K. Similarly, divide the degree of each vertex by 2 to get the number of edges, also known as links, denoted by L. The number of rows/columns is the number of vertices in the network, called nodes, denoted by N.

Knowing only the values of K, L, and N, as well as each cell in the adjacency matrix A, allows us to calculate the modularity of any given cluster of the network.

Suppose we have clustered the network into groups. We can then use this modularity score to evaluate the quality of the cluster. A higher score means we split the network into “accurate” groups, while a lower score means our clustering is more random. As shown below:

Modularity is a standard used to measure the “quality” of a partition

Modularity can be calculated using the following formula:

This formula is a little complicated, but let’s break it down so we can understand it better.

M is the modularity that we want to calculate.

1/2L tells us to divide the rest by 2L, which is twice the number of edges in the network.

Sigma notation represents summation and iterates over each row and column in the adjacency matrix A. If you’re not familiar with this notation, you can think of I, j = 1, and N as for-loops in programming languages. In Python, it can be written like this:

sum = 0
Copy the code
1 for i in range(1,N):
2     for j in range(1,N):
3         ans = #stuff with i and j as indices
4         sum += ans
Copy the code

What is # Stuff with I and j in this code?

So the parentheses are minus k_i k_j over 2L from A_ij.

A_ij is the ith row and the JTH column of this adjacency matrix.

K_i and k_j refer to the degree of each vertex — which can be obtained by adding up the items in each row and each column. Multiply them and divide by 2L is the expected number of edges between vertices I and j when the network is randomly assigned.

As a whole, the terms in parentheses represent the difference between the real structure of the network and the expected structure when combined randomly. Studying its value shows that it returns the highest value when A_ij = 1 and (k_i k_j) / 2L is very small. This means that higher values are obtained when there is an “unexpected” edge between the points I and j.

And then finally, let’s multiply the parentheses times delta c_i, c_j. δc_i, c_j is the famous but largely harmless Kronecker-delta function. Here’s the Python explanation:

1 def Kronecker_Delta(ci, cj):
2     if ci == cj:
3         return 1
4     else:
5         return 0
Copy the code
Kronecker_Delta("A","A")    #returns 1
Kronecker_Delta("A","B")    #returns 0
Copy the code

Yes, it’s that simple. The Kronecker delta function returns 1 if the two arguments are equal or 0 if they are not.

That is, if vertices I and j are already in the same cluster, then δc_i, c_j = 1; Otherwise they are not in the same cluster and the function returns 0.

When we multiply the terms in parentheses by the Kronecker delta function, we find that for the nested summation sigma, the results are highest when there are a large number of “unexpected” edges connecting vertices assigned to the same cluster. Modularity is thus a measure of the degree to which graphs can be clustered into different groups.

Dividing by 2L sets the upper limit of modularity to 1. Modularity close to or less than 0 indicates that the current clustering of the network is useless. The higher the modularity, the better the network can cluster into different groups. By maximizing modularity, we can find the best way to cluster the network.

Note that we have to pre-define the clustering of graphs to find a way to evaluate how good a cluster is. Unfortunately, using brute force computations to try out various possible clustering methods to find the highest modularity score is computationally intensive and impossible even on a finite sample size.

Combinatorics tells us that there are 4,140 different clustering ways for a network with only eight vertices. There will be more than 10 billion ways to cluster a 16-vertex network. The 32 vertex network has more than 128 septillion (10^21) possible clustering modes; If your network has 80 vertices, the number of ways it can be clustered already exceeds the number of atoms in the observable universe.

Therefore, we must resort to a heuristic approach that works well at evaluating clusters that produce the highest modularity scores, and does not require trying every possibility. This is an algorithm called fast-greedy Modularity Maximization, This algorithm is similar to the Agglomerative hierarchical Clustering algorithm described above to some extent. However, mod-Max does not fuse groups based on distance, but on modularity changes.

Here’s how it works:

Each vertex is initially assigned to its own community, and then the modularity M of the entire network is calculated.

In the first step, each community pair is required to be linked by at least one side. If two communities are fused together, the algorithm calculates the modularity change δ M caused by this.

The second step is to take the group pairs with the largest increase in δ M and merge. The new modularity M is then calculated for this cluster and recorded.

Repeat steps 1 and 2 — each time the group pairs are fused so that the maximum gain of δ M is finally obtained, and the new clustering pattern and its corresponding modularity score M are recorded.

Stop when all the vertices are grouped into one giant cluster. The algorithm then checks the records in the process and finds the clustering pattern that returns the highest M value. This is the returned community structure.

More details:

A: wow! That’s a lot of computation, at least for us humans. Graph theory has many computational challenges, often nP-hard — but it also has great potential to provide valuable insights into complex systems and data sets. Larry Page knows this, and his famous PageRank algorithm — which helped Google go from startup to near-world domination in less than a decade — is based entirely on graph theory.

Community detection is a hot research area in graph theory, and there are many alternatives to Modularity-Maximization (although useful, it has disadvantages).

First, it starts with small groups of a given size and gradually moves to larger and larger groups. This is known as the resolution limit — the algorithm does not search for groups below a certain size. Another challenge is to perform beyond a significant peak, as mod-Max methods tend to create a “plateau” of many high modular scores, which sometimes makes it difficult to determine the maximum score.

Other algorithms use different methods to determine the community. Edge-betweenness is a splitting algorithm that aggregates all vertices into a large cluster. It iteratively removes the “least important” edge data in the network until all vertices have been separated. This process produces a hierarchy in which similar vertices are placed near each other in the structure.

Another algorithm, Clique Percolation, takes into account possible overlap between graph groups. Other algorithms are based on random walks in graphs, as well as spectral clustering, which starts with the eigendecomposition of adjacent and derived matrices. These methods are applied to feature extraction tasks such as computer vision.

Giving examples of in-depth applications of each algorithm is beyond the scope of this introduction. Effective ways to extract usable information from data that were inaccessible a few decades ago have become a very active area of research.

conclusion

Hopefully, this article has inspired you to better understand how machines understand big data. The future is one of rapid change, and many of those changes will be driven by capable technologies in the next generation or two.

As mentioned in the introduction, machine learning is a very promising research field, where a large number of complex problems need to be solved in an accurate and effective way. Tasks that are easy for humans require innovative solutions when performed by machines.

There is still a lot of work to be done in this area, and whoever contributes to the next big breakthrough will no doubt be generously rewarded. Maybe someone reading this could be the next powerful algorithm inventor? All great ideas start from scratch.

From Medium and additional notes by the blogger