This is the 20th day of my participation in the August More Text Challenge

In order to share this article, I did a lot of work, so without my consent, please do not reprint, thank you!

Hierachical clustering

Nowadays, the pace of The Times is getting faster and faster. Everyone likes to eat fast food and watch short videos. Of course, they can save a lot of time to do what they want to do. But it’s also hard to settle down and get things done. It is also difficult to concentrate and may be irritable. In fact, a lot of things are not all of a sudden can immediately, and even some of the time is up and down fluctuations, even if the efforts to return regressive, at this time we need to adhere to, tired tired to rest for a while, entertainment, as long as insist on, I believe that there will be effective.

The two types of hierarchical clustering are Agglomerative and Divisive.

  • Aggregation: The clustering of data points adopts the bottom-up method, starting from a single data point
  • Divisive: All data points are considered to be in a large cluster, and the clustering process involves dividing a large cluster into several sub-clusters

In this article, we will focus on clustering involving a bottom-up approach.

General steps for clustering

The following are the steps for clustering.

  • Start by treating each data point as a cluster. The initial number of clustering K is the number of data points
  • Form k-1 clusters (classes) by joining the two closest data points to form a cluster (class)
  • By connecting the two closest clusters, more clusters are formed, thus forming k-2 clusters.
  • Repeat the above three steps until all sample points form a large cluster
  • Once a cluster is formed, the tree graph can be used for different tasks to divide into multiple clusters

There are different ways to find distances between groups. The distance itself can be Euclidean distance or Manhattan distance. Here are some alternative ways to measure the distance between two clusters

  • Measure the distance between the closest points in two clusters
  • Measure the distance between the farthest points in the two groups
  • Measure the distance between the centers of two clusters
  • Measure the distance between all possible combination points between two clusters and take its average

The role of tree graph in hierarchical clustering

A large cluster is composed of small clusters, and the combination relationship can be represented by a tree graph, which is used to actually divide the cluster into multiple clusters of related data points. Let’s see how it works.

Next we have a collection of data points represented as a NUMpy array, as shown below.

X = np.array([[5.3],
    [10.15],
    [15.12],
    [24.10],
    [30.30],
    [85.70],
    [71.80],
    [60.78],
    [70.55],
    [80.91]])Copy the code
import matplotlib.pyplot as plt

labels = range(1.11)
plt.figure(figsize=(10.7))
plt.subplots_adjust(bottom=0.1)
plt.scatter(X[:,0],X[:,1], label='True Position')

for label, x, y in zip(labels, X[:, 0], X[:, 1]):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-3.3),
        textcoords='offset points', ha='right', va='bottom')
plt.show()
Copy the code

The sample points are plotted on a graph for easy observation, and each point is given a label to indicate which point the sample point is.

In the figure above, it is not difficult to find that the sample layout points are obviously divided into two groups. The lower left part of the figure (1-5) is a group, while the upper right part of the figure (6-10) is another group. However, in practical problems, the number of sample points may be tens of thousands, and the dimensions of data points are all high-dimensional data. In this case, we can’t see the pattern just by looking.

Going back to the use of tree graphs in hierarchical Clustering, we can plot sample points into tree graphs provided by the SCIPY library.

The algorithm first finds two points closest to each other on the basis of Euclidean distance. If we look back at Figure 1, we can see that sample points 2 and 3 are closest to each other, and sample points 7 and 8 are closest to each other. So a cluster will first form between these two points. In Figure 2, splines 2 and 3 have been joined to indicate that they are in a group, as well as a tree of points 8 and 7. As can be seen from Figure 2, the Euclidean distance between sample points 8 and 7 is greater than the distance between sample points 2 and 3.

The next step is to connect the cluster formed at two points to the next closest cluster or point, which in turn forms another cluster. If you look at Figure 1, point 4 is closest to the cluster of points 2 and 3, so in Figure 2, connect point 4 with the tree of points 2 and 3 to get the new tree. This process continues until all the points are joined together to form a large cluster.

When a large cluster is formed, the same horizontal line can be crossed in the tree for clustering. The horizontal line in different positions will get different clustering methods.

We can see that the maximum vertical distance that no horizontal line passes through is indicated by the blue line. So let’s draw a new horizontal red line across the blue line, and since it intersects the blue line at two points, the number of clusters will be two.

Basically, this horizontal line is a threshold that defines the minimum distance required to become a separate cluster. If we draw a line further down, the threshold required to become a new cluster is lowered and more clusters form, as shown in the figure below.

In the figure above, horizontal lines cross four vertical lines, resulting in four clusters: clusters at 6, 7, 8, and 10 points, clusters at 3, 2, and 4 points, and clusters at 9 and 5 points will be considered single points.

The next step is to import the class of the cluster and call its FIT_PREDICT method to predict the cluster to which each data point belongs.

from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)
Copy the code

In the above code, we import the AgglomerativeClustering class from the sklearn.cluster library. The number of parameters was set to 2 using the N_clusters parameter, and the affinity was set to Euclidean (the distance between data points). Finally, the link parameter is set to ward, which minimizes variation between clusters.

Next we call the fit_predict method from the variable cluster of the AgglomerativeClustering class. This method returns the name of the cluster to which each data point belongs. To see how the data points are clustered.

print(cluster.labels_)
Copy the code

The output is a one-dimensional array of 10 elements corresponding to the clustering assigned to us by the 10 data points.

[1 1 1 1 1 0 0 0 0]
Copy the code