-
Reprint please note the original sources, thank: blog.csdn.net/pentiumCM/a…
Machine learning – Python (SKlearn/SCIpy) implements hierarchical clustering, precomputed custom distance matrices
For the principle of hierarchical clustering, please refer to my other blog: The principle of hierarchical clustering. This blog will show you how to implement hierarchical clustering simply and directly using Python.
This blog post covers two ways to implement hierarchical clustering in SKLearn/SCIPY, and a custom distance matrix in SKLearn with precomputed parameters
Scipy implementation
(I) Function description
There are two main functions: linkage and fcluster
1. linkage
def linkage(y, method='single', metric='euclidean', optimal_ordering=False):
Copy the code
Perform hierarchical/cluster clustering.
-
Parameters:
- Y: Input Y can be a 1-dimensional condensed distance matrix (distance vector) or a 2-dimensional array of observation vectors (matrix of coordinate points).
- Method: Method refers to the method for calculating the distance between classes, including single, complete, Average, Weighted, Centroid, and Ward
- Find the distance between a pair of nearest points in cluster U and cluster V as the distance between cluster U and cluster V:
- Find the distance of a pair of farthest points in cluster U and cluster V as the distance between cluster U and cluster V:
- Average: average distance of all pairs between classes. For example, there are several points I in cluster U and several points J in cluster V. The distance between cluster U and v is:
- Weighted: Weighted/WPGMA linkage is performed on the compressed distance matrix. If cluster U is composed of cluster S and cluster T, then the distance between cluster U and cluster V is:
- Centroid: Centroid distance. The distance between the centroid of a class and the centroid of a class is the class distance
- Ward: Minimizes the variance of the cluster to be merged.
-
Returns:
-
Z (Ndarray) : The hierarchy is encoded as a linkage matrix.
Z consists of four columns. The first and second fields are the number of cluster clusters respectively, and each initial value is identified from 0 to N-1 before the initial distance. A new pair of cluster clusters is added to identify each new cluster cluster generated. The fourth field represents the number of elements contained in the newly generated cluster.
-
-
Resources: docs.scipy.org/doc/scipy/r…
2. fcluster
def fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None):
Copy the code
Form a plane cluster from the hierarchical cluster defined by a given link matrix.
- Parameters:
- Z: Z is the linkage matrix, which records the hierarchy information of hierarchical clustering.
- T: is a clustering threshold
- Criterionstr: Criteria for forming flat clusters
- The inconsistent:
- Distance: Takes the distance between clusters as the dividing criterion
- Returns: fcluster (nDARray) : an array of length N. T [I] is the number of plane clusters to which the original observation value I belongs.
- Detailed reference: docs.scipy.org/doc/scipy/r…
(2) Examples include complete algorithms
For example, the following five points are aggregated.
x | y | |
---|---|---|
Point 0 | 1 | 2 |
Point 1 | 2 | 3 |
Point 2 | – 3 | 3 |
Point 3 | 2 – | – 1 |
Point 4 | 5 | – 1 |
Position on the coordinate axis:The clustering tree of hierarchical clustering results is: |
||
The complete algorithm is as follows: | ||
“`python | ||
#! /usr/bin/env python | ||
# encoding: utf-8 | ||
‘ ‘ ‘ | ||
@Author : pentiumCM | ||
@Email : [email protected] | ||
@Software: PyCharm | ||
@File : sci_cluster.py | ||
@Time : 2020/4/15 22:21 | ||
@desc: Scipy implements hierarchical clustering | ||
‘ ‘ ‘ |
import numpy as np from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from matplotlib import pyplot as plt
data = np.array([[1, 2], [2, 3], [-3, 3], [-2, -1], [5, -1]])
Draw some
plt.scatter(x=data[:, 0:1], y=data[:, 1:2], marker=’.’, color=’red’) n = np.arange(data.shape[0]) for i, txt in enumerate(n): plt.annotate(txt, (data[i:i + 1, 0:1], data[i:i + 1, 1:2])) plt.show()
1. Hierarchical clustering
The linkage method is used to calculate the distance d(s,t) between two clusters s and t.
The hierarchy clustering is encoded as a linkage matrix.
Linkage (data, ‘average’) print(” linkage “, Z)
A plane cluster is formed from a hierarchical cluster defined by a given link matrix.
Distance: the criterion for dividing distance based on distance
F = fcluster(Z, 4, ‘distance’) print(“, f)
fig = plt.figure(figsize=(5, 3))
The hierarchical clustering results are represented by tree graph
dn = dendrogram(Z) plt.show()
Algorithm running results: python clustering process: [[0.1.1.41421356 2.] [2.3.4.12310563 2.] [5.6.4.75565014 4.] [4.7.6.48606798 5.] [1 123 4]Copy the code
The main information for clustering is in Z = linkage(data, ‘average’). Z consists of four columns. The first and second fields are the number of cluster clusters respectively, and each initial value is identified from 0 to N-1 before the initial distance. A new pair of cluster clusters is added to identify each new cluster cluster generated. The fourth field represents the number of elements contained in the newly generated cluster.
Hierarchical clustering can cluster all the situations at one time. When the results of clustering tree are generated, the results of clustering into several categories can be selected by drawing horizontal lines on the clustering tree (for example, in the algorithm above, the horizontal lines are based on the distance between clusters). If the user needs to cluster into two categories:Just draw a horizontal line from the top down to determine the result of clustering into two categories. You can see the two categories of results, one is 4, the other is 0, 1, 2, 3. If you want to cluster into three categories, just shift the horizontal line down to see the result of clustering into three categories.
Second, skLearn implementation
(I) Function description
The hierarchical clustering under the SKlearn library is in AgglomerativeClustering of sklearn. Cluster:
def __init__(self, n_clusters=2, affinity="euclidean",
memory=None,
connectivity=None, compute_full_tree='auto',
linkage='ward', distance_threshold=None) :
Copy the code
The AgglomerativeClustering constructor takes three important parameters, including the number of clusters n_clusters, connection method linkage, and affinity:
-
N_clusters: The user specifies how many clusters to cluster.
-
Linkage: select ward, complete, Average, and single to calculate the distance between clusters
- Ward: Minimizes the variance of the cluster to be merged.
- Complete: Uses the maximum distance between all observations in the two groups.
- Single: minimum distance, using the minimum distance between all observations in two groups.
- Average: the average distance, using the average distance of each observation value of the two groups.
-
Affinity: is a method of calculating the distance between clusters. This can be “Euclidean”, “L1”, “L2”, “Manhattan”, “cosine” or “precomputed”.
-
If the link is “ward”, only “Euclidean distance” is accepted.
-
If “precomputed”, the distance matrix is required as the input of the fitting method.
Distance matrix generation method: Assuming that the user has N observation points, the distance list between the n points is constructed successively, that is, the distance list with the length of N *(n-1)/2, and then the distance matrix can be constructed through the squareform function of dist library of Scipy.spatial. Distance. The advantage of this method is that users can use their own defined methods to calculate the distance between any two observation points, and then perform clustering. A clustering algorithm based on a custom distance matrix is also presented.
-
Resources: source code documentation
(2) Complete algorithm
#! /usr/bin/env python
# encoding: utf-8
''' @Author : pentiumCM @Email : [email protected] @Software: PyCharm @File : sklearn_hierarchical_cluster.py @Time : 2020/4/23 15:00 @desc: Hierarchical clustering of sklearn
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import AgglomerativeClustering
data = np.array([[1.2], [2.3], [...3.3], [...2, -1], [5, -1]])
# draw point
plt.scatter(x=data[:, 0:1], y=data[:, 1:2], marker='. ', color='red')
n = np.arange(data.shape[0])
for i, txt in enumerate(n):
plt.annotate(txt, (data[i:i + 1.0:1], data[i:i + 1.1:2]))
plt.show()
# Training model
ac = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')
clustering = ac.fit(data)
print("Cluster number of each data", clustering.labels_)
print("Members of each cluster", clustering.children_)
Copy the code
Algorithm running results:
Cluster number of each data [2 2 0 0 1] Member of each cluster [[0 1]
[2 3]
[5 6]
[4 7]]
Copy the code
Clustering. Labels_ : indicates which cluster each data belongs to. [2 20001] : indicates that data 0 and 1 are divided into a cluster, 2 and 3 are divided into a cluster, and 4 is divided into a cluster. Cluster.children_ : indicates what elements are present in each cluster. [[0 1] [2 3] [5 6] [4 7]] : Firstly, the data is initialized into clusters 0 ~ n-1, then clusters 0 and 1 are merged into clusters 5, clusters 2 and 3 are merged into clusters 6, clusters 5 and 6 are merged into clusters 7, and finally clusters 4 and 7 are merged.
Supplement the distance matrix algorithm based on precomputed arithmetic
The complete algorithm is as follows:
#! /usr/bin/env python
# encoding: utf-8
''' @Author : pentiumCM @Email : [email protected] @Software: PyCharm @File : sklearn_hierarchical_cluster.py @Time : 2020/4/23 15:00 @desc: Hierarchical clustering of sklearn
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import euclidean_distances
import scipy.spatial.distance as dist
from scipy.cluster.hierarchy import dendrogram, linkage
data = np.array([[1.2], [2.3], [...3.3], [...2, -1], [5, -1]])
# draw point
plt.scatter(x=data[:, 0:1], y=data[:, 1:2], marker='. ', color='red')
n = np.arange(data.shape[0])
for i, txt in enumerate(n):
plt.annotate(txt, (data[i:i + 1.0:1], data[i:i + 1.1:2]))
plt.show()
# Clustering method 1
# Training model
ac = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')
clustering = ac.fit(data)
print("Cluster number of each data:", clustering.labels_)
print("Members of each cluster:", clustering.children_)
# Clustering method two
# Custom distance matrix
num = data.shape[0]
dist_matrix = np.mat(np.zeros((num, num)))
for i in range(num):
for j in range(i, num):
distence = euclidean_distances(data[i:i + 1], data[j:j + 1])
dist_matrix[i:i + 1, j:j + 1] = distence
dist_matrix[j:j + 1, i:i + 1] = dist_matrix[i:i + 1, j:j + 1]
# Cluster based on user-defined clustering matrix
model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
clustering2 = model.fit(dist_matrix)
print("Custom distance matrix clustering method:")
print("Cluster number of each data:", clustering2.labels_)
print("Members of each cluster:", clustering2.children_)
Adjust the shape of the distance matrix
dist_matrix = dist.squareform(dist_matrix)
The linkage method is used to calculate the distance d(s,t) between t and s.
# The hierarchy is encoded as a linkage matrix.
Z = linkage(dist_matrix, 'average')
print("Clustering process:", Z)
# Display the hierarchical clustering results in a tree graph
fig = plt.figure(figsize=(5.3))
dn = dendrogram(Z)
plt.show()
Copy the code
Algorithm running results:
Cluster number of each data: [2 2 0 0 1] Member of each cluster: [[0 1]
[2 3]
[5 6]
[4 7] user-defined distance matrix clustering: Number of clusters to which each data belongs: [2 2 0 0 1] Member of each cluster: [[0 1]
[2 3]
[5 6]
[4 7] clustering process: [[0. 1. 1.41421356 2. ]
[2. 3. 4.12310563 2. ]
[5. 6. 4.75565014 4. ]
[4. 7. 6.48606798 5. ]]
Copy the code
The above algorithm contains two clustering methods of SkLearn. The second method is predictive calculation: the algorithm precalculates the dist_matrix between each data point at the beginning of the period, and then affinity=’precomputed’ during clustering.
In contrast, you can use sciPY’s approach for easy to understand hierarchical clustering.
The resources
- Implementation of hierarchical Clustering algorithm
- Blog.csdn.net/andy_shenzl…
- Blog.csdn.net/weixin_4453…
- Sklearn/scipy Help documentation:
- Scikit-learn.org/stable/modu…
- Docs.scipy.org/doc/scipy/r…