This is the 16th day of my participation in the August More Text Challenge. For details, see:August is more challenging
Introduction to spectral clustering
Spectral clustering is an algorithm evolved from graph theory and has been widely used in clustering. The main idea is to view all data as points in space that can be connected by edges. The edge weight value between the two distant points is low, while the edge weight value between the two close points is high. By cutting the graph composed of all data points, the edge weight value between different subgraphs after cutting the graph is as low as possible, and the edge weight value within the subgraph is as high as possible, so as to achieve the purpose of clustering.
Spectral clustering principle
Spectral clustering algorithm is easy to use but not easy to understand in principle. The spectral clustering algorithm can be summarized as the following steps: Input: sample set D=(x1,x2… Xn), generation mode of similarity matrix, dimension K1 after dimension reduction, clustering method, dimension K2 after clustering output: cluster division C(C1, C2… Ck2). \
- The similarity matrix S\ is constructed according to the generation mode of the input similarity matrix
2) Construct the adjacency matrix W according to the similarity matrix S, Construct the degree matrix D 3) calculate the Laplacian matrix L 4) construct the normalized Laplacian matrix D−1/2LD−1/2 5) calculate the corresponding eigenvectors F of the smallest K1 eigenvalues of D−1/2LD−1/2 6) standardize the matrix composed of the corresponding eigenvectors F according to row, Finally, n×k1 dimension eigenmatrix F 7 is formed. Each row in F is taken as a k1 dimension sample, a total of N samples, and the input clustering method is used for clustering. The clustering dimension is K2. 8) Get cluster division C(c1,c2… Ck2). The specific principle of spectral clustering algorithm needs some graph theory related foundation, which will not be expanded here. For a detailed explanation, see Stanford CS224w, Lesson 5. The courseware for this course is available at [ref. 2] or at [Ref. 3].
Summary of spectral clustering algorithm
The main advantages of spectral clustering
- Spectral clustering only needs the similarity matrix between data, so it is very effective for clustering sparse data. This is difficult for traditional clustering algorithms such as K-means to do
- Because of the use of dimension reduction, the complexity of processing high-dimensional data clustering is better than the traditional clustering algorithm.
Major disadvantages of spectral clustering
- If the final clustering dimension is very high, the running speed of spectral clustering and the final clustering effect are not good due to insufficient dimensionality reduction.
- The clustering effect depends on the similarity matrix, and the final clustering effect obtained by different similarity matrices may be very different.
Implementation of spectral clustering in Sklearn
The implementation of spectral clustering algorithm is provided in SkLearn:
sklearn.cluster.SpectralClustering(n_clusters=8, *, eigen_solver=None, n_components=None, random_state=None, n_init=10, Gamma =1.0, affinity=' RBF ', n_neighbors=10, EIGEN_tol =0.0, ign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, verbose=False)Copy the code
Parameter description:
‘N_clusters’ is the dimension of the cast shadow space, that is, the number of clusters after clustering is also the principle k’ EIGEN_solver’ is the eigenvalue decomposition strategy used, optional parameters are ‘arpack’, ‘lobPCg ‘,’ AMg ‘. AMG requires pyAMG to be installed. It can be faster on very large sparse problems, but it can also lead to instability. If not, it is also possible to use ‘arpack’. ‘random_state’ is a pseudo-random number generator used for eigenvector decomposition in K-means. The kernel coefficients for ‘Gamma’ for RBF, poly, Sigmoid, Laplacian, and Chi2 Kernels are just a few of the parameters used. For the full parameters, please refer to the official document [reference link 4] or [Reference link 5] to understand the meanings and principles of the parameters.