This is the 11th day of my participation in the August More Text Challenge. For details, see:August is more challenging

This article is the fifth part of the notes series of Ng’s machine learning course. It mainly introduces several clustering algorithms, including the classic K-means algorithm, its extension binary K-means algorithm and another commonly used DBSCAN algorithm

Clustering (Clustering)

Clustering is the process of dividing a data set into multiple groups or clusters composed of several similar objects, so as to maximize the similarity between objects in the same group and minimize the similarity between objects in different groups. It’s an unsupervised problem.

K – Means algorithm

KKK is just the number of clusters. This number is an artificial choice. K-means algorithm is based on centroid partition.

Algorithm process:

  • First, KKK points were randomly selected as the initial cluster center.
  • For each data in the data set, the distance to the KKK center points was evaluated, and it was associated with the nearest center point.
  • Recalculate the average in each cluster and update the location of the cluster center

Distance judgment: Euclidean distance or cosine similarity

Algorithm description:

  • Mu (I)/mu ^ {(I)} mu (I) said iii clustering center, x (I) x ^ {} (I) x (I) a sample iii, MMM said sample

R e p e a t {       f o r    i =    1    t o    m : c ( i ) : = With the sample x ( i ) Index of the nearest cluster center       f o r    k =    1    t o    K : mu ( k ) : = The first k The average position of a cluster } \begin{aligned} &Repeat \{ \\ & \; \; for\; i =\; 1 \; to\; M: \ \ & \ qquad c ^ {(I)} : with sample x ^ = {(I)} the index to the nearest cluster center \ \ & \; \; for\; k =\; 1 \; to\; K: \ \ & \ qquad \ mu ^ {} (K) : the first K = a bunch of average position \ \ & \} \ end} {aligned

Algorithm to optimize

K-means should also minimize the clustering cost, and what is minimized is the sum of distances between all data points and their associated clustering center points. The cost function of K-means is introduced:

J(c(1),… , c (m), u (1),… , mu (K)) = 1 m ∑ I = 1 m ∣ ∣ x (I) – mu c (I) ∣ ∣ 2 j (c ^ {} (1),… ,c^{(m)},\mu^{(1)},… ,\mu^{(K)})=\dfrac{1}{m}\sum\limits_{i=1}^{m}||x^{(i)}-\mu^{c^{(i)}}||^2J(c(1),… , c (m), u (1),… ,μ(K))= m1I =1∑m∣ x(I)−μ C (I)∣ 2 JJJ also known as the Distortion Function

In fact, in the first for loop, when you assign the cluster center to the sample, you are decreasing the cost of c(I)c^{(I)}c(I); In the second for loop, the cost of μ(k)\mu^{(k)}μ(k) is decreasing.

Determine the values of k

Because different initializations are likely to cause different clustering results, the clustering algorithm is often used to conduct multiple random initializations, calculate the corresponding cost function, and use Elbow Method to select the k value corresponding to the obvious “Elbow joint” part as the cluster number. The diagram below:

A few other common clustering algorithms are added below:

Binary K-means algorithm

Bisecting K-means (Bisecting K-means) algorithm, compared with conventional K-means, does not rush to random K clustering centers at once, but first attributes all points to a cluster, and then divides the cluster into two. The cost function (error) of each obtained cluster is calculated, the clusters with the largest error are selected and divided (error reduction is minimized), and the process is repeated until the desired number of clusters is reached.

The Error is SSE (Sum of Squared Error). The smaller the SSE, the more concentrated the objects in the cluster.

The algorithm is greedy and has a lot of calculation.

DBSCAN algorithm

DBSCAN algorithm is a density-based algorithm, which divides regions with sufficient density into clusters and finds clusters of arbitrary shapes in the spatial database with noise. It defines clusters as the largest set of density-connected points.

Basic concepts:

  • EpsEpsEps neighborhood: The neighborhood within the EpsEpsEps radius of a given object is called the EpsEpsEps neighborhood of the object.
  • MinPtsMinPtsMinPts: The minimum number of points contained in a given neighborhood to determine whether a fixed-point PPP is the core of a cluster or a boundary point or noise.
  • Core object: An object is a core object if its EpsEpsEps neighborhood contains at least MinPtsMinPtsMinPts of objects.
  • Boundary point: A boundary point is not a core point, but falls within the neighborhood of a core point. A point on the edge of a dense region.
  • Noise point: any point that is neither a core point nor a boundary point. A point in a sparse region.
  • Direct density-reachable: If PPP is in the EpsEpsEps neighborhood of QQQ, and QQQ is a core object, then object PPP is said to be directly density-reachable from object QQQ.
  • Density reachability: If there is a chain of objects P1, P2… ,pn,p1=q,pn=pp_1,p_2,… ,p_n,p_1=q,p_n=pp1,p2,… , pn, p1 = q, pn = p, for the PI D ∈ (1 I n or less) or less p_i \ in D (1 \ I \ le le n) PI D ∈ (1 I n or less) or less, PI +1p_{I +1} PI +1 is the direct density of PIp_ipI from about EpsEpsEps and minptsminpts, Whereas PPP is densityreachable from object QQQ about EpsEpsEps and MinPtsMinPtsMinPts.
  • Density is connected: If there is an object O∈DO\in DO∈D, make the object PPP and QQQ both accessible from OOO about EpsEpsEps and minptsminpts, Then the object PPP to QQQ is density connected with EpsEpsEps and MinPtsMinPtsMinPts.
  • Density-based cluster: a collection of the most density-connected objects based on density-reachability.
  • Noise: Objects not contained in any cluster.

The DBSCAN algorithm searches for clusters by checking the EpsEpsEps neighborhood of each point in the data set.

Algorithm process


( 1 ) First of all, the data set D All objects in the ( 2 ) f o r    The data set D Each object in p    d o ( 3 ) i f    p Has been assigned to a cluster or labeled as noise    t h e n ( 4 ) c o n t i n u e ; ( 5 ) e l s e ( 6 ) Check the object p the E p s neighborhood E ( p ) ; ( 7 ) i f    E ( p ) The number of objects contained is less than M i n P t s    t h e n ( 8 ) Tag object p Is the boundary point or noise point; ( 9 ) e l s e ( 10 ) Tag object p For the core point, and create a new cluster C ; ( 11 ) f o r    E ( p ) All objects that have not been processed in q    d o ( 12 ) Check the E p s neighborhood E ( p ) If, E ( p ) Contain at least M i n P t s Object, will E ( p ) Objects that do not belong to any of the clusters in the C \begin{aligned} &(1) \text{first mark all objects in dataset}D as unprocessed \\ &(2) for\; Each object in dataset D p \; do \\ &(3) \quad if \; P has been assigned to a cluster or labeled as noise \; then\\ &(4) \qquad continue; \\ &(5) \quad else\\ &(6) \qquad check object P Eps neighborhood E(p); \\ &(7) \qquad \quad if \; E(p) contains fewer objects than MinPts \; Then \\ &(8) \qquad \qquad mark object P as boundary point or noise point; \\ &(9) \qquad \quad else\\ &(10) \qquad \qquad \qquad mark object P as the core point and create a new cluster C; \\ &(11) \qquad \qquad for\; All objects in E(p) that have not been processed q\; Do \\ &(12) \qquad \qquad \quad check its Eps neighborhood E(p), if E(p) contains at least MinPts objects, then add objects in E(p) that do not belong to any of the clusters to C \end{aligned}

algorithm

Advantage:

  • You do not need to specify the number of clusters
  • Good at finding outliers (detection task)
  • Clusters of any shape can be found
  • Just two parameters

Disadvantage:

  • Parameters are difficult to select (parameters have a great influence on results)
  • Higher dimensional data is a bit more difficult (you can reduce the dimension)