This is the 11th day of my participation in the August More Text Challenge. For details, see:August is more challenging
This article is the fifth part of the notes series of Ng’s machine learning course. It mainly introduces several clustering algorithms, including the classic K-means algorithm, its extension binary K-means algorithm and another commonly used DBSCAN algorithm
Clustering (Clustering)
Clustering is the process of dividing a data set into multiple groups or clusters composed of several similar objects, so as to maximize the similarity between objects in the same group and minimize the similarity between objects in different groups. It’s an unsupervised problem.
K – Means algorithm
KKK is just the number of clusters. This number is an artificial choice. K-means algorithm is based on centroid partition.
Algorithm process:
- First, KKK points were randomly selected as the initial cluster center.
- For each data in the data set, the distance to the KKK center points was evaluated, and it was associated with the nearest center point.
- Recalculate the average in each cluster and update the location of the cluster center
Distance judgment: Euclidean distance or cosine similarity
Algorithm description:
- Mu (I)/mu ^ {(I)} mu (I) said iii clustering center, x (I) x ^ {} (I) x (I) a sample iii, MMM said sample
Algorithm to optimize
K-means should also minimize the clustering cost, and what is minimized is the sum of distances between all data points and their associated clustering center points. The cost function of K-means is introduced:
J(c(1),… , c (m), u (1),… , mu (K)) = 1 m ∑ I = 1 m ∣ ∣ x (I) – mu c (I) ∣ ∣ 2 j (c ^ {} (1),… ,c^{(m)},\mu^{(1)},… ,\mu^{(K)})=\dfrac{1}{m}\sum\limits_{i=1}^{m}||x^{(i)}-\mu^{c^{(i)}}||^2J(c(1),… , c (m), u (1),… ,μ(K))= m1I =1∑m∣ x(I)−μ C (I)∣ 2 JJJ also known as the Distortion Function
In fact, in the first for loop, when you assign the cluster center to the sample, you are decreasing the cost of c(I)c^{(I)}c(I); In the second for loop, the cost of μ(k)\mu^{(k)}μ(k) is decreasing.
Determine the values of k
Because different initializations are likely to cause different clustering results, the clustering algorithm is often used to conduct multiple random initializations, calculate the corresponding cost function, and use Elbow Method to select the k value corresponding to the obvious “Elbow joint” part as the cluster number. The diagram below:
A few other common clustering algorithms are added below:
Binary K-means algorithm
Bisecting K-means (Bisecting K-means) algorithm, compared with conventional K-means, does not rush to random K clustering centers at once, but first attributes all points to a cluster, and then divides the cluster into two. The cost function (error) of each obtained cluster is calculated, the clusters with the largest error are selected and divided (error reduction is minimized), and the process is repeated until the desired number of clusters is reached.
The Error is SSE (Sum of Squared Error). The smaller the SSE, the more concentrated the objects in the cluster.
The algorithm is greedy and has a lot of calculation.
DBSCAN algorithm
DBSCAN algorithm is a density-based algorithm, which divides regions with sufficient density into clusters and finds clusters of arbitrary shapes in the spatial database with noise. It defines clusters as the largest set of density-connected points.
Basic concepts:
- EpsEpsEps neighborhood: The neighborhood within the EpsEpsEps radius of a given object is called the EpsEpsEps neighborhood of the object.
- MinPtsMinPtsMinPts: The minimum number of points contained in a given neighborhood to determine whether a fixed-point PPP is the core of a cluster or a boundary point or noise.
- Core object: An object is a core object if its EpsEpsEps neighborhood contains at least MinPtsMinPtsMinPts of objects.
- Boundary point: A boundary point is not a core point, but falls within the neighborhood of a core point. A point on the edge of a dense region.
- Noise point: any point that is neither a core point nor a boundary point. A point in a sparse region.
- Direct density-reachable: If PPP is in the EpsEpsEps neighborhood of QQQ, and QQQ is a core object, then object PPP is said to be directly density-reachable from object QQQ.
- Density reachability: If there is a chain of objects P1, P2… ,pn,p1=q,pn=pp_1,p_2,… ,p_n,p_1=q,p_n=pp1,p2,… , pn, p1 = q, pn = p, for the PI D ∈ (1 I n or less) or less p_i \ in D (1 \ I \ le le n) PI D ∈ (1 I n or less) or less, PI +1p_{I +1} PI +1 is the direct density of PIp_ipI from about EpsEpsEps and minptsminpts, Whereas PPP is densityreachable from object QQQ about EpsEpsEps and MinPtsMinPtsMinPts.
- Density is connected: If there is an object O∈DO\in DO∈D, make the object PPP and QQQ both accessible from OOO about EpsEpsEps and minptsminpts, Then the object PPP to QQQ is density connected with EpsEpsEps and MinPtsMinPtsMinPts.
- Density-based cluster: a collection of the most density-connected objects based on density-reachability.
- Noise: Objects not contained in any cluster.
The DBSCAN algorithm searches for clusters by checking the EpsEpsEps neighborhood of each point in the data set.
Algorithm process
algorithm
Advantage:
- You do not need to specify the number of clusters
- Good at finding outliers (detection task)
- Clusters of any shape can be found
- Just two parameters
Disadvantage:
- Parameters are difficult to select (parameters have a great influence on results)
- Higher dimensional data is a bit more difficult (you can reduce the dimension)