DBSCAN interpretation

DBSCAN full spell:

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.

The algorithm is an anti-noise clustering algorithm based on density clustering.

The algorithm is divided into two steps

Step 1:

Search for core points to form temporary clusters

Corresponding source

neighbors_model = NearestNeighbors(
    radius=self.eps,
    algorithm=self.algorithm,
    leaf_size=self.leaf_size,
    metric=self.metric,
    metric_params=self.metric_params,
    p=self.p,
    n_jobs=self.n_jobs,
)

neighbors_model.fit(X)
# This has worst case O(n^2) memory complexity
neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)
Copy the code

Interpretation of the

All sample points were scanned. If the number of points within eps radius of a sample point >= MIN_samples, it was included in the list of core points, and the points with direct density were formed into corresponding temporary cluster.

The second step: merge temporary cluster to get cluster cluster

core_samples = np.asarray(n_neighbors >= self.min_samples, dtype=np.uint8)
dbscan_inner(core_samples, neighborhoods, labels)
Copy the code

Where dbscan_inner in _dbscan_inner. Pyx (. Pyx file is similar to the C language.c source code file, Dbscan_inner function calculation is the core of DBSCAN algorithm with the combination of clusters [stack] depth-first search starts from I, which is very similar to the classical connected calculation algorithm component, The difference is to mark non-core points as part of the cluster, but not to extend their neighbors.

while True:
    if labels[i] == -1:
        labels[i] = label_num
        if is_core[i]:
            neighb = neighborhoods[i]
            for i in range(neighb.shape[0]):
                v = neighb[i]
                if labels[v] == -1:
                    push(stack, v)

    if stack.size() == 0:
        break
    i = stack.back()
    stack.pop_back()
Copy the code

Issues needing attention:

  1. the parameter min_samples primarily controls how tolerant the algorithm is towards noise.the parameter eps is crucial to choose appropriately for the data set and distance function and usually cannot be left at the default value.

Eps neighborhood radius and min_samples minimum number of points. These two algorithm parameters control the clustering effect and cannot be used as default values.

2.the clusters to which non-core samples are assigned can differ depending on the data order. This would happen when a non-core sample has a distance lower than eps to two core samples in different clusters.

This is another problem that is often overlooked. When the data is provided in a different order, the clustering results may not be the same! See: github.com/irvingc/dbs… Boundary point problem :(more serious than Kmeans) the distance between a sample point and two core objects is less than eps, but the core objects are not of direct density and do not belong to the same cluster. In this case, how to determine? The category that is clustered first marks the sample as its category

3. The other Metric,metric_params=None,algorithm=”auto”,leaf_size=30,p=2, several parameters determine the method of calculating the distance between sample points, the default is euclide distance full traversal calculation. Sample_weight is the parameter corresponding to each sample The default value is 1. The larger the value is, the more it can promote the core sample point to become the core sample point; otherwise, it can inhibit the core sample point to become the core sample point.

Blog.csdn.net/qq_35405379…

zhuanlan.zhihu.com/p/336501183