1. In this paper,

Clustering is a technology of statistical data analysis, which is widely used in many fields, including machine learning, data mining, image analysis and so on. Clustering is to divide similar objects into different groups or more subsets so that the member objects of each subset have similar attributes.

The so-called clustering algorithm is actually a method that automatically divides a pair of unlabeled data into several categories. In terms of application scenarios, clustering can help us solve many classification problems in computers, such as color classification, density classification in spatial coordinates, and crowd characteristics classification in e-commerce. In addition to classification problems, it can also help us to implement “exception checking”. What is exception checking? We can understand to find noise point, colloquially speaking is to find those rat excrement in a pot of porridge.

This article mainly introduces the realization principle of clustering algorithm and how clustering algorithm is applied in D2C design draft generation code.

2 DBSCAN clustering algorithm

DBSCAN – Density – based clustering algorithm with noise. Compared with k-means clustering, which is only suitable for convex sample sets, DBSCAN can be applied to both convex and non-convex sample sets. It can sort the scattered samples based on a certain degree of similarity, that is, without knowing the number of facts, the density of the samples can be classified. Here’s an example:

We need to cluster the data of “100, 101, 123, 98, 200, 203, 220”. If the minimum value of cluster forming is 2, then the cluster density threshold set is 30. Then “100, 101, 123, 98” and “200, 203, 220” will be grouped into two clusters. When the clustering density threshold is 10. Then “100, 101, 98”, “200, 203”, divided into two clusters, “123” and “220” are the noise points.

2.1 Core Ideas

DBSCAN algorithm focuses on finding all the dense areas in the sample points, which are called cluster clusters. So the sample points that are not in the dense region are called noise points. So DBSCAN can not only help you do classification, but also find “rat droppings in a pot of porridge”.

2.2 Algorithm Parameters

parameter instructions
Neighborhood radius Eps: Refers to the search radius of each sample point. Other sample points scanned within the search radius can be interpreted as the sample points scanned and the center pointsimilar
Minimum number of points: The minimum number of samples that can be aggregated into clusters, can be understood as eachStraw cocooning frameMinimum number of samples required. In the figure above, we can see that the number of sample points scanned by both red and blue within radius R is > minpoints, while the number scanned by yellow is less than minpoints.

Category of 2.3 points

category instructions
emphasis The number of sample points in neighborhood radius Eps >= the minimum number of minpoints.
Boundary point A point that is not a core point but is in the neighborhood of a core point.
noise It’s neither a core point nor a boundary point.

2.4

Relationship between instructions
Density of direct A is the core point, and B is in the neighborhood Eps of A, so the density from A to B is direct. Any boundary point from the core point to its neighborhood EpsDensity of direct.
The density of If there are core points C, D, E, F. C to D, D to F, E to F. So we could call it C to FThe density of. F to G is density direct, C to G is density directThe density of.
Density is connected If there is a core point at which both sample points X and Y are densified, then X is said to be connected to the density of Y.
Nondensity connection If it’s not connected to the densityNondensity connection, the two non-density points belong to different mounting points, or one of them is the noise point.

2.5 Algorithm implementation steps

The maximum density connected sample set derived from the density reachability relation is a category or cluster of our final clustering. The implementation can be divided into the following four steps:

Step 1: Select any core location with no category as the initial point;

Step 2: Find out the sample set whose density can be reachable to the core point, that is, find out all the boundary points in the neighborhood of the core point, and then it can be turned into a cluster.

Step 3: Continue to find another core point without category and repeat step 2.

Step 4: Until all the points.

To give a more vivid example: you can imagine that there is a pyramid selling person (the core point) in a group of people. To develop downlines, he needs to find N people (minPoints) first, so he goes around to find people to develop downlines. Then downlines (boundary points) will continue to find downlines until there is no one around.

3 combination of layout algorithm and DBSCAN

After a brief introduction of DBSCAN algorithm concept and algorithm implementation, we will talk about the application scenario of clustering algorithm in Deco layout algorithm.

The core of layout algorithm is actually group. How to judge whether a group can be formed based on the location information and size of each module of the design draft is the key. In simple terms, it is how to accurately put a DIV around a bunch of nodes.

As shown in the figure above, there are 11 nodes with white block nodes on the design draft, and the upper half and the lower half are separated according to the close distance between each node. But this is limited to our vision, so how do we make the vision of a machine think of itself as separate? We need the DBSCAN clustering algorithm to generate a cluster. Then our goal is to make the upper part form a cluster and the lower part form a cluster.

Just now, we mentioned that DBSCAN is based on the Euclidean distance between points as the basis of close relationship. Then, when looking at nodes, we change our thinking to the shortest distance between blocks as the basis of close relationship.

3.1 Point distance > block distance

In fact, to obtain the minimum distance between blocks is relatively simple, there are three cases:

First, the two blocks intersect, so the distance is actually 0.

Second: block A and block B are on/down/left/right, so only need to obtain the spacing between the two position can;

The third kind: block A and block B are above/below left/above/below right, so the Pythagorean theorem is used to obtain the distance between the diagonal line of the relative vertices of the two.

The result is as shown in the following figure. Based on the realization of the clustering algorithm, the upper and lower two clusters can be separated into two clusters:

3.2 Derivation of neighborhood radius

DBSCAN clustering algorithm has sample data set, data object number threshold MinPoints and neighborhood radius Eps in addition to input, so what is the appropriate value for neighborhood radius Eps in the layout algorithm? It can’t be a fixed value. The overall spacing of some modules is larger, while that of others is smaller. We need to work out this dynamic neighborhood radius Eps when we aggregate blocks in the actual layout.

Step 1: We first make a statistic of the distance between sample data sets, and first work out the shortest distance between these 5 blocks:

Module 1 Module 2 Module 3 Module 4 Module 5
Module 1 5 5 7 210
Module 2 5 7 5 100
Module 3 5 7 5 214
Module 4 7 5 5 107
Module 5 210 100 214 107

Step 2: Then according to the distance matrix table, we can get the shortest distance between each module and its closest module:

The module Module 1 Module 2 Module 3 Module 4 Module 5
The shortest distance 5 5 5 5 100

Step 3: In this pile of data, we need to extract more effective data as our Eps value and remove some interference items:

According to the calculation formula of standard deviation, we took 1 standard deviation as the filter term to screen out the data set conforming to the majority of samples, and took [5, 5, 5, 5, 100] to calculate its standard deviation. It can be concluded that the overall standard deviation is 38, and the average value is 24.

Then we take one standard deviation as the basis, and it can be concluded that within the range of one standard deviation, the maximum number is 24 + 38 = 62, so we can take 62 as our neighborhood radius Eps in this sample set.

3.3 Algorithm Optimization

Based on the above algorithm transformation, in fact, we have completed a more reliable module clustering and separation in the layout. So in the application of the actual algorithm, we will also make an optimization of the layout of the actual scene for the dynamic generation of neighborhood radius Eps:

For example, a layout with a horizontal spacing of 5 and a vertical spacing of 10 looks like this:

So if according to the form of the shortest distance standard deviation, in fact, the shortest distance of the 8 modules is 5, and Eps is also 5 after the final calculation, it is likely to separate the upper and lower lines.

Therefore, in practical application, in the process of generating standard deviation samples, according to certain rules, “10” of horizontal distance is also taken into account and calculated as the sample of standard deviation.

4. Technology landing

The above technologies have been implemented in Deco intelligent code generation project. Deco is our team’s exploration in the direction of “front-end intelligence”. It focuses on the one-click generation of multi-terminal code from design draft. Implement the ability to parse Sketch/Photoshop and generate multi-terminal code (Taro/React/Vue) directly. Deco enables front-end engineers not to spend a lot of energy on the design draft, greatly saving the development cost, providing powerful support for the output of more multi-terminal pages, and bringing great power for the business cost reduction and efficiency increase.

In the past year, Deco has successfully landed in two major promotions of JD. In the construction of personalized event venue, the r&d efficiency has increased by 48%.

Those who are interested can go to Deco’s official website to experience it. Also attached is a nanny level tutorial on the Deco experience.

5. To summarize

This article mainly introduces the realization principle of DBSCAN, and gives specific code implementation in the introduction. If you are interested in this piece, there are also many specific code implementation logic on the Internet. The main purpose is to tell you the implementation of clustering algorithm, as well as the clustering algorithm in D2C layout on the application of the ground. In addition to DBSCAN clustering algorithm based on density, there are also many algorithms waiting for our mining on D2C layout algorithm.

Citation:

  • [1] [] density of DBSCAN clustering algorithm (www.cnblogs.com/pinard/p/62)…
  • [2] [DBSCAN clustering algorithm — Machine learning (theory + illustration + Python code)] (blog.csdn.net/huacha__/ar…)
  • [3] [DBSCAN] (blog.csdn.net/hansome_hon…)