Geek Planet | Weighted DBScan-based method for calculating workplace residence

This article highlights

As smart phones and the continuous development of information and communication technology and the popularization, the mass trajectory data storage is common, become an important source of mining user behavior patterns, work and residence is one of the important performance, user behavior patterns can be used to support the construction of smart city, such as optimizing the commuter route, industry layout, analyze the population flows, etc. In this way, traffic congestion can be reduced, convenience and satisfaction can be improved. However, there are some problems in the existing calculation methods of working place and residence. This paper proposes an improved scheme.

Existing methods

At present, there are two main methods for the location of work place and residence, one is based on rules, the other is based on models;

The rules-based approach is based on business experience design logic, according to the set statistical indicators to choose the place of residence.

For example, the data based on cars will be based on the user’s starting point and end point every day to collect the frequency, duration and other indicators, and select the highest ranking as the working place and residence.

Based on the base station data, the time for users to connect to each base station will be counted, and the place with the longest connection time in working hours and the highest connection times in monthly working days/in rest days will be selected as the working place;

The model-based method is to locate the working place through clustering + supervised model, eliminate noise points through clustering, then generate features through rules, mark the working place through manual, and finally predict the working place and residence through supervised model.

Limitations of existing methods

The rule based method has great limitations. Different industries have their own rules and data with specific structure, which is not universal enough. Moreover, it is difficult to exhaust all the rules, and the adaptability to abnormal situations is not good enough, complicated and not precise enough;

The model-based method requires manual annotation, the cost is high and the whole calculation process is complex. The accuracy depends greatly on the representativeness of features and the breadth of sample coverage.

The working place and residence calculated by existing methods often do not conform to business logic. For example, most of the normal working place should be distributed in office buildings, industrial parks and other POI, while a small part should be distributed in restaurants, shopping malls and other types of POI. The normal residence should be mostly distributed in poI, such as community, villa and apartment, and a small part in other types of POI; However, the calculation results of the existing methods are greatly affected by the data source and cannot guarantee this, which may lead to a large number of working places in the community and a large number of living places in the office buildings or shopping malls, resulting in business unavailability.

This paper proposes a more general calculation method to reduce the complexity of the whole process and improve the accuracy. Design a more businesslike calculation method to improve the availability of work place residence.

Background Information

1. Introduction to DBSCAN Clustering:

First, a threshold value A is set. For each point in the sample set, a circle is drawn with this point as the center of the circle and a as the radius. The number of points contained in the circle is denoted as B (including the center of the circle).

Then set a threshold c, if b>=c, the center of the circle is called the core object;

If A core object A is contained in the circle of another core object B, which is contained in the circle of another core object C, then A to C is said to be denseable;

If the density of core object X to core object Y is accessible, and the density of core object Z is also accessible, then the density of Y and Z is connected, and the sample set with the largest density is found, which is a cluster of clustering, as shown in the figure below:

Its advantages are:

It can be used to cluster dense data sets of any shape, which is suitable for geographical location data. In contrast, clustering algorithms such as K-means are generally only applicable to convex data sets;

Outliers can be found during clustering and are insensitive to outliers in the data set.

There is no bias in the clustering results. In contrast, the initial value of k-means and other clustering algorithms has a great influence on the clustering results.

2. Weighted geometric mean:

3, longitude and latitude for the center point

Lat_i = Lat_i * PI /180, I = 1,2… ,n

Lon_i = Lon_i * PI /180, I = 1,2… ,n

Xi = cos(Lat_i) * cos(Lon_i), I = 1,2… ,n

Yi = cos(Lat_i) * sin(Lon_i), I = 1,2… ,n

Zi = sin(Lat_i), I = 1,2… ,n

x = (x1 + x2 + … + xn) / n

y = (y1 + y2 + … + yn) / n

z = (z1 + z2 + … + zn) / n

Lon = atan2(y, x)

Hyp = sqrt(x * x + y * y)

Lat = atan2(z, hyp)

lon_center = Lon * 180/pi

lat_center = Lat * 180/pi

The specific methods

1. Preprocess the track data of the user for nearly X months, and clean the abnormal data and holiday data (big or small long holidays, but not including weekends);

2. Divide working time and rest time;

3. Crawl poI data and clean poI data.

4. Judge whether the user’s trajectory data falls into a certain TYPE of POI:

If there is POI boundary, use POI boundary judgment directly.

If there is no POI boundary, use the geohash8 of the latitude and longitude of the POI point and a circle of geohash8.

In the subway station, the longitude and latitude points of each exit are taken as the center point, and a square with a side length of 100m is generated, that is, the longitude and latitude of POI point is added or subtracted by 0.0005 for judgment.

5. Give different weights to the points that fall into different CATEGORIES of POI, depending on where you work or where you live;

For example, when calculating the working place, if the point falls in the office building, give a higher weight;

6. For trajectory points falling into different time periods, different weights are also given;

For example, when calculating the working place, the working time will give a relatively high weight;

7. According to the weights given in Points 5 and 6, weighted DBSCAN clustering was carried out for the track data of each user’s working time and residence time, and the parameters were adjusted;

8. Count the number of work/rest time points and work/rest time days within each cluster;

Calculate the total number of work/rest time points per user,

Calculate the number of points of work/rest time in each cluster of the user, and the proportion in the total number of points of the user’s work/rest time,

Calculate the proportion of the working/resting days of a user in each cluster to the total working/resting days of the user.

The weighted geometric mean was calculated for the proportion of total points and the proportion of total days, and the score of each cluster was obtained. The cluster with the first score of working time was the working place, and the cluster with the first score of rest time was the residence.

Example: The first example ultimately selects class B, the second example is class C

9. Calculate the geographic center of the working and residential clusters according to the formula, and get the final location of the working and residential clusters;

10. According to the number of active days and points in the residence cluster of the work place, the corresponding confidence degree of the work place is given. The more active days, the higher the confidence degree, and the more points, the higher the confidence degree;

11. Because even if the weighted clustering method mentioned above is adopted, the results cannot be guaranteed to completely conform to the business logic, so the types of various POI within a certain range of the working place/residence and the distance from the working place/residence are given, so that the business party can filter the working place/residence according to the distance.

Geek Planet | Weighted DBScan-based method for calculating workplace residence

This article highlights

Related Posts

Open source deep learning platform based entirely on Java, the big guy of Amazon takes you to get started

What makes deep learning rise again and surpass humans?

Cold start algorithm series – Cloud music song cold start