A good scheme to help you achieve complex data sources in the cell information accurate normalization

background

Residential area is very important information in rental business, it can reflect the location and quality of housing. For tenants, the ability to browse accurate community information is the key to efficient housing search. Therefore, the collection and display of accurate community information is an important aspect to improve the efficiency of users looking for housing. In order to obtain comprehensive community information, the rental business usually relies on a variety of data sources to obtain community data, which are in different formats, information is chaotic and contains a lot of redundant information. In order to improve the efficiency of housing search, different data of the same community must be aggregated together and the subordination relationship between community information must be clarified. In this paper, a method based on text matching is designed to solve this problem by seizing the unique characteristics of the cell and using the similarity algorithm.

The target

There are many duplicate communities in the existing community data, such as “Fuding Homeland”, “Fuding Homeland Xiaofeng Garden”, “Fuding Homeland building 2 unit 3”, “West Xixi Beiyuan” and “East Xixi Beiyuan” and so on. Although these community names are not exactly the same, but some of them represent the same community or the sub-community of the same community, we call these community names synonymous community, such as “Fuding Homeland”, “Fuding Homeland Xiaofeng Garden”, “Fuding homeland 2 building 3 units”. The entire community is the father of the community, such as fuding home, Xixi North Garden. Part of the area below the community is called a sub-community, such as Fuding Home Xiaofeng Garden, East xixi Beiyuan. And like “fuding home 2 3 units” such residential address, known as building address.

In order to search and display housing information accurately and efficiently, we need to parse out the corresponding district information of each district data, as well as the hierarchical relationship of the district, and even supplement some district information. To be specific, one is to unify the existing communities into sub-communities: sub-communities are at the level of phase, district and garden, such as’ Fuding Jiayuan Xiaofeng Garden ‘and’ Fuding Jiayuan Yulu Garden ‘: 1. Sub-community is a unit, building, building on the next level: like unit, building, building, XX building such a name, belongs to the community building; 2. Each sub-community has a unique parent community. For example, the parent community of “Fuding Homeland Xiaofeng Garden” is “Fuding Homeland”; Two is to be able to supplement the father community, child community information: for the community library does not exist in the father community or child community information, can be added.

Train of thought

As a unique address unit, a cell has the following characteristics:

The parent of a child community is usually the longest common prefix of multiple sub-communities: the naming of a community is a hierarchical structure, and the children of the same parent community usually have the same prefix, which is in line with the naming habit of people for the location;
Community and the name of the community building has unique characteristics: such as sub district most fit this pattern: “PP [ww | | | xx yy zz] area”, “PP/ww | | xx yy period”, “PP/ww | | xx yy number” and “PP/ww | | xx yy” and so on. Where PP is a common prefix, that is, the name of the virtual parent community, Ww represents a number, XX represents a Chinese character representing a number, YY represents a uppercase or lowercase letter, and ZZ represents a location word (such as east, west, south, north, northeast, northwest, etc.). The residential building address is usually in the following form: “PP/ww | | xx yy house”, “the PP/ww | | xx yy floor”, “PP/ww | | xx yy building”, “PP/ww | | xx yy unit” and “PP/ww | | xx yy tower” and so on (PP is sub district, ww, xx and yy on behalf of the meaning is the same as above);
As a relatively small address unit, the area is small, and the distance between different sub-communities of the same community can not be too far;
Different subcells of the same cell often have very similar names.

Based on the above observations, we propose a cell normalization scheme based on prefix matching and text similarity. The basic idea is:

Use prefix matching algorithm to implement preliminary clustering of cells,
Then, the text similarity is calculated to add distance weight for further screening.
Finally, father and son identification.

Data preprocessing

We determine a community according to city, district, community name and latitude and longitude information. All plot data is stored in a table: plot ID, city, district, name of the plot, GPS of the plot, source source (to mark the source of the plot), type type (0 for the parent plot, 1 for the virtual parent plot, 2 for the child plot, 3 for the building address), and id of the parent plot. We need to preprocess the original cell data:

Need to do data processing on the original data: urban format, sorted into similar: Hangzhou, Yuhang District; Some cell GPS are non-Autonavi GPS and need to be converted to Autonavi GPS
In some district data, there is only the name of city and city street district, without specific region and latitude and longitude information, so it is necessary to use map prompts for correction, and try to complete the information of region and latitude and longitude
There will also be many punctuation marks in the cell name, which will interfere with our analysis. We will first remove these punctuation marks and only conduct matching analysis on Chinese characters

Normalized algorithm

The normalization process of cell information is shown in the figure below:

The main idea is to use the prefix matching algorithm matching tree village get approximate village, and then in the same approximation village trees filter is not suitable for the village, then tree algorithm based on similarity matching between the residential area, and then merge synonymous community tree, get the final normalized plot tree, in every community tree can use pattern matching to identify parent-child area. Here are the first four steps.

The prefix matches the aggregation cell

Most of the children of the same parent cell mentioned earlier have the same prefix. We use this as the entry point to identify the approximate community. The specific method is as follows: For each community in the same city and region, start to search all the communities that start with the first two characters of their name, and these communities have a common prefix, called the approximate community tree with the community as the root. Find all the approximate cell tree, increase the prefix length, the approximate cell tree constantly split into smaller trees, stop the increase of prefix length at the appropriate time, and finally get each cell tree inside the approximate cell, can extract the parent cell and child cell. But how do you determine the maximum prefix length? Divided into the following situations:

If it is able to judge the name of the community is the child community or building, directly extract the name of the father community, if there is no father community of the same name is a new name of the father community, and then search all to the father community name prefix of the community, a tree to the father community as the root of the approximate community tree; The method to judge whether a community name represents a sub-community or a community building is to complete the regular matching of the community name explained above.
If there are other cells prefixed with the local cell, the local cell is considered a parent cell and all the cells prefixed with the local cell are grouped together. The figure below shows two examples of cases 1 and 2, where blue is the parent community, red is the building address, and yellow is the child community:
For other communities, the prefix length should be determined according to the following principles: the number of communities prefixed with this length should not exceed 20 (excluding duplicate communities and those with the same city, district and community name). Generally speaking, there are not too many children in a community, and it is very rare to have more than 20 children in a parent community. For example, in the figure below, the community “Jade City Mulan Garden” could not be identified as a sub-community or building address, and its approximate community tree was obtained by prefix. The number of trees was less than 20, so prefix growth stopped.

Doing this for each cell results in the corresponding prefix tree for each cell, also known as the cell tree (implemented using prefix trees), which is called the root of the cell tree. It’s easy to know that a cell can be in more than one cell tree. In this process can also identify part of the community building, the child community and the father of the community. Essentially, this step is a process of text clustering. As shown in the figure below, text clustering usually divides text words, and then uses TF-IDF(Term Frequency-inverse Document Frequency) to calculate word Frequency, set word weight, and construct VSM(Vector Space Model, Vector space model), construct equal length vectors for each text, finally set metrics (Euclidean distance, cosine similarity, etc.), and use clustering algorithm to cluster the text.

This method is not suitable for our scenario: Firstly, this method is usually used for clustering documents with multiple feature words, but the community name is short, so it is difficult to extract effective feature words; Secondly, there is an obvious feature of community name, that is, the name of the father-son community and the pillar number of the building is hierarchical in order, but this feature can not be reflected and utilized in the text vectorization clustering algorithm; 3 At the same time, common clustering algorithms such as K-means require skill and exploration to select the appropriate K value, while our method avoids this problem because the number of subcommunities is not too large.

Non-synonymous cell filtering

The result from the previous step is a cell tree with the same prefix. Rely on the prefix, we circle the selection of the community is more, many do not belong to the same father of the community was chosen to circle the same community tree. In general, the different subdivisions of a community are not very far apart. So, we filter out the geographically distant neighborhoods. Specifically, if the distance between A and the root of the approximate cell tree is greater than 2km, the cell will be deleted from the cell tree.

Approximate cell reaggregation

Not all synonymous cells have the same suffix. Therefore, we also supplement some missing synonymous cells through text similarity: calculate the editing distance and GPS distance of the approximate tree prefix. For the approximate tree with GPS distance less than 1km and similarity greater than 2, it can be combined into a synonymous cell tree, where similarity is calculated as follows:

Where A and B are the names of root plots of two plot trees respectively; Max (a,b) is the maximum length of A and B; LevenshteinDistance(a, b) is the editing distance; the larger similarity(a, b) indicates that A and B are more similar.

As shown in the figure above, “Xixi Style” is not the prefix of “Dahua Xixi Style”, so in the prefix matching aggregation in the first step, it is not added to the approximate community tree of “Dahua Xixi Style”. To solve this problem, we calculated the similarity between “Xixi Style” and “Dahua Xixi Style” as 3, which means that the total length of the text is a multiple of the text difference. The larger the similarity is, the smaller the relative difference is. When this similarity is greater than 2, we merge the two similarity trees.

The normalized

There will be a lot of overlap between the trees of each cell after the corresponding cell tree is gathered by taking each cell as the root. This step merges the intersecting cell trees to get the final cell normalization result. The communities in the merged community tree can be considered as synonymous communities, and this step can be said to be the final community normalization.

assessment

In order to measure the accuracy of the neighborhood normalization algorithm, the data and artificial recognition of the neighborhood in AmAP are used to accurately distinguish the synonymous neighborhood. Mainly from two aspects:

False positives: people who fail to recognize a child that belongs to a father.
Negative miscommunication (false negatives to identify as a child of a parent community, when not belonging to that community; The data show that the positive error rate of the algorithm introduced in this paper is less than 8%, and the negative error rate is less than 5%, indicating that the accuracy of this normalization method is guaranteed to some extent.

conclusion

By observing the law of cell name and hierarchy, this paper proposes a method to solve the problem of cell information normalization by using text matching and approximation analysis. The method is simple to implement and has high accuracy, and can quickly identify the approximate residential area, which provides basic data guarantee for improving the efficiency of housing search and the accuracy of housing release.

Qcon2018·Flutter&Dart three-terminal integrated development

GDD2018·TensorFlow application “UI 2 Code”

Thousands of online problem playback technology revealed

2018 Double 11· Real-time selection of top goods and excellent products — “Mach”

Pay attention to two-dimensional code, forward-looking technology is in control