1. Background

“Aggregation solves the problem of” comparing data “. The success rate and accuracy of aggregation directly set the tone for users to compare prices on the website.”

At the hotel channel, convergence has always been considered fundamental and core to the business. Regardless of Qunar’s initial positioning of quotation search or the current transformation of price comparison platform, our business model determines that we need to acquire a large amount of hotel data from numerous agents and channels and integrate them. Therefore, aggregation solves the problem of “comparing data”. It is no exaggeration to say that, The success rate and accuracy rate of aggregation directly set the tone of price comparison experience for users on the website.

The responsibilities of hotel aggregation can be summarized as follows:

The hotel data (hotel tree) from different agents are unified under Qunar hotel to provide the mapping of hotel corresponding relationship for the price comparison platform.

For example, the following set of aggregation relationships:

At present, the hotel aggregation algorithm mainly refers to the hotel name, address, city, coordinates, telephone data to determine, the name and address as a key reference content, in most scenarios can directly determine the aggregation results; Coordinates and telephone numbers due to source data standards and other problems (such as inconsistent coordinate system, hotel phone and agent phone doping provided), only for auxiliary judgment.

2. Pain points and difficulties

“The naming of address information varies greatly from country to country, and the ability to parse information in hotel names and addresses is limited based on the calculation of text similarity.”

International hotels started late in Qunar, with limited accumulation of all kinds of basic data, especially aggregation data, uneven data of international agents and poor standardization of data. In addition, limited operating resources make it costly, inefficient and unrealistic to rely on manual efforts to establish aggregation relations for data. In view of the above situation, we urgently need to improve the capability of automatic aggregation algorithms for international hotels. This paper will introduce the situation of aggregation algorithm optimization in several key countries in the past year, hoping to give some ideas and references to students with similar business pain points.

At present, the pain points and difficulties of international hotel convergence focus on the following two aspects:

A. Localized differences in address information between different countries

As the chart above shows, the format of the address composition varies from country to country and is mixed with localized information (highlighted in red). \

B. The original algorithm is aimed at text similarity calculation and has limited ability to parse information in hotel name and address

The essence of the hotel aggregation algorithm is to calculate the text similarity of two hotel information, which can be judged as the existence of the aggregation relationship after meeting a certain similarity score. However, for long text data similar to name and address, it itself contains a variety of components, such as brand, branch, hotel industry words, road name, house number, city, etc., and the importance of these components is different.

For example, “Yongzheng” as a brand information can almost directly lock “Beijing Yongzheng Business Hotel”, but “business hotel” as an industry term, because the information is too vague, almost can not locate any specific hotel. If you cannot distinguish these components clearly and directly perform text similarity calculation, the following two hotels may be grouped together in error.

As shown in the figure below, the current international hotel aggregation algorithm is almost impossible to break down the detailed components of hotel names and addresses.

3. Optimize your thinking

“Sorting out frequently occurring name and address formats to form a variety of word segmentation results; from text matching to matching word segmentation results with possible patterns, weighted score calculation for similarity.”

As mentioned above, the hotel name and address can be subdivided into a variety of different components, which can be summarized as the important part:

City level, brand, POI dictionary based on qunar hotel basic information, with agent data, the hotel official website expanded coverage and synonyms;

Through the hotel word segmentation and reverse high-frequency statistics, with operation manual screening, can sort out the hotel description words and industry word dictionary;

Due to the large amount of localized content in the road name information in the hotel address, this part of the keyword dictionary needs to divide the address words by region (generally according to the national dimension) and count the high-frequency results, with the official data (Google, Wikipedia, etc.) screening and sorting.

Based on the existing hotel data in the database, we make statistics on the composition formats of name and address, and sort out the formats with high frequency, as shown in the fragment below:

By introducing the concept of pattern matching, it is easy to extract various component information from a given hotel name and address, for example:

Example of splitting the hotel name

【 Highlight 】

A complete aggregation process after optimization is as follows:

  1. A hotel data entry to be aggregated;

  2. Analyze hotel name and address data based on pre-collected dictionaries of each type of vocabulary (brand, city, country) to find all possible word segmentation results;

  3. The result of word segmentation is matched with all possible patterns, and the optimal solution of legal matching is selected as the analytic result.

  4. According to the key components (brand + branch name, road name + house number), a full-text retrieval was conducted in the aggregated data, and the n results with the highest similarity were initially screened as candidate sets.

  5. In aggregation, and between each candidate hotel, according to comparing the components, contrast out based on the similarity of the string (the possibility of matching are: exactly, containing, prefix, suffix, etc.) the results with the weight of each ingredient itself (such as hotel name: brand word > > branch > industry city), eventually work out a comprehensive score;

  6. In some special circumstances for the adjustment of scoring: such as the same road name, different house number, minus points; If the contact number is the same or the coordinate distance is similar, extra points; 7) Select the candidate hotel with the highest similarity score as the final candidate, judge whether the score is higher than the given aggregation score line, if so, the aggregation can be considered successful.

4. Optimize results

For eight international key countries, we optimized the aggregation algorithm in accordance with the mode of pattern matching, and the results are shown in the figure below:

In the future, we will continue to share our experience in the aggregation of international hotel rooms. We welcome you to communicate with us and make progress together.