Travel, booking a hotel is an essential link. Living clean and comfortable is very important for everyone who is away from home.

Booking hotels online makes this easier. When the user opens a selected hotel in the hornet’s nest, the booking information provided by different providers is presented to the user accurately in an aggregated list. To do so, first of all, avoid the same information repeatedly shown to the user to affect the experience, more importantly, to help the user to carry out real-time price comparison of the whole network hotel, quickly find the supplier with the highest cost performance, complete the consumption decision.

The strength of the aggregation ability of the hotel determines the “thickness” of the optional price when booking the hotel, thus affecting the user’s personalized and diversified booking experience. In order to make hotel aggregation more real-time, accurate and efficient, nearly 80% of the aggregation tasks in hornet’s nest hotel business are completed automatically by machines. This article will explain in detail what hotel aggregation is and how popular machine learning techniques are applied in hotel aggregation.

Part.1 Application Scenarios and Challenges

1. Application scenarios of hotel aggregation

Hornet’s Nest hotel platform has access to a large number of suppliers. Different suppliers will provide many same hotels, but the description of the same hotel may be different, for example:

What hotel aggregation wants to do is to aggregate the hotel information from different suppliers and show it to users in a centralized way to provide users with one-stop real-time price comparison booking service:

The following figure shows hornet’s Nest’s aggregation of different suppliers’ hotels. The quotation of different suppliers is clear at a glance, making consumption decisions more efficient and convenient for users.

2. The challenge

(1) accuracy

As mentioned above, different vendors may have different descriptions of the same hotel. If the aggregation is wrong, the hotel the user sees in the App is not the one they actually want to book:

In the figure above, the user hopes to open “Jinglu Hotel” in the App, but the system may book the “boutique hotel” provided by the supplier E for the user. For such hotels with wrong aggregation, we call them “AB store”. As you can imagine, arriving at the store only to find that there is no order can have a disastrous effect on the user experience.

(2) real-time performance

The most direct way to solve the above problems is to adopt all artificial aggregation. Manual aggregation ensures high accuracy, which is possible when vendor and hotel data volumes are not that large.

But the hornet’s nest is connected to the hotel resources of the whole network provider. Artificial way of aggregation processing will be very slow, one will cause some hotel resources without aggregation, unable to show rich booking information for users; Second, if the price fluctuates, it cannot provide users with the current quotation in a timely manner. And it takes a lot of human resources.

The importance of hotel convergence is obvious. However, with the development of business, the rapid growth of hotel data access, more and more technical difficulties and challenges one after another.

Part.2 Initial scheme: cosine similarity algorithm

In the early stage, we conducted hotel aggregation processing based on cosine similarity algorithm in order to reduce labor cost and improve aggregation efficiency.

Usually, with the name, address, and coordinates, we can uniquely identify a hotel. Of course, the easiest technical solution to think of is to compare the name, address and distance of two hotels to determine whether they are the same.

Based on the above analysis, the aggregation process of our preliminary technical solution is as follows:

  1. Enter hotel A to be aggregated;

  2. ES searches for N online hotels with the highest similarity within A distance of 5km from HOTEL A;

  3. A pairwise comparison was conducted between N hotels and A hotel;

  4. Hotel in pairs to calculate the whole name cosine similarity, the whole address cosine similarity, distance;

  5. By manually setting the threshold of similarity and distance to reach the conclusion whether the hotel is the same.

The overall process diagram is as follows:

After the launch of “Hotel Convergence Process V1”, we verified the feasibility of this solution. Its biggest advantages are simple, low cost of technical implementation and maintenance, while the machine can also automatically process part of the hotel aggregation task, compared with complete manual processing more efficient and timely.

But because this solution is so simple, and the problem is so obvious, let’s look at the following example (the data is fictitious, just for convenience) :

I’m sure each of us can quickly tell that these are two different hotels. But when the machine calculates the overall similarity, the value is not low:

In order to reduce the error rate, we need to raise the threshold value of similarity comparison to a higher index range, so a large number of similar hotels will not be aggregated automatically and still need to be handled manually.

Finally, this version of the scheme can be automatically processed by the machine only accounts for about 30%, the remaining 70% still need manual processing; In addition, the automatic aggregation accuracy rate of the machine is about 95%, that is, there is a 5% probability that AB stores will be generated. There is no order in the store, and the check-in experience is very bad.

Therefore, with the rise of machine learning, we began to explore the application of machine learning technology in hotel aggregation to solve the contradiction between real-time and accuracy.

Part.3 Application of machine learning in hotel aggregation

In the following, I will introduce the hotel aggregation business scenarios from the aspects of word segmentation processing, feature construction, algorithm selection, model training iteration and model effect in machine learning.

3.1 Word segmentation

The previous scheme obtains similarity through comparison of “overall name and address”, but the granularity is too coarse.

Word segmentation refers to the text cutting of the hotel name, address and so on, and divides the overall string into structured data. The purpose is to solve the problem that the overall comparison of the name and address is too coarse, and also to prepare for the construction of feature vectors later.

3.1.1 Participle dictionary

Before we talk about specific name and address participles, let’s talk about the construction of thesaurus. The existing word segmentation technology is generally based on the dictionary. Whether the dictionary is rich and accurate often determines the quality of the word segmentation result.

When classifying the name of the hotel, we need to use the hotel brand and hotel type dictionary. If the maintenance is purely manual, it requires a lot of manpower and low efficiency, and it is difficult to maintain a set of rich dictionaries.

Here we use the idea of statistics, using a machine + manual way to quickly maintain thesaurus:

  1. Randomly select 100000+ hotels to obtain their name data;

  2. On the name from back to front, from front to back step by step cutting;

  3. Each cutting word is obtained and the frequency of the cutting word is +1;

  4. The higher frequency of words, often is the hotel brand words or type of words.

The above table shows the words with high frequency. After getting these words, they will soon be able to build a dictionary of hotel brand and hotel type.

3.1.2 Name segmentation

Imagine how one compares the names of two hotels. Such as:

  • A: 7 Days Hotel (Jiuxianqiao Restaurant)

  • B: Home Inn (Wangjing Hotel)

First of all, because of the existence of empirical knowledge, people will unconsciously conduct the judgment process of “word segmentation before comparison”, namely:

  • 7 days –> Home

  • Hotel –> hotel

  • Jiuxianqiao Store –> Wangjing Store

So to make an accurate comparison, we have to divide words according to people’s thinking. After a large number of artificial simulated word segmentation of hotel names, we divided hotel names into the following structured fields:

Focus on the “first 2 words of type” field. Suppose we need to classify the following two hotel names:

  • Hotel 1: Longmen South Kunshan Country Garden Zilai Longting Hot Spring vacation villa

  • Hotel 2: Longmen South Kunshan Country Garden Han mingju hot spring holiday villa

Word segmentation effects are as follows:

We see that the similarity of each field is very high after word segmentation. However, the first two words of the type are:

  • The first two words of hotel type 1: Longting

  • The first two words of hotel type 2: famous residence

In this case, this field (the first two words of the type) is highly discriminating and therefore can be used as a very efficient comparison feature.

3.1.3 Address segmentation

Similarly, address segmentation is carried out by simulating human thinking to make the address comparison finer and more specific. See the following figure for specific word segmentation:

The specific word segmentation effect is shown as follows:

summary

Word segmentation solves the disadvantage of being too coarse in contrast granularity, and now we have about 20 contrast dimensions. But how to determine the comparison rules and thresholds?

Manual rules and thresholds have many disadvantages, such as:

  1. The rules are changeable. The combination of 20 contrast dimensions will produce N rules, and it is impossible to cover all these rules manually.

  2. Artificial threshold is easy to be guided by “empiricism”, which is prone to misjudgment.

Therefore, although the comparison dimension is rich, the difficulty of rule making is increased by N orders of magnitude. The emergence of machine learning can make up for this shortcoming. Machine learning, by training large amounts of data, learns changeable rules and effectively solves tasks that are basically impossible for humans.

Let’s take a closer look at feature construction and machine learning.

3.2 Feature Construction

We spent a lot of effort to simulate human thinking for word segmentation, which is actually to prepare for the construction of feature vectors.

The process of feature construction is actually a process that simulates human thinking. The purpose is to compare the structured data of word segmentation in pairs and digitize the comparison results to construct feature vectors for machine learning.

For different suppliers, we confirm that the data we can get mainly include the hotel name, address, coordinates, longitude and latitude, and possibly phone numbers and email addresses.

After a series of data research, the available data is finally determined as name, address and telephone number, mainly because

  1. Some suppliers have problems with the latitude and longitude coordinate system and the accuracy is not high, so we do not use it for the time being, but the distance of the converged hotel is limited to 5km;

  2. The mailbox coverage rate is low, so it is not used for the time being.

It should be noted that the comparison dimension of name and address expansion is mainly based on their word segmentation results, but if phone data is added to the comparison, phone data format should be cleaned first.

The final feature vector is roughly as follows, because the similarity algorithm is relatively simple, so it will not be described here:

3.3 Algorithm selection: decision tree

Judging whether hotels are the same or not, it is obviously a supervised dichotomous problem, and the criteria are:

  • Manual annotation of training set, validation set, test set;

  • Input two hotels, and the model returns only “same” or “different” results.

After comparing several existing mature algorithms, we finally choose the decision tree. The core idea is to obtain the decision tree according to the division of different features. Each division is carried out in the direction of reducing information entropy, so that each division can reduce uncertainty. Here is an excerpt of a picture for your understanding:

(Source: the Watermelon Book of Machine Learning)

Ada Boosting OR Gradient Boosting

The specific algorithm we choose is Boosting. Boosting is a good description of Boosting: three heads are better than Zhuge Liang. Boosting is similar to expert consultation, in that one person may make uncertain decisions and may make mistakes, but the error generated by the final decision of a group of people is usually very small.

Boosting is generally based on tree model, and its categories are Ada Boosting and Gradient Boosting. Ada Boosting got a model for the first time, with points that could not be fitted, and then increased the weight of points that could not be fitted to get multiple models successively. Multiple models that come out, and vote in the prediction. As shown below:

Gradient Boosting is a method that fits errors generated by the former model by the latter model, and the errors generated by the latter model are fitted by the latter model… These models are then stacked in turn:

In general, Gradient Boosting is more widely used in the industry, and we also base it on Gradient Boosting.

3.3.2 rainfall distribution on 10-12 XGBoost OR LightGBM

Both XGBoost and LightGBM are efficient system implementations of Gradient Boosting.

We have compared the memory usage, accuracy and training time respectively. LightGBM has a much lower memory usage and the accuracy is basically the same, but the training time is also much lower.

Memory usage comparison:

Comparison of accuracy:

Comparison of training time:

(Photo by Microsoft Research Asia)

Based on the above comparison data reference, LightGBM was finally selected for fast iterative training of the model.

3.4 Model training iteration

With LightGBM, the training time was greatly reduced, so we were able to iterate quickly.

Model training mainly focuses on two aspects:

  • Analysis of training Results

  • Model overparameter adjustment

3.4.1 Analysis of training results

The training result may be unsatisfactory at the beginning and not reach the ideal effect. At this time, we need to analyze carefully what causes the result, which is the problem of feature vector? Or is it a similarity calculation problem? Is it the algorithm? Specific reasons for specific analysis, but will slowly achieve the desired results.

3.4.2 Model overparameter adjustment

This paper mainly introduces some experience of hyperparameter adjustment. First, a brief description of the more important parameters:

(1) with numleaves maxdepth

Maxdepth and numleaves are important parameters to improve accuracy and prevent overfitting:

  • Maxdepth: As the name implies “tree depth”, too much may result in overfitting

  • Numleaves the number of leaves on a tree. LightGBM uses the Leaf-wise algorithm, which is the main parameter controlling the complexity of the tree model

(2) feature_fraction bagging_fraction

Feature_fraction and Bagging_fraction can prevent overfitting and improve training speed:

  • Feature_fraction: Random selection of partial features (0< Feature_fraction <1)

  • Bagging_fraction Select part of data randomly (0<bagging_fraction<1)

(3) lambda_l1 lambda_l2

Lambda_l1 and lambda_L2 are regularization terms, which can effectively prevent over-fitting.

  • Lambda_l1 :L1 regularization entry

  • Lambda_l2: indicates the regularization entry of L2

3.5 Model Effect

After several rounds of iteration, optimization and verification, our hotel aggregation model has become stable.

The effectiveness of the scheme is usually evaluated by “accuracy rate” and “recall rate”. However, in the case of hotel aggregation business, it is necessary to ensure absolute high accuracy (aggregation error produces AB stores and affects users’ check-in), and then high recall rate.

After several rounds of verification, the accuracy of the current model can reach more than 99.92%, and the recall rate also reaches more than 85.62% :

It can be seen that the accuracy has reached a relatively high level. However, for the sake of insurance, we will also establish a set of rules for secondary verification according to the hotel name, address, coordinates, facilities, types and other dimensions after the completion of aggregation. At the same time, for some same-day bookings and same-day check-in orders, we will also involve manual real-time verification to further control the risks in AB stores.

3.6 Solution Summary

After the introduction of the overall scheme, the hotel aggregation process based on machine learning is roughly illustrated as the following figure:

After the above exploration, we have a general understanding:

  1. Solutions evolve slowly, iterating when they don’t meet the requirements;

  2. Word segmentation solves the disadvantage of too coarse granularity of contrast and simulates human thinking for sentence segmentation.

  3. Machine learning can derive complex rules and use vast amounts of training data to solve tasks that humans cannot.

Write Part 4 at the end

The exploration of new technologies is challenging and meaningful. In the future, we will further iterate and optimize, efficiently complete the aggregation of hotels, ensure the accuracy and timeliness of information, and improve users’ booking experience. For example:

  1. Coordinate system unification of domestic hotel resources of different suppliers. Coordinates are an important Feature for hotel aggregation. It is believed that the accuracy and recall rate of hotel aggregation will be further improved after the unified coordinate system.

  2. Open the closed loop of risk control and aggregation. Risk control and aggregation establish a real-time two-way data channel to further improve the underlying capabilities of both services.

Above is primarily concerned with the evolution of domestic hotel polymerization, machines for “foreign” hotel data aggregation, method is also very different, such as how foreign hotel name and address of the word segmentation, morphological reduction and stemming, how to do it, we have a corresponding exploration and actual combat, the overall effect is even better than that of domestic hotel polymerization, In the future, we will share with you through articles, and hope that interested students continue to pay attention.

** Author of this article: ** Liu Shuchao, Exchange Center – Hotel search R&D engineer; He Xialong, Kang Wenyun, intelligent Zhongtai – content mining engineers.