This week’s post, Applying Deep Learning To Airbnb Search, explains how Airbnb is Applying Deep Learning To its Search project, as the title suggests.
What is interesting about this paper is that it does not propose a novel algorithm or model, but introduces what interesting and useful points Airbnb has found in the process of applying deep learning, and then shares them with everyone. Before this, the author said that GBDT has been used for search sorting, but it has fallen into a bottleneck, and deep learning is needed to help break the bottleneck.
Let’s take a look at what difficulties and gains Airbnb has encountered in the process of applying deep learning.
background
In airbnb’s case, it acts as a two-sided marketplace: renting properties for landlords and offering them to potential tenants. In most cases, a booking starts with a search.
The first version of search sorting is often manually constructed a scoring function. For Airbnb, the next milestone after this is GBDT. When THE revenue of GBDT cannot be further improved, Airbnb starts to embrace deep learning.
In addition to ranking, for the whole search system, the model needs to predict the probability that the landlord will accept a tenant’s booking, the probability that the tenant will give a 5-star review, and so on.
In Airbnb, users tend to search -> browse -> search… -> Reserving a room, the operation of reserving a room is finally realized through multiple searches and browsing, and these behaviors will be used to train the data. The trained model then proceeds to line A/B.
The overall structure of this paper is to introduce the evolution process of Airbnb search model first, then feature engineering and systems engineering, and finally explore some tools and superparameters.
What I like about this paper is that it is different from other common papers. It tries to explain that it has proposed a novel model, how the structure is, and finally it is proved to be excellent by experiments. This article focuses on telling you about airbnb’s exploration and progress in search, showing you how to gradually deepen a project from a macro perspective, and what details need to be paid attention to.
Model evolution
The two graphs respectively show the different models airbnb is trying to use in search, one is NDCG performance offline, and the other is revenue from online booking.
Initially, it was a simple single-layer NN network, mainly for running through the whole pipeline (which happened to coincide with what I mentioned in the annual summary of my last article). Once a single-tier network is complete, Airbnb begins to refine their network. Because the offline indicator is NDCG, lambdarank was chosen.
The authors give a brief snippet of code here to illustrate their thinking:
“Decision Tree/Factorization Machine NN” : In addition to lambdaRank, the author expressed that they made an attempt of model fusion by combining GBDT, FM and NN with a straightforward structure:
According to the author, although the index performance of combining the three models in the test set is similar to that of using the neural network directly, the head order of the final list is quite different. Therefore, the author believes that the model fusion combines the respective advantages of the three models, but the paper does not give further elaboration here. I did give you a reference, [I’ll show you later](Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17 (ADKDD’17). ACM, New York, NY, USA, Article 12, 7 Pages.)
“Deep NN” : Next comes the age of DNN, which the author says is easy to use. There’s nothing to talk about. It’s just awesome. The authors say their initial data features are simple things, such as price, device status, historical subscriptions, etc., and also feed in the output of the two models as features.
As shown in the figure below, with the increase of training data, the gap between training set and test set gradually disappears:
But the authors point out that, unlike DNN’s success with CV (machines are as accurate as humans), there is no way for humans to judge what is right for display in the current scenario: after all, booking a house depends on the user’s budget and taste.
Here, the author says, is something that is rarely introduced in other times: some failed attempts
Failed model
Winners are all alike, but losers are all alike. The model is somewhat similar, with the author saying that there are too many failed attempts.
listing ID
Taking the interactive ID list of item as a feature and feeding it directly into the model is a common operation in search, recommendation, NLP and other fields, which is often very effective. Airbnb is also full of hope before it starts. But to get to this point, the results must be a disappointment.
As shown in the figure above, the convergence speed and degree are very different with or without ID feature. It can be seen that those with ID feature often fall into overfitting.
The author analyzed the possible reasons for this phenomenon: due to the special business nature of Airbnb.
Because in other areas, like Tiktok, there’s no limit to how many times a short video can be viewed; On Taobao, for example, there is no limit to the number of times a product can be purchased. The interaction record of ID as a feature itself requires such a large number of behavior records. But on Airbnb, even the most popular rooms are booked up to 365 times a year.
So this article also shows that we need to consider our own business scenarios when selecting features based on our current business sense.
Multitasking learning
Considering the limitations of booking a room, Airbnb began to consider that browsing behavior is not restricted. Can we start from browsing behavior?
The figure above shows the ratio between views and subscriptions, which are much sparser than the views. On the other hand, more page views generally lead to more booking behavior. So Airbnb ran a multitasking learning model, optimizing both booking and browsing time.
With your model designed, alas, you can now use listing ID.
In the end, it didn’t look good. Boy, the model showed people rooms that were really expensive or special, which users liked and spent more time browsing, but didn’t do much for the booking itself.
Characteristics of the engineering
Here airbnb puts forward some ideas, processing features to make them meet certain attributes, helping neural networks to carry out effective mathematical operations.
Characteristic regularization
At the beginning, the authors directly used their previous features in GBDT, but to poor effect. After tracing the reasons, it is found that GBDT is not very concerned about the specific values of features, but mainly about the relative order between the values of features. But neural networks are sensitive to numbers.
If the characteristics are not of the same class, the value range is very different, which will lead to too large gradient propagation error. For example, some too large values will directly invalidate ReLU function, resulting in the disappearance of gradient.
So for the characteristics of the input, it’s ideal to have them in the range of -1 to 1, with an average of 0.
To this end, the author proposes two regularization methods:
- (Feature_val − μ)/σ is used to transform the distribution of features that follow normal distribution.
- For features subject to power law distribution, log((1+feature_val)/(1+median)) is used to transform the distribution.
The characteristics of distribution
The previous step smoothed out the distribution of features, but why do it?
Screen bug
The first is the anomalous eigenvalues in the sample, which are often difficult to observe directly. But if the characteristics of the sample are smooth, then a distribution map can be made, and it is easy to see those abnormal sample eigenvalues.
For example, if the housing price of a certain area is used as a feature, if a sample mistakenly changes the room price booked in the past month as a room booked in a day, then this value will be obviously raised in the distribution map of the feature.
It is beneficial to the generalization ability of the model
Compared with other machine learning methods, one of the advantages of neural network is excellent generalization ability. From this point of view, the author analyzes and proposes the function of feature distribution smoothing.
The figure above shows the distribution of output of each layer of the neural network, and it can be observed that the distribution of the final output layer is the smoothest. And it gets smoother and smoother from the input layer. Is this why neural networks have excellent generalization ability?
The authors show that in practical applications, models learn the distribution in a high-dimensional feature space by training data. However, it is impossible for all feature combinations to be observed during training, so the process of model training is to gradually interpolate learning from uneven distribution.
It also makes sense that the smoother the lower distribution, the more likely the upper layer is to interpolate unseen features. The lowest level is naturally the input layer.
On the other hand, in order to verify the generalization ability, the author tried to make all the prices x2, X3, X4, etc., and the final NDCG was surprisingly stable.
In general, feature smoothing is easy to do. With a little processing, feature distribution can be easily migrated to a suitable and smooth form, but sometimes airbnb does feature engineering to smooth out the distribution for certain features.
One example is showing the location of a house after a user searches for it. The following graph shows the whole process: At first, the distribution of latitude and longitude is obviously uneven; The distance from the center of the user’s search map was changed, and then the logarithm function was added. Finally, we get a characteristic distribution that satisfies the requirement.
Check the integrity of features
If the distribution of features is not smooth, it may help us to check the integrity of features (in fact, in a way, I think it can also be similar to the previous bug detection).
Here, the author gives an example of using the number of days of occupancy of a house as a feature, and finds that the distribution is not smooth, as shown in the left figure below. The reason is that some houses require different minimum stay time, and some even require at least a few months.
So the author smoothed the average number of days to form the power-law distribution on the right.
! [image-20210228090023053](/Users/songshiwei/Library/Application Support/typora-user-images/image-20210228090023053.png)
Kogi characteristics
I tried listing ID earlier, but it didn’t work. So what else is Airbnb trying to do with hyper-basic categorization?
An example of this is location, which makes direct use of Google’s S2 map library. By combining the toponym and s2 map library, a hash function is made, and then the hash value is fed into the neural network as a feature vector for learning.
For example {” San Francisco “, 539058204} → 71829521
An example of a final application is as follows:
And you can see that the heat map, when you search San Francisco, you’re not just radiating heat directly from a linear distance from the center of the heat map. It’s clear that the land below San Francisco is hotter than its water neighbors.
Systems engineering
Here the overall process may be more similar to everyone, online service is a Java service. Spark is used for data processing, TensorFlow is used for training model, Scala and Java are used for data analysis, etc.
Here the author mentions a few tricks that may help us:
Protobufs and data sets
Before, they used GBDT, and the data was CSV files. When they started neural networks, they used this data format and converted CSV files into feed_dict files and fed them to the network. But it turns out that GPU utilization is only 25%.
Later they used Protobufs and data sets to avoid format conversion of data during training, which resulted in approximately 17X training acceleration and 90% GPU utilization.
The static characteristics of the
For a business like Airbnb, the hard features of a house don’t change (at least easily), such as location, structure, utilities, etc. These features of each training data are constantly loaded, which is a large IO consumption.
So instead of using these static features, Airbnb uses the ID of the house as the feature, and only loads it into the training when the information changes.
Java neural network library
So, just to give you an overview, they developed a Java neural network library themselves.
Super parameter
There’s nothing in this section either.
The result is poor dropout performance, which the authors think is best suited for situations where data is noisy.
And then the initial weights, they thought it was better to be random between -1 and 1, rather than zero at all.
The learning rate and batch size are not adjusted out of anything ~~~
Importance of features
Here the author has also made some analysis, and we can simply summarize it as follows:
Score and dismantling
How do you know which features are important and which can be omitted? The author thought of disassembling the final score, but came to the conclusion that it is impossible to isolate the true effect of the characteristics of nonlinear activation functions on the final score, so such disassembling is meaningless
AB test
This is obvious, missing one feature at a time, and then looking at the performance change. But the authors say the effect of removing one feature at a time is indistinguishable from even the data noise during training.
Rearrange some features
Another idea is to randomize all the values of certain features to judge performance. This way the features still work, but scramble the values to see if it affects the model. In this way you can really see the importance of some features.
However, the author puts forward some points that need to be paid attention to: first, the observed features need to be independent of other features, which is often difficult to achieve; Secondly, due to the disruption of eigenvalues, the final results of the model may be affected (for example, some feature combinations that do not exist in real life are constructed), thus misleading our direction.
#### Characteristic analysis of the results
The idea here is to use the results on the test set to sort the results and analyze the feature distribution of the head and tail, so that the model can know how to apply these features in different value ranges.
With price taken into account, head and tail results have a different distribution of price, but without the characteristics of browsing length, the distribution is close. This is also a way to analyze the importance of features.
conclusion
As stated at the end of the article, they felt they were just beginning to apply deep learning to search.
Switching a business from non-deep learning mode to neural network is not a matter of feeding it directly into the model. It requires steady and steady work in the early stage. When the whole pipeline is completed, data, feature engineering, system engineering and so on are complete, it is the time for you to start exploring the model.
This process of exploration, in essence, tests your familiarity and understanding of the business.
I myself am also migrating a search project recently, preparing to open up its entire NN link, this article brings me a lot of inspiration. But the most important core, in fact, is the understanding of the business, you can see that many of the above content can be explained from the perspective of airbnb’s own business.
I will continue to share some other search and recommended articles, welcome to come to me to discuss learning.