Embedding is a method to transform discrete variables into continuous vector representation. It was first applied to WORD Embedding in NLP to map word originally represented by one hot to a multidimensional vector space to form a new vector representation. And the resulting properties can be calculated and compared. Later, image embedding, graph embedding and other technologies came into being. Embedding is also used to construct features in feature engineering. This article discusses how Airbnb uses embedding technology in their recommendation scenes.

1. Word2Vec

Word Embedding cannot be avoided by the classical Word2Vec. Word2Vec is a model of semantic knowledge learning from a large number of text corpus in an unsupervised way. It is widely used in NLP. It represents the semantic information of words in the way of word vector by learning text, that is, it makes semantically similar words close to each other in an embedded space. Embedding is actually a mapping, which maps words from the original space to a new multidimensional space, that is, Embedding the original space of words into a new space.

For example, the word cat and kitten are semantically close, dog and kitten are not, and the word iPhone and kitten are even more semantically different. Learning this numerical representation of words in a vocabulary (that is, turning words into word vectors) can lead to some interesting conclusions about vectoquantization operations based on such numbers. For example, if we do kitten-cat + dog for the word kitten, cat, and dog, we end up with an embedded vector that looks very similar to puppy.

Word2Vec is divided into skip-Gram (Skip word model) and CBOW (continuous bag of words model), which respectively use input word to predict context and use context to predict input word. The models are very similar in construction.

1.1 the Skip – “gramm

This article mainly summarizes the Skip-Gram model. It mainly consists of two parts: establishing model and obtaining embedding through model. Essentially speaking, skip-gram and autoencoder have very similar ideas. It is also to construct a neural network and train with data. After model fitting, what we really need is the parameter layer or weight matrix in the middle of the model.

The training data for The SKip-Gram model is natural statements, namely sentences like “The quick brown fox jumps over The lazy dog.” The training process is as follows:

  • First select the input word, such as “fox”;
  • After the input word is selected, the skip_window parameter is defined to represent the number of words to be selected from one side of the current input word. Assuming 2, our window gets the word [‘quick’,’brown’,’fox’,’jumps’,’over’]. Span =2×2=4span=2 \times2=4span=2×2=4 If skip_window=2, num_skips=2, and num_skips=2, The two sets of training data obtained in the form of input word (output word) are (‘fox’,’brown’),(‘fox’,’jumps’).
  • The neural network outputs a probability distribution based on these data, which represents the probability that each word in our dictionary is an Output word.

Our model learns statistics from the number of occurrences of each pair of words. Then after completing the training, when given an input-output combination such as (” France “, “Paris”) will definitely have a higher probability than (” France “, “Tokyo”).

When it comes to the details of model training, first of all, the words in the training data should be expressed with one HOT coding. Build a vocabulary based on the article, and the total number of vocabularies is the number of dimensions of one Hot. However, since the vocabulary of the whole article itself is very large, each word is a very sparse high-dimensional vector after one HOT coding. Then we will input these sparse vectors into the neural network, as shown in the figure below:

The hidden layer does not use any activation functions, and the output layer uses SoftMax for multi-classification. We trained the neural network with paired input word (output word) encoded by one Hot, and the final output of the model was a probability distribution.

The objective function of SKip-Gram is:


l o s s = 1 T t = 1 T c Or less j Or less c . j indicates 0 log p ( w t + j w t ) loss=\frac{1}{T}\sum_{t=1}^{T}\sum_{-c\leq j\leq c,j\neq 0}^{}\log p\left ( w_{t+j}|w_{t} \right )

Because every word wtw_twt determines the adjacent word wt + jw_ + j {t} wt + j, based on the maximum likelihood estimation, it is expected that the conditional probability of all samples p + j ∣ wt (wt) p \ left (w_ + j} {t | w_ {t} \ right) p + j ∣ wt (wt) is the largest. And the output layer softMax is expressed as:


p ( w o w t ) = e x p ( ( v w o ) T v w i ) W w = 1 e x p ( ( v w ) T v w i ) ) p\left ( w_{o}|w_{t} \right )=\frac{exp((v_{w_o}^{‘})^{T}v_{w_i})}{\sum_{W}^{w=1}exp\left ( (v_{w}^{‘})^{T}v_{w_i}) \right )}

Hidden layer

After the training of the model, the input data will change from 10,000-dimensional sparse vector to 300-dimensional dense vector when it goes to the hidden layer, and then map from 300-dimensional dense vector to the output layer and become 10,000-dimensional again. In between, there will be two weight matrices multiplied by the 10,000-dimensional input vector and the 300-dimensional intermediate vector respectively. What we need is the weight matrix of the hidden layer. In the figure below, we observe the weight matrix from two angles. The left side represents the weight matrix, each column is the word vector and the weight vector of a single neuron in the hidden layer, and each row on the right actually represents the word vector of each word. That is to say, as long as we obtain the weight matrix, we can calculate the word embedding.

When we want to get a one Hot embedding, we need to use one hot to multiply the weight matrix WWW between the input layer and the hidden layer. But because matrix multiplication takes a lot of computations, it reduces efficiency. Then, from another perspective of observing the weight matrix above, the weight matrix can be looked up by the look up method. According to the principle of vector and matrix multiplication, only non-zero elements in the vector can generate input to the hidden layer, that is to say, input the KKK bit of Xk = 1X_K = 1xK =1 in XXX. The corresponding KKK line of the weight matrix WWW is the result of embedding. This method of looking up tables improves the efficiency of embedding.

2. Embedding in Airbnb

As mentioned in the previous article, Embedding can be used in feature engineering. Real-time Personalization Using Embeddings for Search Ranking at Airbnb They applied this technique to two scenarios on their search platform: similar recommendation and search ranking, and successfully improved the original two scenarios.

The main use scenarios of Airbnb are as follows:

  • Bilateral short-term housing rental platform (customer, landlord)
  • Customers find houses through search or system recommendations (99% of orders come from these two scenarios)
  • It is rare for a customer to book the same room more than once
  • Only one house can be rented by one customer at a given time
  • There is a serious scarcity of data

Personalized recommendations for their business execution also need to take into account the following:

  • Real-time personalization for search sorting and similar housing recommendation
  • Consider the objectives of your search based on your business needs
  • Click-through rate (CTR), the length of time spent on a home, the conversion rate to improve the home reservation, etc
  • Unlike other commodities, the user can not just want to order a house
  • In a two-sided market, services should be provided for both buyers (guest) and sellers (host) at both ends of the market
  • Bilateral recommendation, which needs to consider both user reservation and Host acceptance (Host Actions: Reject, Accept, No Response).
  • For Query (with location and travel time), optimize search results for both host and guest:
  • Customer perspective: We need to sort the listing according to location, price, type, reviews and other factors to get the listing that customers like
  • Host perspective: You need to filter out listings that reject guests because of bad reviews, pets, length of stay, number of guests, and other factors, and rank them lower
  • Learnig to rank was used to do this, and the problem was transformed into a Pairwise Regression problem, with the scheduled listing as a positive sample and the rejected one as a negative sample.

In view of the above situation that needs to be considered, Airbnb uses Embedding technology to improve the quality of recommendation in its recommendation scene. The main realization steps are as follows:

  • Use Embedding to learn the low-dimensional expression of listing, and improve the quality of similar housing recommendation

  • Using Embedding to model the long-term preference of users to improve the quality of search and recommendation

2.1 List Embedding used in similar recommendation

Listing Embedding is used to make Similar Listing Recommendation. The user clicks on one of the short-term rentals in the search page, and after viewing the information, the system will recommend other short-term rentals at the bottom of the page that are similar to the one that was clicked on. This is called a similar listing recommendation. However, if we take orders as data according to the recommendation based on users or commodities, we will find that the data is very scarce. To solve this problem, the method proposed in this paper is in-session Personalization. He keeps a record of the listings you’ve clicked on recently, and each listing shows up based on the property’s characteristics.

Depending on whether you clicked the like, entered the list, made a reservation, contacted the landlord, and so on, build a series of Click sessions, each of which should be an unbroken sequence of m listings clicked by a user. If the interval between clicks is more than 30 minutes, Is considered a session, and then the click continues to form the next session.

As shown in the figure above, each session is followed by the id of the recorded listing. Then each session can be formed into a sequence and entered into the skip-Gram model of Word2VEc introduced in the first chapter, where each listing is equivalent to a word. The objective function of the original Negative Sampling Skip-Gram model is as follows:

  • Vlv_lvl represents the embedding vector of the listing located in the current action center

  • Vcv_cvc represents the embedding vector of the input listing after center

  • (L,c)(L,c)(l,c) represents the current input vector

  • (l,c)∈Dp(L,c)\in D_p(l,c)∈Dp indicates that the current input vector belongs to a positive sample

  • (l,c)∈Dn(L,c)\in D_n(l,c)∈Dn indicates that the current input vector belongs to the negative sample

2.1.1 Optimization 1: Booked Listing as Global Context

In the click Session scenario, we add the Booked listing as a positive sample to the model training as a global context, which means that whether or not the current training center word is around the booked listing, We all think the ubooked listing is in its context, so the target function is modified as follows:

Lbl_blb indicates the last booked listing. And what this objective function says is: For the current center listing, LLL, as input, we expect listing DpD_pDp in its context, and eventually Booked listing, lBL_BLb to appear as likely as possible; Hopefully, instead of listing in context, the probability of DnD_nDn appearing is as low as possible.

Then the single model training is as follows:

2.1.2 Optimization 2: Adapting Training for Congregated Search

Most Airbnb users search at a fixed location, so the corresponding negative sample should also come from the same location. However, many listings obtained by Negative sampling are difficult to meet this condition, so the second optimization is to add Negative sample sampling of the same location into the objective function

Where DmnD_{m_n}Dmn is the negative sample sampled in the same location. Note that only lbl_blb is not preceded by a sum, because there is only one Booked Listing for each training Step.

2.1.3 Cold Start Failure

Airbnb’s solution to the new listing cold-launch problem is simple: find the three most similar listings within a 10-mile radius and average their listing Embdding. The paper mentions that 98% of new listing cold start problems can be solved with this simple method.

2.1.4 Effect evaluation

In the offline state, the effect of Listing Embedding is evaluated. The evaluation standard is to test how likely it is that users will eventually make reservations when they click on the recommended housing resources recently. The corresponding steps are as follows:

  1. Get the most recently clicked listings, the candidate listings that need to be sorted, and the final listings that the user has booked
  2. Calculate the cosine similarity of the click housing and the candidate housing in the embedding space
  3. The candidates are sorted by similarity and the final bookings are observed in the ranking

The effect is as follows:

The horizontal axis of the figure above shows how many clicks it takes to get to the booked listing before the final booking, and the vertical axis shows the average rank of the final booked listing in the recommendation order. We can see that the effect of negative sampling of the purple line is the best, and its average ranking position is the highest.

In addition, clustering can also be used to observe the effectiveness of embedding. In the figure below, it can be seen that the clustering of similar housing is still obvious.

2.2 Type Embedding used in search recommendation

Listing Embedding is used for similar recommendation scenarios, but for search recommendation scenarios that are long-term and not in the same location. For example, the user’s session search was for houses in Chengdu before, but he had also lived in Shanghai, Guiyang and Beijing before, so he would also have his interest preferences. However, Listing Embedding does not apply here for the following reasons:

  • The training data set is small
  • Many users only booked once in the past, and this data is not available for training the model (high latitude sparse problem).
  • You need to further remove listings that have been booked a small number of times in total on the platform (e.g., fewer than 5-10 listings)
  • The time span is too long, and users’ preferences may have changed

So in this case, Airbnb makes the granularity of the question bold, and aggregates the user and listing according to the rules to form user_type and listing_type. The training is then combined into a Booked session based on user_type and listing_type. The aggregation rule is very simple. The buckets are divided according to the value of the listing property, and then the buckets are combined, as shown in the figure below:

In this way, the data that was originally detailed to a single user is promoted to the same type users, and the data becomes dense. This way, even for the same listing or user, the corresponding type can change (for example, if the user behavior changes). The first five features at the user level are generic portrait features that can be mapped directly to new users.

The objective function

User Type Embedding:

Listing Tyoe Embedding:

Vcv_cvc represents the mapping parameter vector applied to the input listing. (L,m)∈Dm(L,m) \in D_m(L,m)∈Dm indicates that the current input vector belongs to the negative sample in the target region. The input data form is (user_type, listing_type). Here the two embedding are mapped to the same feature space. So vcv_cvc is the same value here for the same sample.

2.2.1 Expllicit Negatives for Host Rejections

In this case, because the landlord’s rejection behavior needs to be considered, the part of listing that was rejected before user_id is added to the negative sample of the model, and the model is told not to recommend this type of listing to the user. Then the above objective function is optimized:

User Type Embedding:

Listing Type Embedding:

The training process becomes:

2.2.2 Real-time personalized search based on Embedding

After the Embedding construction above is completed, it is necessary to use new features in the Search Ranking Model. The paper constructs the following new features:

The first half collects users’ historical data in the past two weeks and updates it in real time, which is a short-term feature. But the long-term feature uses only a UserTypeListingTypeSim, converts the current user_id to user_type, and the candidate listing_id to listing_type, using cosines to calculate a similarity.

The importance of features is as follows:

It can be seen that the Embedding feature of the new structure is very important after adding the original features.

The online effect is as follows:

It can be seen that the improvement degree of DCU and NDCU, which are the main indicators of their concern, has been significantly increased.

2.3 Thesis Summary

The two Embedding technologies proposed by Airbnb regard the data in the search session as similar sequence information, and learn the Embedding value of each house ID by means similar to Word2VEc (improved). Listing Embeddings is the most granular, splitting a user’s click information into multiple sessions when viewed for more than 30 seconds, which is suitable for short-term personalized sorting and recommendation scenarios. Long-term interest learning on User Type Embedding and Listing Type Embedding is more suitable for long-term personalized scene. In addition, considering the characteristics of bilateral market, strong feedback signals of “Booking” and “host rejection” are added to guide unsupervised learning.

3. Summary

Embedding technology is very easy to use. Since it can transform sparse vectors into dense vectors, it can be used in feature Embedding. Airbnb is an example of using this technology in the recommendation system, which has successfully improved its recommendation efficiency. Therefore, we can also consider using embedding to construct new features in future feature engineering.

The resources

  1. Real-time Personalization using Embeddings for Search Ranking at Airbnb

  2. zhuanlan.zhihu.com/p/69153719

  3. zhuanlan.zhihu.com/p/102516974

  4. zhuanlan.zhihu.com/p/27234078