Airbnb personalized recommendations
Airbnb personalizes scenes
Airbnb usage scenarios:
Bilateral short-term housing rental platform (customer, landlord)
Customers find rooms through search or system recommendation => contribute 99% of Airbnb’s booking
It is rare for a customer to book the same room more than once
Only one house can be rented by one customer at a given time
There is severe sparsity in the data
List Embedding
Place each house => house embedding
The data set is composed of N users’ clicking sessions, where each Session is defined as an uninterrupted sequence composed of M house ids clicked by users
A new Session is considered as long as two consecutive clicks are more than 30 minutes apart
The goal is to learn the D-dimension (32-dimension) embedding representation of each housing listing through the collection of S, so that similar listings are closer to each other in the embedding space
It borrows from the Skip-Gram algorithm in Word2VEc
Housing resource embedding takes the housing resource Session clicked by each user continuously as a sentence, and each housing resource is treated as a word to train the housing resource embedding
The use of Word2Vec
• The space where the original word resides is mapped to a new space through Embedding, so that semantically similar words are close to each other in this space.
• Word Embedding => Learn the weight matrix of the hidden layer
• The input layer is one-HOT coding, and the output layer is probability values
• The input layer and output layer size are equal to the thesaurus size
• The number of neurons in the hidden layer is hidden_size (Embedding Size)
• For the weight matrix W between input layer and hidden layer, the size is [VOCab_size, hidden_size]
• The output layer is a vector of [VOCab_size] size, with each value representing the probability of output of a word
• Suppose there are multiple training samples (juice,apple) (juice,pear) (juice,banana)
• The central word is juice, which specifies that there are apple, PEAR, and banana in the window. The common input juice corresponds to the output of different Word. After training with these samples, the probability values of Apple, PEAR and banana are higher => Their parameters in the corresponding hidden layer are similar
• Calculate the cosine similarity between the hidden layer vectors corresponding to Apple, PEAR and banana
• Word2vec enables words to have similarities and analogies
• The result we want is not the model itself, but the parameters of the hidden layer, that is, to transform the input vector to the new embedding
•
Evaluation of List Embedding
Offline evaluation of List Embedding:
Before using the recommendation system based on embedding to conduct online search test, it needs to conduct several offline tests. The purpose is to compare the embedding trained by different parameters, determine the embedding dimension, algorithm ideas and so on.
Evaluation criteria that test how likely a user’s most recent click on a recommended listing is to result in a reservation
Step1, get the houses that the user recently clicked on, the candidate list of houses that need to be sorted, and the houses that the user finally booked
Step2. Calculate the cosine similarity between the selected housing and the candidate housing in the embedding space
Step3: rank the candidate housing sources according to their similarity, and observe the position of the final reserved housing sources in the ranking
The housing resources in the search are reordered according to the embedding similarity, and the final ordering of the reserved housing is calculated according to the average click before each booking, which can be traced back to the 17 clicks before booking
Evaluation of List Embedding:
The validity of embedding is verified in various ways
K-means clustering, embdding is clustered, and then its geographical differentiation can be found
Cosine similarity between Embeddng
Cosine similarity between different types of listings
Cosine similarity between listings in different price ranges
In addition to the basic attribute (price, geographical location), it is obvious that it can be directly obtained, and the implicit attribute can be found by embedding, such as the literature and science of houses
Calculate the K-nearest neighbor of each listing embedding, and compare this listing with k-nearest neighbor => embedding Evaluation Tool
There are special videos on YouTube, which proves that Embedding is very useful => Make similar listings closer to each other in Embedding space
Cold start of List Embedding
Similar housing recommendation based on List Embedding
Each Airbnb listings detail page contains a “similar listings” roundup, recommending listings that are similar to the current listings and can be booked within the same time frame
After the establishment of list embedding, A/B test was conducted. The recommendation based on embedding increased the click rate of “similar housing” by 21%, and the reservation generated by “similar housing” increased by 4.9%
In the recommendation based on embedding, similar housing resources find K nearest neighbors in the list embedding space
For the list embedding that has learned well, by calculating the cosine similarity between the vectors of all the housings from the same destination, all the similar housings that can be booked for the specified housings are found (conditional on the check-in and check-out dates). The houses need to be available for booking within this period) => The final K houses with the highest similarity form the similar house list
User Type Embedding and Listing Type Embedding
Long-term behavior, such as a person who booked a room in another city a long time ago, is likely to prefer the same type of room in the current city
Further capture this information from the scheduled Listing
Construct the data set: from the previous click sequence to the scheduled sequence, the data set is a set of sessions consisting of listings booked by N users, each Sesison can be represented as
Existing problems:
The training data set is small because the scheduled data is an order of magnitude smaller than a click.
Many users have only booked once in the past, and this data is not available for training models
You need to further remove listings that have been booked a small number of times in total on the platform (e.g., fewer than 5-10 listings)
The time span is too long, and users’ preferences may have changed
Real-time personalized search based on Embedding
Calculate the cosine similarity between the user type embedding of the user and the listing type embedding of the candidate listing
Recommend listing with high similarity to users
Fintech data analysis
Fintech application scenarios
Fintech:
Financial + Technology makes Financial services more efficient through technological means
Financial services: insurance, banks, securities brokers and funds need to be supported by technology. In addition, Internet companies are also launching financial businesses, such as Ant Financial
Fintech Companies & Talent development
Fintech industry club seeks interdisciplinary talents with digital skills and business analysis skills in software Engineer of technology companies and financial industry
Typical companies: Ant Financial, JD Finance, Grab, Sofi, Oscar Health, Nubank, Robinhood, Atom Bank, Lufax, Bloomberg, Factset, PayPal
How to use Python for quantitative trading
Quantitative trading (VNPY, JointQuant, Ricequant)