This article was first published on the wechat public account “Shopee Technical Team”.

Abstract

In mainstream search engines, shopping apps, Chatbot and other applications, pull-down recommendations can effectively help users quickly search for the content they need, and have become a required and standard feature. In this article, the Shopee Chatbot team will introduce the process of building pull-down recommendations from 0 to 1 in Chatbot and share the experience of iterative optimization of the model.

In particular, we explore the multi-language and multi-task pre-trained language model and apply it to vector recall in the drop-down recommendation to optimize the recall effect. On the other hand, in order to make pull-down recommendation as helpful as possible to users and solve their problems, we modeled the two objectives of user click and problem solving simultaneously, and explored multi-objective optimization.

1. Business background

1.1 Shopee Chatbot

With the expansion of Shopee’s business, consumer demand for customer service consulting continues to climb. Shopee Chatbot team is committed to building an organic combination of Chatbot and human customer service Agent based on artificial intelligence technology. Chatbot can solve users’ daily consultation demands, provide users with better experience, and relieve and reduce the pressure of human customer service. Also help the company to save a lot of human resources costs. So far, we have launched Chatbot in several markets. As shown above, users can experience our Chatbot product through Mepage in Shopee App.

We are also constantly refining our Shopee Chatbot products to enhance their functionality and provide users with a better experience to help them solve their shopping problems. Dropdown recommendation is one of the most important features of Shopee Chatbot.

1.2 Drop-down Recommendation

Pull-down recommendations, also known as input Suggestions, search Suggestions, autocomplete or question recommendations, have become a required and standard feature in many products, including major search engines, shopping apps and Chatbots. Suggestion suggestion is displayed when the user enters a query term. In this way, it helps users express the content they want to retrieve more quickly, which in turn helps users retrieve the content they need more quickly.

In Shopee Chatbot, we also hope that Chatbot has the function of drop-down recommendation, so that users can solve their problems faster and better, and improve their shopping experience.

2. Overall plan

For the current Chatbot scenario, in order to make it have the function of pull-down recommendation, we refer to the search and recommendation scenarios and use the process of recall + sort, as shown in the figure below. With users’ current search input, we found the most similar and related suggesiton as suggestions. suggesiton To this end, we need to build recommendation candidate pool, multi-way recall and sorting module.

2.1 Recommended candidate pool

2.1.1 Construction process

At present, the way we construct the recommendation candidate pool is relatively simple, including 3 steps:

  • Using data from multiple sources, including solution titles, annotated data for intent identification, and numerous chat logs;
  • Some preprocessing is then performed, such as deleting messages that are too short or too long, and deleting duplicate queries using edit distance or clustering;
  • Finally, to control the quality of the recommendations, we asked the local business in each market to review these queries or rewrite them into standardized queries. For example, we correct errors and delete dirty words.

2.1.2 Recommended Example

Here are some sample recommendations from our recommendation candidate pool. For every suggestion, we have a solution. When the user clicks on the suggestion, we will give the user the corresponding solution as the answer.

Recommend the suggestion The solution
May I know the status of my order? Solution 1
I have not received my order yet. Solution 1
Why is my order not delivered yet? Solution 1
Why i cannot activate my shopeepay? Solution 2
Why am I unable to set up my shopeepay account for my refund? Solution 2
What is shopeepay? Solution 3
I would like to check on my shopeepay status. Solution 4
Why is my order cancelled? Solution 5
When is the delivery time for my order? Solution 6

2.2 Multiple recall

For recall, we adopt the method of multi-path recall and then merge, including text recall and vector recall at present.

2.2.1 Text recall

For text recall, we use the industry standard solution of Elasticsearch (ES) for keyword matching recall. In addition to the ranking based on THE BM25 score of ES [1], CTR of the solution was also used as a weighting factor to further improve the recall effect.

2.2.2 Vector recall

Text recall is easy to understand and implement, but its effectiveness depends on whether query and suggestion have matching keywords. To recall suggestions that use different words but express the same semantics as Query, query often needs to correct, drop words, replace synonyms, and rewrite queries, especially when there are few perfectly matched suggestions.

In our scenario, the user input query is long and the expression is more colloquial, so the cost of mining stopwords and synonyms is high. In addition, Shopee Chatbot is oriented to different regions and different languages, and the diversity of scenarios and data also increases the cost of algorithm work.

Considering this, vector recall [2] is adopted to train vector recall model from a large number of weakly supervised and user behavior logs, query vector and suggestion vector are used to measure semantic similarity between them, and implicit synonymous rewrite recall is used as supplement to text recall to alleviate this problem to some extent.

The vector recall scheme is suitable for multi-region and multi-language, reducing the cost of adaptation. In addition, cross-language recall can be performed, such as typing Query in Chinese or recalling English suggestion.

Since the current recommendation candidate pool is monolingual, such as English, if the user enters Chinese, the appropriate suggestion cannot be returned to the user only by text recall. Semantic suggestions, such as “Where to use my voucher” and “where can I use a voucher”, can also be recommended by cross-language vector recall.

In addition, for some multilingual markets, such as MY and PH, we only need to prepare a multilingual vector recall model to realize the understanding and recall of different languages, instead of training recall models for each single language.

In order to realize vector recall, we need to have a text encoder, which can encode the text, get the corresponding vector, and then carry out vector recall. There are usually two methods:

1) Based on the twin tower model

In the two-tower model [2], we have query tower and suggestion tower to encode user input query and suggestion respectively, and map to dense vector. Based on the vectors of Query and suggestion, you can calculate the similarity of the two, such as cosine similarity. For the related Query and suggestion, we want them to be more similar, and vice versa. Therefore, based on whether the similarity is related to Query and suggestion, the loss can be calculated and the model can be trained.

2) Based on pre-trained language model

Another approach is to use some pre-trained models as text encoders, such as pre-trained language models. This method is now widely used in many NLP scenarios and is currently the state-of-the art method in many NLP applications. Specifically in our application, we use Facebook’s cross-language pre-training language model XLM[3], which can handle more than 100 languages combined with the corresponding word segmentation.

2.2.3 Vector recall based on multi-language and multi-task pre-training

1) Continue pre-train

Since Facebook’s XLM is trained based on open data, we improved model performance by continuing a pre-train using the domain and task corpora to better fit our scenarios and data. Specifically, reference [4], we use a three-stage method:

  • Stage I: A large number of chat logs from different markets and languages are used as unlabeled data to train multilanguage Modelling task (M-MLM) [3].
  • Stage 2: Multi-language intention classification task (M-CLS) was trained by using click logs from the intention recognition module as weak annotation data (constructed as shown in the figure below). Similar to phase 1, we also used data from different markets for multi-task training;
  • Stage 3: Further fine-tune XLM using the intent Classification Task (CLS), using annotation data from the intent recognition model. For the task of intention classification in different markets, we use corresponding corpus to conduct separate fine-tuning.

2) Knowledge Distillation

In order to make the large pre-trained language model available for online services, we further distilled its knowledge [5], turning the large model (teacher) into a smaller model (Student), such as TextCNN, as shown in the diagram above.

In practice, we use the following three techniques to improve the distillation effect:

  • Introduction of noise: Reference [6], in the process of distillation, we introduce noise to improve the robustness of model learning feature representation, and also improve the sample utilization efficiency, so that it can cover more comprehensive data input distribution.
  • Using a large amount of semi-supervised data: Reference [6], we use a large amount of semi-supervised data, such as unlabeled data of chat logs and weak labeled data of click logs, to fully learn the feature representation distribution of teacher large model XLM in a more comprehensive data input distribution.
  • Two-stage distillation: In order to retain and learn the knowledge of teacher large model XLM to the maximum extent, we also used a two-stage distillation method similar to domain pre-training, and distilled the teacher large model XLM obtained from the second and third stages of domain pre-training in turn.

The complete continuing pre-training and distillation process is shown below.

2.2.4 Experimental results

1) Experiment 1: Intention recognition task

model accuracy
TextCNN (Baseline) 0.64287
XLM 0.63125
+ Chat log continues pre-training 0.66556
+ Click log to continue pre-training 0.67346
+ distillation 0.66400

Since our XLM model is trained based on the data and tasks of intention recognition, we first compare the effects of different methods in the task of intention recognition. Wherein, TextCNN (Baseline) represents the benchmark model using FastText pretrained word vector and TextCNN, and XLM is the intention recognition model fine-tuned using the public XLM.

As shown in the table above, continued pre-training based on field and task data can improve the performance of the pre-training model by 3% to 4%, while after distillation, the performance of the model decreases by only 1%.

2) Experiment 2: Pull-down recommendation tasks

model The recall rate
Text a recall 0.83443
Vector recall (SentBERT) 0.84129
Vector recall (TextCNN) 0.88071

Then, in the pull-down recommendation, the effect of applying the pretraining model to vector recall is further verified. Based on offline data sets, we compare the effects of text recall and vector recall. For vector recall, we used the text encoder based on public SentBERT[7] and TextCNN distilled from teacher large model XLM respectively.

As shown in the above table, the overall effect of vector recall is better than that of text recall, and TextCNN with field pre-training and distillation is 4% better than SentBERT.

3) Comparison before and after distillation

model The model size Computation time
XLM 1064MB 150ms
TextCNN 180MB 1.5 ms
Optimizing the proportion 83% 99%

It can be seen from experiment 1 that the effect of XLM model after distillation only decreased by 1%. The above table further shows the model size and calculation time before and after distillation. It can be seen that the two decrease significantly respectively, especially the reduction of calculation time, so that the TextCNN model after distillation can be applied to online reasoning.

2.3 Sorting Module

2.3.1 Background

Based on multipath recall, we can have a preliminary ranking of recall suggestion. On this basis, a sorting process is also generally required for the following reasons:

  • Recall scores, such as BM25 or vector similarity, are not sufficient for better performance;
  • In addition to typing Query and suggestion, we want to include other characteristics (for example, user characteristics) for better sorting;
  • We also want to combine optimizations for multiple goals, such as click through rate (CTR) or resolution rate (similar to conversion rate CVR).

2.3.2 CTR prediction model based on DeepFM

Based on the exposure and click data accumulated by pull-down recommendations, we built a CTR prediction model to rank pull-down recommendations. Specifically, we adopted the widely used DeepFM[8].

In the input layer shown at the bottom of the figure above, we have various types of features, including text features such as input Query and suggestion, categorical features such as the id of the solution, and numeric features such as some statistics of the solution. In the middle layer, we designed different processing units for each type of input, such as CrossEncoder for text input (e.g. ALBERT[9], RE2[10], ESIM[11], etc.).

2.3.3 Multi-objective ranking model based on ESMM

At the same time, we also carried out multi-objective optimization attempts. In the Chatbot scenario, we want the user to click on suggestion, and we want the solution to solve the user’s problem. As for whether to solve the user’s problem, we define it as solved if the user has not been transferred to another person, has not been dropped and has not received any bad comments. The user path is shown in the figure below.

Using the search recommendation advertising scenario, the estimation of resolution rate can be likened to the estimation of conversion rate (CVR) problem. Therefore, we tried the MULTI-objective optimization model ESMM[12] based on DeepFM, and further explored the ESMM model using MMoE[14] and attention mechanism. The structures of the two models are respectively shown in the figure below.

The probabilities in the figure above have the following relationship:


P ( r e s o l v e d = 1 . c l i c k = 1 q . s ) = P ( c l i c k = 1 q . s ) P ( r e s o l v e d = 1 c l i c k = 1 . q . s ) P(resolved=1,click=1|q,s)=P(click=1|q,s)*P(resolved=1|click=1,q,s)

QQQ means enter query, SSS means recommend suggestion, and


p C T C V R = P ( r e s o l v e d = 1 . c l i c k = 1 q . s ) pCTCVR=P(resolved=1,click=1|q,s)


p C T R = P ( c l i c k = 1 q . s ) pCTR=P(click=1|q,s)


p C V R = P ( r e s o l v e d = 1 c l i c k = 1 . q . s ) pCVR=P(resolved=1|click=1,q,s)

The loss of the final model includes the loss of CTR and CTCVR in all data samples.

2.3.4 Experimental results

1) Experiment 1: CTR prediction

model NDCG@5
BM25 (baseline) 0.66515
LR 0.73303
XGBoost[13] 0.75931
DeepFM+ALBERT[9] 0.77874
DeepFM+RE2[10] 0.77118
DeepFM+ESIM[11] 0.79288

We compared the performance of different CTR prediction models based on exposure and click data accumulated by pull-down recommendations. As shown in the table above, DeepFM with different CrossEncoder has the optimal effect.

2) Experiment 2: Multi-objective optimization

model AUC on CTR Task AUC on CVR Task AUC on CTCVR Task
DeepFM (CTR Task) 0.85500 0.52100 0.77470
ESMM NS 0.85342 0.66379 0.79534
ESMM 0.85110 0.58316 0.78895
ESMM+MMoE 0.85410 0.65215 0.79532

Based on the exposure, click, and resolution data accumulated by pull-down recommendations, we compare the effectiveness of different multi-objective sequencing models, with ESMM as shared-bottom and ESMM NS as non-shared-DEEPFM. As shown in the above table, compared with the pure CTR model, the MULTI-objective model based on ESMM has better effects in both CTR and CVR (in our scenario, corresponding resolution rate) tasks.

It can also be seen that THE effect of ESMM NS is better than that of ESMM, which is contrary to the result of the original PAPER [12] of ESMM.

For this result, our guess is that the resolution rate in our scenario is generally high (e.g., over 70%), so the sample size of the resolution rate prediction task is not as sparse as that of the CVR task. In this case, the shared-bottom model does not necessarily improve the performance. On the contrary, different DeepFM for each task can learn different features of different tasks and achieve better results. However, ESMM NS has higher memory footprint and calculation time than ESMM.

We further tried to integrate MMoE and attention mechanism into the FRAMEWORK of ESMM, and the specificity between learning tasks was further improved, while its performance was only 1% lower than that of ESMM NS in CVR tasks. ESMM+MMoE achieves a good balance between memory footprint/computation time and performance due to its shared DeepFM.

3. System implementation

In Chapter 2, we introduce the overall scheme of Shopee Chatbot’s pull-down recommendation. In this chapter, we reintroduce the overall system architecture (as shown in the figure above), including the offline part and the online part.

3.1 Offline Part

3.1.1 Model training

For the two core models, namely the text encoder in vector recall and the sorting model, we will use exposure and click log to regularly train and update the model, so that the model can maintain a good effect. The process is shown in the figure below.

3.2 Online Part

3.2.1 Model inference

The reasoning part of the model is to chain the parts together. For the user’s input query, the text encoder is called first to obtain its text vector, then the text recall and vector recall are requested in parallel, and then the recall results are combined. Finally, the sorting model is called, and the Top5 sorted recommendations are output and presented to the user as suggestion.

To speed up model reasoning, we used ONNX to deploy the text encoder and the sorting model (both using PyTorch training). At the same time, Redis is used to cache the results of multi-way recall with high frequency query input to reduce double calculation and recall requests. Since the subsequent ranking model will use the characteristics of users and scenarios, we do not directly cache the final Top5 recommendation results at present, but the ranking model dynamically determines the final ranking output.

3.2.2 Recall services

After the recommendation candidate pool is obtained, or the recommendation candidate pool is updated, we build the text recall service based on Elasticsearch.

For vector recall, we used the vector recall service built by Shopee Chatbot engineering team based on Faiss, and the process is shown in the figure below.

At present, the candidate pool of pull-down recommendation relies on manual annotation and the number is relatively small, around 10,000 level, so we choose HNSW [15]. Suggestion 5W or so, 512 dimensional vectors, the overall index size is about 800M.

At the same time, we compare the online retrieval effect of IndexIVFFlat and HNSW, HNSW can provide faster retrieval effect under the condition of lower CPU resource consumption, and meet the demand of online service of pull-down recommendation.

3.2.3 Model experiment

For the convenience of AB experiment, we designed and implemented the pull-down recommendation system so that different modules could be easily assembled within it to build available online services, such as combining different recall and sorting models. In addition, combined with log data and experimental reports, we can further analyze the experimental effect.

4. Business effect

With the collaboration of multiple business, product, engineering and algorithm teams, we launched pull-down recommendations for the first time in June 2021 and rolled them out to all markets covered by Shopee Chatbot in September 2021, with good business results. So far, we have about 3 major versions, which are as follows:

  • Function launch: launched in all markets covered by Shopee Chatbot, the overall resolution rate increased by 1%;
  • Recall optimization: online multi-way recall based on text recall and vector recall, CTR online increased by 2%;
  • Ranking optimization: CTR prediction model based on LR and DeepFM+ESIM was put online, and CTR increased by 2% and 6%, respectively.

5. Future outlook

After a period of optimization, Shopee Chatbot pull-down recommendation has achieved preliminary results. In the future, we hope to further improve its effectiveness in the following aspects to help users better use Chatbot products.

1) Data

  • Expand the number of recommendation pools to cover more user queries;
  • Improve the quality of the recommendation pool to provide a better user experience.

2) Recall

  • The present vector recall model is optimized by using the exposure and click data accumulated by pull-down recommendation.
  • Explore multi-language recall and solve the drop-down recommendation problem in multi-language scenarios [16].

3) Sorting

  • Introduce more user and scene features, improve ranking effect, and explore personalized user recommendation;
  • Evaluate the online effect of multi-objective ranking model, and also try other multi-task and multi-objective optimization methods [17].

4) Products

  • The mechanism of exploration and utilization is introduced to solve the problem of cold start of knowledge points and continuously optimize the effect.
  • Explore new product forms, such as auto – complete mechanism.

6. References

[1] en.wikipedia.org/wiki/Okapi_…

[2] Embedding-based Retrieval in Facebook Search

[3] Unsupervised Cross-lingual Representation Learning at Scale

[4] Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

[5] Distilling the Knowledge in a Neural Network

[6] Self-training with Noisy Student improves ImageNet classification

[7] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

[8] DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

[9] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

[10] Simple and Effective Text Matching with Richer Alignment Features

[11] Enhanced LSTM for Natural Language Inference

[12] Entire Space Multi-task Model: An E o ffective Approach for Estimating post-click Conversion Rate

[13] XGBoost: A Scalable Tree Boosting System

[14] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

[15] github.com/facebookres…

[16] Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

[17] Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations

In this paper, the author

Qinxing, Yihong, Yiming, Chenglong, from Shopee Chatbot team.

Thank you

Lennard, Hengjie, from Shopee Chatbot team.