Abstract: Deep learning is a set of overall solutions that can not only process features, learn features, but also achieve the final ranking and scoring. With the help of deep learning solutions, the working mode of search and recommendation will be greatly changed. Want to know how Alibaba will apply deep learning technology in search recommendations? Do you want to know how to realize the personalized search results of Handtao and Youku? This article is not to be missed!

This section video address: click.aliyun.com/m/48161/
PDF download: click.aliyun.com/m/49207/

Brief Introduction of speakers:
Sun Xiuyu (credit name: Chong), an algorithm expert in Alibaba Machine Intelligence Technology Laboratory, has been engaged in deep learning basic technology research and application in various industries since joining Alibaba in 2014.

The following content is compiled according to the video sharing and PPT of the speakers.

This article mainly focuses on the following aspects to share:
  1. Why use deep learning
  2. Hand Amoy main search scene
  3. Recommended scenarios on the details page
  4. Youku search scene
In this article, first of all, why will share alibaba in search is recommended to use deep learning techniques, and introduced in three scenarios, for example deep learning related technology is how to be applied to the search and recommend the inside of the scene, three scenarios are respectively selected here for the main search scene, hand tao details page recommended scenario and youku search scenario, These are typical scenarios.

1. Why use deep learning

Why does Alibaba use deep learning in search recommendations? As we all know, the traditional search recommendation task is equivalent to feature engineering for products, users and some other background information such as search words, including statistical features, ID features and a variety of other artificial cross class features. These human-engineered features are then fed into a machine-learning tool such as LR or XGBOOST, which combines logs of human-engineered features, clicks, or user behavior with machine-learning tools to generate a ranking model for specific areas of search or recommendation.
With deep learning solutions in place, the whole way things work will change. Deep learning, as you learned, the package was first applied in the field of image, the package is a major advantage to study out some characteristics of artificial design directly, or may, according to previous samples to study its think more valuable features, this feature can replace artificial design feature, This is the feature extraction capability of deep learning. At the same time, deep learning also has very strong fitting ability in classification, which is better than XGBOOST, DBDT and LR schemes. Therefore, deep learning is a set of overall solutions that can not only process features and learn features, but also achieve the final ranking and scoring. These are also the reasons why Alibaba uses deep learning in search recommendations.

Ii. Enabling e-commerce – Mobile shopping main search scenario

In the mobile taobao search (hereinafter referred to as the hand tao) Lord scenario, design a set of End2End training framework, in view of the behaviors of the original data, such as the history of the click, buy, and user behavior automatically learning characteristics, and makes the model could eventually, goods for the users and retrieve words better, to improve the business indicators (GMV).

Model structure

The model structure designed in hand washing is shown in the figure below. In the model, the retrieval information is divided into three main fields: user expression domain, commodity expression domain and retrieval word expression domain. Different from the traditional scheme, this model does not use some statistical features, but only uses ID features such as users, goods and search terms, just like the traditional one-hot expression. The ID-like features here cover more than 100 million items, more than 200 million users, and more than 5 million commonly used Queries.
The dotted box in the figure shows the embedding process. Information in different domains is embedding into a low-dimensional continuous space through a three-layer fully connected network. The difference here is that the first layer is not fully connected layer, but sparse fully connected layer. After all the ids in the three domains are mapped to a low-dimensional space, a Concat layer is adopted to integrate the information together, and then a three-layer fully connected network is adopted. The final learning goal is click, transform and purchase behaviors mentioned above. Through this model structure can be End2End to obtain the search sorting solution.

Commodity code

As mentioned above, the initial hand-shopping uses one-hot feature expression to express goods, users and search terms. Such a method will have the problem of high dimension of goods and users, equivalent to more than 100 million dimensions. For such a high dimension, using one-hot directly to express will occupy a lot of resources. Therefore, in the main search, the random coding method is adopted to reduce the n-dimensional one-hot expression to the n-/ 20-dimensional coding expression.
The very simple but effective method adopted here is to make a mapping. It can be assumed that the six red points in the n-dimensional one-hot expression represent the six unique values, and the black points represent the zero value. Use these six different points to represent the first digit on the left. The corresponding one-hot on the bottom corresponds to the expression of the six red dots on the right. A restriction is made here, and the principle of restriction is that the repetition between different expressions can only reach 3 bits at most. By this restriction method, one-hot mapping is forced into a low-dimensional space. In this way, the coding can be expressed by multiple points, and the difference between points or different expressions is large enough to realize one-HOT coding compression. For example, for some popular commodities, their behaviors are rich. At this time, popular commodities are considered to have their own unique expression, while for unpopular commodities, their behaviors are considered to be sparse. Something similar to hashing can be used to express items that are thought to be related in a similar way. Participle coding similar meaning to the affiliated coding, just in front of the random encoding based on the introduction of some artificial designed encoding, such as the “red” and “dress” in the query will have a special expression, and the part to confuse you can use a special code.

Sparse coding layer
As mentioned earlier, a concrete implementation of the sparse coding layer is a sparse fully connected layer. Its main purpose is to reduce the amount of computation. On the one hand the original dense matrix multiplication can be changed to sparse matrix multiplication, can greatly reduce amount of calculation and calculation efficiency also will improve greatly, at the same time also can solve the problem of the use of memory, will originally may need more machine card to solve the problem of simplified as standalone card alone can solve the problem, improve the efficiency of training.

Multitasking learning

The multi-task learning method is used to learn the final ranking score in the handy search recommendation.

In the traditional search sorting, the implementation scheme is usually divided into two or three steps. The first step is a recall process, which is to select the information of commodities related to the current search term according to the search term, and make these commodities related to the search term into a candidate commodity pool, and then implement it layer by layer in these pools. First of all, the commodities in the pool need to be sorted again according to some historical statistical information of the commodities themselves or other information such as scoring, and then the information of the commodities is taken out further. In this way, the data volume can be transformed from hundreds of millions to millions, and then to tens of thousands. Finally, the process of refinement is carried out. At this time, the user’s personalized information is added into the ranking model to better improve the conversion rate. Here, the level of commodity data from millions to tens of thousands is called audition. This part may only be related to commodities, so the training of this part and the training of sorting are carried out at the same time, which is equivalent to training users to express themselves and scoring commodities at the same time, so as to learn the good or bad way of expression of commodities. Through the learning of these two tasks, I obtained two scores at the same time, namely the scores of the audition part and the score of the finishing part. Through the simultaneous effect of the two parts, the diversity of the sorting results can be better increased, and then the final transformation goal can be improved through the diversity.

Multimodal and online learning

Multi-mode and online learning are also introduced in the realization of the main search scene of Taobao. These two technologies are more in order to deal with the big promotion scene of Taobao. As is known to all, “11” is a very typical big promote scenario, the user behavior in the day is very rich, different marketing or promotion, and behavior are also a variety of goods, this time taobao solution is to use deep learning technology adopted by the user, long-term stable relationship between commodity and retrieve the word learning, At the same time, some features of continuous class, ID class or cross class such as traditional artificial design features are introduced. In addition, some features of real-time expression of goods are also introduced. These two features are fused together, and then the last three layers of full connection are learned through online learning mechanism. Or we can only learn LR of the last layer to realize the result that can take into account both the stable preferences of goods and users, and also the preferences of users in the big promotion scene at that time. Through the integration of features acquired by deep learning and artificially designed features, Taobao’s recommendations have achieved a very good effect in the “Double 11” and other promotional scenes.

Third, enabling e-commerce – details page look again and again

Previously, WE shared common deep learning techniques in the search field and the results obtained. Next, we will share the recommended scenarios. There are similarities and differences between the recommendation scenario and the search scenario. In the search scenario, commodity is related to the current term, and under the recommended scenario, you need to guess which goods will be associated with the historical behavior of users before, such as in the figure below shows the details page, under the scenarios of the recommended goods also is related to its details page, such as in the process of recall for candidate whether the goods should be how to choose, The search and recommendation scenarios have certain differences, and in the final sorting process, the tasks completed in the search and recommendation scenarios are relatively similar, which is also the reason that a similar scheme can be used to solve the problems of the two scenarios.

The recommendation scenario for the details page is designed in such a way that it is given a master baby, according to which other items in the current store are recommended. In this scenario is very interesting thing is that behavior is often a very rich, because on the one hand, there may be hundreds of millions of users to read the daily or on billion, on the other hand with related goods of other goods is very sparse, although the user will see a lot of goods every day, but the effective behavior happened between goods is very small, And it is very unbalanced, which makes direct training may cause imbalance, and the training effect of the whole model is not particularly good due to the distribution of data and the sparsity of user behavior.

The migration study

In the face of the problems mentioned above, Taobao has adopted a new idea in the recommendation of the details page: transfer learning. First of all, a full amount of data related to users’ historical behaviors in Taobao, including search, recommendation and advertising data, are used for unified processing of these data. Then, the model structure mentioned above in the search scenario is used to train the deep learning model and learn the characteristics between users and products. Here to uniformly processed, will search in the scene in the query or retrieval words related domain and recommend the information in the scene, such as the primary commodity information are unified in the details page to get rid of, only to consider the user clicks on or the relationship between the purchase of goods, through deep learning scheme can learn the users and the characteristics of the goods.

As mentioned above, an important reason why deep learning technology is effective is that it can learn the feature expression targeted at the original input by itself, that is to say, it can learn a better feature expression under the current target. By using the transfer learning scheme, stable and robust feature expression of users and goods can be obtained. This feature is directly used as the expression of goods and users, and combined with the current business, namely, the log under the recommendation business of product details page, and then using a method similar to traditional machine learning. For example, LR and GBDT in traditional machine learning methods can be realized through DL, and the transfer learning scheme formed by the combination of these two parts has been improved greatly. At the same time, adopting this method can ensure that the learned scenes with only a small amount of data or relatively sparse data can also obtain stable transformation and increase click rate.

4. New scene exploration – Youku short video search

Next, take Youku short video search as an example to share alibaba’s exploration of applying deep learning in new scenarios. The search, recommendation and personalization mentioned above are basically in the field of e-commerce, and now it has been promoted to the short video search scene of Youku.

The search for short videos is not exactly the same as the search for traditional TV dramas. On the one hand, the search for short videos needs to take into account the correlation between search terms and short videos, and on the other hand, the real quality of short videos needs to be considered, so there is also the concept of transformation. In short video search scenario, the previous overall model scheme needs to be improved, which can be divided into three aspects: First of all, the previous retrieval scheme only introduces personalized information in the rearrangement process of tens of thousands or thousands of products, while in the model of short video search scene, the personalized part is moved to the recall part, and it is unified with the final ranking model. Followed by the use of multimodal information, video retrieval needs to take into account the relevance, relevance is the most simple correlation between text and text, but for a short video, eventually need to search the video content and so on the one hand, the text information, on the other hand the video and image information to express, As for the information in these different domains, the unified model mentioned above is applied, which integrates the behavior coding and content-related coding together. In this way, the correlation is considered and the final conversion rate is improved. The last point of the expression of improvement is in view of the history of user behavior, before doing the user express more to express the behavior of the history of information users, but there is problem in taobao scenario, the user’s behavior is very rich, each user within a month will produce a lot of behavior, and the stability of the behavior is also very good, At that time you can add a lot of historical data in order to better describe the behavior of user preferences, but in the new scene like youku short video search, the user’s behavior is very sparse, likely users within one month of short video class search only a few times, this makes the history of user behavior is hard to describe the real preference, Therefore, a new scheme is adopted in such a scenario, which is to extract all the historical viewing behaviors of users on the whole network, use this information to express users’ preferences, and then integrate other information expressed by users into the ranking model to improve the influence of individuation.

Personalized recall

In the aspect of personalized recall, a very classical model structure — DSSM is adopted. Based on this model structure, we can directly learn the relationship between search terms, users and embedding expressed in video, and minimize the distance between embedding. In the model, in order to simplify online operations, the embedding of users and search words is simply added without unified encoding. The cumulative results are directly related to the minimization of the COS distance of the video-related embedding.

As shown on the left side of the figure above, ID expression is not carried out for every user in the model, because the use of ID expression requires a large number of effective behavior data. Instead, ID expression is transformed into video ID expression. The ID of historical videos watched by users can be used to express users by making a simple average. Through this framework, we can directly learn the distance between different embedding and sort them according to the distance from small to large so as to get the final sorting result we want. The reason why this is a personalized recall model is that the user’s information can be directly added to the relevant part between the search term and the video, that is to say, the user’s information will be taken into account in the first step of the retrieval. Is different from the previous model only need to do a recall inversion lists can be completed, now youku short video search scenarios based on the scheme of embedding can use new quantitative index of the solution to handle huge amounts of data retrieval, ultimately through the engineering as well as the quantitative index for engine optimization can be realized in real time to complete personalized recall calculation, And it can get better results without changing the calculation time.

Multimodal representation

The model in the previous figure only uses expressions related to query and ID. In fact, when doing tasks related to recall, we need to consider the text information of the retrieval word and the text information of the video, as well as some display graph embedding information related to the video as shown in the following figure. The above information needs to be considered at the same time, and then the integrated expression of multiple behavior information, text information, video information and image information is used to improve the overall effect. Compared with the previous ID scheme, this scheme is more robust, because the addition of text information will play a better recall effect on the newly generated video.

Expression of user history behavior

The simplest solution for the representation of user history behavior is to use video to represent users, that is, to store the ID of the list of videos that users have watched together, or to encode embedding first and then average the embedding vector to express user’s historical preference. However, there is also a certain problem here, that is, the user’s historical viewing behavior and purchase behavior in e-commerce are diverse, which may be behaviors in multiple fields, and which information in these information is more relevant to the search words in the current scene? So finding historical preferences that are more relevant to the current search term can greatly improve sorting results. A simple example is that when the user searches for the word “sports video”, some users have watched football videos before, while others have watched basketball videos before. By searching the similarity of embedding with previous embedding or making a Attention, Use the current “sports video” to find these historical behaviors that are more relevant to them, so that different expressions can be generated for different users, and other irrelevant information can be ignored, so that users’ preferences under the current search terms can be better obtained. Using this preference in conjunction with the overall model framework mentioned earlier can present recall results to the user. This is the improvement scheme for expressing user history behavior. The scheme using Attention mechanism and the previous several different improvement methods have greatly improved the transformation in youku short video search scene.


To sum up, this paper first shares the characteristics of deep learning, which has strong feature extraction and fitting capabilities. After that, it introduces the deep learning technology specifically used in several scenes of mobile shopping. Finally, it introduces how to extend the recommendation practice in the field of e-commerce to the new scene in the field of short video to help improve the effect.

The original link
To read more articles, please scan the following QR code: