As the senior director of The Precision display technology Department of Ali Mom, Gai Kun is known as Jing Shi in Ali and is known as “algorithm genius”. In 2011, Gai Kun, who just joined Ali, proposed the sharding linear model MLR, which greatly improved the accuracy of CTR estimation for the industry, which mainly used simple linear model to make CTR estimation at that time. MLR model has been widely used in through-train directional and drilling operations for several years.
GaiKun and lead the team in CTR forecast has launched a new model structure, deep user interest distribution network, proposed the user’s interests are diverse, using deep learning in historical behavior and user creates a partial matches between ads CTR forecast, compatibility, the higher the history data to forecast the result of the influence, the greater the interest points to distinguish the current user. Gai kun took a look at these algorithms at the New Zhiyuan Industry · Jump AI Technology Summit on March 29.
Gai Kun: I am very glad to communicate with you about the evolution of deep learning. Ali Mama is a big data marketing platform under Alibaba Group and a business department responsible for Alibaba’s realization. I have a flower name in Ali, and everyone in Ali uses the flower name to communicate and contact. My name in Ali is Jingshi, and my research direction is machine learning, computer vision, recommendation system and computational advertising. I majored in computer vision during my undergraduate and doctoral studies in Tsinghua University. After graduation, I joined the Advertising Technology Department of Alibaba and later formed Ali Mom Business Division, which is responsible for all advertising realization products of Ali. Now I am a researcher of Ali Mom, in charge of the precise targeted advertising technology team, responsible for the products such as intelligent drilling exhibition and targeted advertising through train. Students familiar with Ali system may know these two products.
I’m going to talk about it in three parts. First, I will talk about the evolution of deep learning under Internet data, then I will talk about how to use deep learning in advertising recommendation or search business, how to use deep learning to solve the problems encountered in retrieval, and finally I will look into the future challenges.
First, big data under the Internet. What are the characteristics of data on the Internet? The first feature is its large scale. The language transformed into machine learning has a particularly high dimension and a large number of samples. In addition, there are rich internal relationships within Internet data.
Here is an example. For example, this is the data on a typical APP or Internet website. There are many users on one side and many materials on the other side. We now have a lot of users and a lot of goods and materials, both of which are big data, and historically you’ll see a lot of behavior, and this is a kind of connection between users and goods. Extending down again, every user has his Profile information, the user sees the title of the goods, details page, and reviews, etc., this extension on very large scale data will be connected to these relationships, this is the characteristics of Internet data.
CTR forecasts. Take the classic question, why is CTR estimation important? This is the core technology of advertising, recommendation and search, which are the core business for many companies. In advertising, for example, why is CTR estimation important in advertising? There are two. First, CTR estimation is fertile ground for deep learning research in the advertising market, and there are many new technologies to explore and evolve. Second, CTR estimates are directly related to platform revenue for Internet companies, which is actually more important for AI. You know, a lot of AI companies, internally, are looking at the future. Where does the cash flow come from? Many Internet companies get their cash from advertising, so advertising matters.
In advertising, for example, to study in depth CTR forecast the progress of the application of the core issues, CTR forecasts two kinds, the traditional method of the first kind is a strong characteristic of artificial design, dimension will not very high, generally is some strong statistical characteristic, on the characteristics of traditional practices, represented by yahoo company with GBDT method. The problem with this method is that although it is simple and effective, the manual processing of the data makes the data lose resolution and the data dimension is very low. The second mainstream approach is to expand the data into higher-dimensional data. The classical approach is to use large-scale logistic regression, which is a generalized linear model. The model is very simple, but its modeling power is limited.
Before introducing deep learning, I would like to expand my first work in Ali Mom. We changed logistic regression from a simple linear model to a nonlinear model and to a three-layer neural network. As mentioned above, a classic approach is to use large-scale data + logistic regression. One of the problems of logistic regression is that the linearity is too simple, and we need to do a lot of artificial feature engineering to make this effect better. Our first idea was how to make the algorithm smarter and automatically extract non-linear patterns from large amounts of data.
We made such an attempt, to do a piecewise linear model, the idea behind is more intuitive. The whole space is divided into many regions, and within each region is a linear model. Different areas are connected smoothly, and the whole space is a piecewise linear model. If the number of regions and fragments is enough, any complex nonlinear surface can be approximated.
This is a schematic view of the model from a neural network point of view. How do you calculate it when you get a sample? So first you compute the membership for each region, and if you have four regions, you compute a membership. Let’s say that this sample is exactly in the first region, with membership of 1000, and for each region there’s a predictor or linear classifier, and for each region there’s a prediction, and those four predictions together form another vector. Take the inner product of the top four dimensional vector and the bottom four dimensional vector, and the predicted value of the first region is picked out, and actually for mathematical purposes it’s expressed in a soft membership rather than a hard membership of 1000.
How to learn this learning model is a major problem. We also added the technology of grouping coefficients to make the model under big data have the ability to automatically select features. Eventually, it turns into a non-convex, non-smooth problem, and this is the model that was proposed in 2011, the algorithm that went online in 2012. At that time, there was no good means for non-convex and non-smooth problems. Non-smooth mathematics was not differentiable everywhere, and it was also a problem how to descend without derivatives in mathematics. Even though it’s not differentiable everywhere, this function is differentiable everywhere in the direction, so we use the directional derivative to find the fastest descending direction, and we use the quasi Newtonian method to accelerate. This work is called mixed logistic regression (MLR), and those of you who have done CTR estimation will probably know it. This is a foundation for us to explore the application of deep learning in advertising.
MLR is a three-layer neural network that transforms the large-scale sparse discrete input into the inner product of two vectors. The two vectors are spliced together to form a long vector, which is the same as the embedded technology today. If you embed a particularly large amount of data, difficult data into a space and turn it into a vector, some continuous vectors in continuous space are very easy to process with deep learning such as multi-layer perceptrons. The first step of deep learning is a very important experience, which runs through all the design concepts of deep learning. The middle layer vector generated by MLR is extracted, and then the multi-layer perceptron is made directly. The diving vector is taken as the input of the multi-layer perceptron. There are two reasons why this doesn’t improve. First, MLR is inherently non-linear; Second, because there is no end-to-end training.
The next breakthrough is that the learning of embedding and the training of multi-layer perceptron are put together for end-to-end learning, which has a very obvious improvement over the original technology. This also explains why deep learning has only made major breakthroughs and advances in the last decade. If there is no end-to-end training, the shallow model is used to generate features in each training and then training and then feature layer by layer. A lot of people haven’t been able to come up with this kind of deep cascading network before, until end-to-end learning allows us to break through on a lot of issues. We integrate the grouped embedding vertically with multi-layer perceptron, which becomes ali Mom’s first-generation deep learning network. Based on tens of billions of samples and hundreds of millions of feature dimensions, the end-to-end training of multi-GPU completes such a service online. The effect of this on-line is to improve CTR and GMV significantly.
The application of classic and relatively standardized deep learning in advertising has been introduced before. Next, we will focus on the Internet data and how to make a better deep learning model through the insight into user behavior. Here is an example. We just talked about embedded technology. Each commodity is represented as a point in the embedded space by embedded technology. The user point with goods to do the final degree of interest calculation, assuming that the calculation is proportional to the distance, the user’s point will be expressed as such an interest function in space will become a single peak function, the user is at the point of the largest degree of interest, the farther the degree of interest is smaller.
In fact, are our users’ interests unimodal? We don’t think so. Do you have shopping experience on Singles’ Day? Is it true that the shopping cart is filled with many different kinds of goods, indicating that the interests of users are diverse, we find that the interests of users are also diverse in the non-active node at ordinary times. There are a large number of subsequences of different classes in the user behavior sequence, and the user is jumping to each other.
Based on such an insight, we propose a deep learning neural network for the distribution of users’ multi-peak interests. We hope to describe users’ multiple interests by using the method of subsequence extraction. When we do CTR estimation, we always have a candidate item. When we take a candidate item and estimate its CTR rate, we use that item to reverse extract all the subsequences of its behavior sequence that are helpful for the estimate, instead of using the entire sequence. In this way, you can extract the relevant subsequence from a complex sequence that contains many subsequences, and use this related subsequence to form an expression that is associated with this commodity. Multi-peak interest distribution can be regarded as any commodity to find a relatively close peak with it to calculate the degree of interest, roughly such a process.
In fact, we adopted a technology similar to attention to achieve the relevant purpose, which also significantly improved the significant indicators of CTR and GMV in the traffic effect of Ali Mom. When users browse various materials on the Internet, it is very important to understand the nature of the materials behind them. For example, in the e-commerce environment, when users browse goods, they often see the pictures of the goods to decide how to conduct the next behavior. Can we use this image information into a deep neural network to do better modeling of user interest? This brings a challenge. Any behavior changes from the ID of a commodity to a picture of a commodity, and the data volume in the sample increases by many times. An ID may be represented by a few bytes, but if it changes to a picture, which is often hundreds of K or even a few megabytes, the data volume will increase by at least a thousand times. The massive data of the Internet requires dozens or hundreds or thousands of machines to train in parallel, and the amount of data explodes thousands of times. Even for a company like Alibaba, this problem is very difficult to deal with.
How can such a challenge be solved? We analyzed the distribution method called Parameter Server, which is often used in deep learning modeling on the Internet. In my sample, a worker traverses the sample and takes parameters from the Server when needed. Can you bear the calculation? First of all, the image is stored in a sample, and a thousand-fold expansion is unacceptable. There is a remote sever image to remove redundant storage, storage can be solved, the relevant image is transmitted, the amount of data explosion thousands of times is unacceptable. Can the remote end not only store parameter images, is the remote end to add a model solution? The remote end has image and Model, the remote end processes the sub-model of the image part, and the worker end is the main CTR model traversing samples. These two models are grafted together to perform an end-to-end training. As I mentioned earlier, an important lesson is that only end to end work. This image feature has been tried by many teams within Ali Mom to add the image into the CTR prediction model as a feature. If the CTR prediction model is very strong, this addition will have no effect. We do such an end-to-end training, propose a new model distribution Server, and change the parameter distribution mode into model distribution mode. The Server side not only has parameters but also has sub-models for calculation, which will be updated together with the main model of the worker side. This makes the image can be processed into a vector and then transmitted, tens of times, hundreds of times, the whole transmission down, making the whole process of joint training possible. This challenge can be met through distributed changes in the framework, which can significantly improve alibaba’s internal business line, click-through rate or revenue capacity of the commercial platform.
When deep learning is really used in the retrieval system of search recommendation ads, it will encounter matching problems or retrieval problems. Such a flow of the business in general will points several modules, to a flow, a flow behind usually represents a user browsing behavior under a scene, the first match, the back of the forecast model of degree of interest in a given commodity do forecast, clickthrough rate forecast, forecast, conversion by forecast behind some sort, advertisements also offer, There is no bidding for non-advertising. However, it is impossible for us to estimate the large material inventory.
Suppose there are 10 billion materials behind it, and each user needs to calculate the click rate of 10 billion materials online, which is impossible. The matching module in front needs to be reduced to a few thousand, one million, so that online can bear so much calculation. The retrieval and matching link in the front of the process is the upper limit of the performance of the whole system. No matter how delicate the model behind is, if the matching in the front is weak, the overall goal of the business cannot be improved. The matching methods can be divided into three categories. Heuristic statistical rules are now very sophisticated, and there is a lot of collaborative filtering in recommendations. Two items look very similar. How can collaborative filtering match? In this case, many people will encounter such a recommendation. Although it is easy to implement personalization, it will greatly improve the non-personalized business indicators, which will bring a problem. Users often see products with similar historical behaviors. There may be some cases that users complain about in many recommended scenarios.
A natural way to improve matching is to introduce machine learning to measure interest and find the best products. When machine learning is introduced, it is difficult to solve the calculation problem of the whole library, so we have a degenerate method when introducing machine learning. If the model is an inner product model, and the user is a vector point, all materials can represent vector points, the inner product model finally becomes a KNN search problem. How do I find my nearest neighbor? There are vector retrieval engines that can do that. CTR often has cross-features, user interest distribution, and many advanced deep learning modes that cannot be used in CTR. Aiming at how to use arbitrary deep learning to search for the optimization of the whole library, we put forward the search engine of the whole library with tree structure. Its idea is also more intuitive. The whole commodity is built into a hierarchical tree, with 1 billion products and 30 binary trees, and its leaf layer can accommodate 2 billion commodities. We study the depth of the layer of each layer scanning, each layer to find the optimal, lower nodes in a layer of optimal children don’t continue to calculate, equivalent to throw away, until finally find the optimal, the 1 billion measure into 3 billion measure from top to bottom, how to solve the deep learning in the library to find the optimal problem, the problem of the retrieval and matching. Compared with the methods of the previous two generations, the recommended recall rate is significantly improved. In addition, we limit the recommendation to only materials that users have not acted in the category, and use the recall rate of the new category to do a comprehensive evaluation of novelty and recall rate. This is nearly four times better than the first generation of collaborative filtering methods. This is a technical solution to the problem of deep learning for full library retrieval.
For future challenges, machine learning needs label data, also known as target data, for experience problems and data loss of recommendation or advertisement. At present, some targets are generated user data such as click to buy and so on. We can optimize these indicators. Many experience problems are difficult to optimize without labels, which makes it difficult to solve these problems with machine learning. How to solve the experience problem? Use algorithms to automatically deduce the user experience behind it, or use human annotation, like search engines to use relevant teams to mark users’ feelings, or let users actively feedback through interaction? This is a question that needs to be explored in the future.
Recall rate is often used to evaluate recommendation evaluation in both industrial and academic circles. In fact, recall rate only evaluates the performance of products that users have consumed. How to evaluate the stimulating effect of new recommended products on users is not reflected in recall rate evaluation. And then there’s the self-loop of recommendations, where you click on more of what you’re interested in, and then you get more and more recommendations, and you end up losing a lot of other recommendations that you might be interested in. There are many recommended scenarios on many apps. How to make coordination in multiple scenarios? From the perspective of business, every business is actually facing a full amount of massive users, how to detect potential customers. What businesses are facing is the whole operation process of consumers, and how to optimize and innovate on the whole link in the stage of potential interest and purchase, which is a problem they hope to solve in business.
Ali Mama technology team continues to evolve and innovate in deep learning. We pursue business results and hope to do something different in technology behind pursuing business results. We hope to do some innovative business models. If you are interested, please feel free to contact us. Ali mom together with Tianchi undertook this session of Ali Mom international advertising algorithm contest, we are interested in welcome to challenge.
Ali Mom International Advertising Algorithm Competition:
Alibaba (Taobao, Tmall) is the largest e-commerce platform in China, providing hundreds of millions of users with convenient and high-quality transaction services and accumulating massive transaction data. As alibaba’s advertising business department, Ali Mama has used these data in the past few years to predict users’ purchasing intentions efficiently and accurately by using artificial intelligence technologies such as deep learning, online learning and reinforcement learning, effectively improving users’ shopping experience and advertisers’ ROI. However, as a complex ecosystem, factors such as user behavior preference, commodity long tail distribution and hot event marketing in e-commerce platforms still bring great challenges to conversion rate estimation. For example, during the Singles’ Day shopping carnival, the promotional activities of merchants and platforms will lead to drastic changes in traffic distribution, and the model trained on normal traffic cannot well match these special traffic. How to make better use of massive transaction data to efficiently and accurately predict users’ purchase intentions is a technical problem that artificial intelligence and big data need to continue to solve in e-commerce scenarios.
In 2018, ali mother joint international conference on artificial intelligence joint (IJCAI – 2018) and ali YunTianChi platform, start the ali mama international advertising algorithm competition, with ali electric commercial advertisement as the research object, the enormous real scene data, provide a platform contestant prediction model is built by artificial intelligence technology, forecast the user purchase intention. The winning team will not only receive substantial prize money and travel sponsorship fees, but will also be eligible to attend the IJCA-2018 main Conference in Stockholm in July.
The original article was published on April 9, 2018
Author: Gai Kun
This article is from xinzhiyuan, a partner of the cloud community. For relevant information, you can follow the wechat public account “AI_era”
Interpretation of Ali deep learning practice, CTR prediction, MLR model, interest distribution network, etc