Strata Data Conference, the world’s top big Data Conference, was held in Beijing recently. Hailed as a “milestone in the big data movement” by Forbes magazine, the Strata Conference attracts the most influential data scientists and architects in the field of big data and artificial intelligence. Chen Yuqiang, co-founder and chief research scientist of the Fourth Paradigm, was invited to attend and delivered a keynote speech entitled “Pain Points and Solutions of ARTIFICIAL intelligence industrial Applications”.
Yuqiang Chen is a world-class expert in deep learning and transfer learning. He has published papers in NIPS, AAAI, ACL, SIGKDD and won APWeb2010 Best Paper Award and ranked third in KDD Cup 2011. His academic work has been reported by MITTechnology Review, a world-renowned science and technology magazine. And AI Chen Yu strong industrial application, in the baidu phoenix nest hosted the world’s first commercial deep learning system, presided over the new traffic during today’s headline is recommended with the design of the advertising system implementation, chief research scientist at the fourth paradigm, lead the team to research, the most advanced machine learning technology, Focus on creating artificial intelligence platform level product “Wevin”.
The following is based on Chen Yuqiang’s keynote speech.
Hello, everyone. I am Chen Yuqiang from the Fourth Paradigm. I am mainly responsible for the research and development and application of artificial intelligence algorithms. I am very glad to share with you some pain points of artificial intelligence application in industry and corresponding solutions.
Artificial intelligence is a very hot term and has been successfully applied in many fields such as speech and image. But has artificial intelligence reached a point where it can simply land on the ground? What technology does industrial AI need? So let’s start with these questions.
First of all, let’s discuss what kind of system is needed for industrial ARTIFICIAL intelligence? The rise of artificial intelligence is due to a combination of larger data, improved performance, and the development of parallel computing technologies. So the problems in industry are very complex. Therefore, we need a system that is scalable, not only in terms of throughput and computing power, but also in terms of intelligence as the volume of data and users increases. How do you implement a scalable system? It is important to note that the industry needs a high VC dimension model to solve the problem of intelligent scalability. How do you get a high VC dimension model? As we all know, machine learning = data + feature + model. If the data is in a given situation, we need to optimize both features and models.
There are two kinds of characteristics, one is called macro characteristics, like age, income, or how many books you’ve bought, how many movies you’ve watched. The other is micro-characteristics, which are relatively fine-grained characteristics, which books you’ve read, which movies you’ve watched. Every movie, every book, every person is a different character. There are millions of books and millions of movies, so there’s a huge amount of features.
Models can be divided into two types, one is simple models, such as linear models. There are also complex models, such as nonlinear models.
This divides AI into four quadrants. As shown above, in the lower left corner is the first quadrant, which is solved using a macro feature simple model. This model is rarely used in the industry, because it has few features and a simple model. The VC dimension is low and it cannot solve very complex problems. The second quadrant in the lower right corner is a simple model plus microscopic features, the most famous of which is Google Adwords, which uses a linear model plus hundreds of billions of features to create the world’s best click-through estimation system for ads. The third quadrant in the upper left corner is complex model plus macro features. There are many well-known companies that have done very well, such as Bing Ads and Yahoo. The classic COEC+ complex model is a common tool in this quadrant. Finally, the fourth quadrant, which uses complex models and microscopic features, is a research hotspot because the model space is too large.
So how do you do higher dimensional machine learning along models and features? The research model is mainly in the academic world, and most of the work comes from conferences such as ICML, NIPS and ICLR. The nonlinear three swords are Kernel, Boosting and NeuralNetwork respectively. Kernel became very popular a decade ago, providing nonlinear capabilities to the then-popular SVM algorithm. Boosting is the most widely used in GBDT, and many problems can be solved well. Neural Network has also been used successfully in many fields. The industry’s approach to optimization models can be summarized as follows. Firstly, a hypothesis is obtained by thinking about the past data, and then the mathematical modeling of the hypothesis is abstracted into parameters, which are used to fit the newly added parameters. Finally, another part of data is used to verify the accuracy of the model. Here’s an example of Kepler’s discovery of Kepler’s three laws along the model path. In the Middle Ages, Tycho tied his head to a telescope and observed the night sky for 30 years, recording the movements of the planets. Based on these data, Kepler continued to make assumptions, and finally assumed that the orbit of the planet was elliptical. He used elliptic equations to fit his data, and found that the fit was very good, so he got a new model: Kepler’s First law. This is a typical way to go along the model. By observing data, scientists obtain a hypothesis, which is a model, and then use the data to fit the parameters of the model, and finally verify the correctness of the model on the new data. This is a way to go along the model.
The feature trail is dominated by industry. Google’s Adwords, for example, has hundreds of billions of features. To describe why an AD clicks, efficient parallelism needs to be solved. Most of these advances come from KDD or WWW. Optimization by feature machine learning is the process of breaking down problems into enough detail by feature to make very accurate models.
Is the depth model better than the width model? Here’s the no free lunch theorem: there’s no one-size-fits-all model. Simply put, there is no optimization algorithm that works for any problem, which means we can always find a problem where the optimization algorithm doesn’t perform better than random. Further, all machine learning is a bigotry, representing knowledge of the world. If the data is small, the paranoia needs to be strong. For example, scientists observing physical phenomena don’t have a lot of data. In this case, you need a lot of theories and guesses, and a little data to fit. But if your assumptions are wrong, you can come to the wrong conclusion. The geocentric theory of astrophysics, for example, turns out to be wrong. But if there is a lot of data, we don’t need a strong bias, add more uncertainty to the model, and automatically fit through the data. Taken together, there is no free lunch in industrial machine learning. There is no one-size-fits-all model. So make sure you make the right choice for your business, which is the best way to do it.
Artificial intelligence is still far from being able to spread everywhere. Even if we solve the problem of breadth and depth, we still have a lot of things to do. How to train a good model, how to choose good parameters and how to combine features is not an easy thing.
For example, data need to be collected, sorted, imported, cleaned, spliced, feature engineering and so on before model training. After the model goes online, it is necessary to ensure the stability, timeliness and throughput of the system. At the same time, in order to provide online services, an online architecture needs to be rebuilt, which needs to ensure real-time data flow, online and offline consistency, as well as the mechanism of the model. Only by completing these, can we really have a landing artificial intelligence system, otherwise we can only call it a toy of AI on the laptop.
What I just mentioned is only one company’s problems and systems. If applied to different problems in various industries, many problems will be found. Therefore, enterprises must need an AI platform that integrates all the above capabilities. Good tool platforms and algorithms within the open source community can help a lot, and these tools will mature, but not enough. Although it seems that ARTIFICIAL intelligence has been applied in many fields or started breakthrough attempts, it is far from widespread when compared to distributed storage computing systems like Hadoop.
Let’s start with Hadoop. The reason why so many people use Hadoop is that although it is a distributed system, programmers using it do not need to master high knowledge of distributed system, and developers do not need to make targeted changes to their data and business in order to use Hadoop. There is no need to redesign your online service system for the Map-Reduce framework. This is not the case in artificial intelligence, in order to use the AI, all of the upstream and downstream components and models related to: different model not only means different training system, also means that different real-time and non real-time data flow, different spelling table and frame selection, different feature extraction, different online services architecture, different disaster preparedness architecture, rollback is related to the architecture. As you can see, architects of data flow and online systems for AI systems need to understand machine learning to do their job.
So the people who can do AI applications right now are mainly machine learning scientists who do research and applications, and they need engineers who understand both machine learning and business and systems architecture. This creates a high bar for AI. Just like 30 or 40 years ago, the real programmers weren’t people like us today, they were a bunch of scientists, and they controlled the programs with paper tape, and they not only had to program, they had to understand computer architecture. As a result, not everyone has access to the technology, and not every business can benefit from it. But now that it is possible to program even in Excel, these programmers may have no idea about computer architecture, operating systems, compilation principles, or database concepts, and devote all their attention to understanding and solving business problems to achieve more results with less effort.
Therefore, if we want AI to have a greater impact in the industry and really land, we need a complete AI application platform, so that people can use ARTIFICIAL intelligence at a lower cost. From this perspective, what hinders the popularity of AI is not that the current algorithm is not good enough, but that the threshold of the current algorithm is too high. The importance of developing new platforms and reducing the threshold of the algorithm is greater than that of optimizing the algorithm effect. We expect to achieve good results with a low threshold.
How do you lower those barriers? The results of the fourth paradigm are shared here. First of all, feature engineering is a huge challenge for the industrial application of AI. The goal of feature engineering is to identify the key attributes of a model that are relevant to the problem to be solved. There are also several open source projects that attempt to solve feature engineering. The feature engineering algorithms included in the Official Spark 2.2 documentation are listed below. So, for different businesses and different models, are these operators sufficient for low threshold modeling?
If you want to do feature engineering well, you need to have a deep understanding of the machine learning algorithm that you’re going to use, and just throw in all the features, existing algorithms don’t work very well. Sometimes, different algorithms can use feature engineering and do things completely differently to achieve the same goal. Take news recommendation as an example, we need to make two features to improve the click rate of recommended news. One is first-order features, which describe content that users directly like. The other is second-order characteristics, which describe the expansion of personal interests. People who like big data, for example, are likely to be interested in machine learning.
In the diagram below, the figure represents a User, and the figure below represents the User portrait obtained through statistics, that is, the historical interest point of the User (User_Topic). On the right are three news stories, each with a topic (News_Topic).
So, how do you add a first-order feature to the “simple model (linear model) + micro feature” path mentioned earlier? As shown in the upper left corner, we simply take the User and news topic cartesian product (user-new_topic). In this way, we don’t need any User profile statistics, because each click or non-click on a news story can already train the weight and preference of the “user-news_topc” combination feature. In this way, when serving online, all information can be obtained during recommendation. However, in order to update users’ interests in a timely manner, we need to make the model highly time-sensitive.
Looking back, how do you add a first-order feature to the “complex model (nonlinear model) + macro feature” path mentioned earlier? As shown in the lower left corner of the figure, since it is a macro feature, we need to turn different topics into a single feature. One way is to add “whether the topic of this news belongs to the user’s interest in history” through first-order logic. In this way, when serving online, we need to not only recommend real-time information, but also maintain the interest points of users’ history in real time. However, the updating frequency of the model is not so fast. After all, in order to achieve the recommendation timeliness goal, either the feature is static and the model is particularly real-time, or the feature is real-time and the model is static and unchanging.
So, what if we want to learn second-order features? For the linear model (as shown in the upper right corner), we also need to use the user’s historical interest points and combine the user’s historical preferences with the topic of the article (user_topic-new_topic), so that the model can learn what topics people like in history and what news topics they like, to achieve the goal of second-order migration. For nonlinear model (as shown in the lower right corner), we will have to do the original first-order logic () can be considered to be a Identity matrix into a second-order state transition matrix, through the historical statistics learned about different topics like conversion, determine the user is not in the existing articles of interest topic whether users love.
Further, let’s summarize and compare. For the models in the second, third and fourth quadrants of machine learning mentioned above, there is a big difference in the way we do feature engineering. For first-order features, if it is a linear model with fine features, it can be combined directly without statistics. If you need statistics to do nonlinear models, use inclusion relations to do it; If a nonlinear model does not need inclusion relations, the model itself will carry out feature combination. If you do second-order features, each of these methods uses statistical features, but they work in different ways, so for example, nonlinear model macro features, you need three related pieces of information and a lot of statistics to do that.
This example illustrates a truth, if you want to do a good job of feature engineering, it needs a lot of customized optimization for the model, and it is not enough to only use the current tools, it needs to rely on human experience and judgment. Therefore, it is very important to develop the algorithm of automatic feature engineering. Automatic feature engineering is a difficult problem, which is actively studied in the academic and industrial circles. Here we share three directions of automatic engineering, implicit feature combination (such as NN, FM), semi-explicit feature combination (such as GBDT) and explicit feature combination (explicit feature cross product).
The main characteristic of implicit feature combination is that it is very friendly to continuous value feature. The most successful application scenarios are speech and image. In these problems where the original signal is pixel or sound wave, deep learning generates the underlying Filter and hierarchical feature combination through neural network, achieving the effect far beyond human manual feature engineering. However, deep neural network is not a panacea. In deep learning, the variable processing of high-dimensional discrete features is very complex and lacks interpretability, and the over-black box is also the focus of neural network. As a result, feature combination derived from deep learning is relatively difficult to be applied to other algorithms, and it is difficult to give clear information feedback.
For the problem that NN is difficult to deal with discrete features, we need the technology of Large Scale Embedding to solve it. The earliest application of Embedding on NN is in NLP. Researchers used Embedding technology to map each word to a low-dimensional space and form the lowest input of equal length through concat, sum, poolling, convolution, etc. The training was then performed using standard deep neural networks. Subsequently, Embedding technology is applied in more and more fields, and the recommendation is a typical scenario. When RBM was proposed, it used Embedding to try to solve the collaborative filtering problem. Recently, Google published a description of how to implement large-scale Embedding technology to recommend billions of videos to billions of users on Youtube. In their work, they implement Embedding for each user and video at the same time, and then integrate the vector of user’s viewing history, search history and other videos into features. Then deep learning has achieved great success.
Large Scale Embedding is still a popular research field, whose achievements include Discrete Factorization Machine,FNN, PNN, DeepFM, etc. The figure above shows the similarities and differences of these algorithms. To put it simply, these models can not only find the inference relationship between features, but also remember the finer features. In this field, Deep Sparse Network (DSN) algorithm is proposed in the fourth normal form. It is a very wide and Deep model that also realizes large-scale Embedding, uses neural Network to make automatic learning combination, and aims to solve regularization and parallel computing problems of high-dimensional models (tens of trillions of VC dimensions).
The second is semi-explicit composition, based primarily on the tree model. Why is it “semi-explicit”? Because you might think it’s natural that a tree model can explain or do feature combinations, but it’s not: each branch of a leaf node is not an explicit and direct feature combination, but a combination of these features in a specific value range. So in terms of results, we have achieved feature combination, which is somewhat interpretable, but there is also no way to directly look at feature correlation or feature combination relations. As a nonlinear model, tree model is easy to understand and effective. But similarly, it is very difficult to process discrete fine features. Traditionally, it takes O(MNTK) time to train a model with M features and N training data k-layer deep T trees. Even if the coefficient features are optimized, it is difficult to reduce the space and transmission consumption on the bucket in feature splitting. In this respect, the fourth normal form proposes a series of algorithms, including He-Treenet and GBM series algorithms, which enable feature combination of tree models in the case of large-scale features through Embedding, Ensemble, Stacking, and General Boosting.
The third is explicit composition, where the output of the algorithm explicitly specifies which features are combined (the Cartesian product) as the base features. The whole idea is along the lines of search and search optimization, with regularization plus greed in some places. Because of the large solving space of explicit feature combination, it is very difficult to find the optimal feature combination. Compared with AlphaGo, each point on the 19-by-19 board has three states of black, white and no children, and its maximum state space is. Consider explicit features combination of things to do, we need to have an characteristics of limit order number less than high order combinations, choose the features, so the amount of 2 to feature is that the final selection from the characteristics, the space is, even if will be brought into small, scale could be far greater than that from the point of view of the solution space size display feature combination harder than chess.
Explicit feature combination Another question is how do you combine continuous values, for example, if someone is 30 years old and earns 10,000, how do you combine features? Is it multiplication, addition, or sum of squares? In NN, it is through linear combination and nonlinear change. Eigenvalue splitting has been used in GBDT, but for time-limited feature combination, there is actually no good existing method for direct combination.
Despite the difficulties, explicit feature combinations have the advantage of being interpretable, providing a very deep insight into which features are potentially relevant and should be combined; At the same time, this method enjoys good superposition: all machine learning algorithms are based on features, and the output of explicit feature combination is feature set, which can enhance all other machine learning algorithms and become the basis of training.
At present, there are mainly several algorithms for explicit Feature combination. Some methods are based on Boosting, which trains the single Feature weak classifier, and seeks the inspiration of combination through Boosting process. Or weight truncation based on Regularization is carried out to form combination candidate. These algorithms are generally not designed for feature combination, and combination is mostly a by-product of the training process, so it is difficult to really obtain high-order combination features.
A new generation of explicit feature combinations: FeatureGO
Here is FeatureGO, the latest algorithm of the Fourth Normal Form. It is based on MCTS to model the feature and the state of feature combination and train the benefit function under the combination. During the search process, we did a lot of tuning techniques, using our internal linear typing algorithm LFC to solve the problem of continuous value feature combination. In the end, we found that the combination of features was up to 10 or more steps, and that even 10 steps provided a significant improvement in performance, which was previously impossible to do by humans. Even in the best advertising or recommendation systems, the combination of artificial features generally only reaches order 5-6.
We tested FeatureGO algorithm on four data sets, including two public data sets (higgs, criteoDeepFM) and two private data sets (l_data, m_data). The statistical information of the data sets is as follows:
On these four data sets, we used the feature set generated by FeatureGO plus LR model for training, and AUC was used as the evaluation criterion. The experimental results are as follows. It can be seen that FeatureGO has been used to carry out feature engineering, and the effect on the four data sets has been significantly improved, most of which is about 2 percentage points of AUC.
We also tested the effect over time and with the addition of new combination features, as shown in the figure below. It can be seen that as time goes on, the more the number of feature combinations, the better the feature combination effect will be.
In addition to LR, we also compared some of the latest non-linear algorithm results and conducted experiments on Cretio advertising data. The results are as follows. In this experiment, we can also see that even the feature combination based on the latest NN or FM cannot find all available information, and the explicit feature combination still performs very well in this comparison. Also note that the new combination of features produced by FeatureGO can further improve the effectiveness of all of these models.
Behind algorithms like FeatureGO, there’s a lot of architectural wizardriness that makes search-based algorithms actually work. For example, CPS (Cross Parameter-server Sharing), Dynamic Graph and other technologies have been proposed to share common data reading and processing, feature request storage and update in the calculation process. To achieve the effect of using far less than 10 times of single model time when training 10 models at the same time.
In terms of computing power and architecture, we think it’s a critical part of AI. Past the APP start-up speed of 20 ms optimization to 2 ms in fact meaning is not particularly big, but in machine learning, 10 times means that the speed of the train at the same time more than 10 times of data, or to train 10 model at the same time, also means that the better effect, it represents the effect of model optimization, optimization of the past only scientists do, Now a good architect can do it. Google is a good example of this. In the past, no one thought that a model as simple as LR could work well, but through extreme engineering architecture and implementation optimization, Google has shown that with hundreds of billions of features, even a simple LR model is as good as any nonlinear model. The fourth paradigm is also an architecture engineering optimization and algorithm company. We are not only a general platform for ARTIFICIAL intelligence, but also devote a lot of energy to optimize the speed and architecture, hoping to achieve a stronger and more comprehensive level and capability of artificial intelligence.
On the other hand, the Deepmind team at Google is working on REINFORCEMENT LEARNING. NEURAL ARCHITECTURE SEARCH WITH REINFORCEMENT LEARNING is a way to automatically learn the structure of NEURAL networks. Someone mentioned that in order to automatically get a network structure, the work used 800 Gpus to train at the same time, which was very impractical. But we think that when it comes to the future, the trend will not be to save the machine by experts and reduce the use of the machine, but to reduce the participation of experts by machine saving experts, because with Moore’s Law, the computing cost of machines will decrease exponentially, but the number of experts will not increase exponentially. AutoML, with fewer people, is a sure way to bring ARTIFICIAL intelligence to more fields.
Currently, the algorithms of the fourth paradigm and the product “Wevin Platform” are under continuous investment and research in both AutoML and threshold reduction in the broader sense. These new research directions include automatic table data import, model interpretation, automatic optimization and so on. What we need to do is make AI more automated, make AI as ubiquitous as Windows. Now, “Wevin platform” trial version has been officially open to the public, welcome to scan the TWO-DIMENSIONAL code to register.