Nowadays, algorithmic distribution has gradually become the standard of almost all software such as information platform, search engine, browser and social software, but it also begins to face various doubts, challenges and misunderstandings.

In January 2018, Dr. Cao Huanhuan, senior algorithm architect of Toutiao, made public the algorithm principle of Toutiao for the first time, hoping to promote consultation algorithm and voice behavior algorithm in the whole industry. By making the algorithm transparent, to eliminate the misunderstanding of the algorithm.

It is reported that the information recommendation algorithm of Toutiao has undergone four major adjustments and modifications since the first version was developed and run in September 2012. Currently serves hundreds of millions of users around the world.

The following is cao Huanhuan’s share of Toutiao Algorithm Principle (authorized) :

This sharing will mainly introduce toutiao recommendation system overview and content analysis, user tags, evaluation analysis, content security and other principles.

I. System overview

Recommendation system, if described in a formal way, is actually fitting a function of user satisfaction with content, which needs to input variables of three dimensions.

The first dimension is content.Toutiao has now become a comprehensive content platform, with texts, videos, UGC mini-videos, q&A and micro headlines. Each content has many features of its own, so we need to consider how to extract features of different content types to make recommendations.The second dimension is user characteristics.Including various interest labels, occupation, age, gender, etc., as well as many models carved out the user’s implicit interests, etc.The third dimension is environmental characteristics.This is the recommended feature of the mobile Internet era. Users move at any time and anywhere, and their information preferences are offset in different scenarios such as work, commuting and travel.

Combining the three dimensions, the model gives an estimate of whether the recommendation is appropriate for this user in this scenario.

There is also the question of how to introduce goals that cannot be directly measured.

In the recommendation model, click-through rate, reading time, likes, comments, forwarding and likes are all quantifiable targets, which can be directly fitted with the model to make an estimate, so as to know whether the online promotion is good or not. However, as a large volume recommendation system serves a large number of users, it cannot be completely evaluated by indicators. Therefore, it is also important to introduce factors other than data indicators.

For example, advertising and special content frequency control. Question-and-answer cards are special forms of content where the goal of recommendations is not only to get users to view, but also to attract users to respond and contribute to the community. These contents and ordinary content how to mix, how to control the frequency control need to be considered.

In addition, due to the consideration of content ecology and social responsibility, such as suppression of vulgar content, suppression of tagging and low-quality content, top ranking, weighting and forced insertion of important news, and deweighting of low-level account content, the algorithm itself is unable to complete, and further intervention on content is required.

Now I will briefly introduce how to implement the algorithm based on the above objectives.

The formula y = F(Xi,Xu,Xc) mentioned above is a classic supervised learning problem. There are many methods that can be implemented, such as traditional collaborative filtering model, supervised learning algorithm Logistic Regression model, deep learning-based model, Factorization Machine and GBDT.

An excellent industrial recommendation system needs a very flexible algorithm experimental platform, which can support a variety of algorithm combinations, including model structure adjustment. It is difficult to have a common model architecture that applies to all recommended scenarios. It’s very popular to combine LR with DNN, and a few years ago Facebook combined LR with GBDT. Toutiao’s several products use the same powerful algorithmic recommendation system, but the model architecture will be adjusted according to different business scenarios.

After the model, we take a look at the typical recommendation features. There are mainly four types of features that play an important role in recommendation.

The first category is the relevance feature, which evaluates the attributes of the content and whether it matches the user. Explicit matching includes keyword matching, classification matching, source matching, subject matching and so on. There are also some implicit matches in the FM model, as can be seen from the distance between the user vector and the content vector.

The second category is environmental characteristics, including geographical location and time. These are both bias features and can be used to construct some matching features.

The third category is heat signatures. Including global heat, classification heat, topic heat, and keywords heat. Content popularity information is very effective in large recommendation systems, especially when users are cold booted.

The fourth category is collaborative features, which can help in part to solve the problem of so-called narrowing algorithms. Synergy features do not take account of user history. Instead, similarities among different users are analyzed through user behaviors, such as click similarity, interest classification similarity, theme similarity, interest word similarity, and even vector similarity, so as to expand the exploration ability of the model.

Most of toutiao’s products are recommended to use real-time training for model training. Real-time training saves resources and gives fast feedback, which is very important for information flow products. Users need the recommendation effect that behavioral information can be quickly captured by the model and fed back to the next brush. Currently, our online sample data is processed in real time based on storm cluster, including action types such as click, display, favorites and share. Model parameter server is a set of high performance system developed internally. As the scale of toutiao data grows too fast, the stability and performance of similar open source systems cannot be satisfied. However, we have made a lot of targeted optimization at the bottom of the system developed by ourselves, providing perfect operation and maintenance tools, which is more suitable for the existing business scenarios.

At present, toutiao’s recommendation algorithm model is relatively large in the world, including tens of billions of original features and billions of vector features. The overall training process is that the online server records real-time features and imports them into the Kafka file queue, and then further imports the Storm cluster to consume Kafka data. The client sends back the recommended label to construct training samples, and then carries out online training to update model parameters according to the latest samples. Finally, the online model is updated. The main delay in this process is the delay of the user’s action feedback, because the user may not read the article immediately after the recommendation, ignoring this part of time, the whole system is almost real-time.

However, because the current content of Toutiao is very large, plus the small video content has tens of millions of levels, it is impossible for the recommendation system to predict all the content by the model. Therefore, it is necessary to design some recall strategies and screen out thousands of content libraries from the mass of content when recommending each time. The most important requirement of recall strategy is that the performance should be extreme, and the timeout should not exceed 50 milliseconds.

There are many kinds of recall strategies, we mainly use the idea of inversion. Offline maintenance of an inverted row. The inverted key can be category, topic, entity, source, etc. The order takes into account heat, freshness, action, etc. Online recall can quickly truncate the content from the inversion according to the user’s interest label, and efficiently screen a small part of the more reliable content from a large content library.

Second, content analysis

Content analysis includes text analysis, picture analysis and video analysis. Toutiao started with information, and today we’re going to focus on text analysis. A very important role of text analysis in recommendation system is user interest modeling. User interest tags cannot be obtained without content and text tags. For example, only know that the article label is the Internet, the user read the Internet label of the article, can know that the user has the Internet label, other keywords are the same.

On the other hand, labels of text content can directly help to recommend features. For example, meizu content can be recommended to users who follow Meizu, which is the matching of user tags. If the recommendation effect of the main channel is not ideal in a certain period of time and the recommendation is narrowed, users will find that the recommendation effect will be better after reading the recommendation of specific channels (such as science and technology, sports, entertainment, military, etc.) and then returning to the main feed. Because the whole model is open, the sub-channel exploration space is smaller, and it is easier to meet the needs of users. It is difficult to improve recommendation accuracy only through single channel feedback, so it is very important for sub-channels to do well. And that requires good content analysis.

Above is an actual text case from Toutiao. It can be seen that this article has text features such as classification, keywords, topic and entity words. Of course, the recommendation system cannot work without text features. The earliest application of the recommendation system was in Amazon and even wal-mart, including Netfilx’s video recommendation without direct collaborative filtering of text features. However, for information products, most of them consume the content of the day. Without text features, it is very difficult to cold start new content, and collaborative features cannot solve the problem of cold start of articles.

The main text features extracted by Toutiao recommendation system include the following categories. The first feature is semantic tag class, which explicitly labels articles with semantic tags. This part of the label is defined by the human characteristics, each label has a clear meaning, the label system is predefined. In addition, there are implicit semantic features, mainly topic feature and keyword feature. Topic feature is the description of word probability distribution without clear meaning. However, the keyword features will be described based on some unified features without definite set.

In addition, text similarity is also very important. At Toutiao, one of the biggest problems users have complained about is why they always recommend repetitive content. The difficulty with this question is that everyone has a different definition of repetition. For example, some people think that this article about Real Madrid and Barcelona, yesterday saw similar content, today said that the two teams is a repeat. But for a serious fan, especially one of Barcelona’s fans, it’s tempting to read them all. To solve this problem, it is necessary to make online strategies according to the characteristics of similar articles, such as theme, style and subject.

Also, there are temporal and spatial features, location and timeliness of the analysis. For example, it may not make sense to push wuhan’s restrictions to Beijing users. Finally, consider the quality of the relevant characteristics, determine whether the content is vulgar, pornographic, soft text, chicken soup?

The image above shows the features and usage scenarios of the headline semantic tags. They have different levels and different requirements.

The goal of classification is comprehensive coverage, and it is hoped that every content and every video can be classified. Entity systems need to be precise, so that the same name or content can clearly distinguish which person or thing it refers to, but it doesn’t have to cover everything. The conceptual system is responsible for solving the semantics of more precise and abstract concepts. This was our original classification. In practice, we found that the classification and concept were technically interoperable, and later unified a set of technical architecture.

At present, implicit semantic features have been able to help recommendation, while semantic labels need to be continuously marked, with new nouns and concepts constantly appearing, and labeling needs to be constantly iterated. The difficulty and resource commitment to get it right is much greater than implicit semantic features, so why semantic tags? There are product requirements, such as channels with clearly defined categorised content and easy-to-understand text labelling systems. The effect of semantic tags is a litmus test of a company’s NLP technology.

The online classification of Toutiao recommendation system adopts typical hierarchical text classification algorithm. Root, the first layer below the classification is like science and technology, sports, finance and economics, entertainment, sports and so on, and then subdivided into football, basketball, table tennis, tennis, athletics, swimming… , football subdivided into international football, Chinese football, Chinese football subdivided into China a, Chinese Super League, national team… , the hierarchical text classification algorithm can solve the problem of data skew better than the separate classifier. There are a few exceptions, if you want to improve the recall, you can see that we connected some fly wires. This framework is universal, but each meta-classifier can be heterogeneous according to different problem difficulty. For example, some SVM classification has good effect, some need to be combined with CNN, and some need to be combined with RNN for further processing.

The figure above is a case of entity word recognition algorithm. Candidates are selected based on word segmentation results and partof speech tagging, during which some splicing may be required according to the knowledge base. Some entities are combinations of several words, and it is necessary to determine which words combined together can map the description of the entity. If the result maps to multiple entities, word vector, topic distribution and even word frequency itself should be used to remove ambiguities. Finally, a correlation model is calculated.

User tags

Content analysis and user tagging are the two cornerstones of recommendation systems. Content analysis involves more machine learning than user tag engineering.

Toutiao’s commonly used user tags include categories and topics of interest to users, keywords, sources, interest-based user clustering, and various vertical interest features (car models, sports teams, stocks, etc.). Gender, age, location, etc. Gender information is obtained through the user’s third-party social media account login. Age information is usually predicted by models, from models, reading time distribution, etc. Resident location comes from the user’s authorized access to location information, on the basis of location information through the traditional clustering method to get the resident location. Resident point Combined with other information, the user’s working place, business trip place and travel place can be predicted. These user tags are very helpful for recommendations.

Of course, the simplest user TAB is the TAB for browsing content. But there are some data-handling strategies involved. Mainly include: one, filter noise. Filter clickbait with short stays. Second, hot spot punishment. To the user in some popular articles (such as PG One news some time ago) on the action to reduce the right processing. In theory, the credibility of more widely distributed content will decline. Third, time attenuation. Users’ interests are biased, so policies favor new user behaviors. Therefore, with the increase of user actions, the old feature weight will decay over time, and the new action will contribute more feature weight. Four, punishment show. If an article recommended to the user is not clicked on, the relevant feature (category, keyword, source) weight will be penalized. Of course, you should also consider the global context, whether the content is pushed too much, as well as the closing and dislike signals.

User tag mining is relatively simple in general, with the main engineering challenges just mentioned. The first version of Toutiao User label is a batch calculation framework with a relatively simple process. Every day, the action data of daily active users of yesterday in the past two months are extracted and the results are calculated in batches on the Hadoop cluster.

The problem, however, is that with the rapid growth of users, the types of interest models and other batch processing tasks are increasing, which involves too much computation. In 2014, batch processing of Hadoop tasks with millions of user tag updates began to struggle to complete the same day. The strain of cluster computing resources easily affects other work, and the pressure of centralized writing to the distributed storage system increases. In addition, the update delay of user interest labels becomes higher and higher.

Face these challenges. At the end of 2014, Toutiao launched the user-tagged Storm cluster streaming computing system. After changing to streaming, the label will be updated whenever there is user action update, which has a relatively small CPU cost and can save 80% of CPU time, greatly reducing the cost of computing resources. At the same time, only a few dozen machines can support the update of interest model of tens of millions of users every day, and the feature update speed is very fast, which can basically achieve quasi-real-time. The system has been in use since its launch.

Of course, we also found that not all user tags require streaming systems. Information such as the user’s gender, age, and resident location is not double-counted in real time to keep daily updates.

Iv. Evaluation and analysis

The overall architecture of the recommendation system is introduced above, so how to evaluate the recommendation effect?

There is a saying that I think is very wise, “if you can’t evaluate something, you can’t optimize it.” The same is true for recommendation systems.

In fact, there are many factors that can affect recommendations. For example, the change of optional aggregation, the improvement or addition of recall modules, the increase of recommendation features, the improvement of model architecture, the optimization of algorithm parameters, etc. The point of evaluation is that many optimizations can end up being negative, not that they improve once they go live.

A comprehensive evaluation and recommendation system requires a complete evaluation system, a powerful experimental platform and easy-to-use experience analysis tools. The so-called complete system is not a single index to measure, not just click rate or length of stay, etc., need comprehensive evaluation. We’ve been trying for the last few years to see if we can synthesize as many indicators as possible into a single indicator, but we’re still exploring that. At present, we still need to be decided after in-depth discussion by the review committee composed of senior students in each business.

Many companies do not perform well in algorithms, not because engineers are not competent enough, but because they need a powerful experimental platform and convenient experimental analysis tools, which can intelligently analyze the confidence of data indicators.

The establishment of a good evaluation system needs to follow several principles, the first is to give consideration to short-term indicators and long-term indicators. When I was in charge of e-commerce at my previous company, I observed that many of these changes were new to users in the short term but didn’t actually help them in the long term.

Secondly, both user and ecological indicators should be taken into account. As a content creation platform, Toutiao should not only provide value to content creators so that they can create with more dignity, but also have the obligation to satisfy users. Still have advertiser interest to also want to consider, this is multi-party game and balance process.

Also, be aware of synergies. Strict flow isolation in the experiment is difficult to achieve, pay attention to external effects.

The direct advantage of the powerful experimental platform is that when there are many experiments online at the same time, the platform can automatically allocate the flow without manual communication, and the flow can be recovered immediately after the experiment, improving the management efficiency. This can help the company to reduce the cost of analysis, speed up the algorithm iteration effect, so that the algorithm optimization work of the whole system can be quickly advanced.

This is the rationale behind the Toutiao A/B Test experimental system. First of all, we will do user buckets in offline state, and then online distribution of experimental flow, label the users in the buckets and distribute them to the experimental group. For example, open a 10% flow experiment, two experimental groups 5% each, one 5% is the baseline, the strategy is the same as the online market, the other is a new strategy.

During the experiment, user movements will be collected, which is basically quasi-real-time and can be seen every hour. However, because the hourly data fluctuates, it is usually viewed by day as the time node. After action collection, there will be log processing, distributed statistics, write database, very convenient.

In this system, engineers only need to set flow requirements, experiment time, define special filtering conditions, and customize experimental group ID. The system can automatically generate: experimental data comparison, experimental data confidence, experimental conclusion summary and experimental optimization suggestions.

Of course, an experimental platform is not enough. The online experimental platform can only predict the change of user experience through the change of data indicators, but there are differences between data indicators and user experience, and many indicators cannot be fully quantified. Many improvements still need to go through manual analysis, and significant improvements need to be validated by manual evaluation.

Fifth, content security

Finally, some of Toutiao’s initiatives on content security. Toutiao is now the largest content creation and distribution voucher in China and must pay more and more attention to social responsibility and the responsibility of industry leaders. If 1% of recommendations go wrong, it can have a big impact.

Therefore, Toutiao has put content security on the company’s highest priority queue since its inception. At the beginning of its establishment, there has been a special audit team responsible for content security. At that time, there were only less than 40 students who developed all the clients, backends and algorithms. Toutiao attached great importance to content review.

Today, Toutiao’s content mainly comes from two parts, one is the PGC platform with mature content production capacity

One is UGC user content, such as Q&A, user comments and micro headlines. These two parts of content need to go through a unified review mechanism. If it is a relatively small amount of PGC content, the risk review will be carried out directly, and no problems will be widely recommended. UGC content needs to be filtered through a risk model, and those with problems go through a second risk review. Once approved, the content will actually be recommended. At this time, if you receive more than a certain amount of comments or report negative feedback, will return to the review link, if there is a problem directly removed. The whole mechanism is relatively sound, as the industry leader, in content security, today’s toutiao has been using the highest standards of their own.

Sharing content recognition technology mainly includes pornography model, abuse model and vulgar model. The vulgar model of Toutiao is trained by deep learning algorithm, and the sample base is very large. Images and texts are analyzed simultaneously. This part of the model pays more attention to recall rate, accuracy can even sacrifice some. The sample base of the abuse model also exceeds one million, with a recall rate of 95%+ and accuracy of 80%+. If users are often outspoken or make inappropriate comments, we have some mechanisms to punish them.

Pan-low quality recognition involves a lot of situations, such as fake news, black manuscripts, inconsistent articles, false headlines, low content quality, etc. It is very difficult for the machine to understand this part of the content, requiring a lot of feedback information, including other sample information comparison. At present, the accuracy rate and recall rate of low-quality models are not particularly high, and the threshold needs to be improved in combination with manual review. So far the final recall has reached 95%, and there is still a lot of work to be done. Li Hang, a teacher from Toutiao ARTIFICIAL Intelligence Lab, is also co-constructing a scientific research project with the University of Michigan to establish a rumor recognition platform.

The above is to share the principle of toutiao recommendation system. We hope to get more suggestions in the future to help us improve our work better.

Reprint instructions
Please click here