This article is originally written by Programmer. It should not be reproduced without permission. Please subscribe to Programmer 2017.
Author: Chen Kaijiang @ punishment no knife, CTO of agricultural science and technology, financial technology company day recommendation algorithm, a former sina weibo senior engineer, the koala FM algorithm, has been responsible for weibo anti-spam based data mining, intelligent, personalized customer service platform, back-end algorithm of recommended products such as research and development, for the koala FM audio from zero to construct the personalized recommendation system.
A mobile recommendation system
Recommender System, which we often talk about, is a relatively “static” recommendation in form and usually located in the periphery of the main information on the website, such as “Look again and again” and “buy again and again” in e-commerce websites. This recommendation system cannot support a product on its own in most scenarios.
According to the definition of Wikipedia Recommender System, “Recommender System is a subclass of information filtering System, which is specially used to predict users’ preferences or ratings for a project”. Interest Feed is also a Recommender System. It predicts users’ preference for the dynamic content of adjacent nodes in the social network, and decides the display order of these dynamic content according to the preference degree.
A Feed is a stream of news that we see. When users establish a connection with some content sources (such as follow, like, favorites, etc.), new actions generated by these content sources will flow to users continuously through the connection. The dynamic generated by different content sources is aggregated and presented to users, which is called Feed.
A Feed that broadcasts social network activity, usually by default, appears in front of you in the chronological order that the activity is generated, called timeline. Twitter and Instagram in foreign countries were feeds in the form of timelines at the beginning. Domestic weibo, QQ space, wechat circle of friends is still the time line.
Twitter experimented with smart sorting in 2012. Twitter launched a feature called “What Happens when You’re Not there” last year. Facebook even abandoned timelines for its NewsFeed long ago.
There are two main reasons for this. On the one hand, the popularity of smart phones and the acceleration of mobile networks make UGC easier and easier, and users’ dynamic generation and browsing become more and more fragmented, with a sharp increase in quantity and frequency. Previously, users consumed time-line feeds without pressure, and began to experience information overload or miss some more interesting content. Instagram says its users miss an average of 70% of its content, and Facebook once said that each user sees only 300 of its 1,500 news items per day. On the other hand, timeline feeds are not conducive to commercialization. Commercial accounts publish ads in an unrestrained and dynamic way, which not only affects user experience, but also completely bypasses the platform for commercial activities, forming a “tragedy of public resources”.
Therefore, the development trend of Feed will inevitably be from timeline to algorithmic reordering to display the Feed according to the relevant degree of users’ interests. On the one hand, it will help users solve the problem of information overload, and on the other hand, it will balance the business value and user experience on the platform.
Successful Interest Feed — NewsFeed
One successful interest Feed is Facebook’s NewsFeed. It was September 2006, and Facebook launched friends News at the same time as MiniFeed. In its 10th anniversary, NewsFeed has become a cash cow that generates tens of millions of dollars a day.
When Facebook first introduced this feature, there was a big debate about “privacy” — how could anyone see my feed? As users continued to question and protest, Facebook responded to the controversy by adding some of its original privacy controls, like hiding their feeds, and NewsFeed stuck.
In 2009, when Facebook acquired FriendFeed and integrated its “like” feature into the NewsFeed, it began reordering feeds by popularity, which again provoked a backlash from users who were used to reading in chronological order.
Over the past 10 years, NewsFeed has made countless improvements, even deploying multiple versions of the algorithm for AB testing online every day. However, EdgeRank algorithm is a landmark building on this road of improvement. We can divide NewsFeed sorting strategy into pre-Edgerank era, EdgeRank era and post-Edgerank era.
In the pre-Edgerank era, according to Chris Cox, Facebook’s chief product officer, “Originally, NewsFeed sorting was all about brain shots, putting weight on photos and downweighting system messages.” When our algorithm engineers read this, they must have laughed: Facebook, which is now a giant of artificial intelligence and deep learning, came from this era.
Serkan Piantino, now the engineering director of Facebook’s AI Research institute, then led the development of the first version of EdgeRank around 2010.
It is EdgeRank algorithm
To understand how the famous EdgeRank is, first watch a friend’s new (dynamic) birth after the flow of your front:
-
First, a friend of yours creates something new. He posts a thought, likes a page, tags a photo.
-
Then through your friend’s introduction, to your door (your home page), you open the door (login or refresh) may see it.
-
In general, the new ones get more interviews than the old ones.
-
New is not much, open the door a greeting may be ok, they also can afford to wait, too much content, you have to consider a first come last.
These steps roughly describe the idea of EdgeRank, simple and direct. Based on this assumption, the EdgeRank sorting algorithm mainly considers three factors:
-
Intimacy. It corresponds to the idea behind the second step. With so many introductions, of course, we should give priority to the person we “like” more. The quantification of closeness takes into account how often you “move” and how closely you connect with this friend in daily life.
-
Edge weights. That’s what EdgeRank’s name means: different dynamic weights, likes and photos are obviously different.
-
Freshness. Since it’s a NewsFeed, New things are more popular.
The three scores are finally multiplied together to the score of each new item for sorting and filtering, as shown in Figure 1. The ranking method is really simple, just quantifying the three main factors and subjectively multiplying them, without any goal optimization behind it. According to Facebook, the early EdgeRank didn’t incorporate machine learning, so it wasn’t a smart algorithm at all.
FIG. 1 Principle of EdgeRank sorting algorithm
Be smart with machine learning
After 2011, The EdgeRank algorithm was no longer mentioned inside Facebook, and NewsFeed entered the post-Edgerank era. At this point, Facebook, with more than 1 billion monthly active users, about 20 million public pages, and mobile devices contributing most of the traffic, had to introduce machine learning to Hold the whole scene in the face of complex contextual factors.
The advantages of introducing machine learning are obvious. The original consideration is the feature of the machine learning model. A linear model can process large-scale sparse features and find the optimal parameter (i.e. weight) for each feature. Under such a framework, products and engineers would simply try to find the features that affect the ordering, then hand them over to the machine learning model and be done with it.
The NewsFeed team built on the original EdgeRank and defined different levels of intimacy in more detail. Deep neural networks can be used to understand picture content and text content, so as to know whether the object in the photo is of interest to the user or not, and analyze the new discussion topics for ranking. As the product iterates, more and more ranking factors are considered, such as reading time, video content, linked content, or removing a close, hiding a source, etc. A total of more than 100,000 ranking factors have been considered (the feature space of the model should be higher). If the weight is adjusted in the original way, it is obviously neither scientific nor inefficient.
Breaking the limits of machine learning and data
In addition to a full shift toward machine learning, the NewsFeed team is also rethinking the relationship between people and algorithms. They should focus on “how to find out what users really care about,” not just “how to increase click-through rates.” Facebook has always been data-driven, and that’s where the belief that they were able to sustain NewsFeed through the controversy came from, but there was a lot of thinking going on inside the team that wasn’t just data-driven. Such as:
-
The team found that 85% of the hidden news operations came from 5% of the users. After communication with these users, it was found that these 5% people regarded “hidden” as “marked read” in the email, and they would click “hide” if they read or not like the news.
-
For sad things, users may care but not like them.
-
For some likes, users may not really be interested, just like maniacs.
-
When a user reads a long post and stops reading it halfway through, it does not mean he is not interested in the post.
This led them to focus on machine learning and the limitations of data. So, in addition to the algorithms team, Facebook has built a global human evaluation group. Meat evaluation team’s work is not simply the result of the screening algorithm like and dislike annotations, but very deeply expounds why do you like/dislike algorithm selection results, and they will communicate with engineers detailed evaluation result, because this way of meat evaluation can effectively reveal the data lie, keep product away from devotion to improve data index of the circle.
In addition, NewsFeed has two important supporting facilities: a social relationship recommendation system and an advertising system.
NewsFeed exists because users make a lot of social connections and are overloaded with information, so a “people you May know” recommendation system is essential for NewsFeed to “legitimately” exist.
This is a recommendation system that we are familiar with in the form of products. Its core is a set of large-scale matrix decomposition algorithm, which uses the existing collaborative matrix to recommend new items that you may want to establish contact with, including users, App, public home pages, etc.
Facebook ads come in a variety of forms, including:
-
Suggested Page (public Page you might like)
-
Page Post (Public account Post promotion)
-
Suggested App (An App you might like)
-
Video Ads
All the data showed that the reordered NewsFeed made users more motivated to read. So, despite all the questions and arguments, NewsFeed reordering hasn’t stopped. Obviously, the time line of “everyone is poor, everyone is good” does not conform to the basic philosophy of business society, and the interest Feed of improving efficiency is a necessity.
Uncomplicated interest Feed implementation
How are interest feeds implemented? Here’s a look at Pinterest’s Smart Feed, followed by some general technical points.
➤Smart Feed technology implementation
The logic of the main module of the whole Smart Feed back-end is shown in Figure 2, which consists of three parts:
-
Background task (Worker)
-
Content Generator
-
Front-end Service
Figure 2 Smart Feed backend architecture
lowBackground task (worker)
Worker has two responsibilities:
-
Receive the new Pin generated by the data source, determine which users the Pin should be pushed to, and give the attractiveness of the Pin to each user it should push, commonly known as “scoring”.
-
Store these graded pins for later use.
Figure 3 Smart Feed background task module
Scored pins are pooled according to their different sources. The storage structure is a priority queue, sorted by score, and new pins come in to be sorted along with the original (but not yet seen by the user) pins.
The storage Pool can be directly used on top of KV database, including HBase and Redis. The data sent to the database each time is a triplet (user, PIN and Score). Pinterest uses HBase. There are two HBase clusters: one stores pins that have not been read, and the other stores pins that have been read.
When the data source generates a new Pin, a module called PinLater is used to push the Pin to the fan through Zen (a graph data storage module that encapsulates HBase basic operations). The push is asynchronous, with a delay of a few seconds to a few minutes.
lowContent Generator
As shown in Figure 4, the content generator does:
-
The number of pins returned is not fixed, but depends on how frequently the user visits and how much new content they saw last time.
-
The proportion of the distribution source. Pinterest doesn’t reveal the proportion of the distribution. We can guess, maybe it’s a fixed proportion, maybe there’s some heuristic algorithm.
-
Put the pins in a certain order and sort them by fractions.
-
The content to be pushed should be removed from the Pool to ensure that every request is unread.
Figure 4. Smart Feed content generator module
Each time the content is generated, it is collectively called a “chunk”.
lowFront-end Service (Feed Service)
In Figure 5, the Feed Service provides the front-end service. To provide a highly available service, the Feed Service has two tasks:
-
Receives “new” content returned from the content generator.
-
New content merges previous “old” content.
Figure 5 Smart Feed front-end service module
lowPinterest sorting algorithm
The sorting algorithm is called Pinnability. We can translate this into “pin-ability,” a set of machine learning models that measure how likely a user is to interact with a Pin.
FIG. 6 Tasks of Pinnability sorting algorithm
The machine learning algorithms used in the Pinnability model are all commonly used models, including logistic regression (LR), support vector machine (SVM), GBDT and convolutional neural network (CNN). The whole Pinnability model process is shown in Figure 7. The process of model generation is divided into three stages: preparation of training data, training model and online use.
FIG. 7 Pinnability model flow
Technical essentials of interest feeds
After analyzing Pinterest’s interest Feed implementation, let’s summarize what needs to be considered for a general interest Feed.
➤The overall logical
As a whole, the logical structure of an interest Feed is roughly shown in Figure 8.
Figure 8. Overall logic of interest feeds
➤The data model
A Feed is also called an Activity Stream. As the name implies, it is a Stream of data formed by the user’s activities.
The basic data of a Feed is three: a User, an Activity, and a Connection.
The element that expresses user activities has a specification called Atom, which you can refer to and define your own Feed data model based on product requirements. According to Atom, a dynamic line contains the following elements:
-
Time: indicates the occurrence Time.
-
Actor: Who sent it? Actors are usually user ids, but we can extend to other anthropomorphic objects, such as a “shop” to focus on, or a “movie” to collect.
-
Verb: Which action is the core of the movement? “Follow”, “like”, etc.
-
Object: The action applies to only one major Object.
-
Verb: the final Target of an action. It corresponds to things followed by the preposition to in English, such as “John saved a movie to his wishlist” (John saved a movie to his wishlist), where the movie is Object and the list is Target.
-
Title: Dynamic Title, natural language description.
-
Summary: Usually a small piece of HTML code that describes the Activity. It may also contain visual elements such as thumbnails. This can be interpreted as the view of the Activity, but is not required.
For example: On May 6, 2016 23:51:01 (Time) @XingWudao (Actor) shared a Microblog (Object) to @resysChina (Target). Title is the content of the preceding sentence with the parentheses removed, Summary.
Relationships are connections. There are strong and weak connections everywhere in Internet products. Social connections such as friend relationship and following relationship are strong connections. There are also likes, favorites, comments and browsing, all of which can be considered as establishing connections between users and another object. With connectivity, there is delivery and publishing of feeds.
The elements that define a connection are:
-
From: indicates the initiator of the connection.
-
To: indicates the connected party.
-
Type /name: connection type/name, follow, friend, like, browse, comment, etc.
-
Affinity: indicates the connection strength.
If you think of creating a connection as an Atom model, from corresponds to the Actor and to to corresponds to the Object.
Connections are initiated from to, and dynamic flows from to to. Connections and dynamics are mutually reinforcing, like the egg-and-chicken relationship: dynamics create new connections; With new connections, you can Feed you more dynamic content.
➤Release new updates
How is the Feed generated after the user logs in/refreshes? Content appears in the audience’s Feed, a process known as fan-out.
Our intuition goes something like this:
-
Gets the endpoint of all connections to the user (such as friends or followers).
-
Get new content (activities) generated by these connection endpoints (focus objects).
-
Sorted output.
Figure 9. Pull pattern produces content
Known in the industry as fan-out-of-load, feeds are generated in real time after the user logs in/refreshes.
The benefits of pull mode are as follows:
-
The implementation is simple and straightforward: one line of SQL.
-
Real-time: The content is generated and the audience can see it by refreshing.
But there are also disadvantages:
-
As the number of connections increases, the complexity of this operation increases exponentially and is clearly undesirable.
-
Each person’s content should be kept in memory.
-
Services are hard to make highly available.
Corresponding to the pull pattern, there is a push pattern (fan-out-on-write).
Figure 10. Push pattern produces content
When a user (Actor) generates an Activity, regardless of whether the audience refreshes or not, this content is immediately pushed to the corresponding user (the person connected to this Actor). The system opens a separate Feed storage area for each user to receive the pushed content. When a user logs in, the system only needs to read his or her own Feed.
The benefits of push mode are obvious: there is almost no complex query operation when users access their feeds, so service availability is high.
Push mode also has the following shortcomings:
-
A lot of writing: every fan has to write once.
-
Lots of redundant storage: N copies (audience size) are stored for each piece of content.
-
Non-real-time: After a piece of content is generated, there is a delay before it reaches the audience Feed.
Since each has its advantages and disadvantages, what about combining the two? A simple combination is global:
-
For highly active users, use push mode so they don’t have to wait as long each time they refresh, and there are more pages.
-
For less active users, use pull mode to pull the latest content when they log in.
-
For popular content producers, cache their latest N pieces of content for different scenarios to pull down.
Another combination is user-specific, and this is Etsy’s design:
-
If there is a high degree of intimacy between the audience user and the content producer, the content will be pushed first, because the content is more likely to be interested by the audience.
-
If the intimacy between the audience user and the content producer user is low, push will be delayed or not pushed.
-
Not exactly the order of closeness, but the probabilities associated with it.
For small and medium-sized social networks, pure push mode is enough, and the combined scheme can be considered after the business develops to a certain scale.
A push mode Feed publishing implementation is simple:
-
A database that centrally stores all dynamic content, typically MySQL.
-
Save sorted feeds for each user in a key-value database, such as Redis or HBase.
-
A distributed asynchronous task queue like Pinlater, Celery is a good choice.
➤Sort by interest
There are two pitfalls to avoid when sorting interest feeds:
-
There is no ordering of targets. Before designing a sorting algorithm, it’s important to know: Why reorder time? What do you hope to achieve? What are the targets quantified by? The algorithm can only be tested and optimized if the target is identified.
-
Manual quantification of ranking factors. We often see the product or operation of the students asked for a factor of weight, weight reduction. It is unwise to do so, because human knowledge is better for inspiration than for quantification. People can tell the algorithm a lot of possible useful sorting factors, shorten the path of effect improvement, but people directly specify the weight of the parameters, basically for effect improvement without benefit.
We simply design an interest Feed to improve the interaction rate from the idea of machine learning. First, define what the interaction includes, such as likes, retweets, comments, and viewing details. Secondly, distinguish positive and negative interactions, such as hiding a certain content, click not interested in the negative interaction.
This is a typical dichotomous supervised learning problem, which treats positive interactions as the same category. After a dynamic is generated and before it is shown to users, machine learning is used to predict the probability of positive interaction between users, and the predicted probability can be output as interest ranking score.
All dichotomous algorithms that produce probabilistic outputs can be used here, including Bayes, maximum entropy and logistic regression. Logistic regression is commonly used on the Internet, and it has many benefits:
-
Linear model, simple enough.
-
Produces outputs between 0 and 1 that can be compared with each other.
-
Open source implementation, low initial technical cost.
-
Industry has proven this over and over again.
Using machine learning to rank interest feeds, the most important thing is to represent the < dynamic, audience > data pair as feature vectors. Eigenvectors are vectorized statements of ordering factors. After the algorithm is selected, the human can spend a lot of effort to find the factors that affect the ranking, which is the legendary “feature engineering”. Feature engineering also involves the selection of existing features for the purpose of providing services in the form of RPC to be invoked when new dynamic content is published in the Feed system after the machine learning model is completed.
For the RPC framework, use Apache Thrift. Vowpal Wabbit is a distributed machine learning framework that can be easily combined with Hadoop.
➤Data and effectiveness tracking
We not only need to find the optimal parameters of the algorithm through historical data, but also need to verify the sorting effect through new data, so we should pay attention to the storage and use of data.
The data associated with interest feeds are:
-
Target-related interactive behavior data (which records every user’s actions on the Feed).
-
The content exposed to the user (an exposure should have a unique ID, and the exposed content should only record the ID).
-
Mapping between interaction and exposure (each interaction data corresponds to one exposure data).
-
User profiles (providing user profiles from offline mining and data synchronization).
-
Feed content analysis data (providing content characteristics).
Kafka and Hadoop are generally used for log collection and storage. Hive is used to process data, generate training samples, and monitor product indicators. Among them, the more important is the model parameter update, namely the training model.
For a primary interest Feed, it is not necessary to update the sorting algorithm parameters in real time online, so the data pipeline can learn from Pinterest. For example, logistic regression was used to predict interactive behavior ranking feeds. In the offline phase, the AUC of the model was focused on whether it improved.
In addition, the amount of interactive data is much smaller than that of all exposed data, so it is necessary to sample negative samples (samples displayed but not interactive) when generating training data. The sampling ratio is also a parameter that can be optimized, and the proportion with the best effect can be selected after fixing the algorithm and features.
When testing AB, pay attention to whether specific product objectives are improved, such as interaction rate, etc., and also pay attention to some auxiliary indicators according to the specific form of the product.
Challenges and solutions of interest feeds
Interest Feed is an inevitable trend after the in-depth development of the Internet, and many Feed products have verified this point in data. But let’s be clear: Interest feeds are simple in concept, but not without their challenges.
➤The user habit
The chronological Feed is very natural and easy for users to accept. Once the algorithm is used to determine the order of feeds, whether users can accept it is very challenging to design the product. In particular, if you start with a timeline Feed and convert it to an interest Feed, it’s going to be a little harder to change habits than if you started with an interest Feed.
To face this challenge, we need to consider the following:
-
Do you really need interest feeds? Interest feeds are not needed without information overload. Is there information overload, and the data is easy to verify: How much is the user missing?
-
The product design of interest Feed requires in-depth thinking. Although the algorithm is used to filter the content that users are not interested in, the technical traces should be diluted in UI/UE to present a more natural browsing mode. For example, should we consider that the time sequence is still displayed after the algorithm is selected?
-
The algorithm effect of interest Feed is improved quickly. It’s normal that interest feeds don’t work well at first, but if you make them usable before users get impatient, the risk is much lower.
➤Technical challenges
In a feed-like product that requires algorithmic sorting, the magnitude of data should not be small, and if interest feeds do work, the speed of data growth will increase, so the technical challenges will arise quickly.
-
High availability of the Feed service. Ensure that critical modules are gracefully degraded when they fail, that any data is redundant and readily exchangeable.
-
Large-scale machine learning. The high dimensional sparse feature space and large sample size require the machine learning platform to be able to deal with large-scale learning problems. It must be parallelized and convenient for algorithm engineers to carry out rapid iteration.
-
Online experimental system. Orthogonal segmentation of online flow is carried out, and different experiments are carried out as much as possible, and the experiments carried out at the same time do not affect each other, and the experimental conclusions are scientific and effective. This section can refer to Google’s online experimental system. Baidu and other large Internet companies in China have also publicly shared how their experimental system divides traffic.
➤Boundary of algorithm
We have to admit that algorithms have boundaries. However, many products are far from the boundary and have not fully mined out the value contained in the data. Facebook’s creation of a human flesh review panel shows how serious it is about using human creativity to complement algorithms.
Because most people are irrational in most cases, the standard of interest is inconsistent, coupled with the interference of social group psychology, it is a very complicated topic to find the content of interest for individuals.
In addition, the introduction of the algorithm itself also increases the complexity of the whole product. The user’s interest in the content can be measured under the intervention of the algorithm, which is similar to the “uncertainty principle” in the quantum theory.
Face algorithm over our Feed content, we can neither make too much subjective decision, sure scientific algorithm can get better than pure human subjective specified rules as a result, but also should not be lazy, need to get inspiration from the data, in view of our god patrol the whole situation, help algorithms perform better.
The Programmer article recommended reading:
Please scan the QR code below for more exciting articles and subscribe to the Programmer 2017 (iOS, Android and print edition).
More than 130 lecturers will attend the forum, including Chen Runsheng, academician of Chinese Academy of Sciences, Zhang Wensong, senior vice President of Didi Chuxing, Rui Yong, senior vice President and CTO of Lenovo Group, bai Shuo, former chief engineer of Shanghai Stock Exchange and other experts2016 China Big Data Technology Conference(http://bdtc2016.hadooper.cn/dct/page/1), the fare discount is coming to an end.