This article was originally published by AI Frontier.
YouTube and today’s toutiao very grievance: pornography violence pot recommendation system should not back?


By Vincent,Debra Chen


Edit | Emily

The “Elsagate” incident on YouTube has exposed the problem of “children’s cult videos” hidden in video websites. Major video websites at home and abroad have cleaned up and rectified similar videos on their own websites. In addition to anger, parents asked the sites: Why is my child being featured in these videos?

Recommendation system, one of the most widely implemented technologies of artificial intelligence, has become the target of public criticism during this period. Not only video websites, but also many news websites and apps, such as Toutiao, have been found to recommend vulgar content to users. Facebook, which is known for its technology, has also been exposed as having problems with fake news recommendations.

The question then arises: should the recommendation system be the victim of this huge “black hole”, or are these problems really all the fault of technology?”


t.cn/RE5zhWm (Qr code automatic recognition)

(Recommend a recommendation system column, scan the wechat subscription, register now to enjoy 30 yuan new red envelope)


Event review

YouTube Elsagate event


Let’s take a look back a few months ago at the “Aisha Gate” scandal.

In December 2017, Elsagate gate on YouTube caused a huge sensation in public opinion and a collective crusade of the public. As a matter of fact, Elsagate is not a recent event. As early as 2016, some organizations or companies recorded and uploaded to YouTube cartoon images of Elsa, spiderman, Mickey Mouse and other contents that are not suitable for children to watch, such as murder, kidnapping, pregnancy, injection, beating each other, blood and violence. The content of the video that triggered this incident is shocking, especially for parents with children.

From the perspective of recommendation algorithm mechanism, a typical recommendation system usually consists of mining, recall and sorting to form a funnel model. In an ideal state, the information left after filtering through these three layers of funnel should be relatively pure and high-quality information. However, the recommendation system is often not perfect, in any link of the three layers of filter net may appear problems, by some interest groups malicious use to seek profits.

The world’s largest Internet video platform’s content rating system can be described as “confusing” and “imperfect,” according to the engineers who train YouTube’s signing algorithms. While Google Brain managed to increase YouTube videos by 20 times, Elsagate did so by exploiting a flaw in YouTube’s recommendation algorithm. To put it simply, YouTube recommendation algorithm consists of two neural networks. The first one is to generate candidate set, which takes the user’s viewing history as input and uses collaborative filtering algorithm to select from hundreds of videos. A second neural network was used to sort the hundreds of videos. The system uses logistic regression to calculate the score for each video, which is then continuously improved using A/B tests.

Using these features of YouTube algorithm, Elsagate, the profit winner behind these videos, makes the algorithm automatically judge them as children’s videos by labeling the names and names of a large number of cartoon characters as funny and children, and makes them appear at the top of the recommended ranking. As long as children click one, The site recommends videos of the same genre, one after another. In addition, due to the material handling capacity of the algorithm in the mining stage, the Matthew effect in the recall stage and the conversion rate only theory in the sorting stage, a large number of Elsagate videos take the fast train of the algorithm and easily appear on the popular videos on YouTube, making unsupervised children victims.


Today’s headlines have been “please drink tea” several times

In 2017, headlines came under fire for their vulgarity and invasion of public privacy.

To cause public panic “microphone incident”, the headline said today, from a technical point of view, the current sound information processing, also far to reach the microphone to get the level of personal privacy, “today’s headlines user information accumulation, entirely by the user click on today’s headline data such as behavior.”

Despite headlines’ claims, users remain spooked and skeptical of the explanation. On baidu Post, Zhihu, Weibo and other social platforms, many netizens report that their speech content matches the headline recommendation, and even the content they speak in Taobao and wechat groups will receive the matching recommendation the next day.

In fact, this is not the first time toutiao has received an invitation from the government. On December 29, 2017, several toutiao channels were shut down for 24 hours for continuously spreading pornographic and vulgar information and illegally providing Internet news and information services. Toutiao is suspected of using algorithms and other technical means to recommend users pornographic and vulgar information that is easy to obtain traffic, so as to obtain advertising revenue.

Before this year’s Spring Festival Gala, toutiao’s two products — “Volcano Video” and “Douyin” — were also temporarily removed from the title of the Spring Festival Gala by several TV channels.

As early as Last June, the Beijing Cyberspace Administration ordered toutiao to close more than a dozen accounts, telling them and other news portals to curb reporting on celebrity scandals and “actively spread core socialist values and create a healthy and upward mainstream public opinion environment.”

In September 2017, The People’s Daily published a series of commentary pieces that harshly criticized AI-based news apps like Toutiao for spreading misinformation and superficial content.

In response, Toutiao’s parent company, Beijing Bytedance Technology Co LTD, closed or suspended more than 1,100 blog accounts, saying those bloggers had posted “vulgar content” on the App. It has also replaced its “Society” section with a new section called “New Era,” which features extensive state media coverage of government decisions.

Not coincidentally, the hot search section of Weibo, the largest we-media platform in China, has also been removed for rectification, and a “New Era” section has been added to the popular part.


Facebook’s AD model supports fake news

Separately, Fox News reported That Dipayan Ghosh, a former privacy and public policy consultant on the social network, said the kind of disinformation that interfered with the US election and Brexit was closely related to Facebook’s nature as an advertising platform. “Political misinformation succeeds because it follows basic business logic that someone will benefit from the product and that broader digital advertising market strategies are perfected,” Ghosh and his co-author Ben Scott wrote in a report published by the New America Foundation.

Ghosh left Facebook in 2017, shortly after the US election, amid the fallout from the Facebook misinformation incident. In the new report, he and Scott argue that as long as a social network’s core business model is influenced by advertising, algorithms and user concerns, attempts to tweak the platform are doomed to failure.

Facebook has a large number of users, and the problem of fake news can affect almost all users of social media around the world, causing a negative impact on the majority of people around the world.

Facebook uses a different set of algorithms called the EdgeRank algorithm. Its News Feed algorithm, Google’s search engine algorithm and Netflix’s recommendation algorithm are all distributed and complex algorithms, including many small algorithms.

The algorithm has gone through countless iterations since the slash-and-burn era, when Facebook acquired FriendFeed and incorporated its likes, but the general approach — interest feeds — has remained the same. Facebook and its news-streaming algorithm are trained by the EdgeRank algorithm to show users what they like to see.

As Facebook’s role in disseminating information has changed to become a de facto content distribution intermediary, it is assumed that it should be responsible for identifying the authenticity of information. The company has rolled out a series of measures to combat fake news, including streamlining procedures for users to report false information and flagging controversial content through third-party fact-checking agencies. In 2017, the “Controversial Tag” function was launched, that is, users can tag the controversial news with a report label, as shown in the picture below:

Last week, Facebook announced changes to its algorithm for news feeds on its home page, reducing the share of news in news feeds to 4% from the current 5%.

However, the effect of this approach was limited after several more fake news incidents and, as the Wall Street Journal points out, the almost routine annual adjustment of Facebook’s news feed had limited effect. Each time, publishers that relied heavily on Facebook bounced back.


What’s the problem?

This series of events have pointed the finger at the recommendation system, but is recommendation system technology really wrong?

As for vulgar content being recommended, it is obviously not as simple as putting a cap on the developer of recommendation system. A piece of content from production to consumption (read) generally has this chain: creation, publication, grab, distribution, click, read.

There are six links, and three waves of people are involved: one group of people creating and publishing, content producers; Scraping and distribution are the second wave, many using recommendation engines; The third wave of clickers and readers are content consumers. The current crusade is vulgar content is seen by people, obviously only crusade against the recommendation engine in the middle is not appropriate, of course, it is impossible not to crusade, three waves of people are involved.

If there is no human intervention in the recommendation engine, then the most likely reason that a piece of vulgar junk is pushed to the user’s home page is that it is really popular, because user behavior is the data that recommendation systems rely on most. In this case, in addition to platform self-checking and manual intervention, there are technically some things you can try to do:


  • In terms of content analysis, manual screening data can be used as samples to train some identification models. Vulgar garbage content can be subdivided and different identification models can be trained to assist manual rapid screening.
  • In the content capture, control the quality of the source of capture, avoid the worst areas of vulgar garbage content;
  • In terms of recommendation distribution, the recommendation algorithm has changed from purely data-driven to data-inspired, and the optimization objective of recommendation algorithm has changed from single objective to multi-objective optimization. Besides the effect index, content diversity is also considered. In the use of user behavior, also have to screen, consider the user value. In the use of popular content, some screening and restraint, using some categories of popular content or the popular content of some high-quality user circles, rather than the global popular content. And so on.

No matter which content distribution platform, they do not want to see vulgar garbage content occupied, after all, it will affect the brand image, and some operational risks, but driven by some interests, people will always try the law, so this is a never-ending offensive and defensive process, there is no end to the day.

As for the situation like “Eshammen” and the flood of vulgar garbage content in the information flow, if we want to use technical means to combat, then the focus is on the in-depth excavation and identification of the content itself. The difficulty of “Esamun” lies in the fact that it has done a lot of work in terms of form (adding up key words and imitating characters), but in terms of plot, it has strong hints, sex, violence, and abuse.

AI the front, from the technical means or can use manual annotation, machine learning to identify the part of an illegal content, “aisha” has two characteristics can key consideration when training model, is a very strong colors, can say spicy eyes, and is often accompanied by screaming and crying, those in normal children in the video is not normal. But ultimately, it has to work in tandem with humans and machines.

From many perspectives, the problems of spam recommendation on the Internet are not directly related to recommendation algorithms, but are in need of improvement in data content analysis algorithms. Recommendation algorithms focus on satisfying users’ interests and recommending corresponding things after discovering their interests. Obviously, the suppression and attack of fake news and vulgar garbage should not focus on this process, but should be done at the source.


Factors affecting the quality of recommendation system

The fact that recommendation systems can produce so many “evil flowers” shows that there are still many difficult problems to solve. AI Front interviewed the technical expert of the recommendation system — Xing Wudao (Chen Kaijiang), a senior algorithm expert of Lianjia. He believed that the biggest problem of the recommendation system at present is two such problems: the first is the cold start problem, and the second is the exploration and utilization problem.

The solution to cold start is mainly to introduce more third-party data to make cold hot; Pure technical means to solve, generally intensive learning, simple point is multi-arm slot machine, but it is a little unrealistic to rely on technical means to solve cold start, generally with a variety of operation means, a bit of “entertaining” meaning.

The second problem is the exploration and exploitation problem, sometimes called the EE problem. Now all media would call it an information cocoon, which means it gets narrower and narrower. The reason is that the recommendation algorithm snatches effective information from the user’s item relationship matrix and then fills the matrix. This is a positive self-reinforcing process, and the narrower it gets, the more fatalistic it is. This means that after user interests are detected, they are only exploited and not found new user interests. It can be said that no recommendation system can avoid this fate. If there is only mining and utilization, the recommendation system is a closed system, and the entropy of the closed system increases forever, and all the closed systems will go cold and quiet without exception, which is reflected in the recommendation system. It is not good, but you just don’t want to see it any more. The only way is not to let the recommendation system become a closed system, need to constantly introduce information exchange with the outside, such as independent of user interest, recommendation in a random way, such as introducing data from other external products, and so on.

In addition to the defects of the algorithm itself, there are many factors that affect the quality of the recommendation content, such as audit mechanism, user factors, data factors, algorithm strategy factors, engineering architecture factors, etc., all have an impact on the recommendation effect. Let’s take YouTube’s moderation mechanism as an example to see how it affects content recommendations.

According to BuzzFeed News, YouTube’s contracted search algorithm engineer said there were flaws in YouTube’s system, according to a review of the company’s video review guidelines and interviews with 10 current and former “raters.” These so-called “guidelines” often contradict each other, recommending “high quality” videos on the basis of “product value,” regardless of whether the content may cause discomfort to users of different ages. Not only did this result in thousands of Elsagate videos spreading across the web, but it also algorithmically made them more searchable.

Raters say they have received more than 100 assignments over the past decade to assess whether videos designed for children are safe. “I made more than 50 videos about children in just one day, which lasted about seven hours.” “Said one of the raters. ‘However, none of these videos should be shown to children and as a parent I am outraged.’ The videos, though animated, contain profanity, dirty jokes, hurtful and sexual content. When a child sees these videos unsupervised, it’s a really scary thing.

According to the raters, they don’t have the power to determine how videos on YouTube are ranked in search results, whether they violate guidelines, remove, age limit viewers, or judge them as illegal ads, because the power to intervene rests with Google and other groups on YouTube.

In the wake of the controversy, YouTube CEO Susan Wojcicki announced plans to increase the number of human reviewers on the platform to 10,000 by 2018, and that YouTube raters were recently given the power to judge content. Content seen unsupervised by children aged 9-12 can only be reviewed if it is deemed harmless by their parents.

We don’t know how effective strengthening the audit system will be, but we know from experience that there is a human element to the audit system.


What does a good recommendation system look like?

So from the algorithm, data, architecture, product form and other aspects, how to design a better recommendation system? What is a “good” recommendation system?

Let’s start with a set of examples of the benefits of a good recommendation system.

Amazon’s second-quarter sales rose 29% to $12.83 billion, up from $9.9 billion a year earlier, according to the company’s earnings report. That growth arguably has a lot to do with the way Amazon has incorporated recommendations into almost every part of the purchasing process, from product discovery to checkout.

Toutiao adds 100 million users a year, and YouTube, combined with the Google Brain recommendation algorithm, has seen its viewing time increase by 50% every year, thanks to the recommendation system.

The above are only a few cases in which recommendation algorithms bring convenience and benefits to us. It is also an objective fact that many other applications bring improvement to user experience.

Xing Wudao, an expert interviewed, said, “Good systems are not designed, but evolved. It is difficult to design a better recommendation system. Recommendation system is ultimately for product experience, or return to the essence, there is no standard manual to optimize the recommendation system, every day to experience their own products, to see the data, to observe the data, rather than follow the data.”

AI also learned that deep learning has been applied in many recommendation systems, such as content expression learning, embedding; RNN is used for sequence recommendation; There is also more to replace the traditional single-use linear model on the fusion ranking, such as Wide&Deep model. These applications can be used to enhance the functions of the recommendation system and optimize the experience of the recommendation system.


Reflection: balance between interests and social rights and interests

All the world is for profit, all the world is bustling for profit. This truth is still true in all places and time after thousands of years. The drive of profit makes many platforms lose their integrity.

Are all teams as unscrupulous as some platforms, and are there platforms that uphold the right values? We dare not make a conclusion on this issue, but from the current exposure of various platforms, the platforms mentioned above, such as Facebook, YouTube and Google, indulge in content that violates social morality and values, as well as leading search results, One has to suspect that the value chain behind them is driving them to make such a choice.

Platform in China, the biggest since the media headlines today, microblogging platform has also been the involvement of the relevant departments to rectification, big platform if so, can you imagine how much is the default rules in quietly plays a role, let a person have to suspect behind the information that we see every day by how much the interests of the chain, and also let people to think about how the user can get rid of the predicament of “consumption”. However, while these measures can prevent some content that violates social value from appearing on the platform, more can appear overnight.

“Behind this is the ‘value misalignment’ that is prevalent across content distribution platforms.” “It’s about the interests of business versus the interests of society,” Selman said. When there are irreconcilable contradictions between the interests of enterprises and the interests of society, the intervention of supervision and the responsibility of enterprises are the key points to return users to a pure land.

In short, is not only our today to discuss the topic of the recommendation algorithm, all technical progress will be “evil” side, but after all they just tools, the essence of, like a scalpel, it can be used to kill people also can save people, whether it can have what effect, use it for the benefit of society but also on people is the kernel of the medical or serial killer, And the purpose for which it is used.

Reference article:

[1] www.foxnews.com/tech/2018/0…

[2] www.wired.com/story/dont-…

[3] www.cnet.com/news/youtub…

[4] www.buzzfeed.com/daveyalba/y…

[5] qz.com/1194566/goo…

How to build a recommendation system quickly from scratch? Our guest Teacher Xing Wudao has a set of course recommendations:

The authors introduce

Xing Wudao, born Chen Kaijiang, is now a senior algorithm expert of Lianjia, engaged in the research and development of algorithm products. He used to be senior algorithm engineer of Sina Weibo and algorithm supervisor of Kaola FM. During 8 years of practice, my work and research scope have never exceeded the recommendation system.

Over the years, Xing has worked for startups, traditional companies and large Internet companies, and his experience has allowed him to witness the construction of recommendation systems of all shapes and sizes. And because he’s been involved in their recommendation systems from zero to one, he knows the pitfalls.

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.