The author | Zhou Xiang



Can you imagine zhihu and Toutiao diss each other when they disagree?


In recent days, a friend circle of Zhihu big V @evil dad has become the trigger for an all-out battle between “Wukong Answers” and Zhihu under Toutiao.

Toutiao signed more than 300 Zhihu big V’s in one breath this year, and JUST signed me, and it is to give money, the annual income is higher than ordinary white-collar workers. After signing all the content can not be sent zhihu. The quality of Zhihu has gone from bad to worse…

Zhang Liang, co-founder of Zhihu, said: “For at least the past year, I have been hoping for two people to leave Zhihu. One is the ‘evil dad’ and the other is the ‘Huo Master’.”

It smells like fire.


However, the “war” between Toutiao and Zhihu is not just that simple. There is also a “competition” between the two in terms of content distribution technology and AI talent, one of the forms of which is the “algorithm competition”.


As you all know, Toutiao poached Ma Weiying, former executive vice president of Microsoft Asia Research Institute, to take charge of its ARTIFICIAL intelligence lab. In fact, Zhihu is also quietly layout.


Zhihu has set up a machine learning team, and zhang Rui, a former employee of Baidu and Wandoujia, is one of them, according to AI Technology Base. Zhang received a master’s degree from Beijing University of Posts and Telecommunications and has been working on search engines and natural language processing.


Since both have data and demand, it makes sense to hold algorithmic contests to provide new ideas, expand their influence and attract talent.


Zhihu mountain Cup Machine Learning Challenge

In 2016, Toutiao, Chinese Society for Artificial Intelligence and IEEE China jointly organized a three-month BYTE CUP International Machine Learning competition. The task of the competition is to build models that predict the probability that an expert is likely to answer a particular question, aiming to more efficiently push questions from ordinary people in the headline quiz to experts willing to answer them.


On May 15th this year, Zhihu Algorithm team, together with Chinese Society for Artificial Intelligence, IEEE Computer Society and IEEE China Representative Office, launched “Zhihu Jianshan Cup Machine Learning Challenge”. The contest takes semantic analysis application as the topic and carries out precise automatic topic labeling on Zhihu content, so as to improve user experience of Zhihu and improve content distribution efficiency.


It can be seen that both questions are directly related to improving the distribution efficiency of question and answer content. You should know that Toutiao’s Q&A channel was officially launched in toutiao App on July 14, 2016. So although Zhihu’s algorithm contest came a bit late, it was also the first shot at Toutiao.


On August 15, Innovation Works, Sogou and Toutiao announced that they will jointly launch the AI Challenger Global AI Challenge. The first challenge will be officially launched on September 4.


However, the AI Challenger competition has not yet started. The results of zhihu’s competition were announced yesterday (August 30) : Init team from the Pattern Recognition Laboratory of Beijing University of Posts and Telecommunications won the competition.


Although Zhihu is not the first Chinese company to hold an algorithm competition, the competition is still a positive demonstration for the country’s AI community. At least, the reputation of big data competition is much better than Ctrip.

Let’s take a look back at the game.

Background introduction

At present, an important way of content distribution on Zhihu is through the Feed stream generated by paying attention to relationships. Focus relationships may be based on people or “hashtags”; To recommend content to users from the topic tags they care about will better meet the needs of users for different types of knowledge in different fields. Therefore, accurate automatic topic labeling of content on Zhihu plays a very important supporting role in improving user experience and content distribution efficiency of Zhihu. At the same time, the semantic understanding and automatic tagging of text, especially in tagging with a large number of tags, is also a frontier research direction of natural language processing.


Task description

Participants need to train an automatic labeling model for unlabeled data according to the training data of the questions given by Zhihu and the binding relationship of topic tags.


The labeling data contained 3 million questions, each with one or more labels, for a total of 1,999 labels. Each label corresponds to a “topic” on Zhihu, and there is a parent-child relationship between the topics, which is organized into a directed acyclic graph (DAG).


Due to concerns about user privacy and data security, the contest does not provide the original text of the question and topic description, but uses character numbers and word numbers after cutting words to represent the text information. In view of the wide application of word vector technology in natural language processing, the competition also provides character-level embedding vector and word-level embedding vector.


Evaluation method

Zhihu provides a evaluation data set containing 217,360 questions. Participants need to run the trained model on these questions to make predictions and label the Top 5 hashtags. In addition, there is an order between the predicted hashtags.


  1. The five predicted hashtags were ranked in order of their predicted score.

  2. Hashtags are not duplicated by default. When a topic tag is repeated, only the first occurrence is retained, and the subsequent tags are replaced. If there are less than five tags after deduplication, the remaining positions are -1 by default. -1 does not match any tag. Those with more than 5 hashtags will be ignored from the sixth place onwards.

  3. Evaluation criteria:

    Precision: If the predicted tag hits any one of the tagged tags, it is regarded as correct. The final accuracy is weighted by position for accuracy at each position. Math Precision = \sum_{pos \in \{1,2,3,4,5\}} \frac {Precision@pos} {log_{pos + 1}}



    Recall: The predicted amount of coverage of the original labels in the Top 5 labels.

    The final evaluation index was harmonic mean of Precision and Recall. That is:




The result of the match

The winner is the judging team’s performance on a validation dataset based on the model submitted by the team


Links to performance and ranking -https://biendat
a.com/competition/zh
ihu/leaderboard/
), and the final screening confirmation.
To verify the results, after the winning team submitted their method description and reproducible code and model data, the review team reviewed the winning team’s method and code one by one, randomly selected models submitted by some of the teams, and verified the results with another validation dataset. A total of 7 teams won prizes in this competition. The winners are as follows:

The first prize, 40,000 YUAN, was won by the INIT team from Beijing University of Posts and Telecommunications.


Two second prize winners, prize of $10,000, the winning teams are:


  • Koala’s team from Beijing University of Posts and Telecommunications;

  • YesOfCourse team from CAS, Google and Baidu;

Four third prize winners, prize money 5,000 YUAN, the winning teams are:


  • NLPFakers from Microsoft and Peking University;

  • Gower Street & 81 Road from Wuhan University and University College London;

  • Ye team from Beijing University of Posts and Telecommunications;

  • Yin & Bird team from Zhengzhou Railway Bureau, Flush Company and Zhongshan University of Electronic Science and Technology.

Games excel

According to Zhang Rui of Zhihu, all the seven winning teams used Deep Nerual Network (DNN) of various structures without exception, while traditional text classification methods, such as Support Vector Machine (SVM), Methods such as SVM or Naive Bayes are less used. This may indicate that, to some extent, deep neural network has become the mainstream method in the field of NLP.


  • Init team: TextCNN + TextRNN + RCNN, shared Embedding learns jointly. In terms of model integration, Bagging method with multi-model and equal weight is used for integration. In data preprocessing, delete and Shuffle are used for data enhancement.

  • Koala team: FastText + TextCNN + TextRNN, used boosting idea to train neural network layer by layer, and weighted average bagging method was used among each network.

  • YesOfCourse team: TextCNN + LSTM/GRU + RCNN was used as the base model, and GBRank was used to fuse the outputs of multiple neural networks;

  • NLPFakers team: TextCNN + RNN + RCNN was used as the base model, and linear weighting was used for model integration; Attentional mechanism is used in neural network training.

  • Gower Street & 81 Road team: RNN neural network was used as the basic model, and the similarity of Query-TopicTitle was jointly trained with the neural network. Finally, Bagging with Ensemble Selection was adopted as the model integration strategy.

  • Ye team used TextCNN + BiGRU as the basic model and bagging with weight search as the model integration strategy.

  • Yin&Bird team: Using LSTM and Bayes methods as the base model and stacking methods for model integration.

In modeling the problem, all the teams translated the problem into a “text multiple classification” or “text label prediction” problem. During training, most teams use the Cross Entropy as a loss function. All teams applied the idea of integrated learning, using multiple models to complement each other to improve performance. At the same time, the players also carried out a lot of optimization for their understanding of the problem, and there were some very bright optimization methods.

Such as:


The init team, the first place winner, has done creative work on data enhancement. During model training, the INIT team uses delete and shuffle mechanisms to avoid over-fitting of training results and ensure the difference of models. The Init team noted in its review submission that multi-model results trained to perform better than second-place results using equal-weight bagging only through data enhancement mechanisms.


Koala, the second place winner, used a layer-by-layer boosting method to improve the performance of a single neural network model in neural network training. According to its description, this optimization can improve the performance of multilayer neural networks by about 1.5 percentage points.

The YesOfCourse team, in third place, turned the Tag Precition process into a two-step Recall-Rarank problem; A large number of neural network models are used for recall, and the predicted score of neural network labels is taken as the characteristic input of GBRank, and the Pairwise method is used to optimize the sorting of labels, and the first 5 sorted labels are selected as the output of the model. From the descriptions submitted by the YesOfCourse team, we can see that the results obtained by using Recall + Rerank model are more than two thousandths better than those obtained by non-Linear NN Ensemble. At the same time, YesOfCourse also tried to use a variety of Loss functions and attention mechanisms to ensure the difference between models.

The Gower Street & R1 Road team, ranked fifth, made use of the topic title information provided by the data and used RNN + question-topic Similarity information to conduct joint training of the model. The result of single model was improved from 0.415 to 0.419, and the ensemble of 20 models was used to achieve a good result of 0.432.

The future of algorithmic contests

Zhang Rui said that zhihu has initiated relevant researches on this issue for a long time, such as Word2vec + CNN and LSTM, and a relatively mature version has been running online.


“We are by no means” technology that fleeces people for pennies.” “We have great expectations for the contestants. I believe that through the competition, you will be able to put forward some unique insights from some unexpected places and spark some thinking, which will greatly inspire the further improvement of our work. We also hope that participants will use the competition to develop their interests and abilities in the field of natural language processing and make it a win-win event for both of them.”


However, the most influential competitions in THE FIELD of AI are mostly foreign, and the lack of high-quality data sets is also a major constraint in addition to language factors and development priorities. Zhang Rui wrote in his column,”


According to Zhang, in addition to the text tag data set of the competition, they will also release some data sets and machine learning tasks closely related to Zhihu, such as content recommendation and social network link prediction, which will be opened to the public after strict desensitization and review.


Nowadays, it is more and more difficult for algorithms to make breakthrough progress, and the importance of high-quality data sets is further highlighted. It’s even harder for academics and independent developers to get access to research data than it is for big companies.


At the same time, both zhihu’s “Kanshan Cup” and Toutiao’s “AI Challenger” jointly held with Innovation Works and Sogou have indirectly contributed a lot of data to the AI community through algorithm competitions.


So whether these companies are trying to expand their influence or attract talent, it’s at least a good start.