I’ve done a lot of things this month. An ES cluster with a scale of 100 sets was launched and a real-time recommendation system was designed and developed. The title is a little long, in fact, in order to highlight the three highlights of the recommendation system, one is real-time, one is based on user portrait to do, one is asynchronous.

Real-time is mainly reflected in three aspects:

  1. Short-term interest models in user profiles are constructed in real time.

    That’s when you watch a video, and within seconds, that video affects your short-term interest model, which is reflected in your next recommendation.

  2. Candidate sets change in real time.

    In the recommendation system I designed, the concept of candidate set is different types of video libraries to be recommended to users. A user can not see the whole candidate set, but can see part of the candidate set processed by the matching algorithm. The update cycle of candidate sets directly affects the real-time performance of videos that users can see. There can be many candidate sets and different candidate sets can be used to solve the problem of different recommendation scenarios. For example, by combining the latest candidate set with the hottest candidate set in recent N hours, we can achieve the recommendation effect similar to Toutiao. New content candidates are generated in real time, while popular video candidates in the last N hours can be updated in minutes. For example, Synergize can make recommendations related to videos, while popular candidate sets can further sift out content users like from content everyone cares about.

  3. Real-time presentation of recommended performance indicators.

    Some of the key metrics you see after launch, such as the click conversion rate, can be updated at the minute level. There is something special about the recommendation system, that is, it is not determined by any one person, but measured by some indicators. Like click conversion rate.

User portrait and video portrait

The user portrait is reflected in the interest model. By constructing the long-term interest model and short-term interest model of users, it can well meet the needs of users’ interests and hobbies as well as during their sessions. There are many ways to make recommendations, such as collaboration and various tricks. However, based on user portraits and video portraits, it will be difficult to start, but in the long run, it can promote the team’s understanding of users and videos, and support businesses other than recommendation.

asynchronous

The recommended calculation is triggered by the user refresh behavior, and the user information is sent asynchronously to Kafka. The Spark Streaming program then consumes and matches the candidate set with the user, and the result is sent to the user private queue in Redis. The interface service is only responsible for fetching the recommended data and sending the user refresh action. Asynchronism can be a problem for new users or users who have not been in a long time and whose private queues may have expired. Once the front-end interface discovers this problem, there are two solutions:

  1. A special message is sent (with the Storm cluster on the back end) and then held, waiting for the result of the asynchronous calculation
  2. After obtaining user interest tags by ourselves, we will search for synergies according to certain rules, and then retrieve them in ES, fill the private queue, and quickly give the results. (The scheme we adopted)

Except for new users, this is generally a minority. Most computations are covered by asynchronous computations.

The influence of streaming technology on recommendation system

I’ve written a lot about streaming technology, most explicitly about the fact that data is inherently streaming. Of course, the main work of my department in the past two years is to build Pipline and solve problems related to real-time log billing. Streaming computing has a great impact on recommendation systems and can be fully implemented

In the recommendation system, in addition to the interface service, all other computing-related, including but not limited to:

  1. New content is preprocessed, such as tagging, and stored in multiple stores
  2. User profile construction such as short-term interest model
  3. New hot data candidate set
  4. Short-term synergy
  5. Recommend metrics such as click conversion

Spark Streaming is used to complete these processes. Spark is used for batch processing of long-term collaboration (data of more than one day) and long-term user interest model. Because of StreamingPro, you can configure all the calculation flow, and what you see is a bunch of description files that form the core calculation flow of the whole recommendation system.

Three other points worth mentioning here are:

  1. Spark Streaming + ElasticSearch is recommended for evaluation. Spark Streaming preprocesses the reported exposure click data and stores it in ES, and then ES provides a query interface for BI reports. This avoids the need to pre-calculate metrics and constantly change the streaming program as many metrics implementations fail to take into account.

  2. Reuse existing big data infrastructure. In the whole recommendation system, only services that provide EXTERNAL apis need to be deployed independently. All other computing uses Spark to run on the Hadoop cluster.

  3. All computing cycles and resources can be easily adjusted or even dynamically adjusted (Spark Dynamic Resource Allocatioin). This is so important that I can give up a certain amount of real time to save resources or give more resources to offline tasks in my spare time. Of course, this is all good for Spark’s support.

Recommend the architecture of the system

The structure of the whole recommendation system is shown as follows:




Snip20161201_6.png

It looks pretty simple. Distributed flow computing is mainly responsible for five pieces:

  1. Click exposure and other reported data processing
  2. New video tagging
  3. Calculation of short-term interest model
  4. Users to recommend
  5. Candidate set calculation, such as latest, hottest (any time period)

Storage uses:

  1. Codis (User Recommended List)
  2. HBase (User portrait and Video Portrait)
  3. Parquet(HDFS) (Archiving data)
  4. ElasticSearch (HBase copy)

The diagram below is a refinement of the stream calculation piece:




Snip20161201_7.png

Technical solutions reported by users:

  1. Nginx
  2. Flume (Collecting Nginx logs)
  3. Kafka (receives Flume reports)

For third party content (the whole web), we have developed a collection system ourselves.

Personalized recommendation




Snip20161201_10.png

All candidate sets are updated in real time.

Here we talk about the concept of a parameter configuration server.

Suppose I have three algorithms A,B, and C, each of which is done by three streams, each of which is independent of the other, and each of which computes its own result set. Because different candidate sets and the content of the algorithm to calculate frequency and quantity of the data will be difference, assuming A calculated result set is too large, B to calculate the result set is small, but the quality is very good, this time they are sent to users recommend queue, needs to be itself is submitted to the parameters configuration of the server, and is determined by the parameters of the server finally be able to send the amount to the queue. The parameter server can also control the corresponding frequency. For example, if algorithm A has A new recommendation 10 seconds after the last recommendation result, the parameter server can reject its content to be written into the user recommendation queue.

The above is a multi-algorithm flow control. In fact, there is another way that I want the results of A and B to let A new algorithm K decide the blending rules, because all algorithms are A configurable module in StreamingPro, and A,B and K will be put into A Spark Streaming application. K can periodically call A and B to calculate, and mix the results, and finally write to the user recommendation queue by configuring the server’s authorization through the parameter server.

Some feeling

I made a recommendation system in 14 or 15 years. At that time, there was no concept of streaming computing, and some existing technical systems could not be reused, which made the system too complicated and difficult to be productized. Moreover, the recommended effect can only be seen every other day, resulting in a very long period of improvement. The entire development cycle was over a month. Now with StreamingPro, two or three people can only work on it in two or three hours a day, and it was developed in two weeks. The follow-up is to further explore user portrait and video portrait, the core is to build a label system, and then put these labels on the user and video body. We combine LDA, Bayes and other algorithms, and get a lot of useful experience.