Author: Xianyu Technology – Huacai

background

Scatter is based on the results of the recommendation, advertising and search systems to improve the user’s visual experience. The main method is to reorder the results in a presentation order to disperse objects of similar categories and avoid user fatigue. The recommendation results from the algorithm end often have the following pain points: **1. It is easy to cluster goods of similar categories. ** Obviously, if the characteristics of products are similar, it is easy for them to get similar recommendation scores, and the same products everywhere is definitely not the result expected by users. **2. Too strong to capture user preferences. ** At the psychological level, users are sensitive to the perfect capture of privacy or preferences. Overly accurate results not only easily lead to user aversion, but also easily limit the transformation of user potential. **3. Errors are easy to magnify. ** For users with few traces of use, it is easy to magnify only features, which can easily lead to false recommendations. By changing the presentation order, the scatter algorithm separates similar categories, buffering the interaction between the recommendation system and users and improving user experience, which is the last step of algorithm enabling.

Problem definition

First, we clarify the definition of the scatter algorithm. Each object has one or more attributes that need to be distinguished, and the output requirement is a list of similar attributes dispersed. These details will be involved: 1, the degree of dispersion. Should the same classes be kept as separate as possible, or should they be as far apart as possible? 2, shatter the basis of the dimension. Is it ok to separate by one attribute, or are there multiple factors that need to be considered separate? 3, the performance of beating. As a frequently called interface, the more performance optimization is of course the better it is worth noting that we do not want to lose the user personality factors brought by the algorithm side system, so how to make full use of the original object order on the basis of the scattered, is also very worth weighing the problem.

The solution

From three different dimensions, we will discuss three more common methods of dispersion. Of the three methods, the most thorough degree of dispersion is by column dispersion; The weight distribution method can be considered comprehensively. The sliding window method requires only local calculations to improve performance.

1) By column scatter method

Since we want to avoid the contents of similar properties being presented next to each other, the straightforward idea is to put the contents of different properties in different buckets, and try to select different buckets each time we want to take them. This allows the elements to be scattered as much as possible. In this example, the initial list is made up of three categories (blue, yellow, and red), as shown below:Put them into buckets in order (usually a HashMap) :At this point, we take the elements out of each bucket by column, that is to ensure that the elements are scattered to the maximum extent, the final result isIn order to ensure the retention of the results of the original algorithm, we have a sorting process according to the original order when we take each column. The advantages of this algorithm are: 1. Simple and direct, easy to implement 2.Good effectAlthough sorting may lead to accidental adjacency of elements at the beginning and end of the column, the most adjacent elements before the end are 2, which does not affect the experience. 3. 1, at the end of the break up failure, prone to stick 2, respect for the original sequence is not strong, even with the object of the recommended coefficient is very low also forced to appear in the previous three, can only consider a dimension classification, unable to consider other factors Also it can be seen at the same time, the algorithm of category has a number of dependencies, if category is meticulous, The shortcomings of this algorithm can be partially covered up, in terms of performance, time and space consumption are O(n)

2) Weight distribution method

When we want to comprehensively consider multiple factors, it is not intuitive to categorize each commodity directly, so we can use the weight distribution method. First, we define a new weight for each object:Where W is the artificially assigned coefficient for each attribute, representing the priority of the hash, and Count represents the representation of the object in this attribute (how many times the same attribute has occurred). Intuitively speaking, the more times similar attributes have appeared, the larger the weight value will be, and in the process of function calculation, the original order of factors are naturally considered, so after calculating the weight, there is no need to do other processing, just need to order by weight. Below the graph example, if we specify font color weight coefficient to 2, color color piece weight coefficient is 1, then, in the no. 1, 2, they didn’t seen the font color and color piece, the weight of 0, to no. 3, have seen one, the weight is 2 * 1 * 1 + 1 = 3, and so on, 8, the font color appeared twice, If the block color appears three times, the weight is 2 * 2 + 1 * 3 = 7In this way, only one sort operation is required to do the shredding according to the weight.As can be seen, by setting a heavier weight coefficient, we achieve the priority of shuffling the font colors, and the color block information can tolerate their limited adjacency because of the lower coefficient. The advantages of this algorithm are as follows: 1.The different factors of the dispersion are considered comprehensively3. Through the modification of the weight calculation function, it can be easily integrated into other considerations. If you want to respect the original order, you can also add the original order to the weight calculation. 1. Due to the cumulative effect of weight calculation, this algorithm is still prone to end failure. 2

3) Window sliding method

In complexity calculation, the variable of N is also the size of the whole original sequence. However, in actual scenes, users do not see the whole sequence at once, but usually return topN at a time and fill the user window. At this point, we should explore a local reference method, the basic idea is sliding Windows. As shown in the figure below, we open a window with a size of 3 to simulate the user’s receiving window. It is stipulated that there should be no repetition of elements in the window, that is, there is no repetition of a display page seen by the simulated user. If repeated elements are found in the window, a suitable element will be detected and exchanged with the current element. In the first step, we detect 2 and 3 of the same type, so we take 3 out, and then detect that 4 meets the requirements of 3, so we swap 3 and 4, and slide the window back one space. In the second step, it is found that there are similar types 2 and 3 in the window, and then the 3 and 5 are exchanged. The window slides back one space to find no conflict in the window, and then slides another space. In the third step, the same group of 5 and 6 is found, 6 and 7 are exchanged, and the window slides to the end. Although the same group of 7 and 8 is also found, there is no exchange element at that time, so it is not processed.The advantages of this algorithm areIt’s just a local calculation, does not need to be completely dismantled, adapted to the needs of topN; And the disadvantages are also obvious, its robustness is not good, is greatly affected by the sequence distribution, also can not avoid the end of the accumulation of defects.

Comprehensive considerations

Based on the above discussion, we can draw the following conclusions about these methods:

Scatter degree Is it possible to combine dimensions performance
Scatter by column high no In the
Weight distribution method In the can In the
Window sliding method low no high

In order to intuitively compare the performance of the three methods, we generated 100 thousand random data and tested the performance of the three algorithms at different scales in the laptop environment. The abscissa represents the size of output data, and the ordinate represents the running time (unit :ms).It can be seen that in a certain range of data, the sliding window method has a great advantage, but the performance is also greatly related to the size of the window. If the window range is too large, there will be more conflicts and the exchange speed will be greatly reduced.

Generally speaking, the applicable scenarios of the three algorithms are as follows:

  • If the usual scenario is a single – dimensional hash, using a column hash is perfectly fine
  • If performance is desired and the original arrangement is already sparse, it is better to choose a sliding window with small units
  • If multidimensional degree is to be introduced, the weight distribution method is essential

The performance of all algorithms proposed in this paper is at O(n) and O(nlogn) level, and because the actual scene is often very small, it does not become a performance bottleneck in the application, and also leaves a lot of room for modification and trade-offs. Later, this problem can be further discussed in terms of global and local harmonization and terminal stacking.

Choose the instance

When we are in practical application, we generally do not simply use any of them. We must make clear the requirements, and then analyze the requirements in combination with the advantages of the three.

This time, when solving the demand for smashing Mach selection system on Xianyu, we learned the following characteristics: 1. The length of the commodity list is about 2000, and the number of objects when users get a message is limited, usually only one or two digits. 2. 3. There are a lot of user ids (every user may publish goods), and the category of goods is very limited, so we can choose our own plan accordingly. It can be seen from feature 2 that the weight distribution method needs to be selected in order to integrate multiple factors. In order to solve the performance problem, the sliding window method is selected because the information obtained at one time is very few and the categories of goods are very limited by integrating feature 1 and feature 3. We combine the weight distribution method and the sliding window method, adopt the sliding window with the window size of 4, then use the weight coefficients 13 and 7 (both prime numbers, convenient sorting) to calculate the weight function of user and class respectively, change the restriction condition inside the window to, the weight difference with all other objects is less than a certain threshold. Finally, the unification of multifactor statistics and performance can be achieved.

However, the parameter selection (window size, weight coefficient) has not been compared repeatedly. Later, we plan to use ABtest and other methods to optimize and adjust the parameters in the actual scene, so as to achieve better performance of the algorithm.

conclusion

This paper discusses several implementation methods of the scatter algorithm, from the implementation method to the advantages and disadvantages of a detailed description, through the method of this paper, can be specific categories of results in order to disperse the presentation, so as to improve the user’s visual experience. We can see that, in fact, there is an inevitable contradiction between the effect of smashing and the respect for the order characteristics of the original algorithm. How to better balance the two in the actual complex demand conditions, so as to be generally suitable for more scenes, is what we need to continue to explore in the future; How to better improve user experience by improving technology is an eternal proposition for technical people.