“This paper mainly introduces the evaluation indicators commonly used in the recommendation system, including scoring prediction indicators, collective recommendation indicators, ranking recommendation indicators, diversity and stability, etc.” \

This article source: Sue a zhuanlan.zhihu.com/p/67287992

I have sorted out the evaluation indexes in the literatures related to the recommendation system in the past half year. If you find any evaluation indexes omitted in this paper, please point them out in the comment section and I will add them.

directory

An overview,

2. Commonly used evaluation indicators

Iii. Other evaluation indicators


An overview,

Since the beginning of recommendation system research, the evaluation of prediction and recommendation results has always been a very important link, and the quality of a recommendation algorithm is directly reflected in its performance in these evaluation indicators. Generally speaking, according to different recommendation tasks, the most commonly used recommendation quality measurement methods can be divided into three categories :(1) to evaluate the score of prediction, which is applicable to the score prediction task. (2) Evaluate the predicted item set, which is applicable to top-N recommendation tasks. (3) Weighted evaluation of recommendation effect is carried out according to ranking list, which can be applied to both scoring prediction tasks and top-N recommendation tasks.

Specific evaluation indexes corresponding to the three measures are as follows:

(a) Scoring prediction indicators: such as accuracy indicators: Mean absolute error (MAE), mean square error root (RMSE), standardized mean error (NMAE); And Coverage.

(b) Aggregate recommended indicators, such as Precision, Recall, ROC and AUC

(c) Recommended indicators: half-life and cumulative gain, etc

The rest of this article covers these metrics in detail

2. Commonly used evaluation indicators

1. Quality of the Predictions

In order to measure the accuracy of RS results, calculation of some of the most common prediction Error indicators is usually used, among which Mean Absolute Error (MAE) and its related indicators are: Normalized Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Normalized Mean Absolute Error (DEM) NMAE is the most representative index.

Symbol definition

U represents the user set in the test set, I represents the item set in the test set, represents the score of U on I, ● represents the score of vacancy (=● represents that U does not overrate I), represents the predicted score of U on I,

Represents both user U score record and item set of predicted score generated by the model in the test set.

1.1 Mean Absolute Error (MAE)

The “standard mean absolute error (NMAE)” of single user U is

Where, and are the maximum value and minimum value of user U score range respectively.

“1.2 Root Mean Squared Error (RMSE)”

Remove the square root of the above equation to get “Mean Squared Error (MSE)”

“1.3 Coverage”

The simplest definition of coverage is the proportion of items recommended by the recommendation system to the total number of items. The higher coverage indicates that the model can generate recommendations for more items, thus promoting the mining of the long tail effect. We will define as the nearest neighbor set of u, so we can define the coverage as follows:

In addition, information entropy and Gini coefficient can also be used to measure coverage.

2. Quality of the Set of recommendations

Due to sparse data and cold start problems, it is sometimes difficult to directly predict user’s score on item. For this reason, some scholars proposed a top-N recommendation method, that is, it does not predict user’s score on item. Instead, a set of items that the user is most likely to like is generated and recommended to the user based on implicit user-item interactions (such as clicks and favorites).

In this section, we introduce the most widely used recommendation quality measures in top-N recommendations. They are :(1) Precision, which represents the proportion of relevant recommended items in the total number of recommended items;

(2) Recall, represents the proportion of relevant recommended items in the number of relevant recommended items;

(3) F1, the combination of accuracy and recall.

(4) ROC and AUC

Symbol definition

R(u) represents the recommendation list made to users according to their behaviors in the training set, and T(u) represents the behavior list of users in the test set.

“2.1 Precision”

“2.2 Recall”

“2.3 F1”

“2.4 Receiver Operating Characteristic (ROC) and AUC (Area Under Curve)”

AUC refers to the area under the Receiver Operator Curve (ROC), which measures how well a recommendation system can distinguish products that users like from those that they don’t.

Due to the tedious ROC curve drawing procedures, the following methods can be used to approximate the AUC of the system: Each time a product α is randomly selected from the related product set, that is, the user’s favorite product set, and compared with the randomly selected unrelated product β. If the predicted score value of product α is greater than that of product β, then one point will be added; if the two score values are equal, 0.5 point will be added. In this way, if there are n ‘times when the predicted score value of commodity α is greater than that of commodity β, and there are n’ times when the two scores are equal, then AUC can be approximated as:

Obviously, if all predicted scores were randomly generated, then AUC=0.5. Thus, an AUC greater than 0.5 measures the degree to which the algorithm is more accurate than the randomly recommended method. The AUC index represents the overall performance of the recommendation algorithm with only one value, and it covers the performance of all different recommendation list lengths. However, AUC index does not consider the influence of specific ranking position, which makes it difficult to compare the good and bad algorithms in the case of the same ROC curve area, so its application scope is also limited.

“2.5 Hit Rate (HR)”

HR is a very popular evaluation index in current top-N recommendation studies, and its formula is shown above, where #users is the total number of users, and # HITS is the number of users whose item in the test set appears in the top-N recommendation list.

“2.6 Average Reciprocal Hit Rank (ARHR)”

ARHR is also a very popular indicator in the current top-N recommendation. It is a weighted version of HR, which measures the strength of an item’s recommendation. The formula is as follows:

Where the weight is the inverse of the position in the recommended list.

3. Quality of the List of Recommendations

When the number of recommended items is large, users will pay more attention to the item at the top of the recommended list. The errors that occur in these items are more serious than those that occur later in the list. This is taken into account by the weighted evaluation of recommendations by ranked lists. Among the most commonly used ranking metrics are the following standard information retrieval metrics:

(a) Half-life, assuming that as users move away from the top recommendation, their interest drops;

(b) Discounted cumulative gain, where the attenuation function is a logarithmic function.

(c) Rank – Biased precision (RBP), attenuated in geometric series

“3.1” HL

Half-life utility is proposed under the assumption that the probability of users browsing a product and the specific ranking value of the product in the recommendation list decrease exponentially. It measures the practicality of the recommendation system for a user, that is, the difference between the user’s real score and the default score value of the system. The expected utility of user U is defined as:

Where, represents the actual score of user U on product α; Is the ranking of product α in user U’s recommendation list; D is the default score (e.g., average score); H is the half-life of the system, which is the position of the recommendation list with a 50% probability that the user will view it. Obviously, the user’s half-life utility index reaches its maximum when all the items he likes are placed at the top of the recommendation list.

“3.2 quotient gain”

The main idea of discounted cumulative gain (DCG) is that users’ favorite goods ranked at the front of the recommendation list will increase user experience to a greater extent than those ranked at the back, which is defined as:

Where, represents whether the product ranked I is favored by users; Ri = 1 indicates that the user likes the product; Ri =0 indicates that the user does not like the product; B is a free parameter which is usually set to 2; L is recommended list length.

Since DCGs are not directly comparable between users, we need to normalize them. In the worst case, DCG is 0 when the non-negative correlation score is used. To get the best results, we put all the items in the test set in the ideal order, take the first K items and calculate their DCG. Normalized Discounted Cumulative Gain (NDCG) ** can be obtained by dividing the original DCG by the DCG under the ideal state, which is a number between 0 and 1.

“3.3 Rank – Biased Precision (RBP)”

Unlike DCG, rank-biased Precision (RBP) assumes that users tend to view the item at the top of the recommended list first and then view the item at the next one with a fixed probability of P, with a probability of 1− P. RBP is defined as:

The only difference between RBP and DCG indicators is that RBP reduces the probability of viewing items on the recommended list in geometric order, while DCG follows a log harmonic order.

“3.4 Mean Frame Rank (MRR)”

MRR takes the reciprocal of the order of the correct item in the recommendation list as its accuracy, and then takes the average of all questions. Relatively simple, for example: there are three queries as shown in the figure below:

(boldface is the best match among the returned results), then the MRR value is :(1/3 + 1/2 + 1)/3 = 11/18=0.61

“3.5 Mean Average Precision (MAP)”

The average-accuracy MAP, given that when we search for a keyword using Google, returns 10 results. Of course, the best case scenario is that all 10 results are relevant information we want. But if only parts are relevant, such as 5, then these 5 results are still a relatively good result if they are displayed at the top. However, if the five related pieces of information do not appear until the sixth result is returned, the situation is worse. This is the metric reflected in AP, which is somewhat similar to the concept of recall, except AP is “sequential sensitive Recall.

For user U, if some items are recommended to him, the average accuracy of U is:

$\ displayStyle p_{uj}<p_{UI}$indicates that item J is ranked before item I. <=”” p=””>

MAP indicates that the aps of all users u are averaged

Iii. Other evaluation indicators

Most of the articles discussed methods to try to improve the accuracy of the recommendation results (RMSE, MAE, etc.) of the scoring prediction task. It is also common to try to improve the accuracy, recall, ROC, etc., of top-N recommendation. However, in order to achieve higher user satisfaction, other goals should be considered, such as topic diversity, novelty, fairness of recommendations, etc.

Currently, the field is increasingly interested in generating algorithms with diversity and innovative proposals, even at the expense of accuracy and precision. To assess these aspects, various indicators have been developed to measure the novelty and diversity of proposals.

  1. “Diversity and Novelty”

Assuming that

Is the similarity between items I and J, then the diversity of user U’s recommendation list R(u) can be defined as:

In addition to diversity, novelty is also an important indicator of user experience. It refers to the ability to recommend products that are not popular or popular to users. Recommending popular products may improve the accuracy of recommendation to a certain extent, but it reduces the satisfaction of user experience. The simplest way to measure the novelty of a recommendation is to use the similarity of the recommended items. The less similar an item on the recommendation list is to something the user already knows, the more novel it will be to the user. Thus, the recommended novelty index was obtained:

In the formula, Zu represents the set of N items recommended to user U

2. Stability

The stability of forecasts and recommendations affects users’ trust in RS. A recommendation system is stable if the forecasts it provides do not change strongly over a short period of time. Adomavicius and Zhang proposed a mass measure of stability: Mean Absolute Shift (MAS).

Assuming that we have a series of known user rating data set R1, we can predict a set of item set that users have not scored according to R1 and get a set of predicted rating data set P1. After a period of interaction, users score some unrated items. At this point, we predict item score in P1 and get a new data set of prediction score P2. Then MAS can be expressed as:

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) this qq group, 1003271085 to join WeChat group please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code