Click “Datawhal E” above, select “Star standard” public number

Get value first

In the field of machine learning, different business indicators are usually formulated according to the actual business scenarios. For different machine learning problems such as regression, classification and sorting, the evaluation indicators are also different. \

Accuracy rate, accuracy rate, recall rate, F1 value

Define \

  • Accuracy: the number of correctly classified samples accounts for the total number of samples.
  • Precision: the proportion of positive example data that is correctly predicted to positive example data that is predicted,
  • Recall: the proportion of positive example data predicted to be correct in the actual positive example data,
  • F1 score:

To calculate

Context: Suppose there are 100 ads, and a user is not interested in 80 ads, but is interested in 20 of them. The goal is to find the ads that all users are interested in. Now, select 40 ads and 10 of them are interested.

The actual is class The actual negative class
Prediction is class TP=10 FP=30
Predict negative class FN=10 TN=50

By obfuscating the matrix, we can figure it out

The advantages and disadvantages

The accuracy rate, accuracy rate, recall rate, and F1 values are mainly used for classification scenarios.

Accuracy can be understood as the probability of correct prediction. Its defect lies in that when the proportion of positive and negative samples is very imbalanced, the category with a large proportion will affect accuracy. For example, when detecting outliers, 99.9% of them are non-outliers, then we can treat all samples as non-outliers with a very high accuracy rate.

Accuracy can be interpreted as how much of the predicted thing the user is interested in, and recall can be interpreted as how much of the predicted thing the user is interested in. Accuracy and recall are generally contradictory measures. In order to better characterize the performance measurement of the learner in accuracy and recall, we introduce F1 value.

In some fields, we may have different preferences for accuracy and recall, so we introduce, to express different preferences for accuracy and recall.

Time accuracy is more influential, and recall is more influential. \

P-r, ROC, AUC

define

  • P-R curve: horizontal axis recall rate, vertical axis accuracy rate.
  • Receiver Operating characteristic Curve (ROC) : a curve surrounded by TPR (true case rate) and FPR (false positive case rate) when unclassified threshold is adopted, with FPR as abscissa and TPR as ordinate. If ROC is smooth, it can basically be judged that there is not too much overfitting.
  • Area Under CURVE (AUC) : the whole two-dimensional area below the ROC curve from (0, 0) to (1, 1) is calculated to measure the generalization ability of the machine learning algorithm performance of binary classification problems. Another way to read it is the probability that the model will rank some random positive category sample above some random negative category sample.

To calculate

P-R

The points on the P-R curve represent that under different thresholds, the model considers the results greater than the threshold as positive samples, and those less than the threshold as negative samples.

It can be seen that the accuracy rates of model A and model B are different under different recall rates, so it is very one-sided to measure the performance of the model only at A certain point, and only through the overall performance of the P-R curve can A more comprehensive evaluation be made.

ROC, AUC

In addition to F1 and P-R curves, ROC and AUC can also reflect the performance of a model comprehensively. True value of dichotomies: probability of dividing into positive samples:

In order to sort the data, the threshold value is taken as a value once, so the threshold value is successively 0.1, 0.35, 0.4 and 0.8

Then we calculate TPR and FPR under different thresholds in turn. Let’s take this as an example

The actual is class The actual negative class
Prediction is class TP=2 FP=0
Predict negative class FN=1 TN=1

True positive rate:

False positive rate:

You get the coordinates of a point.

After calculating the four thresholds, the following results can be obtained:

The threshold value 0.8 0.4 0.35 0.1
FPR 0 0.5 0.5 1
TPF 0.5 0.5 1 1

The drawing is as follows:

We can see that the ROC curve is generated by moving the threshold of the classifier to generate the key points on the curve. The ROC curve is generally above the straight line, so the VALUE of AUC is generally 0.5 ~ 1. The larger THE AUC is, the better the classification performance is.

The advantages and disadvantages

P-r, ROC, and AUC are used in classification scenarios.

Compared with p-R curve, ROC curve has a big characteristic: the shape of ROC curve will not change greatly with the change of positive and negative sample distribution, while P-R curve will change greatly.

As shown in the figure above, after the number of negative samples in the test set increases by 10 times, the P-R curve changes significantly, while the ROC curve remains basically unchanged. In practice, the number of positive and negative samples is often unbalanced, so this explains why the ROC curve is more widely used. \

MSE, RMSE, MAE, R2

define

  • MSE(Mean Squared Error)
  • RMSE(Root Mean Squared Error)
  • MAE(Mean Absolute Error)
  • , determining coefficient,

The advantages and disadvantages

Mainly used in regression models.

MSE and RMSE can well reflect the deviation between the predicted value and the real value of the regression model. However, if the deviation degree of some outliers is very large, even if the number of outliers is very small, the RMSE index will be worse (because square is used). There are three main solutions to this problem:

  1. If it is considered to be an outlier, it is filtered out during data preprocessing.
  2. If it is not an outlier, the prediction ability of the model should be improved, and the cause of outlier should be modeled into the model.
  3. In addition, evaluation indexes with better robustness can also be found, such as,. \

Cosine distance applications

The distance between samples can be defined in different ways, such as Euclian distance, Manhattan distance, Hamming distance, cosine distance and so on. Here we are going to focus on cosine distance and its applications.

define

Cosine similarity is defined as follows:

The value can be:.

If we want a similar representation of distance, we just subtract cosine similarity from 1:

The value can be:.

And notice, although we call this cosine distance, it’s not a strictly defined distance. We know that a strict definition of distance has to satisfy: nonnegativity, symmetry, triangle inequality.

  • The negative:

Special:

  • Symmetry:

  • Triangle inequality:

    Give a counter example:

    Thus there are:

So we can see from this proof that cosine distance does not satisfy the definition of distance.

The advantages and disadvantages

We know that cosine similarity is concerned with the angular relationship between two vectors, not the absolute magnitude. The most direct advantage of the recommendation system lies in that different users give different grades to movies. Some films are strict and give low marks on average, while some films are loose and give high marks on average. Cosine similarity can eliminate the interference of rating degree and pay attention to relative differences.

In general, Euclidean distance reflects the absolute difference in value, while cosine distance reflects the relative difference in direction. \

A/B testing

A/B testing is the primary means of verifying the final effect of the model. When doing A/B testing, there are usually two (or more) groups: Group A and group B. The first group was a control group, and the second group altered some of these factors. \

Why do I need A/B tests

  1. Offline evaluation cannot eliminate the influence of model overfitting, so the offline evaluation results cannot completely replace the online evaluation results.
  2. Offline evaluation cannot completely restore the online engineering environment, such as data loss, label loss, etc.
  3. Some evaluation indicators cannot be evaluated offline, such as user click rate, retention duration, and PV page views.

Theoretical basis

The central limit theorem: Given an arbitrarily distributed population, n samples are randomly drawn from the population m times. And then you take these m groups and you average them. The distribution of these averages is close to normal.

The central limit theorem is the basis of A/B test analysis data, we can estimate the mean and variance of the population sample by randomly sampling samples.

Design principles

The users are divided into buckets, and the users are divided into experimental group and control group. The experimental group uses the new model, and the control group uses the model. Pay attention to the independence of samples and the unbiased sampling method during bucket splitting, so as to ensure that the same user can only be divided into one bucket.

Hypothesis testing

The basic principle of hypothesis testing is to make some hypothesis about the characteristics of the population and then, through statistical reasoning in a sample study, to make an inference about whether the hypothesis should be rejected or accepted. Hypothesis testing means that we have to make a decision about whether to believe the original hypothesis or the alternative hypothesis.

The general steps are as follows:

  1. Ask questions (give null hypothesis and alternative hypothesis, the two hypotheses are complementary);
  2. Collect evidence (the probability of getting the sample mean when null hypothesis is true: p-value);
  3. Judgment criteria (significance level, 0.1% 1% 5%);
  4. Make a conclusion (p<=, reject null hypothesis, accept otherwise).

The essence of hypothesis testing is to construct a reasonable test statistic based on the existing data. When I see that the statistic is greater than a certain value, I will abandon the original hypothesis, or I will believe it.

Common types of hypothesis tests include: T test, Z test, chi-square test.

T test

Also known as the student test, it is used for normal distributions with small sample sizes (e.g., n<30) and unknown population standard deviation σ. The purpose is to compare the sample mean, represented by the unknown population mean μ and the known population mean.

Applicable conditions:

  1. We have a population mean;
  2. A sample mean and the sample standard deviation can be obtained.
  3. Samples come from a normal or nearly normal population.

Steps:

  1. Establish assumptions, that is, assume that there is no significant difference between the two population averages;
  2. To calculate the statistic T value, different statistical calculation methods are used for different types of problems.
  3. According to the degree of freedom, check the T-value table to find out the prescribed T theoretical value and compare. The significant level of theoretical difference was 0.01 or 0.05;
  4. The calculated t-value is compared with the theoretical t-value, the probability of occurrence is deduced, and the judgment is made according to the given t-value and the significant difference table.

Z test

Z test is generally used to test the difference between the average values of large samples (i.e., sample size greater than 30). It is to use the theory of standard normal distribution to infer the probability of the occurrence of difference, so as to compare the difference between two means is significant.

Steps:

  1. Establish the nihilistic hypothesis, that is, first assume that there is no significant difference between the two averages;
  2. To calculate z-value of statistic, different statistical calculation methods are used for different types of problems.
  3. The calculated z-value was compared with the theoretical Z-value, the probability of occurrence was deduced, and the judgment was made according to the significant relationship table between z-value and difference.

chi-square

The first two tests are normal distribution tests, and the Chi-square test is a nonparametric test. It mainly compares the correlation analysis of two or more sample rates (composition ratio) and two classification variables. The basic idea is to compare the coincidence degree between theoretical frequency and practical frequency.

Chi-square test is a common hypothesis testing method based on distribution, and its invalid hypothesis is that there is no difference between observed frequency and expected frequency.

The basic idea of chi-square test is: first, the hypothesis is true, and then the value is calculated based on this premise, which represents the deviation degree between the observed value and the theoretical value. According to the distribution and degree of freedom, the probability P of obtaining the current statistic and the more extreme case can be determined under the H0 hypothesis. If the P value is very small, it indicates that the observed value deviates too much from the theoretical value, and the null hypothesis should be rejected, indicating that there is a significant difference. Otherwise accept the null hypothesis.

Value represents the deviation degree between the observed value and the theoretical value, and its general steps are as follows:

  1. Let A represent the observed frequency of A certain category, E represent the expected frequency based on calculation, and the difference between A and E is called residual.
  2. Residuals can represent the deviation degree between the observed value and the theoretical value of a certain category, but if the residuals are simply used to represent the difference between the observed frequency and the expected frequency of each category, there are certain deficiencies. Because residuals are positive and negative, they cancel each other out when they add up, and they still add up to 0, so you can square the residuals and sum them up;
  3. On the other hand, the size of residual is a relative concept. When the expected frequency is 10, the residual of 20 is very large, but when the expected frequency is 1000, the residual of 20 is very small. With this in mind, the residual squared is divided by the expected frequency and summed again to estimate the difference between the observed frequency and the expected frequency.

After the above operations, we get the commonly used statistics, which are formulated as follows:

Is the observed frequency at level I, is the expected frequency at level I, n is the total frequency, is the expected frequency at level I. The expected frequency at level I is equal to the total frequency n times the expected probability at level I, where k is the number of cells. When n is large, the statistic approximately follows the Chi-square distribution of K-1 (the number of parameters used in calculation) degrees of freedom.

Example — Independence test:

An organization wants to know whether gender is related to income at present. They randomly sampled 500 people and asked their opinions on this. The results are divided into three kinds of answers: “Relevant, irrelevant and hard to say”.

  1. Null hypothesis H0: Gender has nothing to do with income.
  2. The degree of freedom was determined as (3-1)×(2-1)=2, and the significant level α=0.05 was selected.
  3. Find out the number of times men and women expect different views on income and gender. Here, calculate each expectation by dividing the total value of the total value by the multiplier of the total value of the row. Type “=B5*E3/E5” in cell B9 to find the other values.

  1. Calculate the statistics using the Chi-square statistics calculation formula, type “=(B3-B9)^2/B9” in cell B15, and so on for the rest of the cells, and the result is as follows:

  1. Finally, the statistical value was 14.32483, and the significance level was 0.05, and the critical value of the 2 Chi-square distribution was 5.9915.
  2. The statistical measure and the critical value were compared, and the statistic 14.32483 was greater than the critical value 5.9915, so the null hypothesis was rejected.

reference

Wiki.mbalib.com/wiki/%E5%8D…

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

You are not alone in the battle. The path and materials suitable for beginners to enter artificial intelligence download machine learning online manual Deep learning online Manual note:4500+ user ID:92416895), please reply to knowledge PlanetCopy the code

Like articles, click Looking at the