Analysis of machine Learning exercises (Chapter 2) Model Evaluation and Selection

The author | I’m Han Xiaoqi

Link | zhuanlan.zhihu.com/p/42435889

2.1 The data set contains 1000 samples, including 500 positive examples and 500 negative examples. The data set is divided into a training set containing 70% of the samples and a test set containing 30% of the samples for the estimation of the number of partitioning methods.

Answer: Permutation combination problem.

The training/test set should be divided as consistently as possible

So the training set should include 350 positive examples and 350 negative examples, and the rest as the test set, so the division method should beKind of.

2.2 The data set contains 100 samples, of which half are positive and half are negative. Assuming that the model generated by the learning algorithm predicts the new sample as a category with a large number of training samples (random guesses are made when the number of training samples is the same), the results of evaluating the error rate by using the 10-fold cross-validation method and the one-leave method are presented.

10-fold cross-validation: The data distribution of each subset in cross-validation should be as consistent as possible. In this case, 45 positive and negative cases are accounted for each of the 10 training sessions. The model training results are randomly guessed, and the error rate is expected to be 50%.

The one-rule method: if the samples are set aside as positive examples, there are 50 negative examples and 49 positive examples in the training set, and the model predicts the negative examples. On the contrary, samples are set aside as negative examples, and positive examples are predicted by the model. The error rate is 100%.

2.3 If the F1 value of learner A is higher than that of learner B, test whether the BEP value of learner A is also higher than that of learner B.

Explain your understanding

So let’s look at the definition of F1,

Among them:

That is, Precision, [the number of positive cases predicted and real positive cases] / [the number of positive cases predicted], to speak plainly about the accuracy of positive samples predicted;

Namely, Recall rate (also known as Recall rate), [the number of positive cases predicted and positive cases true] / [The number of positive cases true]. Both precision and recall are equally valued in F1 calculation.

Let’s look at BEP

First of all, many current output of the classification algorithm is a probability value between 0 and 1, such as logistic regression, xgboost etc, the practice of classification is due to a threshold (typically 0.5), if the output is greater than the threshold value of samples is classified as class 1 (that is, the positive cases), then according to the sample of the output value from big to small order (hereinafter referred to as “sample order”), The first sample can be understood as the sample most likely to be a positive example, while the last sample is the sample least likely to be a positive example. Once upon a time in the future, one by one the sample forecast is positive cases (i.e. the output value is due to the current sample for threshold, is less than the threshold for example), each computing current precision and recall, recall ratio can be as the abscissa precision as a point on the ordinate, after connect all points according to the well can be “P – R curve”, BEP (i.e. break-event Point) is the value when recall rate = precision rate.

P – R curve

Discussion:

By definition, F1 value is the value obtained by integrating recall rate and precision rate after all samples are classified when the threshold value is fixed. The BEP value is obtained when a threshold is sought to make recall rate and precision rate the same (BEP = recall rate = precision rate).

In other words, the BEP value is closely related to the “order of the sample”, and has nothing to do with the size of the predicted value of the sample. For the same order, even if all the predicted values are multiplied by 0.5, the BEP value is the same. But for F1 values, all samples will be predicted as negative examples (assuming a threshold of 0.5), where F1 values are 0.

Back to the question itself, “If the F1 value of learner A is higher than that of learner B, then the BEP value of A is higher than that of B”, so if two learners can be found to have the same BEP value but different F1 values, then the proposition of the question is not valid. The answer is already clear from the discussion above. Imagine that learner A has twice the sample output of learner B, and the BEP values are the same. The output of learner A is between (0,1) and the output of learner B is between (0,0.5), where the F1 value of B is 0 and the F1 value of A is between 0 and 1. So that’s not true.

Ps. Personal intuitively BEP value and formula 1 there is no clear relationship, in the process of discussion with the “output value multiplied by 0.5”, for example, in fact, imagine a bunch of fixed ordering point output probability value (model), only between 0 and 1 at the same time forward or backward (each point before progress can be different, but sorting the same), Its BEP value does not change, while its F1 value is constantly changing.

2.4 The relationship between true case rate (TPR), false positive case rate (FPR) and accuracy rate (P) and recall rate (R) was described.

Confusion matrix

Recall:

。

[The number of positive examples predicted and true] / [The number of positive examples true]

Precision:

。

The number of positive cases predicted and true] / the number of positive cases predicted

True case rate (TPR) : identical recall rate

False positive example rate (FPR) :

。

The percentage of all negative examples that are predicted to be positive.

2.5 Test proof (2.22)AUC=1− Lrank

Obviously, in the ROC curve, a line corresponds to one (or more) negative samples, a vertical line corresponds to one (or more) positive samples, and an oblique line corresponds to multiple positive and negative samplesAnd,, that is, the predicted values of the samples are the same. As shown below:

Among themCorresponding to one (or more) positive example of the same predicted value, andCorresponding to multiple positive and negative cases with the same predicted value, obviously, the width of the shadow part is:

The ordinate of is:

whileThe ordinate is:

，

Then the shaded area is:

makeRepresents the unique negative example set of predicted value, that is, the predicted value in the original negative example setThen the AUC value is:

Among them:

while

2.6 The relation between the error rate and ROC curve is described

A: The error rate is derived when the threshold is fixed, and the ROC curve is derived when the threshold changes with the predicted value of the sample. For every point on the ROC curve, there is an error rate.

2.7 Try to prove that any ROC curve has a cost curve corresponding to it and vice versa

First of all, “every ROC curve has a cost curve corresponding to it”. Obviously, every point (FPR, TPR) on the ROC curve corresponds to a line segment in the figure below. The next of all line segments can obtain a unique cost curve.

On the contrary, the cost curve is actually a polygon (under finite samples), which is easy to understand. Each edge corresponds to a line segment on the cost plane. In fact, by traversing each edge from left to right, every point on the ROC curve from left to right can be obtained.

Ps. ROC curve corresponds to a unique cost curve, but one cost curve can correspond to multiple different ROC curves. As shown in the figure above, when the green, yellow and blue line segments intersect with the red point, the cost curve will not change if the yellow line segment is removed, but the ROC curve will lose one point.

2.8 Advantages and disadvantages of min-max normalization and Z-Score normalization.

Advantages of min-max normalization 1. Calculation is relatively simple. 2. When a new sample comes in, the normalized value needs to be recalculated only when the new sample is larger than the original maximum value or smaller than the original minimum value. Disadvantages: 1. It is easily affected by high leverage points and outliers.

The advantage of z-Score normalization is that. 1. Low sensitivity to outliers. Disadvantages: 1. Calculation is more responsible. 2. Recalculation of normalization is required every time a new sample comes in.

2.9 a brief introductionThe inspection process

slightly

2.10 Describe the difference between equations (2.34) and (2.35) used in Friedman test.

A little. I didn’t study statistics. I’ll make it up later.

Machine learning online manual for Machine learning Deep Learning Notes album for AI Basics download (PDF update25Set) Machine learning basics of Mathematics album get a discount website knowledge planet coupon, copy the link directly open: HTTPS://t.zsxq.com/yFQV7am site QQ group 1003271085, join wechat group please scan code like articles, click a look
Copy the code

Analysis of machine Learning exercises (Chapter 2) Model Evaluation and Selection

Related Posts

Spring Boot implements idempotent interfaces in 4 ways! Who else wouldn’t?

Introduction to OpenFeign

Various architectural options for Fizz Gateway