Original link:tecdat.cn/?p=11160
For classification problems, classifier performance is usually defined in terms of the confusion matrix associated with the classifier. From the obfuscation matrix, sensitivity (recall rate), specificity and accuracy can be calculated.
All of these performance metrics are readily available for binary classification problems.
Non-scoring classifier data
To demonstrate the performance indicators of non-scoring classifiers in multi-category Settings, let us consider the classification problems observed \ (N = 100 \) and the five classification problems observed \ (G = \ {1, \ ldots, 5 \}) :
ref.labels <- c(rep("A", 45), rep("B" , 10), rep("C", 15), rep("D", 25), rep("E", 5))
predictions <- c(rep("A", 35), rep("E", 5), rep("D", 5),
rep("B", 9), rep("D", 1),
rep("C", 7), rep("B", 5), rep("C", 3),
rep("D", 23), rep("C", 2),
rep("E", 1), rep("A", 2), rep("B", 2))
df <- data.frame("Prediction" = predictions, "Reference" = ref.labels)
Copy the code
Accuracy and weighting accuracy
In general, multi-class accuracy is defined as the average of correct predictions:
Where \ (I \) is an indicator function that returns 1 if the class matches, and 0 otherwise.
More sensitive to the performance of each class, we can distribute the weights for each class \ (w_k \), in order to make \ (\ sum_ {k = 1} ^ {| G |} w_k = 1 \). The higher the \ (w_k \) value of a single class, the greater the influence of the observations of that class on the weighting accuracy. The weighting accuracy depends on:
For all classes in the weighted average, we can set up \ (w_k = \ frac {1} {| G |} \ and \ \ forall k in \ {1 \ ldots, G \} \). Note that it is difficult to find a sound argument for a particular combination of weights when using any value other than equal weights.
Calculation accuracy and weighting accuracy
The accuracy is easy to calculate:
calculate.accuracy <- function(predictions, ref.labels) { return(length(which(predictions == ref.labels)) / length(ref.labels)) } calculate.w.accuracy <- function(predictions, ref.labels, weights) { lvls <- levels(ref.labels) if (length(weights) ! = length(lvls)) { stop("Number of weights should agree with the number of classes.") } if (sum(weights) ! = 1) { stop("Weights do not sum to 1") } accs <- lapply(lvls, function(x) { idx <- which(ref.labels == x) return(calculate.accuracy(predictions[idx], ref.labels[idx])) }) acc <- mean(unlist(accs)) return(acc) } acc <- calculate.accuracy(df$Prediction, df$Reference) print(paste0("Accuracy is: ", round(acc, 2)))Copy the code
## [1] "Accuracy is: 0.78"
Copy the code
## [1] "Weighted accuracy is: 0.69"
Copy the code
Micro and macro mean of F1 scores
Micro and macro means represent two ways of interpreting the confusion matrix in a multi-class setting. Here we need to compute a confounding matrix for each class \ (g_i \ in G = \ {1, \ ldots, K \} \) so that the first confounding matrix considers class \ (g_i \) as an affirmative class, While all the other classes \ (g_j \) are \ (j \ neq I \) are \ negated.
To illustrate why adding real negative numbers can be problematic, imagine that there are 10 categories, each with 10 observations. Then, the confusion matrix for one of the categories might have the following structure:
Forecast/reference | Class 1 | Other classes |
---|---|---|
Class 1 | 8 | 10 |
Other classes | 2 | 80 |
Based on this matrix, the specificity would be \ (\ frac {80} {80 + 10} = 88.9 \ % \), even though class 1 was correctly predicted in only 8 of 18 instances (accuracy 44.4%).
In the following, we will use \ (TP_i \), \ (FP_i \) and \ (FN_i \) respectively to indicate the true positive, false positive and false negative classes in the confusion matrix associated with (I). Furthermore, let the precision be represented by \ (P \), and by \ (R \).
Calculate the micro and macro averages in R
Here I demonstrate how to calculate the micro and macro average of F1 scores in R.
We will use the confusionMatrix function caret in the package to determine the confusionMatrix:
Now we can summarize the performance of all classes:
metrics <- c("Precision", "Recall")
print(cm[[1]]$byClass[, metrics])
Copy the code
## Precision Recall
## Class: A 0.9459459 0.7777778
## Class: B 0.5625000 0.9000000
## Class: C 0.8333333 0.6666667
## Class: D 0.7931034 0.9200000
## Class: E 0.1666667 0.2000000
Copy the code
These data indicate that, overall, performance is high. However, our hypothetical classifier does not perform well for individual categories such as class B (accuracy) and class E (accuracy and recall). We will now examine how the micro and macro averages of F1 scores are affected by model predictions.
The overall performance of miniature average F1
The function then simply sums up the count and computes the F1 score defined above.
micro.f1 <- get.micro.f1(cm)
print(paste0("Micro F1 is: ", round(micro.f1, 2)))
Copy the code
## [1] "Micro F1 is: 0.88"
Copy the code
The value of 0.88\ (F_1 {\ rm {micro}} \) is quite high, indicating good overall performance.
The class-specific performance of macro average F1
Since each of the confounding matrices cm already stores a one-to-many predictive performance, we only need to extract these values from one of the matrices and then calculate \ (F1 _ {\ rm {macro}} \) as defined above:
get.macro.f1 <- function(cm) {
c <- cm[[1]]$byClass # a single matrix is sufficient
re <- sum(c[, "Recall"]) / nrow(c)
pr <- sum(c[, "Precision"]) / nrow(c)
f1 <- 2 * ((re * pr) / (re + pr))
return(f1)
}
macro.f1 <- get.macro.f1(cm)
print(paste0("Macro F1 is: ", round(macro.f1, 2)))
Copy the code
## [1] "Macro F1 is: 0.68"
Copy the code
Value 0.68, \ (F _ {\ RM {macro}} \) is decidedly smaller than the micro-mean F1 (0.88).
Note that the population (0.78) and weighted accuracy (0.69) of micro and macro mean F1 have similar relationships for the current data set.
Exact calls to curves and AUC
The area under ROC curve (AUC) is a useful tool for evaluating the classification and separation quality of soft classifiers. In multi-category Settings, we can visualize the performance of multi-category models in terms of their relationship to all precision recall curves. AUC can also be generalized to multi-category Settings.
One to one exact recall curve
We can visualize the performance of a multiclass model by plotting the performance of a \ (K \) binary classifier.
The method is based on fitting \ (K \) pairs for all classifiers, where in iteration (I) group (g_i \) is set to positive and all classes ((g_j \)) are treated as negative together with \ (j \ neq I \). Note that this method should not be used to plot the conventional ROC curve (TPR versus FPR), as the large number of negative instances due to demethylimide would cause the FPR to be underestimated. Instead, consider accuracy and recall:
for (i in seq_along(levels(response))) {
model <- NaiveBayes(binary.labels ~ ., data = iris.train[, -5])
pred <- predict(model, iris.test[,-5], type='raw')
score <- pred$posterior[, 'TRUE'] # posterior for positive class
test.labels <- iris.test$Species == cur.class
pred <- prediction(score, test.labels)
perf <- performance(pred, "prec", "rec")
roc.x <- unlist([email protected])
roc.y <- unlist([email protected])
lines(roc.y ~ roc.x, col = colors[i], lwd = 2)
# store AUC
auc <- performance(pred, "auc")
auc <- unlist(slot(auc, "y.values"))
aucs[i] <- auc
}
Copy the code
print(paste0("Mean AUC under the precision-recall curve is: ", round(mean(aucs), 2)))
Copy the code
## [1] "Mean AUC under the precision-recall curve is 0.97"Copy the code
The graph shows that Setosa is fairly predictable, while virginica is even more so. The mean AUC of 0.97 indicates that the model separates the three categories well.
Universalization of AUC for multi-class Settings
Generalized AUC of a single decision value
When a single quantity allows classification, the AUC can be determined using the multiclass.roc function pROC in the wrapper.
## Multi-class area under the curve: 0.654
Copy the code
The calculated AUC of the function is just the average AUC of all pairwise category comparisons.
The generalized AUC
The following describes the generalization of AUC from Hand and Till, 2001.
It seems that due to Hand and Till (2001), there is no publicly available implementation of the MULTI-class generalization of AUC. So, I wrote an implementation. Compute the function. The A.c onditional determine \ (\ hat {A} (I | j) \]. The multiclass.auc function computes \ (\ hat {A} (I, j) \) for all pairs of classes with \ (I
multiclass.auc <- function(pred.matrix, ref.outcome) {
labels <- colnames(pred.matrix)
c <- length(labels)
pairs <- unlist(lapply(combn(labels, 2, simplify = FALSE), function(x) paste(x, collapse = "/")))
A.ij.joint <- sum(unlist(A.mean))
M <- 2 / (c * (c-1)) * A.ij.joint
attr(M, "pair_AUCs") <- A.mean
return(M)
}
model <- NaiveBayes(iris.train$Species ~ ., data = iris.train[, -5])
pred <- predict(model, iris.test[,-5], type='raw')
pred.matrix <- pred$posterior
ref.outcome <- iris.test$Species
M <- multiclass.auc(pred.matrix, ref.outcome)
print(paste0("Generalized AUC is: ", round(as.numeric(M), 3)))
Copy the code
## [1] "Generalized AUC is: 0.988"
Copy the code
print(attr(M, "pair_AUCs")) # pairwise AUCs
Copy the code
## setosa/versicolor /virginica ## 1.0000000 1.0000000 0.9627329Copy the code
Using this method, the generalized AUC is 0.988. The resulting pair of AUC interpretations are similar.
Abstract
For multiple classes of problems.
- For hard classifiers, you can use (weighted) accuracy as well as micro or macro average F1 scores.
- For soft classifiers, you can determine a pair of full-precision recall curves, or you can use the AUC in Hand and Till.