Purity:

The purity of clustering is

The following table is the k-means clustering result of 3204 articles in Los Angeles Times, k=6, label number =6.

Python implementation:

def purity(cluster, labels, k, label_set):
    p = np.zeros((k, len(label_set)))
    purity = 0
    for i in range(len(cluster)):
        p[int(cluster[i]), label_set.index(labels[i])] += 1

    purity = sum(np.max(p, axis=1))/len(labels)

    return purity
Copy the code

TP, TN, FP, FN

In dichotomies: TP(true positive): correctly classifies the samples that originally belong to the positive category into positive categories. TN(True negative): correctly classifies the samples originally in the negative category into negative categories. FP(false positive): indicates that errors in the negative category are classified into positive categories. FN(false negative): categorizes errors that originally belong to positive categories into negative categories. The second letter (N or P) is the result predicted by the model, and the first letter (T or N) represents whether the result is correct or not

Accuracy Accuracy

Accuracy = (TN + TP)/(TP + TN + FP + FN)

Ps: It is not applicable when the data set is unbalanced. For example, 97% of the samples in the data set of classification problem belong to X, and only 3% do not belong to X. When all the samples are classified into X, the accuracy rate is 97%, but it does not mean that the classification effect is good

Recall Recall rate

Also known as Sensitivity (Sensitivity), TP Rate, True Positive Rate, Probability. Recall=TP/(TP + FN) in the example that truly belongs to X, the proportion that is successfully predicted to belong to X (TP).

F-score

The index F-Score can comprehensively consider the harmonic values of Precision and Recall

(multi-class) Python implementation:

def f_score(cluster, labels, label_set):
    TP, TN, FP, FN = 0.0.0.0
    n = len(labels)
    # a lookup table
    for i in range(n):
        if i not in cluster:
            continue
        for j in range(i + 1, n):
            if j not in cluster:
                continue
            same_label = (labels[i] == labels[j])
            same_cluster = (cluster[i] == cluster[j])
            if same_cluster:
                if same_label:
                    TP += 1
                else:
                    FP += 1
            elif same_label:
                FN += 1
            else:
                TN += 1
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    fscore = 2 * precision * recall / (precision + recall)
    return fscore, precision, recall, TP + FP + FN + TN

Copy the code

Code and data download: download.csdn.net/download/SA…