Purity:
The purity of clustering is
The following table is the k-means clustering result of 3204 articles in Los Angeles Times, k=6, label number =6.
Python implementation:
def purity(cluster, labels, k, label_set):
p = np.zeros((k, len(label_set)))
purity = 0
for i in range(len(cluster)):
p[int(cluster[i]), label_set.index(labels[i])] += 1
purity = sum(np.max(p, axis=1))/len(labels)
return purity
Copy the code
TP, TN, FP, FN
In dichotomies: TP(true positive): correctly classifies the samples that originally belong to the positive category into positive categories. TN(True negative): correctly classifies the samples originally in the negative category into negative categories. FP(false positive): indicates that errors in the negative category are classified into positive categories. FN(false negative): categorizes errors that originally belong to positive categories into negative categories. The second letter (N or P) is the result predicted by the model, and the first letter (T or N) represents whether the result is correct or not
Accuracy Accuracy
Accuracy = (TN + TP)/(TP + TN + FP + FN)
Ps: It is not applicable when the data set is unbalanced. For example, 97% of the samples in the data set of classification problem belong to X, and only 3% do not belong to X. When all the samples are classified into X, the accuracy rate is 97%, but it does not mean that the classification effect is good
Recall Recall rate
Also known as Sensitivity (Sensitivity), TP Rate, True Positive Rate, Probability. Recall=TP/(TP + FN) in the example that truly belongs to X, the proportion that is successfully predicted to belong to X (TP).
F-score
The index F-Score can comprehensively consider the harmonic values of Precision and Recall
(multi-class) Python implementation:
def f_score(cluster, labels, label_set):
TP, TN, FP, FN = 0.0.0.0
n = len(labels)
# a lookup table
for i in range(n):
if i not in cluster:
continue
for j in range(i + 1, n):
if j not in cluster:
continue
same_label = (labels[i] == labels[j])
same_cluster = (cluster[i] == cluster[j])
if same_cluster:
if same_label:
TP += 1
else:
FP += 1
elif same_label:
FN += 1
else:
TN += 1
precision = TP / (TP + FP)
recall = TP / (TP + FN)
fscore = 2 * precision * recall / (precision + recall)
return fscore, precision, recall, TP + FP + FN + TN
Copy the code
Code and data download: download.csdn.net/download/SA…