“This is the 21st day of my participation in the Gwen Challenge.
KS curve and ROC curve are essentially the same. They also pay attention to TPR and FPR, and hope that TPR should be as high as possible, that is, potential lost customers should be found out as much as possible. Meanwhile, they also hope that FPR should be as low as possible, that is, non-lost customers should not be misjudged as lost customers. Different from the false alarm rate (FPR) as abscissa and hit ratio (TPR) as ordinate in ROC curve, KS curve takes threshold as abscissa and the difference between hit ratio (TPR) and false alarm rate (FPR) as ordinate, as shown in the figure below.
Like the ROC curve, in addition to the visualized graph, a quantifiable index is needed to measure the prediction effect of the model. The AUC value corresponds to the ROC curve, while the KS value corresponds to the KS curve. The calculation formula of KS value is as follows.
KS = Max (TPR – FPR)
The KS value is the peak of the KS curve. Specifically, each threshold corresponds to a value of (TPR-FPR), so there must be a threshold that makes the value of (TPR-FPR) maximum under this threshold condition, then the value of (TPR-FPR) is called the KS value. For example, in the figure above, when the threshold is equal to 40%, the hit ratio (TPR) is 80% and the false alarm rate (FPR) is 25%, so the value (TPR-FPR) is 55%, which is the largest value (TPR-FPR) under all threshold conditions. Therefore, the KS value of this model is 55%.
In more simple terms, when the threshold is 40%, the model can identify the bad guys as much as possible, and does not harm the good guys as much as possible. At this point, the difference between the hit ratio (TPR) minus the false alarm rate (FPR) is 55%, which is the KS value of the model.
Evaluation of customer churn warning model with KS curve
Similar to ROC curve drawing, false alarm rate (FPR) and hit ratio (TPR) under different thresholds can be calculated by the following code, from which it can be seen that KS curve and ROC curve are actually of the same root and homology.
from sklearn.metrics import roc_curve
fpr,tpr,thres = roc_curve(y_test,y_pred_proba[:,1])
Copy the code
The same code is used to sort out the threshold, false alarm rate and hit ratio at this time.
A = pd. DataFrame () a [' threshold '] = list (thres) a [' false alarm rate] = list (FPR) a [' shooting '] = list (TPR)Copy the code
The sorting result is shown in the following table (4 decimal places are reserved for aesthetics)
Now the false alarm rate and hit ratio under different thresholds are known, and the KS curve is drawn with relevant knowledge points of Matplotlib library, the code is as follows.
plt.plot(thres[1:],tpr[1:])
plt.plot(thres[1:],fpr[1:])
plt.plot(thres[1:],tpr[1:]-fpr[1:])
plt.xlabel('threshold')
plt.legend(['tpr','fpr','tpr-frp'])
plt.gca().invert_xaxis()
plt.show()
Copy the code
Lines 1 to 3 draw curves with the threshold as the abscissa, while the ordinate is the hit rate, the false alarm rate and the difference between the two. Because the threshold value in the first row of the table is greater than 1, it is meaningless and will lead to unbeautiful graphics, so the first row is removed by slicing, where THres [1:], TPR [1:] and FPR [1:] all indicate that the drawing starts from the second element.
Line 4 labels the X-axis with “threshold”.
Line 5 adds the legend.
Line 6 inverts the X-axis, ordering the threshold values from large to small and then plotting the KS curve, as mentioned earlier in the principle. Specifically, gCA () was used to get the axes information and invert_xaxis() was used to invert the X-axis.
The KS curve is shown in the figure below.
The KS value can be quickly calculated by the following code.
max(tpr-fpr)
The KS value is printed out and the result is 0.4744, which is in the range of [0.3, 0.5]. Therefore, this model has strong discrimination ability.