“This is the 20th day of my participation in the Gwen Challenge.
In Python, you can plot the ROC curve by calculating the hit ratio (TPR) and false alarm rate (FPR) at different thresholds using the following code.
from sklearn.metrics import roc_curve
fpr,tpr,thres = roc_curve(y_test,y_pred_proba[:,1])
Copy the code
Line 1 introduces the roc_curve() function. The second line of code is the roc_curve() function passes in the target variable y_test of the test set and the predicted loss probability y_pred_proba[:,1] to calculate the hit ratio and false alarm rate under different thresholds. Since roc_curve() returns a tuple of three elements, where the first element is the false alarm rate, the second element is the hit rate, and the third element is the threshold by default, we assign the variables FPR (false alarm rate), TPR (hit rate), and THRES (threshold), respectively.
At this time, the FPR, TPR and THRES obtained are three one-dimensional arrays, which can be combined into a two-dimensional data table by the following code.
A = pd. DataFrame () a [' threshold '] = list (thres) a [' false alarm rate] = list (FPR) a [' shooting '] = list (TPR)Copy the code
Print a.read () and a.mail () to view the false alarm rate and hit ratio at different thresholds, as shown in the table below.
You can see that as the threshold drops, both the hit rate and the false alarm rate increase. A few notes on this form are as follows.
First, the threshold value in the first line of the table indicates that only when the probability of a customer’s predicted loss is greater than or equal to 193% can it be judged that the customer will lose. However, since the probability will not exceed 100%, all customers will not be predicted to lose at this time, and the hit rate and false alarm rate are 0. This threshold doesn’t mean anything, so why set it? This threshold is the default for the roc_curve() function and is described in the official documentation as follows: While being predicted and is arbitrarilyset to Max (y_score) +1. This means that the first threshold has no meaning and is often set to a maximum threshold (0.9303 in this case) +1 to ensure that no records are selected.
Second, form line 2 data representation only when a client is predicting the probability of loss of 93.03% or higher, to determine the cost of their will, this condition is strict, at this time was predicted for the loss of customers is very few, hit ratio is 0.0028, namely “forecasts for erosion and actual customer/actual loss of the loss of customers” this ratio is 0.0028, Assume that there are 5000 groups of actual lost customers at this time, then under this threshold condition, 5000×0.0028 = 14 of the actual lost customers will be accurately judged as lost. At this point, the false alarm rate is 0, that is, the ratio of “predicted but not actually lost customers/actual not lost customers” is 0, that is, none of the actual not lost customers will be misjudged as lost.
Now that the false alarm rate and hit ratio under different thresholds are known, ROC curve is drawn with relevant knowledge points of Matplotlib library, the code is as follows.
import matplotlib.pyplot as plt
plt.plot(fpr,tpr)
plt.title('ROC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()
Copy the code
The ROC curve drawn is shown in the figure below.
The AUC value of the model can be quickly calculated by the following code.
from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test,y_pred_proba[:,1])
Copy the code
Line 1 introduces the roc_auc_score() function; The second line is the roc_auc_score() function passing in the value of y_test, the target variable in the test set, and the predicted probability of loss. The AUC value obtained is 0.81, so the prediction effect is good.