The ROC curves we able to evaluate the performance of a model across all possible threshold choices. ROC (Receiver Operating Characteristic) shows how well a model can separate two classes, positive and negative.
We need two metrics for ROC curves: TPR and FPR, or true positive rate and false positive rate.
The idea is similar to what we previously did with accuracy, but instead of recording just one value, we record all the four outcomes for the confusion table.
def tpr_fpr_dt(y_true, y_pred):
scores = []
for t in np.linspace(0, 1, 11):
predictions_true = y_pred >= t
predictions_false = y_pred < t
actual_true = y_true == 1
actual_false = y_true == 0
TN = (predictions_false & actual_false).sum()
FP = (predictions_true & actual_false).sum()
FN = (predictions_false & actual_true).sum()
TP = (predictions_true & actual_true).sum()
scores.append([t, TN, FP, FN, TP])
scores = pd.DataFrame(data=scores, columns=["threshold", "TN", "FP", "FN", "TP"])
scores["FPR"] = scores["FP"] / (scores["FP"]+scores["TN"])
scores["TPR"] = scores["TP"] / (scores["TP"]+scores["FN"])
return scores
scores = tpr_fpr_dt(y_val, y_pred_val)
scores
threshold | TN | FP | FN | TP | FPR | TPR | |
---|---|---|---|---|---|---|---|
0 | 0.0 | 0 | 7994 | 0 | 1048 | 1.000000 | 1.000000 |
1 | 0.1 | 5154 | 2840 | 311 | 737 | 0.355266 | 0.703244 |
2 | 0.2 | 7421 | 573 | 609 | 439 | 0.071679 | 0.418893 |
3 | 0.3 | 7689 | 305 | 699 | 349 | 0.038154 | 0.333015 |
4 | 0.4 | 7821 | 173 | 801 | 247 | 0.021641 | 0.235687 |
5 | 0.5 | 7901 | 93 | 865 | 183 | 0.011634 | 0.174618 |
6 | 0.6 | 7950 | 44 | 917 | 131 | 0.005504 | 0.125000 |
7 | 0.7 | 7971 | 23 | 957 | 91 | 0.002877 | 0.086832 |
8 | 0.8 | 7980 | 14 | 996 | 52 | 0.001751 | 0.049618 |
9 | 0.9 | 7992 | 2 | 1040 | 8 | 0.000250 | 0.007634 |
10 | 1.0 | 7994 | 0 | 1048 | 0 | 0.000000 | 0.000000 |
Both TPR and FPR start at 100% at the threshold of 0.0, we predict ‘subscribe’ for everyone, hence there are no negative predictions :
With FPR and TPR already compute. Let’s plot them:
sns.lineplot(scores, x='threshold', y="TPR", label="TPR")
sns.lineplot(scores, x='threshold', y="FPR", label="FPR", ls='--')
plt.legend()
plt.title("TPR and FPR")
plt.ylabel(None);