Revisiting the ROC and the Precision-Recall Curves

I have written a series of articles on the techniques for machine learning modeling with extremely imbalanced target data, and in such cases, the ROC curve is not a sensitive measure. However, I believe it will be helpful to showcase a model prediction example. In this article, I will show you
- How to construct the ROC curve from the confusion matrix,
- Why the ROC curve is not sensitive enough when the target is extremely imbalanced, and
- How to read a Precision-recall Curve.
I also build the Python code snippets for those of you who may be interested.
(1) What Is Imbalanced Data?
The definition of imbalanced data is straightforward. A dataset is imbalanced if at least one of the classes of the target variable constitutes only a very small minority. In a supervised machine learning model, the class imbalance in the target variable can result in a serious bias towards the majority class and reduce the predictability. Imbalanced data prevail in banking, insurance, engineering, and many other fields. It is common in fraud detection that the imbalance is on the order of 100 to 1.
(2) Let’s Start with the Confusion Matrix
Assume the target variable is binary and you build a binary model called a binary classifier. Your model prediction for each record will be a continuous probability from 0.0 to 1.0. You will decide on a cutoff to label the predictions as Positive/Negative Yes/No or 1/0.
The performance of a model can be represented in a confusion matrix with four categories. Let’s use the labels “positive” or “negative” for either Positive/Negative or Yes/No or 1/0. True positives (TP) are positive examples that are correctly labeled as positives, and False positives (FP) are negative examples that are labeled incorrectly as positive. Likewise, True negatives (TN) are negatives labeled correctly as negative, and false negatives (FN) refer to positive examples labeled incorrectly as negative. A good model is expected to have more true positives and fewer false positives.

Given the above confusion matrix, we can define the following:

(3) The ROC curve
A confusion matrix evaluates a model given a certain threshold. But the threshold can vary from low to high. Is there a different measure that incorporates all ranges of the thresholds? Yes, the Receiver Operating Characteristic (ROC). It is an effective and popular evaluation metric because it visualizes the accuracy of predictions for a whole range of threshold values. The Receiver Operating Characteristic (ROC) curves plot TPR against FPR as shown below. The area under the ROC curve (AUC) assesses overall classification performance. The 4 dashed line means you select the record by chance without any model guidance. If the ROC curve is on top of the dashed line, the AUC is 0.5 (half of the square area) and it means the model result is no different from a completely random draw. If the ROC curve is very close to the northwest corner, the AUC will be close to 1.0. The AUC is a value between 1.0 (excellent fit) to 0.5 (random draw). A rule-of-thumb is shown in the table. The predictability of a model can be considered “excellent” if the AUC is more than 0.9, and “good” if the AUC is above 0.8.

(4) The ROC Is Not Sensitive If the Target Is Extremely Imbalanced
However, the ROC curve is not sensitive enough if the target variable is extremely imbalanced. The AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The false positive rate (FPR) = FP/N. Because N is large and FP is small, the FPR will be small and insensitive to show changes in FP.

Let’s use the above number example with an imbalanced target to understand this. The actual positives (P) in this example are only 47 out of the total 1625.
- TPR (TP/P) = 8/47 = 0.17.
- FPR (FP/N) = 7/1578 = 0.004. It somehow reflects the size of the majority but is not direct.
- Precision TP/(TP+FP) = 8/(8+7) = 0.53
- Recall TP/(TP+FN) = 8/(8+39) = 0.17
Do you notice the denominator is not N but FN? We know false negatives (FN) are positive cases labeled incorrectly as negative. If a model mislabels all positives as negatives, the FN will be large and FPR will be small.
(5) The Precision-Recall (PR) Curve
Davis and Goadrich in this paper propose that Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight the differences between models for highly imbalanced data sets. When you compare different models with imbalanced settings, the area under the Precision-Recall curve will be more sensitive than the area under the ROC curve.
Is there a way to compute the Area under Precision-Recall Curve as the AUC under the ROC curve? The short answer is the Average Precision (AP). The AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold. See the formula here. Because Precision and Recall are always between 0 and 1, AP falls within 0 (very poor) and 1 (excellent fit) also.

In the code example below, I will build a model for an extremely imbalanced target. The model is poor but the ROC still shows 0.83. In contrast, the Area Under the PR curve is 0.26 which is much less than 1.0.
(6) Build a Model for an Extremely Imbalanced Target
I am going to use the wine quality data in Kaggle.com to create an extremely imbalanced binary target. The target value of this dataset is the quality rating from low to high (0–10) as shown below. To make the target extremely imbalanced, I define those ≥ 8.0 to be “1” and the rest to be “0”. This results in a very imbalanced target (3% “1”s and 97% “0”s). The notebook is available via this link.












