avatarPriyanka

Summary

The article discusses the limitations of using accuracy as the sole metric for evaluating classification models in machine learning, particularly with imbalanced datasets, and introduces alternative metrics such as Recall, Precision, F1-Score, and ROC-AUC to provide a more nuanced assessment of model performance.

Abstract

When dealing with classification problems in machine learning, especially those with imbalanced data, relying solely on test accuracy can be misleading. The article emphasizes the importance of additional performance metrics like Recall, Precision, F1-Score, and ROC-AUC. Recall, also known as Sensitivity, measures the model's ability to correctly identify all instances of a class. Precision, or Positive Predictive Value, assesses the model's accuracy in classifying instances as positive. The F1-Score provides a balance between Recall and Precision by calculating their harmonic mean. The ROC-AUC curve and score offer insights into the model's discriminative ability between classes, with an AUC value close to 1 indicating excellent performance. The article illustrates these concepts using the iris dataset and a Decision Tree Classifier, demonstrating how to interpret and calculate these metrics, and highlights their utility in evaluating and comparing machine learning models.

Opinions

  • The author suggests that accuracy can be a deceptive metric when dealing with imbalanced datasets, as it treats all misclassifications equally, which may not reflect the true performance of a model.
  • There is an opinion that in certain scenarios, such as medical diagnoses or legal judgments, the cost of False Negatives or False Positives can be significantly higher, necessitating the use of metrics beyond accuracy.
  • The article conveys that a high Recall is crucial in situations where missing a positive case is more critical than incorrectly identifying a negative case as positive.
  • It is implied that Precision is particularly important when the cost of False Positives is high, and it is essential to be confident about the positive predictions made by the model.
  • The author posits that the F1-Score is a valuable metric when there is a need to find an optimal balance between Recall and Precision.
  • The ROC-AUC curve and score are presented as powerful tools for evaluating the overall performance of a classification model across different threshold settings, with a preference for higher AUC values indicating better model discrimination.
  • The author advocates for the use of multi-class classification metrics in evaluating models, particularly in cases where the classes are imbalanced or when the misclassification costs vary between classes.

Beyond Accuracy: Recall, Precision, F1-Score, ROC-AUC

Photo by Afif Ramdhasuma on Unsplash

When talking about classification in Machine Learning, we tend to focus on the test accuracy i.e., how many instances were classified correctly among the total number of test instances. This could be misleading when it comes to imbalanced data. In this post, we will discuss other performance metrics like Recall, Precision, etc., and what additional advantages they offer in comparison to accuracy.

For ease of explanation, let us consider a simple multi-class classification problem with the iris dataset throughout this post. There are three types of flowers setosa, versicolor and virginica which are labelled here as 0, 1, and 2.

Drawbacks of Accuracy

Before discussing other metrics, let us understand the drawbacks of accuracy.

  • Let us consider a scenario where 90 samples are from classes versicolor and setosa and only 10 samples of virginica in our sample dataset. The model classifies samples of both the classes correctly and the accuracy is 90%. This might seem high but we miss the 10 misclassified samples of virginica.
  • As we are going to see in the subsequent sections, accuracy weighs all kinds of misclassifications equally where as some kinds of errors might be more harmful than the others depending on the situation.

Confusion Matrix

Some of the problems mentioned above can be solved examining measures that are displayed in a confusion matrix. These terms are very intuitive but can be confusing to remember.

True Positives(TP): These are the instances classified correctly i.e., the test instances that have been classified with their true classes.

True Negatives(TN): These are the test instances that belong to the Negative class and have been predicted correctly.

These two can be remembered as it is “True” that these are positive or negative respectively.

Based on the knowledge above, pause and think about what False Positives or False Negatives could be.

False Positives(FP): These are the instances that have been incorrectly classified as Positive but their actual class is Negative.

False Negatives(FN): These are the instances that have been incorrectly classified as Negative but their actual class is Positive.

These two can be remembered as it is “False” that these are positive or negative respectively and hence they indicate incorrect predictions.

With the help of below code taken from GeeksForGeeks, we will fit a Decision Tree Classifier to the iris data and obtain a prediction.

from sklearn import datasets
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
# loading the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
plot_confusion_matrix(dtree_model, X_test, y_test)

This is the Confusion matrix from the above code.

Confusion Matrix for Three Classes

Now, let us label the TP, FP, TN, FN from this matrix. Since there are 3 classes, these values are calculated for each class. The diagonal elements show the correct predictions.

For Class 0, the TPs are 13 i.e., all of them are classified correctly and hence TN = FN = FP = 0. For Class 1, the TP = 15, FN = 1, FP = 3. As an exercise come up with the values for Class 2(TP = 6, FN = 3, FP = 1).

Recall

This is also called Sensitivity. This indicates out of all the instances that actually belong to a class(True Positives + False Negatives), how many were classified correctly. This tells the ability of a model to classify the instances of a particular class correctly.

Recall = True Positives / (True Positives + False Negatives)

Precision

This is also called Positive Predictive Value. This metric measures out of all the instances classified with a particular class(True Positives + False Positives) how many instances actually belong to the class. If our model classifies something as positive, precision indicates how confident we can be that it is actually positive.

Precision = True Positives / (True Positives + False Positives)

Recall vs(and) Precision

In the above example, class 0 has 13 samples and all of them have been classified correctly. Now, let us say along with these 13 samples, 7 instances of class 2 have also been classified as class 0. In this case, we have a recall of 100% but the precision is only 65%. Had we used only recall to evaluate the model it would have been misleading. Hence, we need to examine both for evaluation of our models. Sometimes it also happens that increasing the precision results in decrease of recall and vice-versa.

When a measure is more important may also change based on our use-case. For example, in serious issues like a cancer diagnosis, having False Negatives could be dangerous as people with cancer might go un-treated which results in the disease becoming worse. On the other hand, if someone is classified falsely as positive, more clinical tests can be conducted and they can be assured as healthy. In such cases, we aim for high recall. In the case of finding an accused person guilty, a False Positive(labelling a non-criminal as guilty) might be considered more serious and hence we aim for higher precision.

F1-Score

As Recall and Precision could be at odds, to achieve a balance we could combine them into a single measure known as F1-Score.

F1-score is defined as a harmonic mean of Precision and Recall and like Recall and Precision, it lies between 0 and 1. The closer the value is to 1, the better our model is. The F1-score depends both on Recall and Precision.

F1-Score = 2 * Precision * Recall/(Precision + Recall)

All the above measures for each class can be calculated using sklearn classification report:

print(classification_report(y_test, dtree_predictions, target_names=['class 0', 'class 1', 'class 2']))
Output from Classification Report

From the above report, we can see that overall accuracy is 0.89 and precision, recall, and f1-score for each class have been calculated. Let us verify the scores for class 1 from our calculations of TP, FP, FN, TN above.

For Class 1, TP = 15, FN = 1, FP = 3
Recall = 15/(15 + 1) = 0.94
Precision = 15/(15 + 3) = 0.83
F1-Score = 2 * 0.94 * 0.83/(0.94 + 0.83) = 0.88

The macro and weighted averages are the average of each of the scores. The macro only takes mean of the measures for the three classes whereas the measures are weighted by the class proportion to get the weighted average. These can be used as the final scores of a model and can be further used for comparison with other models.

ROC-AUC Curve

Another popular way to measure model performance is the ROC-AUC curve/score. Receiver Operating Characteristic Curve shows the performance of the model at different threshold values used for the classification. It plots True Positive Rate on the y-axis and False Positive Rate on the x-axis, that has been calculated at each probability threshold. This curve helps to measure the model’s ability to distinguish between two classes. Like we saw in the previous examples, it is not enough to just have a high recall, we also need to take care that FPs are not too high.

Apart from the recall(sensitivity or True Positive Rate), there are many other metrics that can be calculated from the confusion matrix. One of them is the False Positive Rate which is calculated as follows. This False Positive Rate can also be calculated as 1 — Specificity.

Specificity = True Negatives/(True Negatives + False Positives)
False Positive Rate = Number of False Positives/(Number of False Positives + True Negatives)

In general, our aim is to reduce False Positive Rate and have a high True Positive Rate. In order to measure this, we calculate AUC which stands for Area Under the ROC-Curve. This lies between 0 and 1 and the closer AUC is to 1, the better our model is. If the AUC is zero, all predictions are False Positives and the performance is bad. If AUC is one, the model differentiates between the two classes perfectly and the curve is a horizontal line. If the AUC is 0.5, the TPR and FPR are equal and the model is as good as a random prediction. Usually, AUC score of 0.8 or 0.9 is considered to be good.

The ROC-AUC curve can only used for a binary classification problem. In a multi-classification setting we need to modify the problem into OneVsRestClassification i.e., for plotting the graphs, we first consider Class 0 as the positive class and classes 1 and 2 will be considered the negative class. Similarly, we consider Class 1 as the positive class and classes 0 and 1 as the negative class and this is similar for Class 2. Alternatively, we could also plot the ROC-AUC curve with OneVsOneClassification, and measure the model’s ability to distinguish any two classes by ignoring the third class.

Let us apply the strategy to our current problem using the code given here Scikit-Code-Multi-Class. We change the code to use a decision tree classifier with max_depth = 2 instead of the SVM and plot the ROC-AUC curve of each class.

ROC-AUC Curve for Multi-class Classification

From the above graph, we can see ROC-curves of different classes. The class 0 has the highest AUC and class 1 has the lowest AUC. The black dashed line represents random predictions and the blue one shows the average AUC of all three classes. With this information, we could build better models that improve AUC of class 1 and class 2. Here, the AUC represents how well the model distinguishes a particular class from rest of the classes.

In this post, we have discussed some classification metrics which have formed a basis for more advanced metrics. Understanding them is important in optimization and selection of appropriate models.

References

I am a student of Masters in Data Science @ TU Dortmund. Feel free to connect with me on LinkedIn for any feedback on my posts or any professional communication.

Machine Learning
Classification Metrics
Data Science
Precision Recall
F1 Score
Recommended from ReadMedium