Summary

The article emphasizes the limitations of using accuracy as the sole metric for evaluating binary classification models, especially in cases of class imbalance.

Abstract

The article "Evaluating Classification Models: Why Accuracy Is Not Enough" discusses the potential pitfalls of relying on accuracy as the primary measure of a classification model's performance. It explains that while accuracy—the proportion of correct predictions—is straightforward and commonly used, it can be misleading, particularly in binary classification tasks with imbalanced datasets. The author illustrates this through a hypothetical example involving a cancer prediction model, where a high accuracy rate fails to reflect the model's effectiveness in predicting the positive class (cancer cases), which is the minority and more critical class. The article underscores the importance of considering other metrics such as precision and recall, which provide a more nuanced understanding of a model's performance in predicting the class of interest. The author concludes by highlighting the concept of class imbalance and the need for models that predict well on the minority class, even at the expense of performance on the majority class.

Opinions

The author believes that accuracy, while intuitive and easy to measure, should be used with caution in evaluating classification models.
It is posited that the accuracy paradox can occur when a model achieves high accuracy but fails to effectively predict the class of interest, especially in imbalanced datasets.
The article suggests that data science practitioners must be aware of the limitations of accuracy and consider metrics like precision and recall for a more accurate assessment of model performance.
The author advocates for the development of models that prioritize predictive performance on the minority class, which is often of greater interest and consequence.
A preference is expressed for models that can predict the minority class well, even if it means a compromise in the performance on the majority class.

Evaluating Classification Models: Why Accuracy Is Not Enough

Accuracy is a metric that commonly comes to mind when we think about evaluating classification models in machine learning. Although intuitive to understand and easy to measure, it needs to be used with caution. As data science practitioners, it is important to know exactly when and how to use the accuracy metric. In this post, I break down the concept of accuracy and explain, through a hypothetical example, why it can be misleading.

Introduction

It would not be possible to talk about accuracy without first introducing the idea of classification. Classification is the task of categorising data into different classes. In machine learning, it is considered supervised learning, where models learn from data with labels.

For simplicity, subsequent discussions will be in the context of binary classification — classification with only two possible outcomes. Examples of binary classification in the real world include predicting whether a customer will respond to a marketing ad, detecting emails likely to be spam and predicting whether a tumour is malignant or benign.

What is Accuracy?

Suppose you have trained a machine learning model to perform binary classification. How do you know if the model is doing a good job making predictions? A metric that commonly comes to mind is accuracy. In essence, accuracy is the proportion of correct predictions generated by the model. Although intuitive to understand and easy to measure, the accuracy metric needs to be used with caution. It is important, as data science practitioners, to know exactly when and how to use the accuracy metric.

The Accuracy Paradox: A Simple Example

When I first started out learning data science, phrases like “the accuracy paradox” and “accuracy can be misleading” came up very often. I never really understood the reasons behind them until I researched deeper. I decided to help myself understand better by creating a simple, fictitious example, and I would like to share this with you. Here we go:

Let’s first get some nomenclature out of the way. Consider a dataset where each observation corresponds to a patient and whether he has cancer. The class, in this case, refers to the categorical variable of whether a patient has cancer. Usually, we are interested to know whether an observation belongs to a particular class. This is commonly known as the class of interest and labelled the positive class. The positive class is also sometimes the more rare-occurring class, i.e. the minority class (though there may be exceptions to this!).

In our cancer prediction example, we would be more interested to know if a patient has cancer, because it is the outcome with more serious consequences. It is also statistically rarer to have cancer. Thus, the positive class corresponds to a patient having cancer.

Suppose that we have 1,000 observations, of which only 100 belong to the positive class, i.e., patients with cancer. Let’s assume that we have a binary classification model trained on a separate dataset, and that the model has generated predictions on these 1,000 observations. Figure 1 shows a confusion matrix that summarises the hypothetical prediction results.

Figure 1: Confusion matrix of the binary classification model (Credits: Zeya)

Model is Accurate… Or Is It?

The cells shaded in green represent observations for which the model had predicted correctly. Out of 1,000 predictions, 900 were correct. Accuracy is 90%, and you start to jump for joy and think, “Hurray! Our model is great!” In reality, this is misleading. Here’s why:

Recall that accuracy is the proportion of correct predictions made by the model. For binary classification problems, the number of correct predictions consists of two things:

Correctly predicted positive classes (the value of 80 in top-left quadrant of Table 1); and
Correctly predicted negative classes (the value of 820 in bottom-right quadrant of Table 1)

When we get an accuracy of 90%, it reflects how well the model predicted both positive and negative classes. In this case, it was 90% accurate in predicting both cancer and no cancer.

However, what we are interested to know in binary classification is how well it predicts the class of interest only, which in this case is how well it predicts cancer. Did our model do a good job in predicting cancer? It is very easy to have been misled to think that it did. The truth is that we cannot tell from accuracy alone.

How is Accuracy a Paradox?

What do we mean by the phrase “accuracy paradox”? Let’s explain this using our example. On one hand, accuracy was high, which presumably implied the model’s veracity. On the other hand, it was not useful because it did not tell us how the model performed in predicting the class of interest.

“A paradox is a situation or a statement that seems impossible or is difficult to understand because it contains two opposite facts or characteristics.” — Cambridge Dictionary.

Wrapping Up

If you think about it, the problem with using accuracy in the above example stemmed from the fact that our dataset was highly imbalanced — 100 observations with cancer, and 900 without cancer. This is also known as class imbalance. Most real-world datasets have class imbalance, which makes it all the more important for us to be aware of the pitfalls of the accuracy metric and exercise caution when we use it to evaluate classification models.

According to Wasikowski and Chen², a model that predicts better on the minority class is preferred, even if it means compromising the performance on the majority class. In our example, a good model will be the one that predicts cancer well. Since the accuracy metric did not differentiate between correctly predicted positive classes and correctly predicted negative classes, it is unable to tell us how well it predicts cancer, i.e. the positive class. So, how then can we assess the model performance? The answers lie in precision and recall! In my next post, I will share how we can instead use precision and recall to evaluate binary classification models.

References

Foster Provost and Tom Fawcett. Data Science for Business. O’Reilly Media, Inc., first edition, December 2013.
Mike Wasikowski and Xue-wen Chen. Combating the Small Sample Class Imbalance Problem Using Feature Selection. IEEE Transactions on Knowledge and Data Engineering, 22(10):1388–1400, October 2010. ISSN 1041–4347.

That’s all for now. Thank you for reading this post. If you have any questions or feedback, feel free to drop me a message below. If you have found this post useful, it would be great if you can give it a clap! You can also connect with me via LinkedIn. Have a great day!