Evaluating Classification Models: Why Accuracy Is Not Enough
Accuracy is a metric that commonly comes to mind when we think about evaluating classification models in machine learning. Although intuitive to understand and easy to measure, it needs to be used with caution. As data science practitioners, it is important to know exactly when and how to use the accuracy metric. In this post, I break down the concept of accuracy and explain, through a hypothetical example, why it can be misleading.
Introduction
It would not be possible to talk about accuracy without first introducing the idea of classification. Classification is the task of categorising data into different classes. In machine learning, it is considered supervised learning, where models learn from data with labels.
For simplicity, subsequent discussions will be in the context of binary classification — classification with only two possible outcomes. Examples of binary classification in the real world include predicting whether a customer will respond to a marketing ad, detecting emails likely to be spam and predicting whether a tumour is malignant or benign.
What is Accuracy?
Suppose you have trained a machine learning model to perform binary classification. How do you know if the model is doing a good job making predictions? A metric that commonly comes to mind is accuracy. In essence, accuracy is the proportion of correct predictions generated by the model. Although intuitive to understand and easy to measure, the accuracy metric needs to be used with caution. It is important, as data science practitioners, to know exactly when and how to use the accuracy metric.
The Accuracy Paradox: A Simple Example
When I first started out learning data science, phrases like “the accuracy paradox” and “accuracy can be misleading” came up very often. I never really understood the reasons behind them until I researched deeper. I decided to help myself understand better by creating a simple, fictitious example, and I would like to share this with you. Here we go:
Let’s first get some nomenclature out of the way. Consider a dataset where each observation corresponds to a patient and whether he has cancer. The class, in this case, refers to the categorical variable of whether a patient has cancer. Usually, we are interested to know whether an observation belongs to a particular class. This is commonly known as the class of interest and labelled the positive class. The positive class is also sometimes the more rare-occurring class, i.e. the minority class (though there may be exceptions to this!).
In our cancer prediction example, we would be more interested to know if a patient has cancer, because it is the outcome with more serious consequences. It is also statistically rarer to have cancer. Thus, the positive class corresponds to a patient having cancer.
Suppose that we have 1,000 observations, of which only 100 belong to the positive class, i.e., patients with cancer. Let’s assume that we have a binary classification model trained on a separate dataset, and that the model has generated predictions on these 1,000 observations. Figure 1 shows a confusion matrix that summarises the hypothetical prediction results.

Model is Accurate… Or Is It?
The cells shaded in green represent observations for which the model had predicted correctly. Out of 1,000 predictions, 900 were correct. Accuracy is 90%, and you start to jump for joy and think, “Hurray! Our model is great!” In reality, this is misleading. Here’s why:
Recall that accuracy is the proportion of correct predictions made by the model. For binary classification problems, the number of correct predictions consists of two things:
- Correctly predicted positive classes (the value of 80 in top-left quadrant of Table 1); and
- Correctly predicted negative classes (the value of 820 in bottom-right quadrant of Table 1)
When we get an accuracy of 90%, it reflects how well the model predicted both positive and negative classes. In this case, it was 90% accurate in predicting both cancer and no cancer.
However, what we are interested to know in binary classification is how well it predicts the class of interest only, which in this case is how well it predicts cancer. Did our model do a good job in predicting cancer? It is very easy to have been misled to think that it did. The truth is that we cannot tell from accuracy alone.
How is Accuracy a Paradox?
What do we mean by the phrase “accuracy paradox”? Let’s explain this using our example. On one hand, accuracy was high, which presumably implied the model’s veracity. On the other hand, it was not useful because it did not tell us how the model performed in predicting the class of interest.
“A paradox is a situation or a statement that seems impossible or is difficult to understand because it contains two opposite facts or characteristics.” — Cambridge Dictionary.
Wrapping Up
If you think about it, the problem with using accuracy in the above example stemmed from the fact that our dataset was highly imbalanced — 100 observations with cancer, and 900 without cancer. This is also known as class imbalance. Most real-world datasets have class imbalance, which makes it all the more important for us to be aware of the pitfalls of the accuracy metric and exercise caution when we use it to evaluate classification models.
According to Wasikowski and Chen², a model that predicts better on the minority class is preferred, even if it means compromising the performance on the majority class. In our example, a good model will be the one that predicts cancer well. Since the accuracy metric did not differentiate between correctly predicted positive classes and correctly predicted negative classes, it is unable to tell us how well it predicts cancer, i.e. the positive class. So, how then can we assess the model performance? The answers lie in precision and recall! In my next post, I will share how we can instead use precision and recall to evaluate binary classification models.
References
- Foster Provost and Tom Fawcett. Data Science for Business. O’Reilly Media, Inc., first edition, December 2013.
- Mike Wasikowski and Xue-wen Chen. Combating the Small Sample Class Imbalance Problem Using Feature Selection. IEEE Transactions on Knowledge and Data Engineering, 22(10):1388–1400, October 2010. ISSN 1041–4347.
That’s all for now. Thank you for reading this post. If you have any questions or feedback, feel free to drop me a message below. If you have found this post useful, it would be great if you can give it a clap! You can also connect with me via LinkedIn. Have a great day!





