Summary

The article discusses the F1 score as a metric for evaluating classification models, particularly in the context of imbalanced datasets, and explains its calculation, benefits, and limitations.

Abstract

The F1 score is a measure that combines precision and recall into a single metric for model evaluation, particularly useful when dealing with imbalanced datasets where accuracy may not be the best performance indicator. It is the harmonic mean of precision, which measures the proportion of true positives among all predicted positives, and recall, which measures the proportion of true positives among all actual positives. The F1 score penalizes models with low precision or recall, making it a more robust metric than the arithmetic mean. However, it is not suitable for all imbalanced data scenarios, especially when the positive class is the majority. The article also highlights that the F1 score is not symmetric and is affected by the distribution of classes, necessitating careful consideration when interpreting the score, especially in multi-class classification problems where macro, micro, and weighted averaging methods can be used to aggregate F1 scores across classes.

Opinions

The F1 score is preferred over accuracy for imbalanced datasets as it provides a better measure of model usefulness.
Using the arithmetic mean of precision and recall can be misleading, as it does not adequately penalize models with uneven precision and recall values.
The harmonic mean used in the F1 score is more appropriate for model evaluation as it emphasizes the need for balance between precision and recall.
The F1 score is not a panacea for evaluating models on imbalanced datasets; its effectiveness depends on the specific context of the dataset and the importance of correctly identifying each class.
When the positive class is the minority or the dataset is balanced, a high F1 score is indicative of a good model, but a low F1 score does not necessarily imply a poor model.
The F1 score's lack of symmetry means that it may not be the best choice if the focus is on the negative class predictions.
For multi-class classification, the F1 score should be computed for each class, and different averaging methods should be considered to obtain a single performance metric.
While the F1 score is useful, it is recommended to also examine precision, recall, and the confusion matrix for a more comprehensive understanding of a model's performance.

F1 Score: Don’t Use It For All Imbalanced Data

Benefits and pitfalls of the popular metric

TL;DR at the end

We all had to deal with imbalanced data at some point, where accuracy is not the best metric anymore, and we need something more robust. Should you use the F1 score as a metric instead? Let’s deep dive into where it came from and when it’s the right and wrong choice.

Our job is to create a model to classify if people are healthy or sick. We are given data about them, we’ve created multiple classification models, and it’s time to select the best one.

Precision and Recall

A common way to estimate a model's performance is to measure its precision and recall.

Precision — What portion of all predicted positives are actual positives.

Recall — what portion of all actual positives in our data did we predict correctly.

Precision and recall are great metrics, but they’re still two numbers. If you want to compare two different models to decide which is better, it would be easier to have a single number.

Arithmetic mean

One way to combine precision and recall is just their average (arithmetic mean).

This method effectively combines the two metrics into a single value. However, here’s the catch.

Here, we’ve got the same average. But are the models equally good?

The first model could be just calling everything in our dataset a positive without any logic whatsoever, but the second model looks more useful.

When looking for a good model, we want to avoid the ones with either low precision or recall. These are probably not useful models, and we’d like to lower the “score” if one of the numbers is way smaller than the other.

This is where harmonic mean comes into place.

The F1 score is the harmonic mean of Precision and Recall.

Let’s build it together.

Harmonic mean

At the start, we have our normal arithmetic mean.

Now, let’s invert the P and R.

Why? Let’s graph it out.

When a number is inverted, the smaller it is, the exponentially larger the inversion becomes. Small precision or recall would result in a large value.

But we want to punish small precision and recall, not reward it. That’s why we will invert the whole equation, effectively flipping the numerator and denominator.

This will cause lower and lower precision/recall to exponentially increase the denominator, making the result exponentially smaller.

Let’s freeze precision at 1.0 and see what happens when recall changes.

If we achieve a recall of 1.0, then the harmonic mean is simply 1 (the maximum). Once the recall starts decreasing, we start losing points exponentially. The same would be true if precision decreases.

One interesting property arises — Harmonic mean is always smaller than the arithmetic mean (except when both Precision and Recall are equal, then it’s the same).

We can look at it as a more pessimistic average of precision and recall, giving us a better metric to judge the model’s usefulness.

Another way to look at F1 score is as a score that will ‘cling’ to the lower of the two values.

That’s it!

F1 score simply averages up precision and recall in a way that exponentially punishes small values of either of them.

You made it. Here’s a puffin named Bob as a reward.

Some interesting facts about F1 score

The equation

You may have seen a different equation for the F1 score. It is equivalent to the one above.

F1 score will not save you from imbalanced data.

Let’s consider differently balanced datasets.

Case 1 — Our data has more positives than negatives, and we got a great F1 score even though we are absolutely horrible at detecting negatives. This is not a good model.
Case 2 — We have more negatives in our data. We still get a high F1 score, but it is supported by good performance on negatives.
Case 3 — We have more negatives but worse overall performance than case 2, so the F1 score drops. This is desirable.
Case 4 — We have EVEN more negatives and are pretty good at detecting them. However, our F1 score is super low, even though we misclassified only 1% of them. This is desirable when our positives are people without covid, and we really care about not letting anyone sick through.

Overall, the F1 score is a good choice if your positive class is the minority class or the data is balanced. In that case, if your F1 score is high, the model is likely good (Case 2). However, if it’s low, it doesn’t necessarily mean the model is bad (Case 4).

It is not a good choice if your positive class is the majority of your data (Case 1).

When using the F1 score, you should always consider your data distribution and the use case of the model.

Symmetry

The F1 score is computed from precision and recall, which are metrics computed for the positive class. If you swapped the negative and positive labels in your dataset, your precision and recall — and subsequently F1 score — would be different!

F1 score is not symmetric, and if you care more about negative predictions, consider using them for computing the F1 score instead.

Multiple classes

F1 score is always computed for one class — for binary classification, we usually pick the positive class.

We can also compute the F1 score for a classification dataset with many classes. Then, we would need to pick a class in the style of one-vs-the-rest and compute the F1 score for that class.

This leaves us with an F1 score for each class. To aggregate these into a simple number, you can use different methods like macro, micro, and weighted averaging.

Informational value

F1 score is a great metric to judge a model performance in a quickly dissectable single number that we can easily compare across models. However, we always get more information from looking at Precision and Recall separately and even more information from a complete Confusion matrix.

Conclusion/TL;DR

The F1 score is a metric showing how good a classification model is.
It combines Precision and Recall and clings closer to the lower of the two, emphasizing the need for a balance between them.
The F1 score should be evaluated on a dataset where the positive class isn’t the majority.
It’s a great way to use a single number to compare and pick models when used correctly.