Simplifying Precision, Recall and F1 Score
Explaining evaluation metrics in basic terms
Machine learning terms can seem very convoluted, as if they were made to be understood by machines. Unintuitive and similar sounding names like False Negatives and True Positives, Precision, Recall, Area Under ROC, Sensitivity, Specificity and Insanity. Ok, the last one wasn’t real.
There are some great articles on precision and recall already, but when I read them and other discussions on stackexchange, the messy terms all mix up in my mind and I’m left more confused than an unlabelled confusion matrix — so I’ve never felt like I understood it fully.

But to know how our model is working, it is important to master the evaluation metrics and understand them at a deep level. So what does a data scientist really need to know to evaluate a classification model? I explain the most important ones below using visuals and examples so it can stick in our brains for good.
Accuracy
Let’s start with the easiest — Accuracy. It is literally how good your model is at guessing the correct labels or ground truths. If your dataset is pretty balanced and you care about getting every category correct, this is all you need to worry about.

Sadly, if your dataset is unbalanced like a fraud detection dataset, odds are non-fraud cases take up 80–90% of your labels. So if your model blindly predicts all data points as the majority label, we would still have 80–90% accuracy.
That’s when we need Precision and Recall.
Precision (Also called Specificity)
Precision is the ratio of what the model predicted correctly to what the model predicted. For each category/class, there is one precision value.
We focus on precision when we need the predictions to be correct, i.e. ideally you want to make sure the model is right when you predict a label. For example, if you have a football betting model that predicts whether to bet or not, you care most about it being right, because you will take action based on what it predicted, but you don’t lose money when it tells you not to bet.
When favouring precision, the cost of getting a prediction wrong is much higher than the cost of missing out on the right prediction.

Recall (Also called Sensitivity)
Recall is the ratio of what the model predicted correctly to what the actual labels are. Similarly, for each category/class, there is one recall value.
We care about recall when we want to maximise the prediction of a particular class, i.e. ideally you want the model to capture all examples of the class. For example, airport security scanning machines have to make sure the detectors don’t miss any actual bombs/dangerous items, and hence we are okay with sometimes stopping the wrong bag/traveller.
When favouring recall, the cost of missing a prediction is much higher than a wrong prediction.

F1-Score: Combining Precision and Recall
If we want our model to have a balanced precision and recall score, we average them to get a single metric. But what kind of average is ideal? For ratios like precision and recall, a harmonic mean like F1-Score is more suitable compared to the usual arithmetic mean.
The definition of a harmonic mean seems complex: The reciprocal of the arithmetic mean of the reciprocal of your scores. How I approach lengthy definitions is to start from the deepest layer, and understand it layer by layer. There are 3:

Trade Offs: Fact of Life
As you may have figured, precision and recall plays on what matters more to us — is it the cost of a wrong prediction, or is it the cost of missing out on the truth? Often times, you have to give up one to get more of the other. Below is a great explanation/viz by Google on the trade off and how toggling the classification threshold allows us to decide what we care about — note that this will change our F1 score too.

Conclusion
I hope this explains accuracy, precision, recall and F1 in a simple and intuitive way. Together with the examples, I think this a great start to understand other evaluation metrics. So is your business objective closer to a betting model, an airport scanner, or a mix of both?

