Comprehensive Guide to Multiclass Classification With Sklearn

Model selection, developing a strategy, and choosing an evaluation metric

Learn how to tackle any multiclass classification problem with Sklearn. The tutorial covers how to choose a model selection strategy, several multiclass evaluation metrics and how to use them finishing off with hyperparameter tuning to optimize for user-defined metrics.

Introduction

Even though multi-class classification is not as common, it certainly poses a much bigger challenge than binary classification problems. You can literally take my word for it because this article has been the most challenging post I have ever written (have written close to 70).

I found that the topic of multiclass classification is deep and full of nuances. I have read so many articles, read multiple StackOverflow threads, created a few of my own, and spent several hours exploring the Sklearn user guide and doing experiments. The core topics of multiclass classification such as

choosing a strategy to binarize the problem
choosing a base mode
understanding excruciatingly many metrics
filtering out a single metric that solves your business problem and customizing it
tuning hyperparameters for this custom metric
and finally putting all the theory into practice with Sklearn

have all been scattered in the dark, sordid corners of the Internet. This was enough to conclude that no single resource shows an end-to-end workflow of dealing with multiclass classification problems on the Internet (maybe, I missed it).

For this reason, this article will be a comprehensive tutorial on how to solve any multiclass supervised classification problem using Sklearn. You will learn both the theory and the implementation of the above core concepts. It is going to be a long and technical read, so get a coffee!

Join Medium with my referral link - BEXGBoost

Get exclusive access to all my ⚡premium⚡ content and all over Medium without limits. Support my work by buying me a…

ibexorigin.medium.com

Get the best and latest ML and AI papers chosen and summarized by a powerful AI — Alpha Signal:

Alpha Signal | The Best of Machine Learning. Summarized by AI.

Stay in the loop without spending countless hours browsing for the next breakthrough; our algorithm identifies the…

alphasignal.ai

Native multiclass classifiers

Depending on the model you choose, Sklearn approaches multiclass classification problems in 3 different ways. In other words, Sklearn estimators are grouped into 3 categories by their strategy to deal with multi-class data.

The first and the biggest group of estimators are the ones that support multi-class classification natively:

naive_bayes.BernoulliNB
tree.DecisionTreeClassifier
tree.ExtraTreeClassifier
ensemble.ExtraTreesClassifier
naive_bayes.GaussianNB
neighbors.KNeighborsClassifier
svm.LinearSVC (setting multi_class=”crammer_singer”)`
linear_model.LogisticRegression (setting multi_class=”multinomial”)
linear_model.LogisticRegressionCV (setting multi_class=”multinomial”)

For an N-class problem, they produce N by N confusion matrix, and most of the evaluation metrics are derived from it:

We will focus on multiclass confusion matrices later in the tutorial.

Binary classifiers with One-vs-One (OVO) strategy

Other supervised classification algorithms were mainly designed for the binary case. However, Sklearn implements two strategies called One-vs-One (OVO) and One-vs-Rest (OVR, also called One-vs-All) to convert a multi-class problem into a series of binary tasks.

OVO splits a multi-class problem into a single binary classification task for each pair of classes. In other words, for each pair, a single binary classifier will be built. For example, a target with 4 classes — brain, lung, breast, and kidney cancer, uses 6 individual classifiers to binarize the problem:

Classifier 1: lung vs. breast
Classifier 2: lung vs. kidney
Classifier 3: lung vs. brain
Classifier 4: breast vs. kidney
Classifier 5: breast vs. brain
Classifier 6: kidney vs. brain

Sklearn suggests these classifiers to work best with the OVO approach:

svm.NuSVC
svm.SVC
gaussian_process.GaussianProcessClassifier (setting multi_class = “one_vs_one”)

Sklearn also provides a wrapper estimator for the above models under sklearn.multiclass.OneVsOneClassifier:

A major downside of this strategy is its computation workload. As each pair of classes require a separate binary classifier, targets with high cardinality may take too long to train. To compute the number of classifiers that will be built for an N-class problem, the following formula is used:

In practice, the One-vs-Rest strategy is much preferred because of this disadvantage.

Binary classifiers with One-vs-Rest (OVR) strategy

Alternatively, the OVR strategy creates an individual classifier for each class in the target. Essentially, each binary classifier chooses a single class and marks it as positive, encoding it as 1. The rest of the classes are considered negative labels and, thus, encoded with 0. For classifying 4 types of cancer:

Classifier 1: lung vs. [breast, kidney, brain] — (lung cancer, not lung cancer)
Classifier 2: breast vs. [lung, kidney, brain] — (breast cancer, not breast cancer)
Classifier 3: kidney vs. [lung, breast, brain] — (kidney cancer, not kidney cancer)
Classifier 4: brain vs. [lung, breast kidney] — (brain cancer, not brain cancer)

Sklearn suggests these classifiers to work best with the OVR approach:

ensemble.GradientBoostingClassifier
gaussian_process.GaussianProcessClassifier (setting multi_class = “one_vs_rest”)
svm.LinearSVC (setting multi_class=”ovr”)
linear_model.LogisticRegression (setting multi_class=”ovr”)
linear_model.LogisticRegressionCV (setting multi_class=”ovr”)
linear_model.SGDClassifier
linear_model.Perceptron

Alternatively, you can use the above models with the default OneVsRestClassifier:

Even though this strategy significantly lowers the computational cost, the fact that only one class is considered positive and the rest as negative makes each binary problem an imbalanced classification. This problem is even more pronounced for classes with low proportions in the target.

In both approaches, depending on the passed estimator, the results of all binary classifiers can be summarized in two ways:

majority of the vote: each binary classifier predicts one class, and the class that got the most votes from all classifiers is chosen
depending on the argmax of class membership probability scores: classifiers such as LogisticRegression computes probability scores for each class (.predict_proba()). Then, the argmax of the sum of the scores is chosen.

We will talk more about how to score each of these strategies later in the tutorial.

Sample classification problem and preprocessing pipeline

As an example problem, we will be predicting the quality of diamonds using the Diamonds dataset from Kaggle:

The above output shows the features are on different scales, suggesting we use some type of normalization. This step is essential for many linear-based models to perform well.

The dataset contains a mixture of numeric and categorical features. I covered preprocessing steps for binary classification in my last article in detail. You can easily apply the ideas to the multi-class case, so I will keep the explanations here nice and short.

The target is ‘cut’, which has 5 classes: Ideal, Premium, Very Good, Good, and Fair (descending quality). We will encode the textual features with OneHotEncoder.

Let’s take a quick look at the distributions of each numeric feature to decide what type of normalization to use:

>>> diamonds.hist(figsize=(16, 12));

Price and carat show skewed distributions. We will use a logarithmic transformer to make them as normally distributed as possible. For the rest, simple standardization is enough. If you are not familiar with numeric transformations, check out my article on the topic. Also, the below code contains an example of Sklearn pipelines, and you can learn all about them from here.

Let’s get to work:

The first version of our pipeline uses RandomForestClassifier. Let's look at its confusion matrix by generating predictions:

In lines 8 and 9, we are creating the matrix and using a special Sklearn function to plot it. ConfusionMatrixDisplay also has display_labels argument, to which we are passing the class names accessed by pipeline.classes_ attribute.

Interpreting N by N confusion matrix

If you read my other article on binary classification, you know that confusion matrices are the holy grail of supervised classification problems. In a 2 by 2 matrix, the matrix terms are easy to interpret and locate.

Even though it gets more difficult to interpret the matrix as the number of classes increases, there are sure-fire ways to find your way around any matrix of any shape.

The first step is always identifying your positive and negative classes. This depends on the problem you are trying to solve. As a jewelry store owner, I may want my classifier to differentiate Ideal and Premium diamonds better than other types, making these types of diamonds my positive class. Other classes will be considered negative.

Establishing positive and negative classes early on is very important in evaluating model performance and in hyperparameter tuning. After doing this, you should define your true positives, true negatives, false positives, and false negatives. In our case:

Positive classes: Ideal and Premium diamonds
Negative classes: Very Good, Good, and Fair diamonds
True Positives, type 1: actual Ideal, predicted Ideal
True Positives, type 2: actual Premium, predicted Premium
True Negatives: the rest of the diamond types predicted correctly
False Positives: actual value belongs to any of the 3 negative classes but predicted either Ideal or Premium
False Negatives: actual value is either Ideal or Premium but predicted by any of the 3 negative classes.

Always list out the terms of your matrix in this manner, and the rest of your workflow will be much easier, as you will see in the next section.

How Sklearn computes multiclass classification metrics — ROC AUC score

This section is only about the nitty-gritty details of how Sklearn calculates common metrics for multiclass classification. Specifically, we will peek under the hood of the 4 most common metrics: ROC_AUC, precision, recall, and f1 score. Even though I will give a brief overview of each metric, I will mostly focus on using them in practice. If you want a deeper explanation of what each metric measures, please refer to this article.

The first metric we will discuss is the ROC AUC score or area under the receiver operating characteristic curve. It is mostly used when we want to measure a classifier’s performance to differentiate between each class. This means that ROC AUC is better suited for balanced classification tasks.

In essence, the ROC AUC score is used for binary classification and with models that can generate class membership probabilities based on some threshold. Here is a brief overview of the steps to calculate ROC AUC for binary classification:

A binary classifier that can generate class membership probabilities such as LogisticRegression with its predict_proba method.
An initial, close to 0 decision threshold is chosen. For example, if the probability is higher than 0.1, the class is predicted negative else positive.
Using this threshold, a confusion matrix is created.
True positive rate (TPR) and false positive rate (FPR) are found.
A new threshold is chosen, and steps 3–4 are repeated.
Repeat steps 2–5 for various thresholds between 0 and 1 to create a set of TPRs and FPRs.
Plot all TPRs vs. FPRs to generate the receiver operating characteristic curve.
Calculate the area under this curve.

For multiclass classification, you can calculate the ROC AUC for all classes using either OVO or OVR strategies. Since we agreed that OVR is a better option, here is how ROC AUC is calculated for OVR classification:

Each binary classifier created using OVR finds the ROC AUC score for its own class using the above steps.
ROC AUC scores of all classifiers are then averaged using either of these 2 methods:

“macro”: this is simply the arithmetic mean of the scores
“weighted”: this takes class imbalance into account by finding a weighted average. Each ROC AUC is multiplied by their class weight and summed, then divided by the total number of samples.

As an example, let’s say there are 100 samples in the target — class 1 (45), class 2 (30), class 3 (25). OVR creates 3 binary classifiers, 1 for each class, and their ROC AUC scores are 0.75, 0.68, 0.84, respectively. The weighted ROC AUC score across all classes will be:

ROC AUC (weighted): ((45 * 0.75) + (30 * 0.68) + (25 * 0.84)) / 100 = 0.7515

Here is the implementation of all this in Sklearn:

Above, we calculated ROC AUC for our diamond classification problem and got an excellent score. Don’t forget to set the multi_class and average parameters properly when using roc_auc_score. If you want to generate the score for a particular class, here is how you do it:

ROC AUC score is only a good metric to see how the classifier differentiates between classes. A higher ROC AUC score does not necessarily mean a better model. On top of that, we care more about our model’s ability to classify Ideal and Premium diamonds, so a metric like ROC AUC is not a good option for our case.

Precision, Recall and F1 scores for multiclass classification

A better metric to measure our pipeline’s performance would be using precision, recall, and F1 scores. For the binary case, they are easy and intuitive to understand:

In a multiclass case, these 3 metrics are calculated per-class basis. For example, let’s look at the confusion matrix again:

Precision tells us what proportion of predicted positives is truly positive. If we want to calculate precision for Ideal diamonds, true positives would be the number of Ideal diamonds predicted correctly (the center of the matrix, 6626). False positives would be any cells that count the number of times our classifier predicted other types of diamonds as Ideal. These would be the cells above and below the center of the matrix (1013 + 521 + 31 + 8 = 1573). Using the formula of precision, we calculate it to be:

Precision (Ideal) = TP / (TP + FP) = 6626 / (6626 + 1573) = 0.808

Recall is calculated similarly. We know the number of true positives — 6626. False negatives would be any cells that count the number of times the classifier predicted the Ideal type of diamonds belonging to any other negative class. These would be the cells right and left to the center of the matrix (3 + 9 + 363 + 111 = 486). Using the formula of recall, we calculate it to be:

Recall (Ideal) = TP / (TP + FN) = 6626 / (6626 + 486) = 0.93

So, how do we choose between recall and precision for the Ideal class? It depends on the type of problem you are trying to solve. If you want to minimize the instances where other, cheaper types of diamonds are predicted as Ideal, you should optimize precision. As a jewelry store owner, you might be sued for fraud for selling cheaper diamonds as expensive Ideal diamonds.

On the other hand, if you want to minimize the instances where you accidentally sell Ideal diamonds for a lower price, you should optimize for recall of the Ideal class. Indeed, you won’t get sued, but you might lose money.

The third option is to have a model that is equally good at the above 2 scenarios. In other words, a model with high precision and recall. Fortunately, there is a metric that measures just that: the F1 score. F1 score takes the harmonic mean of precision and recall and produces a value between 0 and 1:

So, the F1 score for the Ideal class would be:

F1 (Ideal) = 2 * (0.808 * 0.93) / (0.808 + 0.93) = 0.87

Up to this point, we calculated the 3 metrics only for the Ideal class. But in multiclass classification, Sklearn computes them for all classes. You can use classification_report to see this:

You can check that our calculations for the Ideal class were correct. The last column of the table — support shows how many samples are there for each class. Also, the last 2 rows show averaged scores for the 3 metrics. We already covered what macro and weighted averages are in the example of ROC AUC.

For imbalanced classification tasks such as these, you rarely choose averaged precision, recall of F1 scores. Again, choosing one metric to optimize for a particular class depends on your business problem. For our case, we will choose to optimize the F1 score of Ideal and Premium classes (yes, you can choose multiple classes simultaneously). First, let’s see how to calculate weighted F1 across all class:

The above is consistent with the output of classification_report. To choose the F1 scores for Ideal and Premium classes, specify the labels parameter:

Finally, let’s see how to optimize these metrics with hyperparameter tuning.

Hyperparameter tuning to optimize model performance for a custom metric

Optimizing the model performance for a metric is almost the same as when we did for the binary case. The only difference is how we pass a scoring function to a hyperparameter tuner like GridSearch.

Up until now, we were using the RandomForestClassifier pipeline, so we will create a hyperparameter grid for this estimator:

Don’t forget to prepend each hyperparameter name with the step name you chose in the pipeline for your estimator. When we created our pipeline, we specified RandomForests as ‘base’. See this discussion for more info.

We will use the HalvingGridSeachCV (HGS), which was much faster than a regular GridSearch. You can read this article to see my experiments:

11 Times Faster Hyperparameter Tuning with HalvingGridSearch

Edit description

towardsdatascience.com

Before we feed the above grid to HGS, let’s create a custom scoring function. In the binary case, we could pass string values as the names of the metrics we wanted to use, such as ‘precision’ or ‘recall.’ But in multiclass case, those functions accept additional parameters, and we cannot do that if we pass the function names as strings. To solve this, Sklearn provides make_scorer function:

As we did in the last section, we pasted custom values for average and labels parameters.

Finally, let’s initialize the HGS and fit it to the full data with 3-fold cross-validation:

After the search is done, you can get the best score and estimator with .best_score_ and .best_estimator_ attributes, respectively.

Your model is only as good as the metric you choose to evaluate it with. Hyperparameter tuning will be time-consuming but assuming you did everything right until this point and gave a good enough parameter grid, everything will turn out as expected. If not, it is an iterative process, so take your time by tweaking the preprocessing steps, take a second look at your chosen metrics, and maybe widen your search grid. Thank you for reading!

Loved this article and, let’s face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that’s me, by the way :).

For only 4.99$ membership, you will get access to not just my stories, but a treasure trove of knowledge from the best and brightest minds on Medium. And if you use my referral link, you will earn my supernova of gratitude and a virtual high-five for supporting my work.

Join Medium with my referral link — Bex T.

Get exclusive access to all my ⚡premium⚡ content and all over Medium without limits. Support my work by buying me a…

ibexorigin.medium.com