CatBoost Regression: Break It Down For Me

A comprehensive (and illustrated) breakdown of the inner workings of CatBoost

CatBoost, short for Categorical Boosting, is a powerful machine learning algorithm that excels in handling categorical features and producing accurate predictions. Traditionally, dealing with categorical data is pretty tricky— requiring one-hot encoding, label encoding, or some other preprocessing technique that can distort the data’s inherent structure. To tackle this issue, CatBoost employs its own built-in encoding system called Ordered Target Encoding.

Let’s see how CatBoost works in practice by building a model to predict how someone might rate the book Murder, She Texted based on their average book rating on Goodreads and their favorite genre.

We asked 6 people to rate Murder, She Texted and collected the other relevant information about them.

This is our current training dataset, which we will use to train (duh) the data.

Step 1: Shuffle the dataset and Encode the Categorical Data Using Ordered Target Encoding

The way we preprocess categorical data is central to the CatBoost algorithm. In this case, we only have one categorical column — Favorite Genre. This column is encoded (aka converted to a discrete integer) and the way it is done varies depending on whether it is a Regression or Classification problem. Since we are dealing with a Regression problem (because the variable we want to predict Murder, She Texted Rating is continuous) we follow the following steps.

1 — Shuffle the dataset:

2 — Put the continuous target variable into discrete buckets: Since we have very little data here, we’ll create 2 buckets of the same size to categorize the target. (Learn more about how to create buckets here).

We put the 3 smallest values of Murder, She Texted Rating in bucket 0 and the rest in bucket 1.

3 — Enocde the categorical column using a formula: Ordered Target Encoding pretends like it is receiving the data sequentially, one row at a time, and uses this formula to encode the values in Favorite Genre:

curCount = the number of people we’ve seen before who have the same Favorite Genre and are in Rating Bucket 1
prior = constant value that is user defined; let’s set it to 0.05 in our case
maxCount = the number of people we’ve seen before that have the same Favorite Genre

NOTE: If we have more data, we’ll have more buckets. And we use a different formula to encode categorical data. Read more here.

Using this formula, let’s encode the first row. Since it’s the first row, we are assuming no data came before it and this row is the only information we have.

Here:

curCount = the number of people we’ve seen before with Rating Bucket 1 and whose Favorite Genre is Mystery = 0
maxCount = number of people we’ve seen whose Favorite Genre is Mystery = 0

So, the encoded value of Mystery in the first row is 0.05.

Now for the second row, we assume the only data we have are the first 2 rows.

curCount = the number of people we’ve seen before with Rating Bucket 1, and whose Favorite Genre is Romance = 0
maxCount = number of people we’ve seen whose Favorite Genre is Romance = 0

Similar to the first row, the encoded value for the second row is 0.05.

For the third row:

curCount = the number of people we’ve seen before who were in Rating Bucket 1, and whose Favorite Genre is Mystery = 0
maxCount = number of people we’ve seen whose Favorite Genre is Mystery = 1

Similarly, if we do this encoding on the remaining rows, we get:

And that’s how we encode the categorical variables.

For now we can ignore Favorite Genre and only consider Encoded Favorite Genre.

Step 2: Make an Initial Prediction and Calculate Residuals

CatBoost starts by making an initial Murder, She Texted Rating prediction of 0 for all rows.

Then we calculate something called Residuals using this formula:

Step 3: Build a CatBoost Tree

Now that we have the Residuals, we can start to build a CatBoost tree. It might be useful to read my previous article on Decision Trees and XGBoost to give you some context on decision trees.

Find a root node

We determine the best threshold for the tree's root (first split) by comparing how well the tree splits using Favorite Genre (encoded) vs. Average Goodreads Rating as a root node.

First, we need to identify candidate nodes for splitting the tree based on Favorite Genre. To do this, we must sort the values of Favorite Genre in ascending order:

Then we calculate the average values of the adjacent values in Favorite Genre:

Our candidate values for Favorite Genre splits are these averages — 0.0375, 0.05, 0.2875, and 0.525.

The first candidate we try is Favorite Genre < 0.0375:

The leaves of the tree are green. CatBoost initializes something called the Output of the leaves of the split to be 0:

If Favorite Genre is less than 0.0375, we end up in the left leaf; otherwise, we end up in the right leaf. When each row of data is run down the tree, its Residuals are put in the leaves.

So running the first row down the tree…

…we put its Residual in the right leaf because Favorite Genre is 0.05, which is greater than 0.0375:

Then we keep track of the Leaf Output for that row:

Then we update the value of the Output in the tree to the average of the Residual values in the leaf. In this case, since there is only one Residual in the leaf, the Output is 3.5.

Now running the second row down the tree:

We put its Residual in the right leaf too because 0.05 > 0.0375:

We store the Leaf Output value:

And we update the Output value by taking the average of the two Residuals in the leaf:

Now let’s pass the third row down the tree:

the Residual ends up in the left leaf because Favorite Genre = 0.025 < 0.0375

Keep track of the Leaf Output:

And update the Output value of the leaf:

Finally, let's run the last 3 rows down the tree. We end up with this tree…

…and this table:

final table with the leaf output values stored

Quantify how “good” this root node is

CatBoost quantifies how good the split is by calculating the Cosine Similarity between the Leaf Output column and the Residuals. The formula for Cosine Similarity is:

where A and B are just two columns we’re trying to compare.

So to calculate the Cosine Similarity of Residuals and Leaf Output column…

…we put the corresponding values in the formula:

And we find Cosine Similarity = 0.786. Thus, the Cosine Similarity for the threshold Favorite Genre < 0.0375 is 0.786.

Now using the same process as above, we build a tree using the second candidate for root threshold: Favorite Genre < 0.05. If we repeat the same process, we end up with this tree:

…and this table:

…with Cosine Similarity:

NOTE: This is the same value as what we got using the threshold Favorite Genre < 0.0375 because the Residuals land up in the same leaves.

Let’s try the next root threshold candidate: Favorite Genre < 0.2875. We get the tree:

…and this table:

…and with Cosine Similarity of Residuals and Leaf Outputs = 0.84. Here because the Cosine Similarity is greater than that of the other 2 thresholds, we conclude that Favorite Genre < 0.2875 is a better root split than Favorite Genre < 0.0375 and Favorite Genre < 0.05.

Now, we for the last split: Favorite Genre < 0.525. We get a Cosine Similarity of 0.84, which leads us to conclude that the splits on 0.2875 and 0.525 perform similarly.

But remember we only tested Favorite Genre candidates. Next, we need to test Average Goodreads Rating root candidates. And to do that, we need to identify our candidates for Average Goodreads Rating splits by arranging the columns in ascending order and calculating the adjacent averages.

For each average, we build a tree and calculate the Cosine Similarity between Leaf Outputs and Residuals:

Comparing the Cosine Similarity values of all candidate root nodes, we find that the Cosine Similarity of Average Goodreads Rating < 3.65 has the highest value of 0.87. *So we choose this to be our root node split*

After obtaining the root node, we can expand the tree by adding new branches. To do this, we follow a similar process as before, but instead of selecting a root node, we select a branch that splits off from the leaves. And the split that has the highest Cosine Similarity values is chosen.

The caveat with CatBoost trees is that they are symmetric, meaning each branch on the same level uses the exact same threshold.

NOTE: If you’re curious about why we build symmetric trees, you can read about it here.

So an example of one will be:

In this case, both the nodes at the same level use the same split.

Since we have so little data let’s just stick to building trees of depth 1.

NOTE: Tree depth is a parameter in the model that we can tune mainly to avoid overfitting. In most cases, the optimal depth ranges from 4 to 10 and values in the range from 6 to 10 are recommended.

And just like that, we have our first tree!

Step 4: Make New Predictions

Now we make new predictions using the old predictions and this formula:

Learning Rate is another parameter we can tune to avoid overfitting. Read more about it here. For now, let’s set it to 0.1.

Let’s go back to our table.

Using the formula, we can calculate the new predictions. For the first row, the new prediction will be:

Similarly, if we calculate the new predictions for the rest of the rows we get:

We can see that our new predictions are not very accurate as they still differ significantly from the actual rating of Murder, She Texted. However, they are an improvement from the old predictions of all zeros.

Our next step is to build a new tree. But before doing that let’s quickly clean up our noisy dataset a bit so it's easier to work with. We can ignore the old predictions, Residual column, and Leaf Output column…

…rename the New Prediction column to just Prediction (because it's not new anymore)

…and replace the values of Encoded Favorite Genre with our original Favorite Genre:

Step 5: Build a New Tree Using Steps 1–4

Now we repeat the same process that we did to build our first tree to build a second one.

Using Step 1, we shuffle the dataset…

…and encode the categorical data (aka Favorite Genre) using ordered target encoding:

Following Step 2, since we already have our prediction, we don’t need to make initial predictions. All we do is calculate Residuals using the same formula we saw above:

We get the following Residuals:

Then we build our second CatBoost tree using Step 3. Let’s assume after testing all the Encoded Favorite Genre and Average Goodreads Rating candidates, we find the perfect root node to be Encoded Favorite Genre < 0.288. Since the tree has a depth of 1, as previously set, we end up with a tree like this:

…and an updated table with Leaf Outputs:

Finally, using Step 4, we make new predictions. So using this formula…

…we get new predictions:

We can see here that the new predictions are slightly better than the old ones. And if we continue this process and continue building more trees, our predictions get better and better.

NOTE: We continue building trees until our predictions are good, or until we hit the number of trees parameter value, which we can set.

Making Predictions with Our CatBoost Trees

Let’s assume we finished our model building process with just these 2 trees we have above. (by default CatBoost builds 1000). We now have a CatBoost model (obviously this isn’t going to be close to a good one because we only have 2 trees) with which we can start making predictions.

Now using our model, we want to predict how these 2 people are going to rate Murder, She Texted.

To begin, we need to encode the categorical data — Favorite Genre. The process of encoding new data is similar to how we encoded the training data; the only difference is that we use the entire training dataset for encoding.

Assign Murder, She Texted Rating to the same Rating Buckets we used before:

Now we use this same formula as above to do the encoding:

However, we use the entire training dataset instead of doing it sequentially as we did during the training process. So for instance, the encoding for Favorite Genre Mystery will be:

Similarly, the encoding for the other Favorite Genre’s are:

We substitute the encoded values in the new dataset:

So now for the first person, we go back to our trees and pass the data down them:

Then we use this formula to make our prediction:

So our prediction is:

Granted, this is a terrible prediction. But remember this is a pretty terrible model. The more trees we have, the better our model is going to perform.

Similarly, for the second person:

And that’s about it. That’s how we build CatBoost trees and use them to make predictions on new data!

Unless otherwise noted, all images are by the author

You can connect with me on LinkedIn or email me at [email protected] to send me questions and suggestions for any other algorithms that you want illustrated!