Summary

The article discusses the occurrence of negative R² values in machine learning tasks, explaining why this happens, particularly in non-linear models, and how it indicates that the model performs worse than predicting the mean.

Abstract

The author of the article encountered a negative R² value while working on a machine learning task, which initially caused concern due to the expectation that R² should be between 0 and 1. However, after researching, the author understood that a negative R² indicates the model's predictions are worse than using the average as a predictor. This situation can occur when using non-linear models, which do not adhere to the same constraints as ordinary least squares (OLS) linear regression. The article delves into the mathematical underpinnings of R², explaining the sums of squares for total variation (SST), explained variation (SSR), and unexplained variation (SSE). It also touches on the debate regarding the reliability of R² for non-linear models and emphasizes that R² should be used alongside other metrics for a comprehensive evaluation of model performance.

Opinions

The author initially expresses concern over obtaining a negative R² but becomes only mildly concerned after understanding its implications.
There is a reasonable debate in the data science community about whether R² is a reliable metric for non-linear models.
The author suggests that while R² is not inherently useful as a sole goodness-of-fit metric, it can be valuable when used in conjunction with other metrics like MAE, MSE, RMSE, or adjusted R².
The article implies that a negative R² is not necessarily a sign of miscalculation but rather an indication that the model is performing poorly compared to using the sample mean as a predictor.
The author provides a toy dataset example to illustrate how a negative R² can be calculated and what it signifies in a practical context.
The author concludes with a reflective note on the importance of perseverance despite model failures, quoting Oprah Winfrey on failure being a stepping stone to greatness.

SERIES ON THEORIES

Negative R2: Where Did You Go Wrong?

A statistical example

Recently, I went about a machine learning task and obtained a negative R2 scoring metric on my holdout test set.

“Wait! Isn’t R2… squared? Shouldn’t it be between 0 and 1? What on earth have I done?”

Initially, I was very concerned about the state of my machine learning task. After a couple minutes of googling, I remained only mildly concerned about the state of my machine learning task. In any case, I got a nice review of what a negative R2 value means and how it can accidentally be obtained. This article provides greater insight into the mathematical origins of the statistic to help elucidate the meaning of this result.

Receipt of a negative R2 value indicates that your model’s predicted values perform worse than if you were to use the average as a predicted value. Your own googling may yield similar statements. Here, I explain why with a data example.

Why use R2?

R2 is the Coefficient of Determination, the proportion of variation in the outcome explained by the input. It is frequently used as a performance metric in regression tasks predicting continuous outcomes.

In classical linear regression, the values of R2 are bound between 0 and 1. This article provides a nice proof-based explanation as to why. However, the short version is that the proof holds for the ordinary least squares (OLS) modeling approach when including an intercept. If you use a non-linear model (as many of us in the data science community do), throw those constraints out the window. Now our Coefficient of Determination bounds change from negative infinity to 1.

In fact, there is some reasonable debate as to whether R2 should be used at all for non-linear models. The argument is based on the fact that R2 is not a routinely reliable predictor of model fit, which is demonstrated by Spiess & Neumeyer in a 2010 journal paper.

Incidentally, R2 is not inherently useful as a goodness-of-fit metric for OLS either, at least until the assumption of linearity is determined valid through assessment of model residuals.

In any case, while R2 should not be used as a sole performance metric, there is no denial that R2 can be useful to evaluate model performance in concert with other metrics, such as MAE (mean absolute error), MSE (mean square error), RMSE (root mean square error), or possibly adjusted R2.

What is R2?

In classical regression, the goal is to predict an outcome. The variation of this outcome variable can be represented by 3 values: SSR, SSE, and SST. These are sums of squares of the variation in the sample. Squaring terms prevents us from having to deal with complications of negative values. The different aliases of SSR, SSE, and SST become particularly confusing, so please recognize I have done my best to compile some of the common alternate naming conventions in one place. Going forward, I will only be using the abbreviations from the left two columns in the table below.

In an OLS model, SSR + SSE = SST. Always.

The reason these statistics are relevant is because the Coefficient of Determination is a function of SSE and SST.

Using a sample of n observations, then we can calculate the purple SSE value as the sum of the squared differences between the true outcome (yᵢ) and the fitted/predicted outcome (ŷᵢ) for observation i.

Using the n observations, we can also calculate the blue SST value as the sum of the squared differences between the true outcome (yᵢ) and the sample mean of the outcomes (ȳ).

After computing both SSE and SST, we can compute our desired metric, R2.

Negative R2 Example

Let’s use a toy dataset of 10 observations. Each observation has a fitted / predicted estimate from a model, and each also has the true outcome, as shown in the table below. The actual model will not be provided for 2 reasons:

A) I made it up.

B) This performance metric can be applied to fitted values for virtually any model, so it is not relevant to the question of why R2 can take on negative values.

Note the mean of the fitted values and the mean of the true values are the same, 5. With the central tendencies being equal, perhaps this model is a reasonable predictor of the observations?

The values from the table are plotted below. Assume the black dots represent the true outcome values. The blue dots represent the predicted (aka fitted) values for the outcome from our model. The total variation of each observation can be partitioned as described. For clarity, only one observation is shown partitioned out in the figure below. The total variation (accounted for by SST) is partitioned into two quantities, the variation explained by the model fit (SSR) and the variation explained by the error in that fitted value (SSE).

For each metric, SSE and SST, there are differences and squares of those differences shown in corresponding columns. The SSE, sum of squares error, for the model is 86.5 while the SST, sum of squares total, is 67.5.

Because SST < SSE, that indicates that if we scrapped our model, and just predicted the mean true value (5.0) as the prediction statistic for each point, that estimate would perform better than our current model producing the fitted values. While that is not an ideal outcome, it means that this performance is not necessarily the result of a miscalculation. The reason R2 is bound between 0 and 1 for an OLS model with an intercept is that its intercept will take on the mean and its beta coefficient would take on a value of 0 if no better fit can be obtained.

After calculating R2 from SSE and SST, we obtain a rounded R2 metric of -0.28, which is indeed negative because the model performance is less than the performance obtained using only the sample mean.

In summary, we have explained the origins of the coefficient of determination, a justification of why it can take on negative values in certain circumstances, and an example that shows the calculation for a toy dataset with the negative value.

If we return to my machine learning task that provided the previously worrisome negative R2 result, it seems that I need only be mildly concerned because my model does technically function… it’s just a terrible model. Well, better luck (to me) next time.

Comment below with any questions about other performance metrics or statistical quantities!

As always, keep on keeping on, despite any recent failures from negatively performing models.

“Failure is another stepping stone to greatness.” ― Oprah Winfrey

Level Up with Python: Quickly Produce PDFs with Stunning Visuals

Produce Brochures/Emails/Flyers for Business Distribution with Matplotlib

towardsdatascience.com

Probability for Data Scientists: The Capable Chi-Squared Distribution

Interactive Visualization of the Distribution Functions

towardsdatascience.com