Summary

The article argues against using Mean Squared Error (MSE) as a loss function for logistic regression due to its inability to strongly penalize misclassifications and its non-convex nature, which can lead to suboptimal solutions.

Abstract

The authors, Rajesh Shreedhar Bhat and Souradip Chakraborty, present a comparative analysis of log loss versus mean squared error (MSE) as loss functions for logistic regression. They demonstrate mathematically and empirically that log loss is superior for this purpose. The article highlights that MSE does not penalize misclassifications severely, especially when there is a complete mismatch between predicted and actual labels. In contrast, log loss assigns an increasingly large penalty as the predicted probability diverges from the actual label, which is desirable in classification tasks. Furthermore, the authors prove that MSE leads to a non-convex loss function in the context of logistic regression, which poses challenges for optimization algorithms that rely on convexity to find global minima. Conversely, they show that the log loss function maintains convexity, ensuring that optimization methods can reliably converge to the best solution. The article concludes by recommending the use of log loss over MSE for binary and multi-class classification problems in logistic regression.

Opinions

The authors believe that the choice of loss function is critical in the performance of logistic regression models.
They assert that MSE is not suitable for classification problems due to its weak penalization of incorrect predictions.
The article emphasizes the importance of convexity in loss functions to ensure that gradient-based optimization techniques can find the global minimum.
The authors advocate for the use of log loss as the standard loss function for logistic regression, based on its ability to penalize errors more effectively and its convex properties.
They suggest that practitioners should be aware of the implications of using MSE in logistic regression to avoid potential pitfalls in model training and performance.

Why not Mean Squared Error(MSE) as a loss function for Logistic Regression? 🤔

Authors: Rajesh Shreedhar Bhat*, Souradip Chakraborty* (* denotes equal contribution).

In this blog post, we mainly compare “log loss” vs “mean squared error” for logistic regression and show that why log loss is recommended for the same based on empirical and mathematical analysis.

Equations for both the loss functions are as follows:

Log loss:

Mean Squared Loss:

In the above two equations

y: actual label

ŷ: predicted value

n: number of classes

Let's say we have a dataset with 2 classes(n = 2) and the labels are represented as “0” and “1”.

Now we compute the loss value when there is a complete mismatch between predicted values and actual labels and get to see how log-loss is better than MSE.

For example:

Let’s say

Actual label for a given sample in a dataset is “1”
Prediction from the model after applying sigmoid function = 0

Loss value when using MSE:

(1- 0)² = 1

Loss value when using log loss:

Before plugging in the values for loss equation, we can have a look at how the graph of log(x) looks like.

As seen from the above graph as x tends to 0, log(x) tends to -infinity.

Therefore, loss value would be:

-(1 * log(0) + 0 * log(1) ) = tends to infinity !!

As seen above, loss value using MSE was much much less compared to the loss value computed using the log loss function. Hence it is very clear to us that MSE doesn’t strongly penalize misclassifications even for the perfect mismatch!

However, if there is a perfect match between predicted values and actual labels both the loss values would be “0” as shown below.

Actual label: “1”

Predicted: “1”

MSE: (1 - 1)² = 0

Log loss: -(1 * log(1) + 0 * log(0)) = 0

Here we have shown that MSE is not a good choice for binary classification problems. But the same can be extended for multi-class classification problems given that target values are one-hot encoded.

MSE and problem of Non-Convexity in Logistic Regression.

In classification scenarios, we often use gradient-based techniques(Newton Raphson, gradient descent, etc ..) to find the optimal values for coefficients by minimizing the loss function. Hence if the loss function is not convex, it is not guaranteed that we will always reach the global minima, rather we might get stuck at local minima.

Figure 4: Convex and non-Convex functions

Before diving deep into why MSE is not a convex function when used in logistic regression, first, we will see what are the conditions for a function to be convex.

A real-valued function defined on an n-dimensional interval is called convex if the line segment between any two points on the graph of the function lies above or on the graph.

If f is twice differentiable and the domain is the real line, then we can characterize it as follows:

f is convex if and only if f ”(x) ≥ 0 for all x. Hence if we can show that the double derivative of our loss function is ≥ 0 then we can claim it to be convex. For more details, you can refer to this video.

Now we mathematically show that the MSE loss function for logistic regression is non-convex.

For simplicity, let's assume we have one feature “x” and “binary labels” for a given dataset. In the below image f(x) = MSE and ŷ is the predicted value obtained after applying sigmoid function.

From the above equation, ŷ * (1 - ŷ) lies between [0, 1]. Hence we have to check that if H(ŷ) is positive for all values of “x” or not, to be a convex function.

We know that y can take two values 0 or 1. Let’s check the convexity condition for both the cases.

Figure 7: Double derivate of MSE when y=0

So in the above case when y = 0, it is clear from the equation that when ŷ lies in the range [0, 2/3] the function H(ŷ) ≥ 0 and when ŷ lies between [2/3, 1] the function H(ŷ) ≤ 0. This shows the function is not convex.

Figure 8: Double derivative of MSE when y=1

Now, when y = 1, it is clear from the equation that when ŷ lies in the range [0, 1/3] the function H(ŷ) ≤ 0 and when ŷ lies between [1/3, 1] the function H(ŷ) ≥ 0. This also shows the function is not convex.

Hence, based on the convexity definition we have mathematically shown the MSE loss function for logistic regression is non-convex and not recommended.

Now comes the question of convexity of the “log-loss” function!! We will mathematically show that log loss function is convex for logistic regression.

Theta: co-efficient of independent variable “x”.

As seen in the final expression(double derivative of log loss function) the squared terms are always ≥0 and also, in general, we know the range of e^x is (0, infinity). Hence the final term is always ≥0 implying that the log loss function is convex in such scenarios !!

Final thoughts:

We hope this post was able to make you understand the cons of using MSE as a loss function in logistic regression. If you have any thoughts, comments or questions, please leave a comment below or contact us on LinkedIn and don’t forget to click on 👏 if you like the post.

Rajesh Shreedhar Bhat - Data Scientist - WalmartLabs India | LinkedIn

View Rajesh Shreedhar Bhat's profile on LinkedIn, the world's largest professional community.

www.linkedin.com

Souradip Chakraborty - Statistical Analyst - Walmart Labs India | LinkedIn

View Souradip Chakraborty's profile on LinkedIn, the world's largest professional community.

www.linkedin.com

References:

Convex function

In mathematics, a real-valued function defined on an n-dimensional interval is called convex (or convex downward or…

en.wikipedia.org