Maximizing Regression Model Performance: A Comprehensive Guide to Evaluation Metrics with Python Code and Formulas, including Pros and Cons

Regression models are used to predict the numerical values of a dependent variable based on one or more independent variables. Various evaluation metrics are used to determine the accuracy of the regression models. Here are some of the commonly used evaluation metrics for regression:
Accuracy:
Accuracy is not usually used in regression tasks because it is a measure of the number of correctly classified instances out of the total number of instances. In regression, we predict a continuous variable, so there are no classifications.
Mean Squared Error (MSE):
MSE is one of the most common metrics used in the regression. It calculates the average squared difference between the predicted and actual values. It is given by the following formula:
MSE = 1/n * Σ (i=1 to n) (yi — ŷi)²
Where,
n is the number of observations,
yi is the actual value,
ŷi is the predicted value.
Advantages:
· MSE is sensitive to outliers, which makes it useful in some cases where outliers may have a significant impact on the performance of the model.
· It is easy to calculate and understand.
Disadvantages:
· MSE is not very interpretable because it is not in the same units as the original data.
· Squaring the errors can lead to the metric being dominated by large errors, which may not be representative of the overall performance of the model.
Python Code:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_true, y_pred)R2 Score:
R2 Score (also known as the coefficient of determination) is a statistical measure that represents the proportion of variance in the dependent variable that is explained by the independent variables. It is a number between 0 and 1, with higher values indicating a better fit. The formula for R2 Score is:
R2 Score = 1 — (Σ(i=1 to n) (yi — ŷi)² / Σ(i=1 to n) (yi — ȳ)²)
Where,
n is the number of observations,
yi is the actual value,
ŷi is the predicted value.
ȳ is the mean of the actual values.
Advantages:
· R2 Score provides a measure of how well the model fits the data, which makes it useful for comparing models.
· It is easy to interpret because it is a number between 0 and 1, where higher values indicate a better fit.
Disadvantages:
· R2 Score can be misleading when used to compare models with different numbers of independent variables.
· It can also be misleading when the model has a high bias or high variance.
Python Code:
from sklearn.metrics import r2_score
r2 = r2_score(y_true, y_pred)Mean Absolute Error (MAE):
MAE is another commonly used metric in regression. It calculates the average absolute difference between the predicted and actual values. It is given by the following formula:
MAE = 1/n * Σ(i=1 to n) |yi — ŷi|
Where,
n is the number of observations,
yi is the actual value,
ŷi is the predicted value.
Advantages:
· MAE is interpretable because it is in the same units as the original data.
· It is less sensitive to outliers compared to MSE.
Disadvantages:
· MAE does not punish large errors as much as MSE, which may not be desirable in some cases.
Python Code:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)Median Absolute Error:
Median Absolute Error (also known as Median Absolute Deviation) is a robust measure of the variability of a set of data. It calculates the median of the absolute differences between the predicted and actual values. It is given by the following formula:
Median Absolute Error = median (|yi — ŷi|)
Where,
n is the number of observations,
yi is the actual value,
ŷi is the predicted value.
Advantages:
· Median Absolute Error is a robust measure of variability that is not sensitive to outliers.
Disadvantages:
· It may not be as sensitive as other metrics to differences between the predicted and actual values, which could make it less useful in some cases.
Python Code:
from sklearn.metrics import median_absolute_error
med_ae = median_absolute_error(y_true, y_pred)Explained Variance Score:
Explained Variance Score measures the proportion of variance in the dependent variable that is explained by the independent variables. It is similar to R2 Score but is always between 0 and 1. The formula for Explained Variance Score is:
Explained Variance Score = 1 — Var (y — ŷ) / Var(y)
Where,
Var (y — ŷ) is the variance of the residuals (the differences between the actual and predicted values)
Var(y) is the variance of the actual values.
Advantages:
· Explained Variance Score provides a measure of how well the model fits the data, which makes it useful for comparing models.
· It is easy to interpret because it is a number between 0 and 1, where higher values indicate a better fit.
Disadvantages:
· It can be misleading when used to compare models with different numbers of independent variables.
· It may not be as sensitive as other metrics to differences between the predicted and actual values.
Python Code:
from sklearn.metrics import explained_variance_score
evs = explained_variance_score(y_true, y_pred)Variance
In statistics, variance is a measure of how spread out a set of data is. It measures the average squared deviation of the values from their mean. A higher variance indicates that the data points are more spread out from the mean, while a lower variance indicates that the data points are closer to the mean. The formula for variance is:
Variance = Σ(i=1 to n) (xi — ȳ)² / n
Where,
n is the number of data points, xi is the i-th data point,
ȳ is the mean of the data,
The summation symbol Σ means “sum of”.
Another way to write the formula for variance is:
Variance = (Σ(i=1 to n) xi² / n) — ȳ²
where xi² is the square of the i-th data point.
Variance is commonly used in statistics to measure the amount of variability or spread in a set of data. It is an important concept in probability theory and is used in many areas of statistics, including hypothesis testing, regression analysis, and machine learning.
References https://machinelearningmastery.com/regression-metrics-for-machine-learning/




