This context provides Python functions and explanations for ten commonly used regression metrics for data scientists.
Abstract
The article titled "10 Regression Metrics Data Scientist Must Know (Python-Sklearn Code Included)" presents detailed explanations and Python code for ten important regression metrics used in data science. These metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE), Root Mean Square Logarithmic Error (RMSLE), R², Adjusted R² Score, Mean Absolute Percentage Error (MAPE), Mean Squared Logarithmic Error (MSLE), Symmetric Mean Absolute Percentage Error (SMAPE), and Normalized Root Mean Squared Error (NRMSE). The article explains each metric, provides formulas, and offers Python functions for calculating each one. The author's goal is to equip data scientists with the tools and knowledge needed to assess the performance of their regression models accurately.
Opinions
The Mean Absolute Error (MAE) is a fundamental metric for understanding the average magnitude of errors in a regression model.
The Mean Squared Error (MSE) and Root Mean Square Error (RMSE) are useful for understanding the average of the squared differences between the predicted and actual values, with the RMSE providing an interpretable scale.
The Root Mean Square Logarithmic Error (RMSLE) is particularly useful for regression models where the target variable has a wide range, as it measures the ratio of prediction and actual values.
R² and Adjusted R² Score are important for understanding the proportion of the variation in the dependent variable that is predictable from the independent variables, with the Adjusted R² Score penalizing for adding independent variables that do not help with prediction.
The Mean Absolute Percentage Error (MAPE) provides a measure of prediction accuracy based on percentage errors.
The Mean Squared Logarithmic Error (MSLE) is a variation of the MSE that only considers the percentual difference.
The Symmetric Mean Absolute Percentage Error (SMAPE) is an accuracy measure based on percentage (or relative) errors and is particularly useful for cases where the actual and forecast values can be negative.
The Normalized Root Mean Squared Error (NRMSE) is a fraction of the RMSE, which divides the difference of minimum and maximum observation data.
10 Regression Metrics Data Scientist Must Know (Python-Sklearn Code Included)
Mean Absolute Error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement.
Formula:
defmae(y_true, y_pred):
"""
Mean absolute error regression loss.
Args:
y_true ([np.array]): test samples
y_pred ([np.array]): predicted samples
Mean squared error (MSE) of an estimator measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value.
Formula:
def mse(y_true, y_pred):
"""
Mean squared error regression loss.
Args:
y_true ([np.array]): test samples
y_pred ([np.array]): predicted samples
Root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. It represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.
Formula:
defrmse(y_true, y_pred):
"""
Root Mean Square Error
"""
returnnp.sqrt(mean_squared_error(y_true, y_pred))
4. Root Mean Square Logarithmic Error
Definition:
Root-mean-square logarithmicerror (RMSLE) is the root mean squared error of the log-transformed predicted and log-transformed actual values. RMSLE measures the ratio of prediction and actual.
def rmsle(y_true, y_pred):
"""
Root Mean Squared Logarithm ErrorArgs:
y_true ([np.array]): test samples
y_pred ([np.array]): predicted samples
Returns:
[float]: root mean squared logarithm error"""
for i inrange(len(y_true)):
if y_true[i] < 0 or y_pred[i] < 0:
continue
R² (also known as the coefficient of determination in statistics) is the proportion of the variation in the dependent variable that is predictable from the independent variables.
There is a clear explanation in the video below:
def r2(y_true, y_pred):
"""
R^2 (coefficient of determination) regression score function.
Best possible score is1.0, lower values are worse.
Args:
y_true ([np.array]): test samples
y_pred ([np.array]): predicted samples
Returns:
[float]: R2
"""
return r2_score(y_true, y_pred)
6. Adjusted R2 Score
Definition:
Adjusted R² measures the proportion of variation explained by only those independent variables that really help in explaining the dependent variable. It penalize for adding independent variable that do not help with the prediction. The only difference between R² and Adjusted R² equation is degree of freedom.
Mean absolute percentage error (MAPE) is a measure of prediction accuracy of a forecasting method. It usually expresses the accuracy as a ratio defined by the formula below:
At is the actual value and Ft is the forecast value. The absolute value in this ratio is summed for every forecasted point in time and divided by the number of fitted points n.
defmape(y_true, y_pred):
"""
Mean absolute percentage error regression loss.
Args:
y_true ([np.array]): test samples
y_pred ([np.array]): predicted samples
Returns:
[float]: mean absolute percentage error"""
Mean squared logarithmic error (MSLE) can be interpreted as a measure of the ratio between the true and predicted values. It is a variation of MSE and only cares about the percentual difference.
defmsle(y_true, y_pred):
"""
Mean squared logarithmic error regression loss.
Args:
y_true ([np.array]): test samples
y_pred ([np.array]): predicted samples
Returns:
[float]: mean squared logarithmic error"""