# Machine Learning Metrics for Regression

## Exploring Advanced Statistical Evaluation Parameters at the University Level

# Introduction

Prepare for a journey into advanced statistical evaluation parameters. The 7 following metrics hold the keys to deciphering the intricacies of your data and will empower you to make informed decisions in the world of machine learning and regression analysis:

- Correlation analysis
- Chi² contingency analysis
- p-value analysis
- Kolmogorov-Smirnov-Test
- R² coefficient of determination
- Explained Variance Score
- Mean Squared Error

# Correlation analysis

Correlation analysis is used to describe the relationship or correlation between several quantitative variables. This analysis compares linearly how the data relate to each other.
It examines the strength or magnitude of a relationship between the data and its direction. (Gogtay, N. J., & Thatte, U. M. (2017). Principles of correlation analysis. *Journal of the Association of Physicians of India, 65*(3), 78–81.)

The final result of a correlation analysis is a correlation coefficient whose values range from -1 to 1. A correlation coefficient of plus one means that the two variables are in a positive linear relationship. A correlation coefficient of -1 means that the two variables have a negative linear relationship, while a correlation coefficient of 0 means that there is no linear relationship between the two variables under study.

The correlation coefficient can be described using the following formula:

Correlation analysis makes it possible to describe the relationship between variables, but it does not provide information about a causal relationship between them. A statistically significant correlation coefficient only indicates an association, without confirming that one variable is the cause of changes in the other variable. Further research is needed to prove a causal relationship.

## Summary

In conclusion, correlation analysis assesses the strength and direction of the relationship between quantitative variables using a correlation coefficient. Values range from -1 to 1, with positive values indicating a positive linear relationship, negative values a negative relationship, and 0 indicating no linear relationship. However, it’s important to note that correlation does not imply causation; it merely identifies associations between variables without confirming causality.

# Chi² Contingency Analysis

The study of cross-category data is widely used in evaluation and research. The chi-square test is one of the most commonly used statistical analyses for testing the stochastic independence of two variables.
(Franke, T. M., Ho, T., & Christie, C. A. (2012). The chi-square test: Often used and more often misinterpreted. *American journal of evaluation, 33*(3), 448–458.)

The chi² test is used to determine if there is a significant dependence between the observed frequencies.

The variables must be scaled nominally or ordinally, and more than 50 randomly distributed variables should be selected by sample to obtain a valid result.

Chi² can be calculated using the following formula:

However, this chi² value alone is not yet meaningful; in addition, the critical value, which depends on the degree of freedom and the significance level, must be calculated.

Since the formula is very complex, the value is often read from the Chi² distribution table instead. The table consists of the number of degrees of freedom and a significance level. If the calculated chi² value is greater than the critical value, the two variables are stochastically dependent on each other.

## Summary

In summary, the chi-square contingency analysis is a crucial statistical method for assessing relationships between categorical or ordinal variables. It helps determine if there is significant dependence between them. Researchers use it with appropriately scaled variables and a large, random sample. If the calculated chi-square value exceeds the critical value based on significance level and degrees of freedom, it suggests a non-random relationship, with important implications for data interpretation.

# p-Value Analysis

One way to evaluate an acceptance or rejection of the null hypothesis, for example, whether something is considered statistically significant, is the p-value. (Ugoni, A., & Walker, B. F. (1995). The Chi-square test: an introduction. *COMSIG review, 4*(3), 61.)

The p-value is calculated using the formula:

First, a significance level must be determined; this is usually 0.05.

Now the p-value is compared with the defined significance level. If the p-value is below the significance level, this means that the probability that the observed differences or correlations are due to chance is less than 5 %. In this case, it can be assumed that there is indeed a real association or dependence between the variables. However, care must be taken not to misinterpret or overinterpret the values.

## Summary

In conclusion, p-value analysis is a method used to assess the significance of results in hypothesis testing. It compares the calculated p-value to a predetermined significance level (typically 0.05) to determine if observed differences or correlations are likely due to chance. A p-value below the significance level suggests a real association between variables, but caution is needed in interpretation.

# Kolmogorov-Smirnov-Test

The Kolmogorov-Smirnov-Test is a non-parameterized test for checking the goodness of fit between two distributions.
This test checks whether an underlying probability distribution deviates from an assumed distribution.
(Dodge, Y. (2008). Kolmogorov–Smirnov Test. In *The Concise Encyclopedia of Statistics* (pp. 283–287). New York, NY: Springer New York.)

It is often used to find out whether data are normally distributed since many statistical procedures assume or have as a prerequisite that the data are normally distributed.

In this test, the distribution of the observed data is compared with the theoretical distribution, and the null hypothesis is made that the observed data originate from the theoretical distribution. If the p-value of the test is less than a predetermined significance level, the null hypothesis is rejected and it can be assumed that the observed data do not originate from the theoretic distribution.

This value can be calculated using this formula:

A p-value of 0.0 means that the null hypothesis can be rejected, indicating that the distributions are not identical, whereas a p-value of one indicates that the hypothesis is accepted and the distributions are identical. (Steinskog, D. J., Tjøstheim, D. B., & Kvamstø, N. G. (2007). A Cautionary Note on the Use of the Kolmogorov–Smirnov Test for Normality. *Monthly Weather Review, 135*(3), 1151–1157. doi:10.1175/mwr3326.1)

In addition, a D-value can be calculated that indicates the maximum vertical distance between the cumulative distribution functions. A D-value of zero means that there is no distance, whereas a value of one shows that the maximum difference is 100%.

## Summary

In conclusion, the Kolmogorov-Smirnov-Test is a non-parametric method used to assess the goodness of fit between observed data and a theoretical distribution. It is commonly employed to check if data are normally distributed, crucial for many statistical analyses. By comparing observed and theoretical distributions, this test yields a p-value, where a low value rejects the null hypothesis, suggesting a mismatch between distributions. Additionally, the D-value indicates the maximum vertical difference between cumulative distribution functions.

# R² Coefficient of Determination

R², also called coefficient of determination or multiple correlation coefficient, has long been known in classical regression analysis.
Because of its definition as the proportion of variance explained by the regression model, it is a useful measure of predictive success.
(Nagelkerke, N. J. (1991). A note on a general definition of the coefficient of determination. *Biometrika, 78*(3), 691–692.)

The coefficient of determination can be interpreted as the ratio between the variance of *y* explained by the model and the total variance of *y* and thus is independent of the distribution of *y*.
(Hoffmann, F., Bertram, T., Mikut, R., Reischl, M., & Nelles, O. (2019). Benchmarking in classification and regression. *WIREs Data Mining and Knowledge Discovery, 9*(5), e1318. doi:10.1002/widm.1318)

This calculates how well samples are likely to be predicted by the model through the proportion of variance explained. The best possible value is one, since it is given as a percentage, and can be calculated as follows:

## Summary

In summary, the R² coefficient of determination is a key metric in regression analysis, quantifying the proportion of variance in the dependent variable *y *explained by the regression model.
It serves as a measure of predictive success, with a maximum value of 1 indicating a perfect fit. This metric is independent of the distribution of *y* and is a valuable tool for assessing the quality of regression models.

# Explained Variance Score

The Explained Variance Score is a metric used in machine learning to assess the quality of predictions. It tests how well the prediction of the model explains the variance of the actual data. As with the R² value, 1 is the best possible result.

The EVS value is used primarily in regression models that predict continuous values, as it quantifies a model’s ability to explain the variance present in the continuous data.

The difference between the explained variance value and the R² value, the coefficient of determination, is that the explained variance value does not take into account systematic deviations in the prediction. For this reason, the R² value, the coefficient of determination, should generally be preferred.

The EVS value can be determined using the following formula:

## Summary

In conclusion, the Explained Variance Score (EVS) is a valuable metric in machine learning for assessing the quality of predictions in regression models. It measures how well a model’s predictions explain the variance in actual data, with a perfect score of 1 indicating an ideal fit. While useful, it’s important to note that EVS does not account for systematic deviations in predictions, making the R² value a preferred choice in some cases for evaluating model performance.

# Mean Squared Error

Validation of a regression model is concerned with the goodness of fit of the regression, particularly its predictive power with unseen data.

Regression analysis examines whether the regression residuals, i.e., the deviation between the predicted and the actual values, are randomly distributed or exhibit a regularity that is not yet explained by the regression model.
A common measure of model accuracy is the mean squared error. (Hoffmann, F., Bertram, T., Mikut, R., Reischl, M., & Nelles, O. (2019). Benchmarking in classification and regression. *WIREs Data Mining and Knowledge Discovery, 9*(5), e1318. doi:10.1002/widm.1318)

This is calculated as follows:

The MSE describes the average squared deviation between the predicted values and the actual values. The aim is to achieve the smallest possible value. This measure is particularly useful because it expresses the errors in the unity of the target variable. Larger deviations are penalized more by squaring than smaller ones.

However, this can be a disadvantage at the same time, since the MSE is prone to outliers due to the squaring of the errors. A few very large errors can greatly increase the MSE, even if most predictions are accurate.

## Summary

In summary, the Mean Squared Error (MSE) is a critical metric in regression analysis, assessing the goodness of fit and predictive accuracy of a regression model, especially with unseen data. It quantifies model accuracy by calculating the average squared deviation between predicted and actual values, with the goal of minimizing this value. The MSE is valuable because it expresses errors in the same units as the target variable, but it can be sensitive to outliers, as large errors have a disproportionate impact due to the squaring.

# Conclusion

In summary, these seven crucial evaluation parameters are fundamental tools in the realm of data analysis and regression evaluation. Each parameter serves a specific purpose, and their interplay offers a comprehensive understanding of the underlying data relationships and model performance.

**Correlation Analysis** reveals linear relationships between quantitative variables, quantifying the strength and direction of these associations. While a correlation coefficient of +1 indicates a perfect positive linear relationship, -1 signifies a perfect negative linear relationship, and 0 denotes no linear relationship.

**Chi² Contingency Analysis** is invaluable for assessing associations between categorical variables. By comparing observed and expected frequencies, this test unveils significant dependencies within cross-category data.

**p-Value Analysis** is a critical tool for hypothesis testing.
It quantifies the likelihood that observed differences or correlations are due to random chance. Typically, a significance level of 0.05 is used, and if the p-value falls below this threshold, it suggests genuine associations.

**The Kolmogorov-Smirnov Test** evaluates the goodness of fit between observed data and assumed distributions. Frequently employed to assess data normality, it yields a p-value. A low p-value implies a departure from the theoretical distribution.

**R² Coefficient of Determination** quantifies the proportion of variance explained by regression models, offering insights into predictive success, irrespective of data distribution.

**Explained Variance Score (EVS)** assesses prediction quality in regression models for continuous data. It measures how well a model’s predictions explain variance in the observed data.

**Mean Squared Error (MSE)** is a pivotal metric for regression model accuracy. It computes the average squared deviation between predicted and actual values, emphasizing the minimization of errors.

These parameters collectively empower analysts, researchers, and data scientists to unlock meaningful insights, make informed decisions, and construct robust predictive models. Understanding their applications and relationships enhances our ability to navigate the intricate landscape of data analysis effectively.