Summary

The provided content distinguishes between Variance, Covariance, and Correlation as statistical measures of the relationship between variables, emphasizing their differences and applications.

Abstract

Variance, Covariance, and Correlation are fundamental statistical concepts used to analyze the relationships between variables in a dataset. Variance quantifies the spread of data from the mean within a single variable, while Covariance assesses the joint variability of two variables, indicating whether they move in the same or opposite directions. Correlation, on the other hand, measures both the direction and strength of the relationship between variables, standardizing the measure to a scale of -1 to 1, which is independent of the variables' units. The article further explains the construction of Variance-Covariance and Correlation matrices, and argues that Correlation is often preferred over Covariance due to its unit-free nature and invariance to changes in scale. These measures are crucial for understanding data behavior and are commonly used in statistical modeling and data analysis.

Opinions

The author suggests that Correlation is more interpretable than Covariance because it is unit-free and standardized.
The article conveys the importance of understanding the differences between these measures to effectively apply them in statistical analysis.
It is implied that the Variance-Covariance and Correlation matrices are essential tools in multivariate analysis for understanding the relationships within a dataset.
The author emphasizes that Covariance can vary with the scale of the variables, which can be seen as a limitation in certain analyses.
The preference for Correlation over Covariance is presented as a widely held view in the statistical community.

Variance vs Covariance vs Correlation: What is the Difference?

Twins from Different Universes

Variance, Covariance, and Correlation are common terms used in statistics. They are often used in the same context with different purposes. In this post, we will explore what are they and how they are different from each other.

What is Variance?

Variance measures the variability, which is defined as the spread of data from the average in a given dataset.

We can use the following formulas to compute the variance.

N = The number of observations in the population
n = The number of observations in the sample
Xi = ith observation in the data
μ = The population mean
x̄ = The sample mean

We use “n-1” instead of “n” (aka, Bessel’s correction) to correct bias in the sample variance.

Variance can only have a positive value. The higher the variance is, the larger variability of data values in a dataset.

What is Covariance?

Covariance measures how the two variables are varying together and the degree to which the deviation of one variable (X) from its mean is related to the deviation of another variable (Y) from its mean.

We can use the following formulas to compute the covariance.

Unlike variance, which can only have a positive value, covariance could have both positive and negative values. The value of Covariance lies in the range of -∞ and +∞.

What does the sign of a covariance indicate?

Positive covariance indicates the two variables (X and Y), on average, move in the same direction. The greater value of X corresponds with the greater value of Y. When X is greater than its mean, Y is likely greater than its mean. Similarly, when X is less than its mean, Y is likely less than its mean.

Negative covariance indicates the two variables (X and Y), on average, move in the opposite direction. The greater value of X corresponds with the less value of Y. When X is greater than its mean, Y is likely less than its mean. Similarly, when X is less than its mean, Y is likely greater than its mean.

Zero covariance indicates there is no relationship between the two variables (X and Y)

How to Create a Variance-Covariance Matrix?

Variance and covariance usually appear together in a Variance-Covariance Matrix. The variance-Covariance matrix is constructed as a symmetric matrix where the diagonal elements are variances and the off-diagonal elements are covariance.

Suppose we have a matrix X with a dimension of “n x k”. This matrix includes n observations of k variables (i.e., X1, X2, X3, …, Xk).

we can define means of the these k variables in the following matrix

Then we subtract each column by its mean in matrix X to create the de-meaned version of X, Xc.

Lastly, we compute the cross product of the transpose of Xc and Xc and divide it by n. Then we have the variance-covariance matrix shown in the following format.

What is Correlation?

While Covariance measures how the two variables are varying together, Correlation (or Correlation Coefficient) indicates how strongly the two variables are related to each other and measures both the direction and strength of the relationship.

We can use the following formulas to compute the correlation.

The value of Correlation lies in the range of -1 and +1.

Positive correlation indicates when one variable increases, the other variable will also increase. When the correlation value is closer to 1, it means the two variables are more likely moving by the exact same percentage and direction.

Negative correlation indicates when one variable increases, the other variable will decrease. When the correlation value is closer to -1, it means the two variables are more likely moving by the exact same percentage but in the opposite direction.

Zero correlation indicates there is no relationship between the two variables (X and Y).

How to Create a Correlation Matrix?

The correlation matrix is a symmetric matrix where the diagonal elements are 1 and the off-diagonal elements are pairwise correlations.

Let’s first construct matrix D.

Then we subtract each column by its mean and divide by its standard deviation in matrix X to create a matrix, Xs.

Lastly, we compute the cross product of the transpose of Xs and Xs and divide it by n. Then we have the correlation matrix shown in the following format.

What is the difference between Covariance and Correlation?

Although both covariance and correlation measure how a change in one variable reflects in another variable, correlation is preferred over covariance for the following reasons.

Measurement units: Correlation is a unit-free measure that takes a value between -1 and 1. This makes it easier to interpret than covariance.
Change in scale: Covariance will be affected by scaling the variables. For example, if we multiply one variable by a constant value and multiply another variable by a different constant value, then the covariance will change. However, correlation will not change in this case.

Summary

If you would like to explore more posts related to Statistics, please check out my articles:

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.