avatarDeepak Chopra | Talking Data Science

Summary

P-Values are a measure of how likely it is that observed data would have occurred by random chance, and they are used to determine the significance of observed data.

Abstract

P-Values are a statistical concept used in hypothesis testing, linear / logistic regressions, Anova / Ancova tables, and more. They are a measure of how likely it is that observed data would have occurred by random chance. P-Values are used to determine the significance of observed data, and they are compared to a pre-determined alpha level to determine whether the observed data is significantly different from the norm. Significance levels and Alpha levels are decided by the data scientist, and they represent the confidence level and threshold error rate that the data scientist is happy to have in their analysis.

Opinions

  • P-Values are widely used and abused in Data Science.
  • P-Values are a measure of how likely it is that observed data would have occurred by random chance.
  • P-Values are used to determine the significance of observed data.
  • Significance levels and Alpha levels are decided by the data scientist.
  • P-Values are compared to a pre-determined alpha level to determine whether the observed data is significantly different from the norm.
  • P-Values are a statistical concept used in hypothesis testing, linear / logistic regressions, Anova / Ancova tables, and more.
  • It is important to understand what P-Values mean and what they do not mean.
  • P-Values are a must to understand correctly in order to interpret statistical analysis.
  • P-Values are a measure of how likely it is that observed data would have occurred by random chance, and they are used to determine the significance of observed data.
  • P-Values are compared to a pre-determined alpha level to determine whether the observed data is significantly different from the norm.
  • Significance levels and Alpha levels are decided by the data scientist, and they represent the confidence level and threshold error rate that the data scientist is happy to have in their analysis.

P-Values —An introduction to correct interpretation

Understanding a very widely used (and abused) statistical concept in 5 min

Photo by Edge2Edge Media on Unsplash

If you are like me and have been exposed to Data Science early on, chances are you could not have escaped hearing about ‘p-values’. They are everywhere; in hypothesis testing, linear / logistic regressions, Anova / Ancova tables etc. — there is no escaping ‘p-values’. At the same time, there are a lots of Analysts / Data Scientists who do not understand what it means and use it blindly to come to conclusions in their analysis. The below post is aimed at understanding the intuition behind P-values, what it means and more importantly what it does not mean. The below post also touches on significance levels, alpha values and null hypothesis which are a must to understand P-values correctly.

What does it mean?

P-Value is a measure of how likely is that the observed data would have occurred by random chance.

It conveys, under the premise of the Null hypothesis what is the likelihood of getting the observed data value. →If this likelihood is LOW then the Null Hypothesis might NOT hold; however, if this likelihood is high then there is NO reason to question the Null Hypothesis.

We will talk about the Null Hypothesis later on. Let’s start with a basic example, and then build on layers of understanding on top of it.

Consider a scenario, where-in you collect data of house prices in the UK and you plot them. Please see picture-1 (below), wherein the chart of the left denotes the distribution of house prices in London. The mean house price is £400K. Clearly, the house prices are centred towards the mean (represented by the peak), which means if we were to select a random house the likelihood of its price being close to £400K is high and as we move away from the mean towards the extreme ends the likelihood of finding such houses with prices greater (or lesser) than the mean price decreases rapidly.

Let’s pick a random value, say £600K on this distribution. The chances (or probability) that you will find a house price of £600K or more (i.e. at least £600K) can be represented by the shaded area towards the right of this point. This is what the ‘P-value’ represents.

Picture-1 | distribution of House Prices in the UK (dummy data) (image by author)

P-value is the total probability of getting a value at least as ‘extreme’ as the observed value when the values are picked randomly from the population’s distribution.

The above definition is useful but does not give a flavour of how it can dictate the outcome of your analysis. In order to understand the uses of P-value, let’s also dive deep into Significance and Alpha levels.

Significance levels and Alpha

The whole idea of statistical testing is to know whether what we are observing is significant (or not) or in other words, whether or not our observed data is significantly different from the norm (or population).

This means we are always comparing two things and are trying to determine whether or not the difference is statistically significant. The Significance level is a pre-decided cut-off based on which we can deem our observations or findings (or hypothesis) to be true or false. It is a pre-determined confidence level that as a data scientist you want to have in your analysis.

The Alpha level comes directly from your decided significance level and is like a threshold error rate that as a data scientist you are happy to have in your analysis. The alpha value is a threshold p-value, beyond which you are happy to consider the observed sample value to be significantly different from the population or even belonging to some new sample distribution.

Alpha value = 1 — Significance

Please note that Significance level and in turn Alpha levels are decided by you. The most common chosen value for significance is 95%. For Significance =95%, alpha becomes 5%.

In picture-2 below, the orange point represents this threshold alpha cutoff at 5%, and the corresponding house price is £750K. That is, based on the distribution the area on the right of this point (shaded area in orange) is 5%.

Picture-2 | comparing observed values with population distribution (dummy data) (image by author)

Let’s say the green and red dots represent some additional sample observations we have of house prices (it may or may not be from the UK). The green represents a house price of £500K (chart-2.1 in picture-2), while the red represents a house price of £800K (chart-2.2 in picture-2). We will be looking to check whether they come from the original population distribution of UK house prices or not.

  • Let’s start with the green point corresponding to £500K, the p-value associated with these is more than the threshold alpha (i.e. area to the right of the green dot is more than the ‘p-value’ shaded area in orange). That is, the probability of observing house prices at least as extreme as £500K is greater than 5% (our threshold alpha). This means there is a high likelihood (compared to our threshold of 5%) of observing this value (i.e. a house priced at £500K) in the population distribution. As these can be observed with a high probability from this population, there is no reason to think they are from a different distribution. →This implies, green-dot is not significantly different from this population’s distribution.

(Please note that if as a data scientist you were to be more relaxed w.r.t. the confidence you want to have, i.e. decrease confidence to lower than 95% resulting in alpha levels of higher than 5%, this green-dot can turn significant.)

  • Now, let’s look at the red point corresponding to £800K; clearly, the p-value corresponding to the red point is less than the threshold alpha we have set (i.e. area to the right of the red dot is less than the ‘p-value’ shaded area in orange). Based on our pre-decided significance level (and hence pre-decided alpha level), we can say that it is less likely to observe a house price of £800K from this population distribution. Therefore, the red-dot is significantly different from the population and it is likely to have come from a different distribution (i.e. this house may not be in the UK).

(Please note that if as a data scientist you were to be more strict w.r.t. the confidence, i.e. increase confidence beyond 95% and in-turn decrease alpha from 5%, this red dot observation can turn non-significant.)

Confused !! .. Let’s examine in detail what we did above:

We have one distribution (i.e. the population distribution of UK house prices in this case) and we have a couple of additional sample observations represented by a green and a red dot (corresponding to house price of £500K and £800 respectively). For both green dot and red dot the observed values (£500K and £800) is different from the population mean (£400K).

TASK: Prove or disprove that the observed house prices belong to the population distribution or not. — Which in turn means to check whether the observed values of green and red dots are significantly different or not from this population.

Notice that there can only be one of the two scenarios:

  1. The house belongs to the UK and by random chance, we selected a house that is priced at the observed value (£500 and £800k), different from the mean UK house price of £400K (i.e. population mean).
  2. The house does not belong to the UK and hence we are observing a price (£500 and £800k) that is different from the mean UK house price of £400K (i.e. population mean)

We start with assuming the baseline, the neutral hypothesis, also known as the Null Hypothesis. We assume the Null Hypothesis to be true from the start and look for evidence to disprove it. In this case,

Null Hypothesis: Green and Red points belong to the population distribution.

Please note that we start with assuming the neutral null hypothesis to be true and try to gather evidence to reject it

Steps we implicitly followed in the above framework:

  • We pre-selected our confidence/significance level, which means we pre-decided our alpha levels (i.e. 5% the most-commonly accepted one). Note that, we want to check whether the observed values (£500K and £800K) belonged to the population distribution or not based on our pre-decided level of confidence.
  • Based on the observed value (£500K and £800K), we found out the corresponding p-values i.e. the likelihood of getting this observation in the population distribution.
  • We compare the p-value with our pre-decided Alpha levels.

Only two possibilities…

1. Observed P-Value ≤ Alpha:

It means, it is very less likely (compared to our pre-decided alpha levels) to get the observed value from the population distribution.

→ The observed value is significantly different from the population distribution. → It may be coming from a different distribution altogether.

Based on the evidence, we can reject the Null hypothesis.

2. Observed P-Value > Alpha:

It means it is very highly likely (compared to our pre-decided alpha levels) to get the observed value from the population distribution.

→ The observed value is NOT significantly different from the population distribution. →It must be coming from the same population distribution.

Based on the evidence, we cannot reject the Null hypothesis.

For Alpha = 5%,

p-Value ≤ 0.05 : Reject NULL Hypothesis

p-Value > 0.05 : Cannot Reject Null Hypothesis

P-value can be thought of as the evidence against a null hypothesis being true

Did You know?

P-Value only coveys whether the observed data is significantly different from what is stated in our Null Hypothesis or not. P-value by no means signifies the magnitude of the difference between observed data and what is stated Null hypothesis.

Summary

P-value is a probability value corresponding to the likelihood of obtaining a data value (‘test statistic’), which is at least as “extreme” as the actually observed data value (observed ‘test statistic’), under the assumption that Null Hypothesis is correct.

P-value corresponds to how likely your data could have occurred under the null hypothesis.

P-Value provides an answer to: Assuming no-difference (i.e. Null Hypothesis), what is the likelihood of seeing the difference we are observing in the data.

Remember: Null Hypothesis — Neutral, Baseline, No-change hypothesis

  • P-value ≤ Alpha-value → Reject NULL Hypothesis
  • P-value > Alpha-value → Do NOT Reject NULL Hypothesis

Have Fun with P-Values !!

Connect, Learn & Grow ..

If you like this article and are interested in similar ones follow me on Medium, LinkedIn, connect with me 1:1, join my email list and (..if you already are not..) hop on to become a member of the Medium family to get access to thousands of helpful articles. (I will get ~50% of your membership fees if you use the above link)

.. Keep learning and keep growing!

Data Science
P Value
Significance Testing
Null Hypotheses
Data Analysis
Recommended from ReadMedium