You say you want statistical significance?

“What is the minimum sample needed to determine statistical significance?”
This is possibly the most common question I have received working as a data scientist for companies that are looking to test new features in their product, searching for new creative direction, or aiming to optimize user experience.
Let’s first start with all the pitfalls I have personally witnessed:

Pitfall Number 1:
Since I work in marketing and we are mainly looking at conversion rates, the main argument I hear is:
“We don’t have enough installs/clicks to determine statistical significance”
However, if you are trying to measure something like click through rate (CTR = clicks/impressions) then you will actually care about your entire sample, not just those who clicked on your ad. That is, you care about all the (unique) impressions you served. Likewise, if you were measuring conversion rate (CVR = installs/clicks) then you would care about all clicks to the site/app.
This is similar to how you would think about a clinical trial. Your sample is determined as the total number of people in each treatment group, not just those who reacted to the treatment.

Now that we have that straight, we can move into the other pitfalls that come up..
Pitfall Number 2:
Let’s assume the test objectives have been set, budget has been allocated, and the test is in progress. Of course, everyone is eager to how the test is performing because, well, decisions need to be made and time is money. So someone is dedicated to constantly monitoring the test, checking significance every step of the way.
If this is what you are doing, STOP! This will, many times more than not, lead you to ending a test early the moment significance is found and lead to false positives.
To gain a better understanding, assume you have an experiment set up and you evaluate results after 200 observations and 500 observations. Below are the different scenarios that can result:

However, if you ended the test the moment you saw significance, it would result in something like this:

Notice that this method has led you to increasing the ratio of significant to insignificant results. “Therefore, the reported significance level — the “percent of the time the observed difference is due to chance” — will be wrong.”
For more about the implications of this, read here.
Pitfall Number 3:
That moment when statistical significance is determined and there is no conversation around practical significance. Sometimes it could be as small as a 0.5% difference between control and treatment group!
I mean great, it’s statistically significant, but is it actionable? This will depend on the context…
Now that we’ve covered all that, let’s discuss specific actions we can take to overcome these issues.
- Determine sample size before starting the test (aka Power Analysis).
- Refrain from checking the results of your test until you have reached the specified number of samples!
This will help immensely with getting clean and accurate results of your test, which could have costed a LOT of money.
So let’s get started…
Minimum Sample Size
Let’s say you got tasked with this job — to find the minimum sample size for an upcoming experiment. What would you do? Google a sample size calculator, right? THAT’S RIGHT! Which will lead you to sites like:
Easy peasy. You start playing with the numbers, but wait..
What’s an effect size?? Why does a smaller effect size lead to a much larger sample size?? Power, what’s power? BAH!
You start looking up effect size, and dig deep into your memory try to remember the statistics you learned years ago! Not happening..

Okay, okay..breathe! Let’s start from the beginning.
To determine the minimum sample size needed for proper evaluation, we need to understand the intimate relationship between all the factors that play into the calculation, which are:
- Effect Size
- Significance level (Probability of a Type 1 Error)
- Power (Probability of a Type 2 Error)

