Pitfalls in Product Experimentation
Common to-not-do-lists often overlooked in product experimentation causing poor and unreliable results
We all know product experimentation is important, and its benefits have largely been proven by organizations, enabling data-driven decisions on products, features, and processes. Google was testing 40 shades of blue on a link in the search results, and the right blue shade led to 200M in revenue. Booking.com has acknowledged the scaling and transformation of the organization were made possible by numerous testing and experiments conducted there.
However, product experiments, like any other statistical testing or experimentation, are prone to pitfalls. These are design and/or execution flaws, which might be hidden or unsuspected throughout the process. It is the duty of the data team — Product Data Analysts/Data Scientists —to guardrail experimentations execution and analysis to get reliable results. And hence it is important to understand the common pitfalls and how to treat them, as they might mislead the analysis results and conclusion.
If the experiment is not configured and analysed properly, it might lead to poor and unreliable results, defeating the initial purpose of the experiment — which is for testing out the treatments and gauging the impact.
Configuration pitfalls
Before looking into the statistical strategy and analysis, it is essential to ensure the planning and designing of the overall experimentation are done right. While the things here seem basic, there is a high chance of it being overlooked (again, as it is so basic) and eventually making us miss out on the experiment if not done properly.
- Optimizing for the wrong metrics. Metrics selection drives the overall decision of whether the treatment changes are being rolled out or not. As a rule of thumb, a metric for an experiment is ideally relevant to business and movable/impacted by the treatment given. (1) If this metric goes up/down, would you be happy? (2) Suppose you’re a user that is given the treatment, would you do or not do activities that will impact the metric?
- Not maximizing the variations potential. In the theoretical world, A/B testing (or split testing) is a common term used. It’s comparing two versions of something to figure out which performs better. In the practical world, this can be further extended to more than two versions (A/B/n testing) or testing for a combination of variables (multivariate testing). Having more variations is great to maximize resource utilization and the possibility of getting the best decision option out of the experiment. They come with some side statistical effects (i.e increase in sample size requirement; familywise error rate), but it’s still something worth exploring.
- Overlapping experiments. There can be numerous experiments happening at the same time in the organization. Problems can occur when these different experiments are running on similar features as they could interfere with each other — affecting the same metrics on an overlapping subset of users. The metric increase from the experiment might actually not come from the treatment alone, but from another treatment from the overlapping experiment. Organization-wide coordination (from experiment timing to targeting assignment) can help to minimize this issue.
- Going directly to full rollout. It might be tempting to run the experiment in full rollout right away to minimize the time needed and get the result as soon as possible. However, experiment changes are still “product releases” and things can go wrong in between. It is recommended to approach the experiment with a staged rollout to reduce the risk in these releases.
Having a product experiment platform can be a potential solution to prevent these pitfalls, ensuring standardized metrics and best practices on the process are implemented.
Statistical pitfalls
Product experiment is the process of continually testing hypotheses for ways to improve your product. Hypothesis testing itself is essentially a form of statistical inference, and hence there are statistical principles to be followed in order to do product experiments properly.
Depending on the product context and use cases, experiments might be statistically more complicated and require some extra measures to be looked out for. Below are some of the common ones.
Experiment “peeking”
When running an experiment, it is quite tempting to check on the results shortly straight away after the deployment, and draw (premature) conclusions, especially if the results look good or aligned with our hypothesis. This is called the experiment “peeking” problem.
Experiment “peeking” occurs when the outcome is erroneously called before the proper sample size has been reached. Even if the initial results show statistical significance, the inference might be coming purely out of chance and is a flawed inference if drawn before reaching the proper sample size.
The ideal way to tackle this is to confirm the sample size at the beginning of the test and defer any conclusion before that sample value is reached. However, in some cases reaching enough sample size might take too long and become not practical. One technique to explore in this case is sequential testing, where the final sample size is dynamic to the data we observe during the test. So if we observe more extreme results at the start, the test can be ended earlier.
Not setting the right null hypothesis
In product experiments, we set up a null hypothesis to be tested — rejected or not rejected — with the treatment given. A common classic null hypothesis is no difference in the variable of interests between the datasets (control group vs treatment group) analyzed. This is called a superiority test, in which we expect some superior discrepancy between the treatment and control groups — expecting a positive change in the variable of interest (e.g. means, proportions) of the treatment group in order to proceed with implementing the treatment.
An alternative for this is the non-inferiority test, in which we have reason to implement a tested variant as long as it is not substantially worse than the control. The null hypothesis in this test would be something along the line of “the variable of interest in the variant is X% worse than the control, or more”. In this test, we are good to proceed with implementing the treatment even if it is performing worse than the control, as long as it is still within the “margin of caring” range.
This non-inferiority test can be useful for changes that might cause some negative impacts (i.e testing the impact of removing a feature on booking conversion) or to check secondary metrics on an experiment that we can accept decreasing to a certain threshold for an increase in the primary metric.
Contamination
The commonly used hypothesis tests — z-test and t-test — runs under the assumption that the data are independently sampled from a normal distribution. While in most cases this can be easily fulfilled by ensuring randomized non-duplicate assignments, it can be tricky in some cases.
For example, experimenting with delivery pricing in an on-demand delivery app. Though treatment is isolated to selected users, there might be some impact on the non-treatment group as well, as the delivery fleet is shared across the area (instead of per customer). This is called contamination or network effect, in which different treatments of an experiment interfere with each other.
One common solution is to utilize a “switchback experiment”. In this case, all the users in the experiment will be exposed to the same experience where randomization happens on a time interval and region (or other granularity where the treatment effect can be isolated). The metrics of interest will then be averaged across time intervals.
Multiple comparison problem
The multiple-comparison problem is a well-known issue in statistics. It occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values.
For example, we’re experimenting with the new UI page (treatment) compared to the old UI page (control) of an e-commerce platform. Instead of mainly testing on the booking conversion impact, we’re also checking it against numerous other (not-so-relevant) metrics like search-bar clicks, per-categories clicks, session duration, coupon usage rate, and so on. As more attributes are compared, it becomes increasingly likely that the treatment and control groups will appear to differ on at least one attribute due to random sampling error alone.
To control this problem statistically, there are some approaches that can be used, like Bonferroni correction which lowers the p-value threshold that is needed to call a result significant.
Taking it to the next level
The common pitfalls above aside, product experiment results might still be not very ideal and reliable. There are still some caveats to consider and keep in mind when analyzing the experiment results.
- Novelty effect. When some changes are introduced in the product, users are typically curious to explore more, and hence drive the change in business metrics. However, this effect is temporary as the interest might normalize after a while as the change becomes less novel. With this in mind, it is often a good idea to establish a “burn-in period” in experiments and ignore the data collected in the initial period of the experiment.
- Consider seasonality. Some product/feature usage might have a certain seasonal lifecycle that might impact experiments. For example, an entertainment site might see much higher traffic on weekends compared to weekdays. When running an experiment on the product, we can try to cover both weekends and weekdays to get holistic estimates of the treatment impact.
There is no single perfect easy way to run a product experiment, as it varies by population and treatment contexts. Also, not every real-world impact can be easily quantifiable. But still, product experiments done in a statistically right way can help bring scientific reasoning to a business decision. It can be considered hand-in-hand with user research and product/business sense grained in the product domain experts — you as the PM or Data Analyst.