Statistics can seem daunting to a designer with a creative background, but I’ve found you don’t need to be an expert statistician to accurately analyse test results if you follow a few guidelines and ask some honest questions.
Here’s the first question I try to ask when interpreting test results – is there enough data to support the findings of the test? A good experiment needs to include a representative sample of users in order to be sufficient, with a baseline minimum of 100 conversions regardless of the number of users. For example, a test running on a website with 1,000,000 visitors per day should have a significantly larger sample size than a website with 1,000 visitors per day.
You can find a workable sample size for the test using a sample size calculator or duration calculator, however this may be building assumptions into predetermining an outcome of the test, when the objective of A/B testing is to let the data tell you the outcome.
Are there variances over time in a test that might skew the results? Are there consistent patterns of data that support the findings? If there is a sudden extreme lift or drop in a variation that dramatically changes the outcome of an experiment, it’s worth investigating the cause and determining whether this has skewed the test results. If the conversion rate over the duration of your test is relatively constant, it can be assumed that the data is consistent.
Is there enough data that appears outside of the natural variance of a test? Natural variance is at least ±2% of the conversion rate and should be factored into test results. For example, if a test with sufficient and consistent data shows a 0.01% lift, it can be assumed that this lift is caused by the natural variance of the test and a negative outcome should be recorded. If the test was sound but other factors caused the negative outcome, it may be worth setting up a new test, influenced by the outcome of the original experiment.
The observed statistical power is how likely the test will distinguish an actual effect from chance. It’s the likelihood that the test is correctly rejecting the null hypothesis (i.e. “proving” your hypothesis). For example, an experiment that reports an 80% power means that the study has an 80% chance of the test having significant results.
Often experiments are observed with a 95% confidence level, but this can vary from 50% to 99%, depending on the framework of the testing situation. Choosing a confidence level is generally a subjective decision when analysing test results as it poses the question “how much confidence do you want to convey in these test results”. There are excellent calculators available to help you find the statistical power or significance of an A/B test.
With a statistical power of 80% and a confidence level of 95%, we can make this statement:
“We can be 95% confident that this variation will be significant 80% of the time, as a consequence of the changes made and not as a result of random chance.”
I wrote this post as a reminder to be analytical in my approach to A/B testing, as it’s too easy to allow your own biases cloud your judgments, influence your design decisions, and how you decide to interpret the results.
Although most A/B testing and conversion rate optimisation software includes analytical tools to help determine the outcomes of experiments, designers can’t rely completely on these insights to get accurate results and should take other factors into consideration.