Are you sure your A/B test result is statistically credible?

TAMPERE, FINLAND

Are you sure your A/B test result is statistically credible?

Dec 12, 2024

I wrote this example to explain to my PM-aspiring students (as part of a lecture for a Product Manager course) about how Statistical Inference is used in UX Research, particularly in this example - A/B testing. The goal is not to explain what Statistical Inference is mathematically, but rather to introduce the term in a simplest manner possible, as well as equipping them with a practical tool so that they could use right away in simple cases in real-work situations.

The experiment

Say, you manage the UX for an e-commerce website, and you want to test a new design: a “sticky” checkout button (a button that stays visible as users scroll) on product detail pages. The goal is to see if it increases the checkout conversion rate.

Research Question

Does adding a sticky checkout button increase the checkout conversion rate compared to the existing design?

Research Method

The most appropriate method here is A/B Testing. You’ll divide users into two groups:

Group A (Control): Users see and use the original product page (no sticky button).
Group B (Variant): Users see and use the product page with the sticky checkout button.

And of course you cannot do the experiment on all visited users which are around 1 millions per day for that product detail page (the whole population). You'll conduct the experiment on only a small number of users (a sample).

Data to Collect

Number of users in each group.
Number of users who completed the checkout process.

Metric to measure the impact

Applying Statistical Methods

Once the experiment is complete, using Descriptive Statistics (summarize data, calculate mean (i.e., calculating the CR)), we can see the results from Group A and Group B.

Group	Total users	Conversion rate (CR)
A (Control)	5,000	8% (400 checkouts)
B (Variant)	5,000	10% (500 checkouts)

→ At this point, we can see a difference in performance, but is the difference statistically significant?
In another word, are you sure the difference you observed from this sample reflects the real difference in the entire population?

This is where we use Inferential statistics!

So this is what you will do:

Hypothesis testing

Null Hypothesis (H₀): There is no difference in conversion rate between the Control (A) and Variant (B) groups.
Alternative Hypothesis (H₁): The conversion rate for the Variant (B) is greater than the Control (A).

t-test

This is how you gonna tell whether you can reject the above null hypothesis.

Since we have two proportions (8% vs 10%) from two independent groups, we use a t-test for 2 proportions.

If we plug our numbers into any off-the-shelf statistical tool which have 2 sample t-test (e.g., this-tool-here), we would get the similar results of: Z=3.49 and p=0.0002

→ We can reject the null hypothesis with statistical confident, because p=0.0002 is much smaller than 0.05

And now you can confidently say: The sticky button significantly increased conversion rates! We should implement it site-wide.

I know this might be too-concise

I gave this example to provide you a general idea about, firstly, how an A/B are conducted, and secondly, with an emphasis on the statistical significance – as this is something, surprisingly, usually missed by the PMs when doing A/B testing, even in well-known tech companies in Vietnam.

After having a rough understanding about A/B test, I recommend you to look deeper into statistical significance yourself (don't worry there are a plenty of great online materials which is also friendly to newbies like you and me, e.g., this-article) – to really understand what it is and why it is needed.

And lastly, I also wrote another article When the ‘Magic Number 30’ in Sample Size goes wrong – despite my emphasis on the application of sample size, (I hope) this additional article will give you a deeper understanding of t-test and statistical significant, so that you could avoid falling into traps due to the lack of fundamental understanding in statistics.

Back to Blog

Take me back