A client sent me a screenshot once. Their testing tool had a big green banner. “Significant at 95%.” As you can imagine, they were pretty happy – after all, roughly 70% of all experiments are inconclusive. Anyway, they wanted to ship the variant immediately. Before they did that, I asked what the actual revenue difference was between control and variant. They didn’t know. The tool said “significant”, so they assumed it meant good.

Statistical significance in A/B testing is one of those concepts that sounds technical. When I’m delivering workshops, even saying it loud (20+ times) becomes a tongue twister. Not my favourite. But at it’s core, it’s not too complicated once you strip the jargon. And it’s one of those concepts that gets misused constantly, by people who understand it and people who don’t.

What it actually means

Statistical significance is the probability that the difference you’re seeing between control and variant is not down to chance.

That’s it. When a test reaches 95% statistical significance, you’re saying there’s a 5% probability this result is random noise. You’re not saying the variant is better. You’re not saying you should ship it. You’re saying you’re reasonably confident the difference is real.

The term for that 5% is the p-value. A p-value of 0.05 means there’s a 1 in 20 chance you’re looking at a fluke. People flip this in their heads and think 95% significance means a 95% chance of success. It doesn’t. It means a 95% chance the result didn’t happen by accident. Whether the result is worth acting on is an entirely separate question.

Think of it like a court case. “Beyond reasonable doubt” doesn’t tell you whether the verdict is the right business decision for the victim’s family. It just tells you the evidence cleared a threshold. What happens after that requires judgment.

Where 95% came from

Here’s the part most people don’t know or were never told. The 95% threshold wasn’t designed for website experiments. In fact, it wasn’t designed for business decisions at all. It came from Ronald Fisher, a British statistician working in agricultural science in the 1920s. Fisher was trying to determine whether fertilisers made a measurable difference to crop yields. He needed a threshold that would help him distinguish real effects from natural variation in soil and weather. He landed on 0.05, effectively 95% confidence, as a working convention. Something to orient around. He never claimed it was a universal law.

Fisher himself wrote that no scientific worker has a fixed level of significance at which they reject hypotheses year after year. He was describing a useful rule of thumb for a specific context. Decades later, the academic and then commercial research world picked it up, formalised it, and eventually the A/B testing software industry embedded it as the default. Now teams running two-week tests on checkout buttons are held to the same threshold a man invented for watching wheat grow.

That’s not a conspiracy or a scandal. It’s just how norms travel. Someone credible does something for a good reason, someone else copies it without reading the footnotes, and eventually it becomes “the standard” because enough tools and enough teams repeat it often enough that nobody questions it anymore.

Why treating it as a binary pass/fail is a problem

The damage comes when teams treat 95% as a switch. Below it, the test failed. Above it, ship it. That framing collapses a lot of important information.

I’ve seen tests reach 97% significance with a 0.3% lift in conversion rate. Technically significant, practically meaningless. The variant would not move the business. I’ve also seen tests that stalled at 88% confidence with a consistent 12% lift across four weeks and a directional story that matched everything else we knew about that user segment. The tool said not yet. The context said go.

This is the gap between statistical significance and practical significance, and it’s a gap the tool will never close for you. Statistical significance tells you the result is real. Practical significance asks whether the result matters. A small business running low traffic volumes might never reach 95% on anything meaningful. Are they supposed to make no decisions? A large retailer might hit 95% on a change that adds one extra order per thousand visitors. Is that worth the engineering resource to ship?

The number alone can’t answer either question. Someone who understands the business has to. 

The case for keeping humans in the loop

There’s a version of CRO where the tools make all the calls. Test hits 95%, variant wins, it ships automatically. Some platforms market this as efficiency. But I really think it’s a way of making bad decisions faster.

A test result sits inside a context that no tool can fully read.

Is the variant consistent with where the brand is heading?
Does the winning copy conflict with a campaign launching in six weeks?
Did the test period overlap with a promotional event that inflated conversion rates across the board?
Does the uplift hold across all device types or only on desktop where traffic is declining anyway?

These aren’t edge cases. They come up on almost every programme I’ve worked on. The statistical output is an input to the decision. It’s not the decision. Teams that are new to testing often look at the results and think; green = good, red = bad & grey = a waste of time (so let’s pretend that experiment didn’t happen). But I always say that the traffic light signalling that most experimentation tools are misleading. Whether the tool say you have a winner, loser or inconclusive result, your wider context should be brought to the table to determine what needs to happen next.

Fisher’s original threshold was a tool for a scientist who still made a judgment call after seeing the number. The p-value was never meant to replace the scientist. Somewhere along the way, the CRO industry decided it could.

What a more useful interpretation looks like

This doesn’t mean ignore significance levels. It means read them with more precision. 75% confidence on a low-risk cosmetic change with a consistent directional uplift across two weeks is probably fine to act on. 94% confidence on a structural checkout redesign with high implementation cost and a history of inconsistent results across segments is probably not enough to ship without more data.

The threshold you use should relate to the risk of the decision. Fisher understood this. He was calibrating confidence to the cost of being wrong. A false positive in agricultural research costs a farming season. A false positive on a homepage hero image costs you a rollback. A false positive on a new checkout flow might cost you a quarter of revenue. The stakes are different. The required confidence should move accordingly.

Some teams set lower thresholds, around 80 to 85%, for low-traffic pages or lower-risk changes. Others require higher confidence, above 95%, for permanent changes to high-value flows. The point is that someone in the team is making an active decision about the threshold based on context, rather than defaulting to 95% because the tool said so.

Alongside statistical confidence, look at the absolute numbers. What’s the actual difference in conversions per thousand visitors? What’s the revenue impact over a month? What does the result look like by device, by new versus returning users, by traffic source? A single significance score flattens all of that. Digging into it is what turns a test result into a business decision.

What this means for your programme

The next time your tool shows a result, ask two questions before acting:

  1. Is this statistically significant, and what does that actually mean given our traffic levels and the length of this test?
  2. Is this practically significant, and does this lift matter enough to act on given the cost, risk, and context of shipping this change?

If you can answer both clearly, you’re making a decision. If you’re just reading a number off a dashboard, you’re not.

The 95% threshold is a useful reference point. It’s not a verdict. Ronald Fisher never said it was. He was watching wheat, not checkout flows, and even he kept his own judgment in the process.

If you want to be better at assessing the risk of each experiment, our free Risk Ranker tool might be the tool you need. It’s not only a way to prioritise your backlog but allows you to factor in the risk associated. Try it out and let me know if it helps.

Kyle Newsam

An optimizer by trade & lifestyle. Truly any experience or interaction becomes an experiment & something I can learn from. Currently, moving around the globe working from the coolest locations that the younger me could never have imagined.

Leave a Reply