The A/B Testing Mistakes That Kill Programmes (With Real Examples)

I’ve reviewed a lot of experimentation programmes so I get to see the good, bad and the ugly. Some teams are running tests every week and have win rates that look impressive on a slide BUT a revenue line that hasn’t moved in two years. The A/B testing mistakes that cause this aren’t always obvious, they do compound though. And by the time someone notices the programme isn’t working, they’re six months deep when there was possibly a simple fix.

Every item below comes from something I’ve seen happen in a real programme, with a real cost attached.

1. Using revenue as the primary metric when your traffic can’t support it

This is the most common mistake I see in programs run by teams who should know better. But at the same time, I kind of understand. When you’re setting up an experiment, it feels logical to want to influence revenue, regardless of what you’re testing. But, revenue is a noisy metric. It swings with seasonality, with promo activity, with things completely outside your experiment. And to detect a meaningful revenue lift with statistical confidence, you often need sample sizes that would take a year to accumulate at normal traffic volumes. Most teams don’t have that runway. So they run a six-week test, get a noisy result, call it inconclusive, and move on.

The fix is to identify a metric that sits closer to the behaviour you’re trying to change and has enough volume to give you signal faster. Think “what is the action or behaviour that this experiment is trying to influence”. Revenue can sit further down the metric chain as a guardrail. But if it’s your primary success metric on a site doing 50,000 sessions a month, you’re building your programme on a foundation that will never give you clean answers.

This is what I’d call the optimiser ignorance problem. It’s not that teams are being lazy. They’re measuring what feels important, rather than what’s measurable at their traffic volume. The result is a programme that looks active but produces almost no reliable learning.

2. Testing the right page for the wrong bottleneck

I worked with a large, online university when I was at Jellyfish agency. The brief was to improve RFI form completion rates. We ran tests. Some of them worked. More people completed the request-for-information form. And then we looked downstream and realised we’d funnelled a larger volume of prospective students into an advisor follow-up process that was already overwhelmed. Response times got worse. Conversion from form completion to enrolled student barely moved.

Although it was my job, optimising the form was the wrong intervention. The bottleneck wasn’t the form. It was the process that started once the form was submitted.

On the surface, it looked like a good outcome and my experiments were a success but as soon as I zoomed out, the bigger problem was clear. The business would see far more benefit from fixing the real bottleneck that was impacting enrollments and revenue.

You can produce positive test results at the page level while the actual constraint sits somewhere the test can’t see. It’s very underrated but…

Before you run anything, map out the whole customer funnel with real drop-off data at every stage. Your next test should target the biggest drop. Not the most testable page.

3. Stopping tests early because the numbers look good

I remember working with a UAE based mattress company. I was called in as a consultant but the relationship didn’t last long. One of the things that surfaced while I was working with them was their need for speed. The CEO didn’t really care about statistical vigour, he just wanted as many tests shipped in the shortest time possible.

At times, it meant shipping a variant that had hit 95% statistical significance at day four BUT it was also if a variant that he liked was hovering at 75% statistical significance on day 10. Wild. Now, rapid experimentation is fine in principle but testing like this almost guarantees that false positives will be shipped.

Statistical significance thresholds are calibrated for the point at which a test is designed to end, not for the moment you first check the dashboard or see an uplift that you like.

If you’re looking at results daily and stopping when they look positive, you will ship losing variants that happened to be winning on the day you looked.

The fix is pre-registration. Decide your sample size before the test starts. Decide your minimum detectable effect. Set a run time based on those numbers, not based on how quickly the results look encouraging. Then don’t stop early. The discipline of not peeking is one of the things that separates programmes with a reliable win rate from programmes where the “wins” don’t replicate in real life.

4. Not setting up the right metrics before the test starts

I ran a metric mapping workshop with the Levi’s team. The starting point was that they were tracking macro conversions, and that was essentially it. Add to cart, purchase. Nothing in between. So when a test produced a lift in add-to-cart but not in purchase, there was no way to understand why. Was checkout abandonment increasing? Was there a payment issue? Were people adding and then hitting a friction point in size selection? Nobody knew, because nobody was measuring it.

Metric mapping is the process of identifying every meaningful micro-conversion between landing and goal completion, and making sure all of them are tracked within your experiment before it goes live. Metric mapping takes a few hours max if you do it properly. Without it, you end up with test results that tell you what happened but not why, which means you can’t build on them.

The output of good metric mapping isn’t just better analysis and context. It can add a diagnostic layer that tells you where to test next.

5. Running tests without a real hypothesis

A hypothesis is not a description of a change. “We will add a trust badge to the checkout page” is not a hypothesis. It’s a plan.

A hypothesis is a falsifiable prediction with a mechanism. It sounds like this… “Visitors are abandoning at payment because they’re uncertain about data security. Adding a visible trust signal at the point of card entry will reduce that uncertainty and increase payment completion.” That’s testable. It names the problem, the intervention, the mechanism, and the expected outcome. If it loses, you’ve learned something about whether that specific friction was real. If it wins, you understand why, and you can apply that learning elsewhere.

Most teams skip this because it takes longer than just queueing up the test – especially if a lot of your testing pipeline is fed to you sporadically by leadership. But tests without hypotheses produce results you can’t learn from. You’ll know a variant won, but you won’t know why, and you can’t replicate or extend it.

6. Treating a losing test as a failure

The Wine Society is an example I keep coming back to. They ran an experiment where a variant lost, significantly. And instead of writing it off or feeling like something had gone wrong we quantified what would have happened if that variant had been shipped without testing it first. Based on the traffic volume, the conversion impact, and the average order value, not testing and shipping would have cost a significant amount of revenue over the following months.

Instead of trying to hide a losing experiment, they were able to show the value of experimentation in a more effective way – that number landed differently with leadership than any winning test result had. Because it reframed the programme. Experimentation isn’t just a growth lever. It’s risk management. You’re not just finding things that work. You’re catching things that don’t work before they go live permanently.

A losing test is cost avoidance. Quantify it. Put it in your reporting. It changes how stakeholders think about the value of the programme.

7. No governance, so results reach nobody

I was at Samsung at a point where the global experimentation team were genuinely capable. Tests were well-designed. Their analysis was solid. But results were only being shared within the immediate team and going nowhere. If someone left or moved to a different project, the insight left with them. Other teams making product and design decisions had no idea experiments had been run that were directly relevant to what they were building. And as a global business, experimentation teams in other regions also had no visibility on what was or wasn’t working.

Governance isn’t the sexy part of experimentation. But without it, a CRO programme is a closed loop. Learning doesn’t compound, because the organisation isn’t ingesting it. You need a results repository that isn’t just a spreadsheet someone forgot to update. You need a defined process for surfacing relevant results to adjacent teams. And you need someone whose job it is to make sure that happens.

The teams that build compounding programmes are the ones where insight travels beyond the silo (…the silo that everyone promised wouldn’t exist). Where a losing test on one team informs a decision by another team six months later. That only happens if there’s infrastructure for it.

8. Building elaborate stakeholder programmes that nobody reads

I’ve seen teams spend a lot of time building beautiful experiment newsletters, insight decks, and monthly readouts that executives ignore. The instinct makes sense. You want to build support for the programme by showing results & all of the recommendations online say that this is the stuff that works. But the format is probably wrong.

Stakeholders already have meeting rhythms, reporting cadences, and comms channels they actually use. Asking them to engage with a new one is asking them to change behaviour. Most won’t. The more effective approach is to embed experiment results into communications that already exist. QBR meetings… The monthly leadership update… The product review… Get your results into those rooms in the format those rooms use, rather than building a parallel channel and hoping people migrate to it.

9. Jumping to personalisation before diagnosing correctly

I had a client in the Middle East, in the banking space. They came in wanting a personalisation strategy. Actually, they didn’t want a atrategy… they just wanted to run experiments for specific segments, test dynamic content, AI-driven recommendations. The works.

When I asked how many A/B tests they’d run in the last twelve months, the number was low – less than 10, low. When I asked what their primary funnel drop-offs were, I didn’t get a clear response. When I asked what hypotheses they’d formed about why drop-offs were even happening, again, there wasn’t a clear answer.

Personalisation at that stage isn’t a strategy. It’s a distraction with a large implementation cost. Personalisation compounds the effects of whatever your baseline experience is. If the baseline experience has undiagnosed conversion problems, you’re personalising those problems at scale.

The same applies to AI tooling in CRO more broadly. I see teams adopting AI-assisted copy generation, AI-powered heatmap analysis, AI experiment planning, before they’ve answered the basic question of where their funnel is actually breaking and why. Speed isn’t the constraint for most CRO programmes, especially in the early stages. Clarity is. AI doesn’t fix a lack of strategic diagnosis. It accelerates in whichever direction you were already pointed, which isn’t helpful if that direction is wrong.

Where to start if you recognise your programme in this list

Most of these mistakes share a root cause. They happen before the test goes live. The metric is wrong before the test starts. The hypothesis is missing before the test starts. The governance structure is absent before the test starts. The funnel diagnosis hasn’t been done before the test starts.

The place to intervene is test design, not test analysis. Honestly, the best experimentaqiton is probably 90% test design. By the time you’re looking at results, most of these problems are already baked in.

Start with your last five tests. Apply the questions in this article to each one.

Did the metric have enough volume?
Was the hypothesis falsifiable?
Was the bottleneck you were targeting actually the biggest one in the funnel?

If the answer to any of those is no, you’re not in a testing problem. You’re in a design problem, and that’s fixable.

Then build the habit of catching these issues before tests go live, not after. That’s exactly what the Experiment Validator is for. It runs through the most common test design mistakes before an experiment goes live, so the problems get caught at the point where catching them actually costs you nothing.

The A/B Testing Mistakes That Kill Programmes (With Real Examples)

1. Using revenue as the primary metric when your traffic can’t support it

2. Testing the right page for the wrong bottleneck

3. Stopping tests early because the numbers look good

4. Not setting up the right metrics before the test starts

5. Running tests without a real hypothesis

6. Treating a losing test as a failure

7. No governance, so results reach nobody

8. Building elaborate stakeholder programmes that nobody reads

9. Jumping to personalisation before diagnosing correctly

Where to start if you recognise your programme in this list

Kyle Newsam

Leave a Reply Cancel Reply

Services

Resources

The A/B Testing Mistakes That Kill Programmes (With Real Examples)

1. Using revenue as the primary metric when your traffic can’t support it

2. Testing the right page for the wrong bottleneck

3. Stopping tests early because the numbers look good

4. Not setting up the right metrics before the test starts

5. Running tests without a real hypothesis

6. Treating a losing test as a failure

7. No governance, so results reach nobody

8. Building elaborate stakeholder programmes that nobody reads

9. Jumping to personalisation before diagnosing correctly

Where to start if you recognise your programme in this list

Kyle Newsam

You May Also Like

What Is a CRO Audit? A Practitioner’s Process for SaaS Teams

Conversion Funnel Optimisation: Where to Focus First

Types of Experimentation: From A/B Tests to Multi-Armed Bandits

Leave a Reply Cancel Reply

Services

Resources