I’ve reviewed hundreds of experiment briefs across a dozen programmes. The number one thing that separates the good from the bad isn’t their testing tool, traffic volume or analytics setup. It’s whether they can write a proper hypothesis before they build anything.

Most can’t. Not really. What they write looks like a hypothesis. But when you pull at it, it falls apart. And when the experiment ends, win or lose, they don’t know what they actually learned. That’s the problem we’re fixing here.

What a CRO Hypothesis Actually Is

A hypothesis is a falsifiable prediction. That’s the whole thing. It tells you what you expect to happen, why you expect it, and how you’ll know whether you were right.

Think about a doctor ordering a test. Before they run it, they have a theory: the patient has X because of Y, so the result will show Z. They’re not just poking around. They have a specific expectation and the test will either support or contradict it. That’s exactly how experiment hypotheses should work.

In CRO, a hypothesis is not a description of what you’re changing. “We’ll add a trust badge to the checkout page” is not a hypothesis. A hypothesis tells you why that change should work, who it should work for and what you expect to see in the data when it does.

The reason this matters so much is simple. If you don’t write it down before you run the test, you’ll retrofit the explanation after. You’ll look at the results and tell a story that fits the numbers. That story might feel like a learning, but it isn’t.

The Components of a Strong Hypothesis

A good CRO hypothesis has four parts. Not three, not two. Four. Most frameworks only give you three, which is why most hypotheses are incomplete.

The first part is the observation. What have you seen in the data, in user research, or in session recordings that is telling you something is wrong? This is not an assumption. It’s evidence. “Checkout abandonment is 74% and heatmaps show almost no one is clicking the security icon” is an observation. “People don’t trust us” is a guess.

The second part is the proposed change. What specifically are you changing, and where? Be precise here. “We’ll make the CTA bigger” is too vague. “We’ll change the primary CTA on the product page from a text link to a high-contrast button above the fold” is a change you could actually brief to a designer without a follow-up call.

The third part is the expected outcome. What metric moves, in what direction, by how much? You don’t need to nail the exact number. But you do need to name the metric. “We expect checkout initiation rate to increase” is the minimum. If you have enough historical data to estimate a range, use it.

The fourth part, and the one most teams skip, is the reasoning. Why do you believe this change will produce this outcome? What is the mechanism? This is where you cite the user behaviour, the research finding, or the cognitive principle that connects the change to the result. Without this, the other three parts are just decoration.

Put those four together and you get a hypothesis that sounds like this: “We’ve observed that 68% of users who reach the pricing page exit without scrolling past the hero section. We believe this is because the value proposition is buried. If we move the three key benefits above the fold and rewrite the headline to lead with outcome rather than feature, we expect time on page to increase and plan upgrade clicks to improve. We’re basing this on five user interviews where participants said they didn’t understand why the paid plan was worth it until they scrolled down.”

That’s a real hypothesis. You can argue with it. You can test it. And when the results come in, you’ll know exactly what you learned regardless of which way it goes.

Weak Hypotheses vs Strong Ones: Real Examples

The fastest way to understand what makes a hypothesis strong is to see the weak version sitting next to it.

Here’s a weak one, the kind I see constantly:

“If we change the button colour, more people will convert.”

There’s no observation behind it. There’s no reasoning. The outcome is vague. And if the test loses, you’ve learned absolutely nothing you can act on. You don’t know if the colour was the wrong choice, if the button placement was the problem, or if colour was never the issue at all.

Here’s the same idea written properly:

“Exit surveys on the product page show 22% of respondents said they weren’t sure what would happen after clicking ‘Submit’. If we change the CTA copy from ‘Submit’ to ‘Start your free trial’ and increase button contrast from a 2.3 to a 4.8 WCAG ratio, we expect click-through rate to improve because users will have clearer expectations about the next step. This is supported by three usability test participants who hesitated on the button and verbalised uncertainty.”

Same general idea. Completely different level of clarity and usefulness.

Here’s another weak one:

“Adding social proof will increase signups.”

This is so broad it’s almost meaningless. What kind of social proof? Where on the page? What’s the evidence that social proof is even the relevant lever? What if the real problem is that people don’t trust the product itself, not the number of users?

Strong version:

“Scroll depth data shows 80% of users on the signup page don’t reach the testimonial section. We believe moving a single, specific testimonial from a recognisable company name to just above the signup form will reduce form abandonment, because the current placement means most users never see any validation before they’re asked to commit. We’re testing this against the control and watching form start rate as the primary metric.”

That second version tells you exactly what to build, what you expect to see, and why. It also gives you a clean learning either way. If it wins, you know visible social proof near the form drives commitment. If it loses, you know something else is blocking the decision, and now you go find out what.

The Most Common Hypothesis Mistakes

The first mistake is writing the hypothesis after the experiment is already designed. This happens more than people admit. The designer mocked something up, the developer already estimated it, and now someone’s asking for a hypothesis to fill in the brief. So the hypothesis gets written backwards, justifying the design rather than predicting the outcome. That’s not a hypothesis. That’s a brief with an alibi.

The second mistake is testing multiple changes under one hypothesis. “We’re redesigning the product page” is not a testable hypothesis. It’s a project. If five things change at once and the test wins, you have no idea which change drove the result. If it loses, you have no idea what to fix. You’ve run an experiment that can only answer one question: did this specific version of the page outperform the original? That’s rarely the question you actually needed to answer.

The third mistake is choosing the wrong primary metric. This usually happens when teams default to the macro conversion because it’s the number the business cares about. But if your test is on a product page and your macro conversion is a purchase that requires three more steps, you might need a sample size of two million sessions to reach significance. Pick the metric that’s closest to the behaviour you’re changing. You can still watch the downstream metrics. But be clear on what the test is actually designed to move.

The fourth mistake is confusing correlation for mechanism. I see this in the reasoning section all the time. “Our top competitors use this design, so it must work.” That’s not a mechanism. It’s an assumption stacked on top of an assumption. Your competitors might be wrong. Or it might work for their audience but not yours. The reasoning in your hypothesis needs to connect your specific observation to your specific proposed change through a named behavioural principle, a piece of user research, or a measurable signal from your own data.

The fifth mistake is not writing down what would falsify the hypothesis. Before you run the test, you should be able to answer: what result would tell you this theory was wrong? If the answer is “nothing, because we’ll learn either way”, you haven’t written a hypothesis. You’ve written an excuse to run the test regardless of what you find.

Why the Hypothesis Is the Foundation of a Clean Experiment

Here’s what I’ve seen happen in programmes that skip proper hypothesis writing. The test runs. Results come in. If it wins, everyone celebrates and ships it. If it loses, there’s a brief moment of “hm” and then the experiment gets archived and nobody looks at it again. No learning gets documented. The same assumption resurfaces six months later under a slightly different design. The same test runs again. Same outcome.

That cycle is expensive. Not just in time and development cost, but in what it does to confidence in the programme. People start to feel like testing doesn’t work. They start pushing for bigger, more dramatic redesigns because the small iterative tests “never move the needle.” And then the programme turns into a series of massive, untestable bets with no mechanism to understand what’s driving the outcome.

A proper hypothesis breaks that cycle. When you write one properly, every result teaches you something, win or loss. A loss tells you your assumption was wrong, which means your mental model of how your user makes decisions just got updated. That is genuinely valuable. It’s what the experiment was for. The reason we run experiments is because we don’t know the answer. The hypothesis is how you extract the learning regardless of which direction the data moves.

It also makes prioritisation cleaner. When you have twenty experiment ideas and you force every one of them through a proper hypothesis structure, the weak ones fall apart immediately. Half your backlog will reveal itself as “we just want to try this” rather than “we have a specific belief based on evidence that we need to validate.” That’s useful information. Those weak ideas don’t disappear, but they get deprioritised until you have the evidence to write a real hypothesis behind them.

A Framework You Can Use Right Now

If you want a repeatable structure, use this. It’s not mine originally, it’s a synthesis of what actually works across programmes I’ve run and audited, but it’s the version I’ve seen hold up the best under pressure.

Start with: “We have observed that…” and fill in the specific data point or qualitative signal. Then: “We believe that if we…” and describe the specific change. Then: “Then we expect…” and name the metric and direction. Then: “Because…” and give the mechanism. Finally: “We will know we were wrong if…” and name the falsification condition.

That last sentence is the one most teams find hardest to write. Good. That difficulty is the point. It forces you to be specific about what the test is actually testing.

Running this structure across your current backlog will surface something interesting. Some of your ideas will have strong observations but weak mechanisms. Some will have clear mechanisms but no data to support the observation. That gap analysis tells you exactly what research or analytics work needs to happen before the test is worth building.

Where to Start If Your Programme Has Never Done This Properly

Pick one live test or one idea currently in your backlog. Write the full hypothesis using the structure above. Don’t clean it up. Write the messy first version and then look at which of the four components is weakest. That weakness is your diagnostic. Weak observation means you need more data. Weak mechanism means you need user research. Weak outcome definition means you need a conversation with whoever owns the metrics. Weak falsification condition means you need to pressure test the assumption before you build anything.

Do this for five ideas. By the fifth one you’ll have a completely different sense of what’s actually worth testing and what’s just noise dressed up as opportunity.

And if you want to pressure test your hypotheses before they go into development, the Experiment Validator runs your experiment brief through the exact quality criteria that separate clean, learnable tests from wasted build time. Worth running every idea through it before it touches a developer’s sprint.

Kyle Newsam

An optimizer by trade & lifestyle. Truly any experience or interaction becomes an experiment & something I can learn from. Currently, moving around the globe working from the coolest locations that the younger me could never have imagined.

Leave a Reply