I was in a sprint planning meeting with a large online retailer. The team had 23 experiment ideas on the board. Someone pointed at the one about changing the hero image colour and said, “Let’s do that one first, it’ll be quick.”

Nobody pushed back. Quick became the prioritisation framework.

And that’s not an edge case. That’s how a lot of CRO programmes actually run. The loudest voice in the room or the easiest thing to build, gets the slot. Everything else waits. And because experimentation cycles are long, the order you run things in genuinely matters. Get it wrong and you’re six months in with nothing to show for it.

So let’s talk about how to actually prioritise experiments. Not in theory. In practice.

Why Most Prioritisation Fails Before It Starts

The honest reason most teams don’t prioritise well is that they’re optimising when they should be diagnosing. There’s a difference between having a list of ideas and having a list of hypotheses grounded in evidence. Most backlogs are idea lists with a thin coat of justification painted over the top.

If you don’t know why a problem exists, you can’t rank solutions accurately. You’re just guessing at impact. And any framework you apply to guesses is still guessing.

So before you touch a prioritisation framework, the question to ask about each experiment idea is simple. What is the specific evidence that this problem exists? Not “we think checkout drop-off is high.” Where exactly? For which users? After which action? The more specific the diagnosis, the more honest the prioritisation.

The Frameworks You’ve Heard Of (And Why They’re Incomplete)

PIE and ICE are the two most common prioritisation frameworks in CRO. If you haven’t heard of them, PIE stands for Potential, Importance, Ease, and ICE stands for Impact, Confidence, Ease. You score each experiment on those dimensions, average the scores, and rank your list.

They’re not bad. They’re just incomplete.

The problem is they score upside without scoring risk. They’ll tell you which experiment looks most promising. But they won’t tell you which experiment could burn three weeks of engineering time and produce uninterpretable results because the test was set up wrong… or the sample was too small… or the hypothesis was too vague to falsify.

A test that scores highly on PIE but runs on a page with 200 monthly visitors isn’t worth running. A test with a compelling hypothesis but no clear metric isn’t worth running. A test that touches five elements at once isn’t worth running, not yet anyway, because you won’t know what moved the needle.

Potential and impact scores assume the experiment is viable. Often it isn’t.

What a Proper Prioritisation Framework Actually Scores

Here’s how I think about it. A good prioritisation framework scores experiments across three categories, not two.

First, expected value. This is roughly what PIE and ICE are trying to capture. How much traffic does this page or flow get? How significant is the problem you’re solving? What’s a realistic uplift range based on comparable tests? You want a number here, not a gut feeling. Pull your analytics. If a page gets 50,000 sessions a month and the exit rate is 68%, you have something to work with. If it gets 4,000 sessions and the exit rate is 52%, you probably don’t, at least not right now.

Second, hypothesis quality. This is the one everyone skips. A hypothesis isn’t just “if we change X, Y will improve.” A testable hypothesis names the problem, names the proposed mechanism, and names a specific measurable outcome. “Users are abandoning the form at the postcode field because the inline error message appears before they’ve finished typing, causing friction. If we delay the validation trigger, form completions will increase.” That’s a hypothesis. “Let’s test a shorter form” is not.

Hypothesis quality matters for prioritisation because vague hypotheses produce uninterpretable results. You can’t learn from a test you don’t understand. And learning is the actual point of experimentation, not just winning.

Third, experimental risk. This is the missing dimension. Risk here means, what are the realistic threats to this test producing clean, actionable data? That includes sample size viability, test duration, the number of variables being changed, whether the metric is actually sensitive enough to detect the change you’re predicting, and whether there are external factors that could contaminate results during the test window.

Score all three categories. Weight them appropriately for your context. Then rank.

The Specific Criteria That Separate High-Priority Tests From Low-Priority Ones

Let me give you the actual criteria I use. Not a scorecard template. The specific questions I ask.

Does this test have enough traffic to reach statistical significance in under four weeks? If not, either the test needs to run longer than is practical, or it needs to be deprioritised until traffic increases. A test that needs six months to reach significance isn’t worth running.

Is the primary metric directly influenced by the change being tested? This sounds obvious until you’re in a room where someone wants to measure revenue impact from a change to the returns policy page. This happens more often than you might think. Measure the thing closest to the change. Secondary metrics can tell you a story. Primary metrics need to be tight.

Has this problem been confirmed in at least two data sources? A funnel drop-off in GA means something. A funnel drop-off in GA plus session recordings showing users stopping at the same point means more. Qualitative plus quantitative is a stronger signal than either alone. Tests built on single-source evidence are lower confidence and should score lower.

Is the hypothesis falsifiable? Can you describe what a losing result would look like, and would that result still tell you something useful? Losses are not failures. They’re the point. The reason we experiment is because we don’t know the answer. But a test where a loss teaches you nothing is a wasted test.

How much does this test rely on engineering resources? Not because easy is better. Because resource-heavy tests carry execution risk. If the test needs four weeks of dev time just to build, it’s vulnerable to deprioritisation before it even launches. Factor that in.

What to Do With the Tests That Don’t Make the Cut

Deprioritised doesn’t mean discarded. It just means not yet.

Some ideas belong in a holding area until the traffic exists to support them. Some hypotheses need more diagnostic work before they’re ready to test. Some tests are waiting for a redesign or a platform migration to make them viable.

The point of a prioritisation framework isn’t to kill ideas. It’s to protect your experimentation capacity for the work most likely to generate clear, actionable results. Capacity is finite. Especially for smaller teams running two or three concurrent tests at a time.

The worst thing you can do is run a test you already know is underpowered or under-defined, watch it produce inconclusive results and then have a senior stakeholder use that inconclusive result as evidence that “CRO doesn’t work here.” That happens constantly. Good prioritisation protects against it.

Writing Before Layout

One more thing while we’re here. If you’re stacking your backlog with layout tests and barely touching copy, you’re leaving results on the table. Writing changes, specifically changes to headlines, CTAs, value propositions, and error messages, regularly outperform layout changes in controlled tests. There’s less drama involved in writing a new headline than redesigning a page section, which is probably why it doesn’t get the attention. But the data across 14 years of running experiments tells a consistent story. Start with what the page says before you change where things sit.

Score Your Experiments Before You Commit to Running Them

Everything above is the thinking. But thinking doesn’t scale when you’ve got a backlog of 30 ideas and a sprint starting Monday.

That’s exactly why we built the Risk Ranker™. It’s a free prioritisation tool that walks you through the key criteria, scores each experiment idea across expected value, hypothesis quality, and experimental risk, and gives you a ranked output so you always know what to run next. No spreadsheet setup. No arguments about whose gut feeling wins. Just a clear score with the reasoning behind it.

If you’ve got a backlog sitting on a sticky note or in a Notion doc and you’re not sure where to start, this is where to start.

Kyle Newsam

An optimizer by trade & lifestyle. Truly any experience or interaction becomes an experiment & something I can learn from. Currently, moving around the globe working from the coolest locations that the younger me could never have imagined.

Leave a Reply