I’ve sat on strategy calls with teams who have invested 100’s of thousands into their experimentation tool but have barely scratched the surface when it comes to using the different features. Then one day someone asks “why aren’t we running multi-armed bandit experiments?” and the truth is, they don’t know when or why they should be using it, so they don’t.
It got me thinking about the other types of tests that are easy to take for granted or simply ignore because people don’t know if they should be using them. So here’s a full breakdown of every major experimentation method worth knowing… what each one is actually designed to do, what it costs you in traffic and complexity, and where I’ve seen each one help or hurt in practice. Not textbook definitions.
A/B Tests: The Right Starting Point for Most Teams
An A/B test is the simplest unit of experimentation. You take one element, create two versions (control and variant), split your traffic between them, and measure which performs better against a defined metric.
When you change one thing, you know what caused the difference. When you change ten things and the variant wins, you’ve learned that the combination worked. You have no idea which element drove it, or whether one element actively hurt performance while others compensated. You’ve generated a result without generating understanding.
I think most teams should probably spend longer in pure A/B testing than they do. Not because it’s easy, but because the discipline of isolating a single variable forces you to get specific about what question you’re actually asking. It can get boring though, because if you don’t find a winner it’s natural to think that testing more things at once will increase your chance of finding a winner. Which in some sense is true – but like I say, it can cause you to get vague about what you’re trying to learn. Getting specific is hard. Skipping to a more complex method before you’ve done the basics is how teams end up with beautiful roadmaps and thin learning.
Where A/B tests are most powerful: headline and copy changes, CTA text, form field labels, single component changes like hero images or trust signals. Anywhere you can articulate one clear hypothesis. I’ve seen more conversion lift come from a changed headline than from any layout redesign. Writing changes outperform visual changes far more often than the industry acknowledges, and A/B testing is the method that makes that visible.
A/B/n Tests: Multiple Variants
An A/B/n test is slightly different. It can essentially look like 1 of 2 things
- Same element, multiple variants, one control. Instead of testing two versions of a headline, you test four
- Different elements, multiple variants, the original page is the control. Instead of just isolating one element, you test related elements that are tied to your hypothesis.
This sounds efficient. In practice, it comes with a cost that teams underestimate.
Same element, multiple variants
The traffic requirement scales with the number of variants. If you need 5,000 conversions per variant to detect a meaningful effect, running four variants means 20,000 conversions before you have a result you can trust. On most sites, that’s weeks of test time, sometimes months. The longer a test runs, the higher the risk that external factors, seasonality, algorithm changes, competitor moves etc. contaminate the result.
The benefit of this type of test is that one iteration of that element may prove to be effective. And had you only limited yourself to testing one variant, then you might not have found the insight. Data from Optimizely mentions that programs testing more variants per experiment can see up to 3.5x impact.
“testing 4+ variations yields 3.5x the expected impact compared to typical A/B tests and drives 27.4% higher uplifts”
Different element, multiple variants
There are times when traffic constraints mean that testing only one element on a page might not allow you to detect the effect of that page. So by testing multiple elements, it’s possible that you’ll see a larger detectable effect for the same traffic volume. The thing to keep in mind here is how you can test multiple elements like this and still learn from the experiment. If the changes are disconnected and have no common hypothesis then any insight will be hard to glean from the results.
An example of how you could test multiple different elements at once and still be able to learn would be if you hypothesised that there was a lack of trust or authority on a certain page so you decided to test the following together:
- Trust focused wording
- Security seals or trust symbols
Although you wouldn’t know what had the biggest impact, you’d still be able to see how strengthening trust indicators performed against the control. There’s a time and place for this kind of testing and I don’t think you need to stick to the ‘purist’ approach by going super granular every time – just my opinion.
A/B/n testing is worth considering when you genuinely don’t have a strong enough hypothesis to commit to a single direction. It’s a hypothesis-generation tool as much as a validation tool. But don’t run four variants because you can’t decide. Make a decision, test it, learn from it. If it loses, the loss tells you something.
Split URL Tests: When You’re Testing Big Ideas
A split URL test doesn’t change an element on a page. It routes users to entirely different URLs. Different page designs, different flows, different structures. This is the right method when the hypothesis is too large for a component test or doing an A/b/n test would actually be more limiting.
If you want to test whether a multi-step checkout outperforms a single-page checkout, you can’t do that by changing a button. The structural change requires a different page, which requires a split URL test. Same logic applies to landing page redesigns, navigation restructures, or testing a new product page template against the existing one.
The risk here is what it’s always been with big changes… if the new version wins, you know the overall direction worked. If it loses, you’re diagnosing a a lot of factors and trying to figure out which part let you down. I’d rather test the big idea early in a program to validate strategic direction, and then use A/B tests to optimise within the winning structure. Or if I find myself doing a lot of micro-optimisation, it can be a good way to think bigger and address macro pains. Too many teams run A/B tests on small elements while never challenging the underlying page structure at all. They’re optimising a flawed template rather than questioning whether the template is the problem.
Multivariate Tests: Powerful, Expensive, and Usually Misused
Multivariate testing lets you test multiple elements simultaneously (headline, image, CTA, and trust badge all in one test) and in theory, it measures not just which version of each element performs best, but how the elements interact with each other. That last part, the interaction effects, is where the value is, and also where the complexity explodes.
A full factorial multivariate test with three elements and three versions of each produces 27 combinations. To detect a meaningful difference across 27 combinations at adequate statistical power, you need A LOT of traffic. Most teams don’t have it. And when you try to run a multivariate test on insufficient traffic, you end up with underpowered results across most combinations, and the “winner” you identify (if you get that far) is often noise.
I’ve seen teams run multivariate tests because a platform feature made it easy to set up. Easy to set up is not the same as appropriate to run. The question to ask before any multivariate test is whether you have a specific hypothesis about interaction effects. Not just “I want to test these three things at once”.
If you genuinely believe that a particular headline only works when paired with a specific image, that’s a hypothesis worth testing as a multivariate experiment. If you just want to move faster, run three sequential A/B tests instead.
I know it probably sounds like I’m against multivariate experiments but I promise you I’m not. It’s just that they’re probably the experiment type that I see teams misuse the most & essentially waste a lot of time on. The teams that I see use it well have the traffic to support it and a specific reason to care about interactions. If that’s not you, sequential A/B testing gives you more learnable results faster.
Multi-Armed Bandits: When Speed Beats Learning
The multi-armed bandit (MAB) is named after the slot machine problem:
if you have multiple slot machines with unknown payouts, how do you allocate your pulls to maximise return while still exploring which machine pays best?
The classic tradeoff is exploration versus exploitation. You need to explore to find the best option, but every pull on a worse machine costs you potential reward.
In CRO terms, a MAB test dynamically allocates traffic based on live performance. Instead of splitting traffic 50/50 until a winner is declared, the algorithm starts shifting more traffic toward the better-performing variant as evidence accumulates. You get more conversions during the test period because you’re not stuck sending equal traffic to a variant that’s clearly losing.
Where this beats traditional A/B testing is in short-lived campaigns, promotional periods, or contexts where the cost of showing a losing variant is high and the learning value of understanding exactly why it lost is low. Think promotional periods like “Black Friday” where you need to optimise fast and the result doesn’t need to generalise anywhere.
The weakness of running a MAB is that it’s not ideal for broader learning. A MAB can identify a winner faster, but it reaches that conclusion with less statistical rigour than a traditional test, and it doesn’t tell you what drove the result. If you’re running a programme designed to build cumulative understanding, MABs can actually slow you down by generating results that don’t inform the next hypothesis. You win the battle and skip the debrief. Over time, that compounds into a programme that optimises locally without ever understanding why anything works.
Contextual Multi-Armed Bandits: Personalisation at the Experiment Level
A contextual MAB (CMAB) adds user context to the allocation decision. Instead of just tracking which variant is winning overall, it tracks which variant wins for which type of user.
A first-time visitor from organic search might respond better to variant A.
A returning user from email might respond better to variant B.
The algorithm learns these relationships and allocates accordingly.
This is personalisation built into the experiment itself, and the ceiling is meaningfully higher than a standard MAB or A/B test. The complexity is also meaningfully higher. You need sufficient traffic across each segment to learn stable patterns, you need clean user identification, and you need to be able to act on the personalisation at scale. Many teams don’t have the infrastructure for this.
If you’re considering a CMAB before you’ve validated basic segmentation hypotheses with clean A/B tests, you’re reaching for the most expensive tool before understanding the problem. Get clear on who your users are and what signals distinguish them first. The method can come later.
Feature Flagging as Experimentation
Feature flags are how product and engineering teams control which users see which version of a feature without a code deployment for every change. In a CRO context, this matters because not everything you want to test is a DOM element you can change with a WYSIWYG editor.
For things like pricing model changes, backend logic, subscription tiers, product feature rollouts etc. they need to be tested at the infrastructure level, not the front-end layer. A feature flag gives you that control. You ship the new pricing structure to 50% of users and measure the impact on sign-up rate, cancellation rate, and lifetime value. You don’t have to wait for a conversion to stop measuring. This connects directly to where the real data lives.
Most teams stop measuring at the conversion event. But the metric that actually matters for a subscription business isn’t sign-up rate. It’s average membership duration, churn rate, and long-term value. That data often doesn’t live in your A/B testing platform. It lives in your data warehouse. Feature flagging, combined with warehouse-native experimentation, is how you connect the experiment to the outcome that actually counts. Most teams haven’t got there yet, but the ones optimising for sign-up rate without understanding what happens to those sign-ups six months later are solving the wrong problem.
Sequential Testing and Continuous Monitoring
Traditional A/B testing asks you to define a sample size before you start, run until you hit it, and look at results once. In practice, most teams check results daily and stop tests early when they see something significant. This is called peeking, and it dramatically increases false positive rates. The more you check, the more likely you are to see a “significant” result that’s actually noise.
Sequential testing is designed for teams that need to monitor results continuously. It uses adjusted stopping rules that account for repeated looks at the data. You can check whenever you want, and the method controls for the increased false positive risk. The tradeoff is that it typically requires more total conversions to reach a conclusion than a fixed-horizon test.
Which Method You Actually Need
Here’s what I keep coming back to… the method that produces the most learnable results is more valuable than the method that sounds most advanced. A clean A/b test that teaches you something about your users is worth more than a CMAB that produces a winner you can’t explain.
The progression that works in practice is this. Start with A/b tests on single elements. Build the muscle of writing sharp hypotheses. When you have enough clean tests to understand your users’ primary objections and motivations, you can challenge bigger structural assumptions (possibly with split URL tests). If your traffic supports it and you have a specific reason to care about element interactions, consider multivariate. Use bandits for short-horizon decisions where learning matters less than performance. Use feature flags for anything the front-end can’t touch.
What I’d push back on is the assumption that moving up that list means maturing. More complex is not more rigorous. I’ve seen mature programmes built almost entirely on well-designed A/b tests, and I’ve seen programmes drowning in sophisticated methodology that couldn’t tell you why a single test won. The programme built on clear questions and disciplined isolation almost always has better cumulative understanding, and cumulative understanding is what compounding growth is built on.





