Momentum is building in your experimentation program, amazing… but now you have the fun challenge of running more even more tests to scale the program. And it raises a new question  that you hadn’t considered before. Should we be using feature flags or stick with our a/b testing tool? Then it goes quiet, because nobody quite knows what the actual difference is in practice, or more importantly, which one you need.

Here’s the short version. Both are ways to run controlled experiments. The difference is where they live, how they’re implemented, and what they’re designed to do well. Getting that wrong costs you either months of blocked tests or a tool your team won’t (or can’t) actually use.

What each tool is actually built for

A traditional A/B testing tool (think Optimizely, VWO, Convert etc), lives in the browser. It intercepts the page after it loads and modifies elements visually. Most of them have a WYSIWYG editor so a marketer can click on a headline, type a new one, and launch a test without writing a line of code. That’s genuinely useful. It’s also the ceiling. You’re modifying what’s already rendered. You’re not touching the underlying logic of the product.

Feature flags live in the code. A developer wraps a feature in a flag, and the experimentation platform decides which version of that code runs for which user, before anything gets sent to the browser. That opens up a lot of possibilities. You’re not changing what a page looks like after it loads. You’re changing how the product behaves at the point it’s built.

So anything involving dynamic content, pricing logic, subscription model variants, search ranking changes, personalised recommendations, anything where the front-end change needs to map to actual back-end behaviour, that’s feature flag territory. The A/B testing tool can’t reach it. A visual editor can change the text on a pricing page, but it can’t change what you’re actually charged at checkout.

The team capability question comes first

Before you get to what you want to test, you need to be honest about who’s doing the testing.

Dev-heavy teams with strong engineering support often find feature flags completely natural. It fits how they already think about shipping code. Wrapping something in a flag, hooking it into an SDK, setting up traffic splits in a dashboard, that’s a familiar workflow. For those teams, a WYSIWYG editor can feel like a step backwards.

For teams without that technical depth, the WYSIWYG editor isn’t a compromise. It’s the reason the programme exists at all. Marketing teams, growth teams, CRO specialists who aren’t developers… they need a tool that lets them move without filing a ticket and waiting two weeks. Modern A/B testing tools have built that out well, and some now have AI-assisted editing on top.

So the early decision often isn’t really about the tools. It’s about who’s going to use them.

Why mature programmes end up with both

The pattern I see at a certain stage of programme maturity is a split that actually makes sense. Marketing runs the A/B testing tool. Product and engineering run feature flags. They’re testing different things, at different layers of the product, with different skill sets.

That can work well. It can also become completely fragmented if nobody’s coordinating it. What tends to happen without oversight is that the same idea gets queued in both tools, or a test that should have gone to feature flags gets shoehorned into the visual editor because that’s where whoever thought of it has access. The experiment runs, the results are messy, and nobody knows if it was the idea or the implementation that caused the problem.

The fix isn’t complicated. Someone needs to look at the test backlog and route ideas to the right tool. These ideas belong in the A/B testing tool. These need feature flags. That single decision point, made consistently, is what lets a programme run at real velocity across both tracks instead of bumping along in one lane.

The billing point most teams don’t consider

There’s a pricing detail worth understanding before you sign anything.

For example, on Optimizely Web, every unique visitor who lands on your site counts toward your monthly active user (MAU) limit, whether or not they’re in an active experiment. Your full traffic volume is your billing volume.

On Optimizely Feature Experimentation, an MAU is only counted if that user is actually exposed to a running experiment. So if you’re running three experiments at once and they only touch a portion of your traffic, you’re only billed for the users who hit those experiments.

For high-traffic sites with targeted experimentation, the feature flag billing model can be meaningfully cheaper. Worth thinking about before you assume web is the lower-cost entry point.

When tests outgrow the A/B testing tool

Starting with a traditional A/B testing tool is usually the right call for teams earlier in their experimentation journey. You get the visual editor, you get accessible reporting, you can run tests without engineering dependency, and you can start proving the programme’s value quickly. That matters. Leadership buy-in comes from wins they can see, and the early wins often come from exactly the kind of test a WYSIWYG tool handles well.

But there’s a point where the test ideas start hitting the ceiling. You’ve done the obvious funnel tests. Something’s clicked, leadership is interested, and now someone wants to test a discount mechanic or compare a subscription model against a one-off purchase. But the visual editor just won’t cut it.

Quip ran into this directly. They sell a smart toothbrush, and they started out testing the usual things across the website. Standard stuff. But at some point the interesting questions stopped being about page layout and started being about how the subscription model should work.

Should they push replacement brush head subscriptions at checkout?
What happens to lifetime value when they do?

You can’t test that with DOM manipulation. They moved into feature experimentation, ran experiments on the subscription and cross-sell mechanic, and found that getting people onto a replacement subscription increased both average order value and customer lifetime value. The programme evolved because the questions evolved.

I saw a version of this play out differently with a team working on a global e-commerce platform (a big brand you’d know). There was pressure from above to roll out a third-party sizing configurator across all product pages. The head of e-commerce wasn’t convinced it would help conversion. In theory it sounded right, but nobody actually knew.

Rather than roll it out site-wide and hope, they used feature flags to test it across specific categories first. The results weren’t good. The configurator was hurting conversion, and the cost of the third-party platform on top of that meant the numbers didn’t stack up. They went back to leadership with actual data instead of a gut feeling, and the rollout was stopped. Feature flags made that possible. A visual editor would have struggled to run the test at that level of scope, and a blanket rollout without testing would have been expensive.

Where to start

If your team doesn’t have strong engineering support and you’re still building the case for experimentation internally, start with a browser based A/B testing tool. Get wins on the board. Make the programme visible. When your test ideas start consistently hitting walls because the tool can’t reach what you want to change, that’s the signal to bring in feature flags, either as the primary tool or alongside what you already have.

If your team is already dev-heavy and the ideas on your backlog involve pricing logic, product features, or dynamic content, go straight to feature flags. The visual editor isn’t going to serve you.

And if you’re already running both and the coordination piece feels messy, the fix is process before it’s tooling. Someone needs to own the decision of what goes where. That’s not a technology problem.

If you’re not sure where your programme currently sits on that maturity curve, the Experimentation Maturity Quiz can help you work that out in a few minutes. It’s worth doing before you commit to a tool, because the right answer depends entirely on where you actually are, not where you think you should be.

Kyle Newsam

An optimizer by trade & lifestyle. Truly any experience or interaction becomes an experiment & something I can learn from. Currently, moving around the globe working from the coolest locations that the younger me could never have imagined.

Leave a Reply