I had a conversation a with a friend who runs a small CRO agency. She’d spent two weeks building an automation that pulled GA4 data, ran it through a scoring model, and spat out a prioritised test backlog every Monday morning. She was proud of it. It looked great. And it was recommending tests on the checkout confirmation page for one of her clients – a page that got maybe 3% of site traffic – while the product listing pages were haemorrhaging visitors with a 78% bounce rate. The automation was working perfectly. The strategy was wrong. And the automation had no way of knowing that.
That’s the tension at the heart of CRO automation. Not whether to do it. But whether to trust it for the right things.
This piece is about making that distinction clearly, building something that actually helps a lean team move faster, and avoiding building things that don’t really solve a problem.
The two categories of CRO work. And why mixing them up breaks everything.
Before you automate anything, you need to separate two fundamentally different types of CRO activity. One is admin. The other is judgment.
Admin is everything that involves moving, formatting, aggregating, or presenting information that already exists. Pulling weekly GA4 data into a tracker. Generating a test brief template from a hypothesis. Flagging when a test has hit statistical significance. Sending a Slack message when an experiment goes live. Scheduling QA checklists. All of this is mechanical. It has a right answer. Automating it saves time and reduces error.
Judgment is everything else. Deciding which page has the highest conversion leverage right now. Interpreting why a test lost when the data doesn’t give you a clean answer. Working out whether a 4% uplift is real or a fluke of your randomisation. Choosing between three strong hypotheses when you can only run one. These require context, experience, and reasoning that automation can’t [currently] replicate. Automating these things doesn’t save time. It produces confident-looking wrong answers, faster.
The practical question for every automation you consider is this. What happens when it gets it wrong? For admin tasks, a wrong answer is a minor inconvenience. For strategic decisions, it’s three weeks of dev time on a test that shouldn’t have been built. That asymmetry should shape every decision you make about what to hand off to a tool.
What’s actually worth automating in a CRO workflow
Let’s be specific, because this is where most articles go vague and useless.
Data aggregation
Data aggregation is the first place to start. Most CRO practitioners spend hours every week pulling numbers from GA4, Search Console, heatmap tools, and session recording platforms, then manually dropping them into a spreadsheet or deck. This is completely automatable. A Google Analytics to Airtable integration via Zapier or Make, or a direct API pull into a Google Sheet, gets you weekly traffic, conversion rate, and revenue data without touching it. Search Console can feed you the keyword and page performance data you need for pre-test diagnosis. Set it up once, it runs forever.
Test documentation
Test documentation and brief generation is the second high-value area. A well-structured test brief has maybe eight fields:
- the page
- the hypothesis
- the target metric
- the secondary metrics
- the sample size needed
- the expected duration
- the risk level
- the success criteria
Once you’ve defined what a brief looks like, you can use Claude or GPT-4 to draft one from a short hypothesis prompt. It won’t be perfect. But it gets you 80% of the way there in 30 seconds instead of 20 minutes. Importantly, the practitioner still reviews and refines it. The time saving is in the scaffolding.
Statistical confidence
Significance and sample size calculations are almost embarrassingly manual in most teams. There are good calculators online, but building one into your Airtable base or your test tracker means it runs automatically as your traffic and conversion data updates. Your team stops guessing when to call a test.
Internal comms
QA checklists and pre-launch verification are worth automating wherever possible. A simple Airtable form that fires a Slack notification to the relevant stakeholders when a test is ready for review, with a link to the QA checklist, removes a category of dropped balls entirely. Not glamorous. Extremely valuable.
Reporting
Reporting is the most visible automation win for stakeholders. An automated weekly digest that pulls test status, live experiments, recent results, and the pipeline view takes about three hours to set up properly and saves about three hours a week forever. It also means leadership always has visibility without the CRO practitioner needing to prepare a deck. That changes how you use meetings. As mentioned earlier, the insights or judgement should still have a practitioners context factored in but automation can save a bunch of time here.
What you shouldn’t automate
Hypothesis generation is not automatable in any meaningful way right now. AI tools can generate a list of things to test. But that list will look largely irrelevant without someone who understands the specific business context, user research, previous test history, and the current commercial priorities. I’ve seen teams feed their GA4 data into Claude and ask it to suggest test ideas. The suggestions are coherent. They are also disconnected from anything that would actually move the needle for that specific business at that specific moment. Speed isn’t typically the constraint when it comes to hypothesis generation. Clarity is.
Results interpretation
Result interpretation is another area to leave alone. A test result isn’t just a number. It’s a number with context – the context of what happened during the test window, the quality of the randomisation, the segments that drove the result, the relationship between the primary metric and secondary metrics and more. I ran a test at a SaaS business once where the sign-up rate went up by 6% and the business wanted to ship it. But… when we looked at downstream data in the warehouse, the new cohort had a meaningfully shorter average membership duration. The test had optimised for sign-up at the cost of retention. No automation would have caught that. You catch it by knowing what questions to ask.
Prioritisation
Prioritisation is also a judgment call, even if tools can assist with it. Which experiment you run next is a function of confidence, expected impact, implementation cost, stakeholder appetite, and strategic alignment. A scoring model can give you a starting point. It can’t tell you that now is a bad time to test the checkout because the dev team has a product release in two weeks that will invalidate the results. Even when the teams I work with are using the Risk Ranker, I remind them that it’s a guide and they can’t just rely on a score without thinking of the wider business context.
MCP’s & CLI’s – what they actually change for CRO data workflows
MCP’s & CLI’s are worth understanding if you’re building anything serious with AI in your CRO stack. The short version:
MCP’s have become a standardised way for AI models to connect to external data sources and tools.
Instead of copy-pasting your GA4 data into Claude and asking a question, an MCP server lets Claude (for example) query your analytics data directly, pull from your test tracker, and reason across multiple live data sources in a single conversation.
For CRO practitioners, this matters in a specific way. Right now, the friction in using AI for CRO work is the data preparation. You spend ten minutes getting the data into a format the AI can use, then two minutes asking the question. MCP’s & CLI’s flip that ratio. When your GA4, your Airtable test backlog, and your Search Console data are all connected, you can ask a question like “which pages in our current backlog have the highest traffic and lowest conversion rate, and which of those have active experiments running?” and get a real answer in seconds.
That is genuinely useful. It compresses the dreaded, repetitive admin work of cross-referencing multiple sources. It doesn’t compress the judgment work of deciding what the answer means or what to do about it. The distinction still applies. It makes the admin automation more powerful, not the judgment automation more reliable.
Building these connections requires some technical setup, typically involving an MCP server that sits between your data sources and the AI client. Tools like Anthropic’s Claude Code support MCP’s & CLI’s natively, and there are open-source ones for Google Analytics, Airtable, and various other tools that are worth exploring if your team has any technical capacity. This is early infrastructure. But it’s where the most capable lean CRO teams will be operating in the next 18 months.
What a smart automated CRO stack looks like for a lean team
You don’t need much. Here’s what actually works for a team of one to three people doing serious CRO work.
Airtable is the connective tissue. It holds your test backlog, your live experiments, your results archive, and your hypothesis library. Everything else connects to it. Automations within Airtable handle status notifications, duration tracking, and brief template generation. The base becomes the single source of truth for the programme.
GA4, Search Console & possibly your experimentation tool feed into it via API or via a tool like Supermetrics or Porter Metrics if you want a no-code route. Weekly pulls update your opportunity sizing data automatically. You know which pages are worth testing without pulling reports manually.
Slack is where the programme surfaces. Automated notifications for test launches, QA requests, significance alerts, and weekly digests mean stakeholders stay informed without meetings. This matters more than it sounds. The most capable experimentation teams I’ve seen plateau because they didn’t have support from above. Automation that keeps leadership informed passively is a politics tool as much as a productivity tool.
A significance and duration calculator, built into the Airtable base or as a simple Google Sheet, means test calls are consistent and defensible. No more gut-feel calls on when to stop a test.
That’s it. Four components. No proprietary experimentation platform required, although if you’re running high-volume testing, adding one on top of this infrastructure makes sense. The stack serves a lean team running five to fifteen experiments a month without drowning in administration.
Common mistakes teams make when they start automating
The first mistake is automating before diagnosing. Teams reach for automation when the real problem is that they don’t have a clear view of where the conversion opportunity is. Automation makes an broken experimentation programme even messier. The cracks will become apparent even quicker!
The second mistake is trusting AI outputs without review. Automated brief drafts, AI-generated hypotheses, and LLM-written analysis summaries all need human eyes before they’re used. A team that’s too busy to review the outputs properly and starts treating them as finished work might just let mistakes slip through the cracks. And that’s where automation shifts from saving time to introducing error at speed.
The third mistake is building automation that nobody uses. I’ve seen teams spend months building elaborate reporting systems that leadership ignores because nobody consulted leadership on what they actually wanted to see. Build the simplest version first. Confirm it’s useful. Then extend it. This is honestly a big risk in my opinion. As soon as something automated isn’t perceived as valuable, it’s seen as noise and people will either mentally ignore it or manually find a way to filter or block it.
The fourth mistake is mistaking automation for a CRO strategy. A losing test is not a failure. It’s the answer to a question you were previously guessing at. The Wine Society work made this concrete for me – being able to quantify what a losing variant would have cost if shipped without testing reframes the programme as risk management. That framing, that thinking, is what makes a CRO programme valuable. No automation produces that. A practitioner with clarity produces that. Automation just clears the space for them to do it.
Where to start
If your team is currently managing experiments in a spreadsheet, start with Airtable. Get your backlog, live tests, and results in one place before you connect anything to anything else. Structure precedes automation.
If you have the basics in place and want to move toward the stack described above, start with the data feeds. Get GA4 pulling into your base automatically. That single change removes the most repetitive manual work and gives you always-current opportunity data.
If you’re already running a structured programme and want to bring AI into the workflow, start with brief drafting. It’s the lowest-risk application, the easiest to review, and the most immediately time-saving.
And if you haven’t formalised how you prioritise experiments yet, that’s the right place to put your energy before you automate anything else. Prioritisation is where most programmes leak value, and it’s the one area where a clear framework pays back consistently over time. The Risk Ranker is built for exactly this. It gives lean teams a structured, defensible way to score and rank experiments before deciding what to build. Worth running your current backlog through it before you automate the prioritisation logic, so you know what good looks like first.





