Your A/B tests are worthless if your analytics are broken

I once audited a testing program that had been running for eight months. Fourteen tests completed. The team had a nice spreadsheet with winners and losers, projected revenue lifts, the whole thing. Very professional looking.

Their conversion tracking was double-firing on iOS Safari. Had been since day one.

Every single test result was contaminated. Eight months. Fourteen tests. Zero usable data. The projected revenue lift they’d been reporting to leadership? Fiction.

This isn’t a rare story. I’d say about half the A/B testing programs I audit have at least one tracking problem serious enough to invalidate results. Not minor discrepancies. Structural failures that make the data unreliable at a fundamental level.

A/B testing has a prerequisite nobody checks

Here’s how most companies start A/B testing: they pick a tool (Optimizely, VWO, AB Tasty, whatever), install it, and start running tests. Maybe they read a blog post about statistical significance. They feel good about being “data-driven.”

What they don’t do is verify that their underlying measurement is accurate. Many companies carried over broken setups from the GA4 migration and never looked back.

Think about what an A/B test actually does. It splits traffic, shows different experiences, and measures which one performs better against a goal. That goal is tracked by your analytics. If your analytics are wrong, your test results are wrong. Full stop.

It doesn’t matter how perfect your sample size calculations are. It doesn’t matter if you wait for 95% confidence. If the thing you’re measuring is broken, the math just makes you confidently wrong.

I’ve seen $50K testing programs built on tracking that was wrong by 30%. Not a rounding error. Thirty percent.

The failures I keep finding

Let me walk through the specific ways tracking breaks A/B tests, because these aren’t theoretical. I find these in real audits, regularly.

Conversion goals that fire twice on mobile. This is the most common one. Your purchase confirmation page loads, fires the conversion event, then something triggers a re-render (maybe a payment widget, maybe a third-party script), and the event fires again. Desktop handles it fine because of faster rendering. Mobile doesn’t. Now your mobile conversion rate is inflated by 15-40%, depending on the page. Your A/B test thinks the variant is winning on mobile when really it’s just double-counting more efficiently.

Your test tool and GA4 count sessions differently. Optimizely counts a visitor one way. GA4 counts them another. The numbers will never match perfectly, but when they diverge by more than 5-10%, you have a problem. I’ve seen cases where the test tool reported a 12% lift while GA4 showed the variant performing worse. Which one do you believe? Neither, probably.

Variant assignment gets lost on redirect. If your test involves a redirect (sending users to a different URL), the variant parameter can get stripped. This happens with certain CDN configurations, with some consent management platforms, and with cross-domain tracking setups. The user sees variant B but gets counted as variant A. Your results become noise.

Revenue tracking that misses payment methods. Your purchase event fires correctly for credit card payments. But PayPal redirects users offsite and back, and the return URL doesn’t always trigger the same event flow. Or Apple Pay uses a different checkout modal that doesn’t hit the confirmation page. Now your revenue data is systematically missing 20% of transactions. Your A/B test on checkout flow optimization is missing a chunk of the actual revenue.

Consent mode blocking the test tool. This one’s gotten worse since consent requirements tightened. User declines cookies, consent management platform blocks the testing script entirely, user never enters the test. But they still show up in your analytics (because you’re using cookieless measurement or consent mode). Now your test population and your analytics population are different groups of people.

How bad tracking biases your results

Let me be specific about what happens to your test results when tracking is broken, because the failure modes matter.

False positives. Your test says variant B wins when it actually performs the same or worse. This happens when tracking errors affect one variant more than the other. Double-firing events, for instance, might hit variant B harder because of a layout difference that causes more re-renders. You implement the “winner,” your actual conversion rate doesn’t change (or drops), and you blame it on regression to the mean. It wasn’t regression. The test was wrong.

False negatives. Your test says there’s no significant difference when variant B actually is better. This happens when tracking noise drowns out the real signal. If your conversion data has a 15% error rate, and the true lift from your variant is 8%, good luck detecting that. The noise is bigger than the signal. You call the test inconclusive and move on, missing a real improvement.

Underpowered tests that run forever. When your conversion data is noisy because of tracking errors, the variance in your data increases. Higher variance means you need a larger sample size to detect the same effect. Tests that should take two weeks take six. Tests that should take six weeks never reach significance at all. Your team starts to think “our traffic is too low for testing” when actually your traffic is fine. Your data is just noisy.

Is your tracking actually ready for A/B testing? I'll audit your analytics setup and tell you exactly what needs fixing before you run another test.

Book a Free Audit →

The pre-testing audit: 5 things to verify

Before you run any A/B test, check these five things. Not some of them. All of them.

1. Fire every conversion event manually and count. Debug your tracking by going through your funnel on desktop Chrome, mobile Safari, mobile Chrome, and Firefox. Complete a conversion on each. Check your analytics. Did exactly one conversion event fire per completion? On every browser? If not, stop. Fix that first.

2. Compare session counts between your test tool and analytics. Install your testing tool and run a simple A/A test (same experience for both groups) for a week. Compare the session count in your test tool to GA4. If they diverge by more than 10%, investigate why. Common causes: bot filtering differences, consent blocking one tool but not the other, timezone mismatches.

3. Test every conversion path. Don’t just test credit card checkout. Test PayPal. Test Apple Pay. Test Google Pay. Test the mobile checkout flow. Test what happens when someone completes a purchase after leaving the site and coming back. Every path that leads to a conversion should fire the event correctly.

4. Verify variant assignment persistence. Start a test, get assigned to variant B, clear your session (but not cookies), come back. Are you still in variant B? Now do the same on mobile. Now do it after going through a redirect. If variant assignment doesn’t stick, your test is mixing audiences.

5. Check consent mode interactions. Decline all cookies. Does the test tool still load? Does it still assign variants? If not, what percentage of your users decline cookies? If 30% of your traffic opts out and is excluded from tests, your test results only represent 70% of your audience. That might be fine. But you should know.

When to fix first vs. test anyway

I’m not going to tell you that everything needs to be perfect before you test. That’s not realistic. But there’s a threshold.

If your conversion tracking is off by less than 5%, and the error is evenly distributed across variants, you can probably test. The noise is small enough that a real winner will still show up, especially for larger effect sizes.

If your conversion tracking is off by more than 10%, or the error is unevenly distributed, fix it first. I don’t care how much pressure you’re under to show test results. Running tests on bad data is worse than running no tests at all, because bad test results lead to bad decisions that look data-driven. That’s more dangerous than going with your gut.

Here’s my rule: spend one sprint fixing measurement before you spend a quarter running tests. If your tracking is solid, your tests will be faster, more accurate, and more likely to produce actionable results. If your tracking is broken, you’ll spend that quarter generating numbers that look like insights but aren’t.

The uncomfortable truth

A/B testing is supposed to remove opinion from decision-making. It’s supposed to let the data decide. But if your data is wrong, you’re still making opinion-based decisions. You’re just dressing them up in statistics.

The companies that get the most value from testing aren’t the ones with the fanciest test ideas or the most sophisticated statistical models. They’re the ones that invested in getting their measurement right first. Boring? Yes. But boring infrastructure work is what separates testing programs that actually move revenue from testing programs that produce impressive-looking reports full of meaningless numbers.

Check your tracking. Then test.