← Back to glossary

A/B Testing

Controlled experiment comparing two product variants to measure the causal effect of a change on user behavior. The gold standard for validating that changes drive desired outcomes.

What is A/B Testing?

An A/B test is a randomized controlled experiment where a portion of users experience variant A (control, existing design) and another portion experience variant B (treatment, new design). By randomly assigning users, you ensure both groups are statistically equivalent except for the change you’re testing. By measuring a specific metric (conversion rate, time-to-task, retention, etc.), you isolate the causal effect of that change. A/B testing is the most rigorous method for validating that a product change actually improves outcomes—not in a small, curated user test, but at scale in real use.

The power of A/B testing lies in its ability to detect small, real effects. A 2% improvement in conversion sounds modest but compounds across millions of users into substantial revenue. Conversely, changes that feel obviously good in design review can actually harm metrics when exposed to real users. A/B testing replaces conviction with evidence.

Test Design: Hypothesis, Metric, Sample Size

Before running a test, you must define: What is your hypothesis (why you believe the change will improve outcomes)? What metric validates it (the outcome variable you measure)? What’s the minimum effect size you’d consider meaningful? How many users do you need to detect that effect with confidence?

Sample size depends on baseline conversion, effect size, and statistical power. A rare event (2% baseline conversion) requires much larger samples than a common event (50% baseline). Most tests require 1,000-10,000 users per variant to reach statistical significance (95% confidence that the difference is real, not random). Running tests too long can introduce novelty effects (users change behavior because something is new); running too short misses important patterns.

Runtime Duration & Stopping Rules

Calendar duration matters. A test running Monday-Friday produces different results than one running Friday-Monday, because user behavior changes by day of week. Most tests run at least one full week and often two weeks to average across day-of-week and day-in-cycle effects. Stopping a test early because “results look good” introduces bias—you’re more likely to stop on a lucky streak.

Sequential testing (pre-planned interim analyses with adjusted confidence thresholds) is more sophisticated: you can stop early with confidence if the effect is extremely strong or extremely absent. But this requires statistical discipline and is rare in practice. Most product teams use fixed duration tests with predetermined stopping rules.

Why It Matters for Product People

A/B testing is the discipline that separates product leadership from guessing-at-scale. Everyone has intuitions about what works. Testing forces those intuitions against reality. The surprising finding (from countless tests) is how often strong intuition is wrong—changes that designers are convinced will help actually harm metrics; others unexpectedly drive engagement.

For executives, A/B testing provides the language for resource allocation: instead of “I believe this will improve retention,” it becomes “We tested this and measured a 3% lift in 30-day retention with 95% confidence.” This shifts conversations from opinion to accountability.

A/B testing is central to experiment-driven development and hypothesis-driven development methodologies. It depends on robust feature flag infrastructure to safely deploy variants. Results feed into cohort analysis to understand which user segments benefit most (and which may be harmed).