A/B Test Calculator
Control (Variant A)
Variant (Variant B)
Run a calculation in the Significance Test tab first, then return here to see the full step-by-step solution.
No data yet — enter visitor and conversion counts in the Significance Test tab first.
What is an A/B test calculator?
An A/B test calculator is a statistical tool that compares two versions of a webpage, email, advertisement, or product experience to determine whether the difference in their conversion rates is statistically significant. You enter the number of visitors and conversions for a control group (A) and a variant group (B), and the calculator runs a two-proportion z-test to tell you whether the variant performed differently from chance alone.
A/B testing is the standard method behind most website and product experiments. Instead of guessing which headline, button color, or checkout flow performs better, a team splits traffic between two versions and lets the data decide. The calculator's job is to translate raw counts — visitors and conversions — into a defensible statistical conclusion: real difference, or random noise. Statistics Fundamentals covers the full statistical theory behind this method in the hypothesis testing guide.
A/B test formulas
An A/B test rests on four formulas: conversion rate, conversion lift, the pooled standard error, and the z-score. The z-score is then converted into a p-value to determine significance.
Conversion Rate
CR = Conversions / Visitors
Example:
CR_A = 250 / 5000 = 0.0500 (5.00%)
CR_B = 295 / 5000 = 0.0590 (5.90%)
Conversion Lift (Relative)
Lift = (CR_B − CR_A) / CR_A × 100
Example:
(0.0590 − 0.0500) / 0.0500 × 100
= 18.0%
Pooled Standard Error
p̄ = (x_A + x_B) / (n_A + n_B)
SE = √[ p̄(1−p̄)(1/n_A + 1/n_B) ]
Where x = conversions, n = visitors
Z-Score & Decision Rule
Z = (p̂_B − p̂_A) / SE
If p-value < 0.05 → reject H₀
(statistically significant)
The pooled standard error assumes the null hypothesis is true — that both variants share a single underlying conversion rate — which is the standard approach for a two-proportion z-test used in significance testing. This differs slightly from the unpooled standard error used when building a confidence interval for the difference between two independent proportions; both versions are common in CRO tooling. See the proportion hypothesis testing page for the full derivation.
Assumptions behind the two-proportion z-test
The two-proportion z-test that powers this calculator depends on a small set of conditions. Results are reliable when each one holds.
Each visitor must be randomly assigned to A or B, and one visitor's outcome should not affect another's.
Both n×p̂ and n×(1−p̂) should be at least 5 for each group, so the normal approximation to the binomial distribution holds.
The test should run to a pre-calculated sample size rather than being stopped the moment results look favorable.
Testing many metrics or variants at once and reporting only the significant one inflates the false-positive rate (the multiple comparison problem).
How many visitors do you need for an A/B test?
The required sample size depends on four inputs: your baseline conversion rate, the minimum lift you care about detecting, your chosen confidence level, and your statistical power. Smaller effects, lower baseline rates, and stricter confidence requirements all increase the number of visitors you need.
Table: approximate sample size per variant at 95% confidence, 80% power
| Baseline Rate | 10% Relative Lift | 20% Relative Lift | 30% Relative Lift |
|---|---|---|---|
| 2% | ~75,200 | ~18,800 | ~8,400 |
| 5% | ~29,000 | ~7,260 | ~3,230 |
| 10% | ~13,500 | ~3,400 | ~1,520 |
| 20% | ~5,880 | ~1,480 | ~660 |
Figures are rounded estimates from the formula above; use the Sample Size tab in the calculator for an exact number based on your own inputs.
Statistical power is the probability of detecting a real effect when one exists — it guards against a false negative (Type II error). The industry-standard power level is 80%; some teams running high-stakes tests use 90%, which requires a larger sample. Evan Miller's widely used A/B test sample size calculator applies the same underlying formula and is a common cross-check in CRO teams.
How long should an A/B test run?
An A/B test should run until it reaches its planned sample size, and for at least one to two full business cycles — typically one to two weeks — so day-of-week and other recurring effects average out across both variants. Divide the total visitors needed by your daily eligible traffic to estimate the number of days, then round up to the nearest full week.
Worked example: A test needs 3,070 visitors per variant (6,140 total). The page gets 900 eligible visitors per day. 6,140 ÷ 900 ≈ 6.8 days, rounded up to 7 days. Because that is short of a full business cycle, the recommended run time is 14 days (two weeks) to capture both weekday and weekend behavior.
How to use the A/B test calculator
Collect visitor and conversion counts for both variants, enter them with your chosen confidence level, and read the significance, lift, and confidence interval from the results panel. Here is the complete process.
Record the number of visitors and conversions for Variant A (the current, unchanged version).
Record the same two numbers for Variant B (the new version being tested).
95% is the standard choice for most experiments; use 99% for higher-stakes decisions.
Check the p-value against your threshold — commonly 0.05 — and confirm whether the confidence interval for the difference excludes zero.
Look at both absolute lift (percentage points) and relative lift (percent change) — they tell different parts of the story.
Deploy the winning variant if the result is significant and the sample size was reached as planned; otherwise continue testing or conclude the test was inconclusive.
Worked A/B testing examples
The five scenarios below show typical inputs and outputs across common test types. Numbers are illustrative; paste them into the calculator above to confirm.
Table: five worked A/B test scenarios
| Test | Visitors A / B | Conversions A / B | Lift | P-Value | Result |
|---|---|---|---|---|---|
| Headline A vs B | 4,000 / 4,000 | 180 / 224 | +24.4% | 0.018 | Significant |
| CTA button: red vs green | 6,200 / 6,150 | 372 / 369 | −0.4% | 0.94 | Not significant |
| Checkout: 1-step vs 2-step | 3,500 / 3,480 | 245 / 298 | +22.0% | 0.007 | Significant |
| Email subject line | 12,000 / 12,000 | 1,440 / 1,512 | +5.0% | 0.061 | Not significant |
| Pricing toggle default | 2,100 / 2,090 | 105 / 146 | +38.7% | 0.011 | Significant |
Interpretation notes
The CTA button test shows why color alone rarely moves conversion: a near-zero lift with p = 0.94 means the data is consistent with no real difference. The pricing toggle test shows the opposite pattern — a large relative lift on a smaller sample can still clear the significance bar when the effect is strong enough relative to the noise. The email subject line test illustrates a borderline case (p = 0.061, just above 0.05): treat this as inconclusive rather than as a loss, and consider running it longer or retesting.
A/B testing methodologies compared
A/B testing vs. multivariate testing
| Aspect | A/B Testing | Multivariate Testing (MVT) |
|---|---|---|
| What changes | One element, two versions | Multiple elements, multiple combinations |
| Traffic required | Lower | Higher — splits across many combinations |
| What you learn | Which version wins | Which elements interact and how |
| Best for | High-impact, single changes | Mature, high-traffic pages with several variables |
Bayesian vs. Frequentist A/B testing
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Core output | P-value against a fixed null hypothesis | Probability that B beats A, given the data |
| Sample size | Fixed, set before the test | Can update continuously as data arrives |
| Interpretation | Requires care; not a direct probability statement | More directly intuitive ("87% chance B is better") |
| This calculator | Uses this approach | Not covered here |
Statistical significance vs. practical significance
| Aspect | Statistical Significance | Practical Significance |
|---|---|---|
| Question answered | Is the difference likely real, not chance? | Is the difference large enough to matter to the business? |
| Driven by | Sample size and effect size | Revenue, cost, and implementation effort |
| Risk | Tiny, meaningless effects can become significant at huge sample sizes | A meaningful-looking lift can still be statistical noise on a small sample |
Common A/B testing mistakes to avoid
Checking results daily and stopping as soon as p drops below 0.05 inflates the false-positive rate well above the stated 5%. Decide the sample size in advance and wait for it.
Running a test with too few visitors makes it unlikely to detect a real, smaller effect even if one exists — the test ends inconclusive by design.
Testing many metrics or many variants and reporting only the one that happened to be significant overstates the evidence. Use a correction (such as Bonferroni) or pre-register the primary metric.
Running a test for three days captures only part of a weekly cycle. Run for full weeks so weekday and weekend behavior are both represented in both variants.
If your 50/50 split actually delivers, for example, 5,400 visitors to A and 4,600 to B, something in the randomization or tracking is broken, and results should not be trusted until it is fixed.
A p-value below 0.05 does not mean there is a 95% probability the variant is better — it means the observed data would be unlikely if there were truly no difference. See the interpretation section below for the precise wording.
How to interpret A/B test results correctly
A p-value tells you how surprising your data would be if the two variants truly had identical conversion rates — it is not the probability that the variant is better. The correct reading separates the statistical claim from the business decision.
✓ Correct interpretation: "If A and B truly performed the same, data this extreme would occur less than 5% of the time. We treat this as evidence that the difference is real."
This distinction matters because the p-value is conditional on the null hypothesis being true; it says nothing directly about the probability that the null hypothesis itself is true. The American Statistical Association's formal guidance on this point, published in The American Statistician (2016), is considered the standard reference for correct p-value interpretation across applied fields.
A/B testing formula and term glossary
The table below collects the core formulas and terms used throughout this page, for quick reference.
Table: A/B testing glossary — 12 key terms
| Term | Symbol / Formula | Plain-English definition | Primary use |
|---|---|---|---|
| A/B Test | N/A | A randomized experiment comparing a control (A) and a variant (B) | Comparing page or product versions |
| Conversion Rate | CR = Conversions / Visitors | Share of visitors who complete a target action | Baseline metric for every test |
| Conversion Lift | (CRB−CRA)/CRA × 100 | Relative percentage change in conversion rate | Headline metric for stakeholders |
| Statistical Significance | Via two-proportion z-test | Likelihood the observed difference is not random chance | Deciding whether to trust a result |
| P-Value | From z-score | Probability of data this extreme if there were truly no difference | Primary decision metric |
| Confidence Interval | (p̂B−p̂A) ± z×SE | Range likely to contain the true difference between variants | Communicating uncertainty around the lift |
| Z-Test (two-proportion) | Z=(p̂B−p̂A)/SE | Statistical test comparing two population proportions | The math engine of this calculator |
| Statistical Power | 1 − β | Probability of detecting a real effect when one exists | Sample size planning |
| MDE | Set by the user | Minimum Detectable Effect — smallest lift you care about catching | Sample size calculation |
| Sample Size | Per-variant n | Number of visitors required per variant before analyzing results | Test planning, avoiding underpowered tests |
| Sample Ratio Mismatch | Observed split ≠ intended split | A sign that randomization or tracking is broken | Pre-analysis data quality check |
| Significance Level (α) | 1 − confidence level | Maximum acceptable false-positive rate, usually 0.05 | Setting the p-value threshold before testing |
Related topics and calculators on Statistics Fundamentals
A/B testing draws on several areas of inferential statistics. These resources build the complete picture.
Sources and further reading
Authority sources cited in this guide:
- Wasserstein, R.L. & Lazar, N.A. (2016). "The ASA Statement on p-Values." The American Statistician. tandfonline.com
- National Institute of Standards and Technology (NIST). Engineering Statistics Handbook — Two-Sample Tests for Proportions. itl.nist.gov
- Miller, E. How Not to Run an A/B Test and Sample Size Calculator. evanmiller.org
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge University Press.
- Optimizely. Stats Engine Methodology. optimizely.com
- OpenStax. Introductory Statistics, Chapter 10: Hypothesis Testing with Two Samples. openstax.org
Frequently asked questions
An A/B test calculator compares two versions of a page, email, or product experience by running a statistical test on conversion data. It tells you whether the difference between the two versions is large enough to be real, rather than the result of random visitor variation. This tool runs a two-proportion z-test and returns conversion rates, lift, p-value, z-score, and a confidence interval.
Calculate the conversion rate for each variant, find the pooled standard error, divide the difference in rates by that standard error to get a z-score, and convert the z-score to a p-value using the standard normal distribution. If the p-value is below your significance threshold — commonly 0.05 — the result is statistically significant. Use the Significance Test tab above to run this calculation automatically.
Relative lift is (Conversion Rate B − Conversion Rate A) ÷ Conversion Rate A, expressed as a percentage. Absolute lift is simply Conversion Rate B minus Conversion Rate A, in percentage points. Both numbers matter: relative lift is easier to compare across tests, while absolute lift shows the raw size of the effect.
Required sample size depends on your baseline conversion rate, the minimum lift you want to detect, your confidence level, and your statistical power. As a rough reference, detecting a 20% relative lift on a 5% baseline rate at 95% confidence and 80% power requires roughly 7,260 visitors per variant. Use the Sample Size tab above to calculate the exact number for your situation.
95% confidence is the standard choice for most A/B tests and corresponds to a 0.05 significance threshold. Use 99% confidence for high-stakes decisions, such as changes to a checkout flow or pricing page, where a false positive is costly. Lower thresholds like 90% are sometimes used for low-risk, exploratory tests where speed matters more than certainty.
An A/B test should run until it reaches its planned sample size, and for at least one to two full business cycles, typically one to two weeks, so day-of-week effects average out across both variants. Stopping a test the moment it first looks significant, before the planned sample size, inflates the false-positive rate.
Statistical power is the probability of correctly detecting a real difference between variants when one truly exists. It is calculated as 1 minus the Type II error rate (β). The standard target is 80% power; some teams use 90% power for higher-stakes tests, which requires a larger sample size.
The minimum detectable effect is the smallest relative lift you want the test to be able to catch reliably. Setting a smaller MDE makes a test more sensitive but requires a much larger sample size; setting a larger MDE requires fewer visitors but will miss smaller real improvements.
A p-value below 0.05 is the conventional threshold for statistical significance, corresponding to a 95% confidence level. Some teams use a stricter 0.01 threshold for high-stakes decisions, while exploratory tests sometimes use 0.10. The threshold should be set before the test starts, not chosen after looking at the data.
The confidence interval for the difference between two proportions is (p̂B − p̂A) ± z* × SE, where SE uses the unpooled standard error and z* is the critical value for your chosen confidence level. If this interval does not contain zero, the result is statistically significant at that confidence level.
It takes visitor and conversion counts for two groups, computes each group's conversion rate, calculates the pooled standard error under the assumption of no difference, derives a z-score, and converts that z-score into a p-value using the cumulative standard normal distribution. The calculator on this page performs all of these steps and shows the full working in the Step-by-Step tab.
A two-proportion z-test is a statistical test used to determine whether two population proportions — such as two conversion rates — differ significantly. It assumes large enough samples for the normal approximation to the binomial distribution to hold, and it is the standard method behind most A/B testing tools, including this calculator.
Frequentist A/B testing, used by this calculator, computes a p-value and asks how surprising the observed data would be if there were truly no difference between variants. Bayesian A/B testing instead estimates the probability that one variant beats the other, given the data and a prior belief, and can update continuously as data arrives. Frequentist tests require a fixed sample size set in advance; Bayesian methods are generally more tolerant of early peeking, though both approaches can be misused.
Stop an A/B test when it reaches the sample size calculated before the test began, and after it has run for at least one full business cycle. Stopping earlier because the result happens to look significant on a given day is a common source of false positives, often called peeking bias.
Results from a small sample can still be statistically valid if the calculated p-value is significant and both np̂ and n(1−p̂) are at least 5 for each variant, but small samples make it harder to detect real, smaller effects and produce wider confidence intervals. Checking the confidence interval, not just the p-value, gives a more complete picture of the uncertainty involved.
The most common causes are stopping a test early after repeated peeking, testing multiple metrics or variants without correcting the significance threshold, and running a test too briefly to account for day-of-week or seasonal variation. Setting a sample size and significance threshold in advance, and sticking to them, addresses most of these causes directly.
A valid A/B test uses random, independent assignment of visitors to each variant, reaches a sample size calculated before the test began, runs for at least one full business cycle, and tests one primary metric defined in advance. Checking for sample ratio mismatch — an unexpected split between variants — is also part of validating results before drawing conclusions.
For the underlying statistical theory behind this calculator, see the hypothesis testing guide on Statistics Fundamentals, which covers proportion tests, confidence intervals, and the formal logic of significance testing in detail.