How do you calculate conversion lift?

Conversion lift is calculated as (Conversion Rate B − Conversion Rate A) ÷ Conversion Rate A, expressed as a percentage. This is relative lift. Absolute lift is simply the difference between the two conversion rates without dividing by the baseline.

What p-value is considered significant?

A p-value below 0.05 is the conventional threshold for statistical significance, corresponding to a 95% confidence level. Some teams use a stricter threshold of 0.01 for high-stakes decisions, while exploratory tests sometimes use 0.10. The threshold should be set before the test starts, not chosen after seeing the data.

A/B Test Calculator: Statistical Significance & Lift (Free)

Q: What is an A/B test calculator?

An A/B test calculator compares two versions of a page, email, or product experience by running a statistical test on conversion data. It tells you whether the difference between the two versions is large enough to be real, rather than the result of random visitor variation.

Q: How do you calculate statistical significance in A/B testing?

Statistical significance in A/B testing is usually calculated with a two-proportion z-test. You compute the conversion rate for each variant, find the pooled standard error, divide the difference in rates by that standard error to get a z-score, and convert the z-score to a p-value. If the p-value is below your significance threshold (commonly 0.05), the result is statistically significant.

Q: What sample size do I need for an A/B test?

Required sample size depends on your baseline conversion rate, the minimum lift you want to detect, your confidence level, and your statistical power. As a rough reference, detecting a 20% relative lift on a 5% baseline rate at 95% confidence and 80% power requires roughly 7,700 visitors per variant. Smaller effects and lower baseline rates both require larger samples.

Q: How long should an A/B test run?

An A/B test should run long enough to reach its pre-calculated sample size, and for at least one to two full business cycles (commonly 1–2 weeks) so day-of-week effects average out. Stopping a test the moment it first looks significant, before reaching the planned sample size, inflates the false-positive rate.

Q: What is the difference between Bayesian and Frequentist A/B testing?

Frequentist A/B testing calculates a p-value and asks how surprising the observed data would be if there were truly no difference between variants. Bayesian A/B testing instead calculates the probability that one variant beats the other, given the data and a prior belief, and updates continuously as data arrives. Frequentist tests require a fixed sample size set in advance; Bayesian methods are generally more tolerant of early peeking, though both can be misused.

A/B Test Calculator

Formula Z = (p̂_B − p̂_A) / SE Method Two-proportion z-test

Control (Variant A)

Visitors A

Conversions A

Variant (Variant B)

Visitors B

Conversions B

Confidence Level

Formula n = (z_α/2+z_β)² × (p₁(1−p₁)+p₂(1−p₂)) / (p₂−p₁)²

Baseline Conv. Rate (%)

Min. Detectable Effect (% relative)

Confidence Level

Statistical Power (%)

Formula Days = Total Sample Size ÷ Daily Traffic

Daily Traffic (eligible visitors)

Required Sample Size (per variant)

Run a calculation in the Significance Test tab first, then return here to see the full step-by-step solution.

No data yet — enter visitor and conversion counts in the Significance Test tab first.

Key Formulas

Conversion Rate CR = Conversions / Visitors

Relative Lift (CR_B−CR_A) / CR_A × 100

Z-Score (pooled) Z = (p̂_B−p̂_A) / SE

Decision Rule p < 0.05 → significant

Quick Decision Guide

p < 0.05: Result is statistically significant at 95% confidence — safe to consider deploying the winner.
p ≥ 0.05: Not enough evidence yet — keep testing or treat the result as inconclusive.
Before testing: Use the Sample Size tab to plan how many visitors you need.

Full Hypothesis Testing Guide

Theory, formulas, interpretations & examples

Related Tools & Guides

What is an A/B test calculator?

An A/B test calculator is a statistical tool that compares two versions of a webpage, email, advertisement, or product experience to determine whether the difference in their conversion rates is statistically significant. You enter the number of visitors and conversions for a control group (A) and a variant group (B), and the calculator runs a two-proportion z-test to tell you whether the variant performed differently from chance alone.

A/B testing is the standard method behind most website and product experiments. Instead of guessing which headline, button color, or checkout flow performs better, a team splits traffic between two versions and lets the data decide. The calculator's job is to translate raw counts — visitors and conversions — into a defensible statistical conclusion: real difference, or random noise. Statistics Fundamentals covers the full statistical theory behind this method in the hypothesis testing guide.

A/B test formulas

An A/B test rests on four formulas: conversion rate, conversion lift, the pooled standard error, and the z-score. The z-score is then converted into a p-value to determine significance.

Conversion Rate

CR = Conversions / Visitors

Example:
CR_A = 250 / 5000 = 0.0500 (5.00%)
CR_B = 295 / 5000 = 0.0590 (5.90%)

Conversion Lift (Relative)

Lift = (CR_B − CR_A) / CR_A × 100

Example:
(0.0590 − 0.0500) / 0.0500 × 100
= 18.0%

Pooled Standard Error

p̄ = (x_A + x_B) / (n_A + n_B)
SE = √[ p̄(1−p̄)(1/n_A + 1/n_B) ]

Where x = conversions, n = visitors

Z-Score & Decision Rule

Z = (p̂_B − p̂_A) / SE

If p-value < 0.05 → reject H₀
(statistically significant)

The pooled standard error assumes the null hypothesis is true — that both variants share a single underlying conversion rate — which is the standard approach for a two-proportion z-test used in significance testing. This differs slightly from the unpooled standard error used when building a confidence interval for the difference between two independent proportions; both versions are common in CRO tooling. See the proportion hypothesis testing page for the full derivation.

Assumptions behind the two-proportion z-test

The two-proportion z-test that powers this calculator depends on a small set of conditions. Results are reliable when each one holds.

Random, independent assignment

Each visitor must be randomly assigned to A or B, and one visitor's outcome should not affect another's.

Sufficient sample size

Both n×p̂ and n×(1−p̂) should be at least 5 for each group, so the normal approximation to the binomial distribution holds.

Fixed sample size, set in advance

The test should run to a pre-calculated sample size rather than being stopped the moment results look favorable.

One comparison, one test

Testing many metrics or variants at once and reporting only the significant one inflates the false-positive rate (the multiple comparison problem).

How many visitors do you need for an A/B test?

The required sample size depends on four inputs: your baseline conversion rate, the minimum lift you care about detecting, your chosen confidence level, and your statistical power. Smaller effects, lower baseline rates, and stricter confidence requirements all increase the number of visitors you need.

Table: approximate sample size per variant at 95% confidence, 80% power

Baseline Rate	10% Relative Lift	20% Relative Lift	30% Relative Lift
2%	~75,200	~18,800	~8,400
5%	~29,000	~7,260	~3,230
10%	~13,500	~3,400	~1,520
20%	~5,880	~1,480	~660

Figures are rounded estimates from the formula above; use the Sample Size tab in the calculator for an exact number based on your own inputs.

Statistical power is the probability of detecting a real effect when one exists — it guards against a false negative (Type II error). The industry-standard power level is 80%; some teams running high-stakes tests use 90%, which requires a larger sample. Evan Miller's widely used A/B test sample size calculator applies the same underlying formula and is a common cross-check in CRO teams.

How long should an A/B test run?

An A/B test should run until it reaches its planned sample size, and for at least one to two full business cycles — typically one to two weeks — so day-of-week and other recurring effects average out across both variants. Divide the total visitors needed by your daily eligible traffic to estimate the number of days, then round up to the nearest full week.

Worked example: A test needs 3,070 visitors per variant (6,140 total). The page gets 900 eligible visitors per day. 6,140 ÷ 900 ≈ 6.8 days, rounded up to 7 days. Because that is short of a full business cycle, the recommended run time is 14 days (two weeks) to capture both weekday and weekend behavior.

How to use the A/B test calculator

Collect visitor and conversion counts for both variants, enter them with your chosen confidence level, and read the significance, lift, and confidence interval from the results panel. Here is the complete process.

Collect control data

Record the number of visitors and conversions for Variant A (the current, unchanged version).

Collect variant data

Record the same two numbers for Variant B (the new version being tested).

Enter values and set the confidence level

95% is the standard choice for most experiments; use 99% for higher-stakes decisions.

Review the statistical significance output

Check the p-value against your threshold — commonly 0.05 — and confirm whether the confidence interval for the difference excludes zero.

Interpret the lift

Look at both absolute lift (percentage points) and relative lift (percent change) — they tell different parts of the story.

Make the business decision

Deploy the winning variant if the result is significant and the sample size was reached as planned; otherwise continue testing or conclude the test was inconclusive.

Worked A/B testing examples

The five scenarios below show typical inputs and outputs across common test types. Numbers are illustrative; paste them into the calculator above to confirm.

Table: five worked A/B test scenarios

Test	Visitors A / B	Conversions A / B	Lift	P-Value	Result
Headline A vs B	4,000 / 4,000	180 / 224	+24.4%	0.018	Significant
CTA button: red vs green	6,200 / 6,150	372 / 369	−0.4%	0.94	Not significant
Checkout: 1-step vs 2-step	3,500 / 3,480	245 / 298	+22.0%	0.007	Significant
Email subject line	12,000 / 12,000	1,440 / 1,512	+5.0%	0.061	Not significant
Pricing toggle default	2,100 / 2,090	105 / 146	+38.7%	0.011	Significant

Interpretation notes

The CTA button test shows why color alone rarely moves conversion: a near-zero lift with p = 0.94 means the data is consistent with no real difference. The pricing toggle test shows the opposite pattern — a large relative lift on a smaller sample can still clear the significance bar when the effect is strong enough relative to the noise. The email subject line test illustrates a borderline case (p = 0.061, just above 0.05): treat this as inconclusive rather than as a loss, and consider running it longer or retesting.

A/B testing methodologies compared

A/B testing vs. multivariate testing

Aspect	A/B Testing	Multivariate Testing (MVT)
What changes	One element, two versions	Multiple elements, multiple combinations
Traffic required	Lower	Higher — splits across many combinations
What you learn	Which version wins	Which elements interact and how
Best for	High-impact, single changes	Mature, high-traffic pages with several variables

Bayesian vs. Frequentist A/B testing

Aspect	Frequentist	Bayesian
Core output	P-value against a fixed null hypothesis	Probability that B beats A, given the data
Sample size	Fixed, set before the test	Can update continuously as data arrives
Interpretation	Requires care; not a direct probability statement	More directly intuitive ("87% chance B is better")
This calculator	Uses this approach	Not covered here

Statistical significance vs. practical significance

Aspect	Statistical Significance	Practical Significance
Question answered	Is the difference likely real, not chance?	Is the difference large enough to matter to the business?
Driven by	Sample size and effect size	Revenue, cost, and implementation effort
Risk	Tiny, meaningless effects can become significant at huge sample sizes	A meaningful-looking lift can still be statistical noise on a small sample

Common A/B testing mistakes to avoid

Stopping early (peeking bias)

Checking results daily and stopping as soon as p drops below 0.05 inflates the false-positive rate well above the stated 5%. Decide the sample size in advance and wait for it.

Underpowered tests

Running a test with too few visitors makes it unlikely to detect a real, smaller effect even if one exists — the test ends inconclusive by design.

The multiple comparison problem

Testing many metrics or many variants and reporting only the one that happened to be significant overstates the evidence. Use a correction (such as Bonferroni) or pre-register the primary metric.

Seasonal and day-of-week effects

Running a test for three days captures only part of a weekly cycle. Run for full weeks so weekday and weekend behavior are both represented in both variants.

Traffic imbalance (sample ratio mismatch)

If your 50/50 split actually delivers, for example, 5,400 visitors to A and 4,600 to B, something in the randomization or tracking is broken, and results should not be trusted until it is fixed.

Misinterpreting significance

A p-value below 0.05 does not mean there is a 95% probability the variant is better — it means the observed data would be unlikely if there were truly no difference. See the interpretation section below for the precise wording.

How to interpret A/B test results correctly

A p-value tells you how surprising your data would be if the two variants truly had identical conversion rates — it is not the probability that the variant is better. The correct reading separates the statistical claim from the business decision.

✗ Wrong interpretation: "There is a 95% chance Variant B is the real winner."

✓ Correct interpretation: "If A and B truly performed the same, data this extreme would occur less than 5% of the time. We treat this as evidence that the difference is real."

This distinction matters because the p-value is conditional on the null hypothesis being true; it says nothing directly about the probability that the null hypothesis itself is true. The American Statistical Association's formal guidance on this point, published in The American Statistician (2016), is considered the standard reference for correct p-value interpretation across applied fields.

A/B testing formula and term glossary

The table below collects the core formulas and terms used throughout this page, for quick reference.

Table: A/B testing glossary — 12 key terms

Term	Symbol / Formula	Plain-English definition	Primary use
A/B Test	N/A	A randomized experiment comparing a control (A) and a variant (B)	Comparing page or product versions
Conversion Rate	CR = Conversions / Visitors	Share of visitors who complete a target action	Baseline metric for every test
Conversion Lift	(CR_B−CR_A)/CR_A × 100	Relative percentage change in conversion rate	Headline metric for stakeholders
Statistical Significance	Via two-proportion z-test	Likelihood the observed difference is not random chance	Deciding whether to trust a result
P-Value	From z-score	Probability of data this extreme if there were truly no difference	Primary decision metric
Confidence Interval	(p̂_B−p̂_A) ± z×SE	Range likely to contain the true difference between variants	Communicating uncertainty around the lift
Z-Test (two-proportion)	Z=(p̂_B−p̂_A)/SE	Statistical test comparing two population proportions	The math engine of this calculator
Statistical Power	1 − β	Probability of detecting a real effect when one exists	Sample size planning
MDE	Set by the user	Minimum Detectable Effect — smallest lift you care about catching	Sample size calculation
Sample Size	Per-variant n	Number of visitors required per variant before analyzing results	Test planning, avoiding underpowered tests
Sample Ratio Mismatch	Observed split ≠ intended split	A sign that randomization or tracking is broken	Pre-analysis data quality check
Significance Level (α)	1 − confidence level	Maximum acceptable false-positive rate, usually 0.05	Setting the p-value threshold before testing

Sources and further reading

Authority sources cited in this guide:

Wasserstein, R.L. & Lazar, N.A. (2016). "The ASA Statement on p-Values." The American Statistician. tandfonline.com
National Institute of Standards and Technology (NIST). Engineering Statistics Handbook — Two-Sample Tests for Proportions. itl.nist.gov
Miller, E. How Not to Run an A/B Test and Sample Size Calculator. evanmiller.org
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge University Press.
Optimizely. Stats Engine Methodology. optimizely.com
OpenStax. Introductory Statistics, Chapter 10: Hypothesis Testing with Two Samples. openstax.org

Frequently asked questions

An A/B test calculator compares two versions of a page, email, or product experience by running a statistical test on conversion data. It tells you whether the difference between the two versions is large enough to be real, rather than the result of random visitor variation. This tool runs a two-proportion z-test and returns conversion rates, lift, p-value, z-score, and a confidence interval.

Calculate the conversion rate for each variant, find the pooled standard error, divide the difference in rates by that standard error to get a z-score, and convert the z-score to a p-value using the standard normal distribution. If the p-value is below your significance threshold — commonly 0.05 — the result is statistically significant. Use the Significance Test tab above to run this calculation automatically.

Relative lift is (Conversion Rate B − Conversion Rate A) ÷ Conversion Rate A, expressed as a percentage. Absolute lift is simply Conversion Rate B minus Conversion Rate A, in percentage points. Both numbers matter: relative lift is easier to compare across tests, while absolute lift shows the raw size of the effect.

Required sample size depends on your baseline conversion rate, the minimum lift you want to detect, your confidence level, and your statistical power. As a rough reference, detecting a 20% relative lift on a 5% baseline rate at 95% confidence and 80% power requires roughly 7,260 visitors per variant. Use the Sample Size tab above to calculate the exact number for your situation.

95% confidence is the standard choice for most A/B tests and corresponds to a 0.05 significance threshold. Use 99% confidence for high-stakes decisions, such as changes to a checkout flow or pricing page, where a false positive is costly. Lower thresholds like 90% are sometimes used for low-risk, exploratory tests where speed matters more than certainty.

An A/B test should run until it reaches its planned sample size, and for at least one to two full business cycles, typically one to two weeks, so day-of-week effects average out across both variants. Stopping a test the moment it first looks significant, before the planned sample size, inflates the false-positive rate.

Statistical power is the probability of correctly detecting a real difference between variants when one truly exists. It is calculated as 1 minus the Type II error rate (β). The standard target is 80% power; some teams use 90% power for higher-stakes tests, which requires a larger sample size.

The minimum detectable effect is the smallest relative lift you want the test to be able to catch reliably. Setting a smaller MDE makes a test more sensitive but requires a much larger sample size; setting a larger MDE requires fewer visitors but will miss smaller real improvements.

A p-value below 0.05 is the conventional threshold for statistical significance, corresponding to a 95% confidence level. Some teams use a stricter 0.01 threshold for high-stakes decisions, while exploratory tests sometimes use 0.10. The threshold should be set before the test starts, not chosen after looking at the data.

The confidence interval for the difference between two proportions is (p̂_B − p̂_A) ± z* × SE, where SE uses the unpooled standard error and z* is the critical value for your chosen confidence level. If this interval does not contain zero, the result is statistically significant at that confidence level.

It takes visitor and conversion counts for two groups, computes each group's conversion rate, calculates the pooled standard error under the assumption of no difference, derives a z-score, and converts that z-score into a p-value using the cumulative standard normal distribution. The calculator on this page performs all of these steps and shows the full working in the Step-by-Step tab.

A two-proportion z-test is a statistical test used to determine whether two population proportions — such as two conversion rates — differ significantly. It assumes large enough samples for the normal approximation to the binomial distribution to hold, and it is the standard method behind most A/B testing tools, including this calculator.

Frequentist A/B testing, used by this calculator, computes a p-value and asks how surprising the observed data would be if there were truly no difference between variants. Bayesian A/B testing instead estimates the probability that one variant beats the other, given the data and a prior belief, and can update continuously as data arrives. Frequentist tests require a fixed sample size set in advance; Bayesian methods are generally more tolerant of early peeking, though both approaches can be misused.

Stop an A/B test when it reaches the sample size calculated before the test began, and after it has run for at least one full business cycle. Stopping earlier because the result happens to look significant on a given day is a common source of false positives, often called peeking bias.

Results from a small sample can still be statistically valid if the calculated p-value is significant and both np̂ and n(1−p̂) are at least 5 for each variant, but small samples make it harder to detect real, smaller effects and produce wider confidence intervals. Checking the confidence interval, not just the p-value, gives a more complete picture of the uncertainty involved.

The most common causes are stopping a test early after repeated peeking, testing multiple metrics or variants without correcting the significance threshold, and running a test too briefly to account for day-of-week or seasonal variation. Setting a sample size and significance threshold in advance, and sticking to them, addresses most of these causes directly.

A valid A/B test uses random, independent assignment of visitors to each variant, reaches a sample size calculated before the test began, runs for at least one full business cycle, and tests one primary metric defined in advance. Checking for sample ratio mismatch — an unexpected split between variants — is also part of validating results before drawing conclusions.

For the underlying statistical theory behind this calculator, see the hypothesis testing guide on Statistics Fundamentals, which covers proportion tests, confidence intervals, and the formal logic of significance testing in detail.

Free A/B Test Calculator