How Statistics Powers A/B Testing (Real Examples & Calculator)

What Is A/B Testing?

Quick Definition

A/B testing is a controlled experiment that randomly splits visitors between two or more versions of a page, email, or product experience, then measures which version performs better against one clearly defined metric, such as conversion rate. The word "controlled" is doing the important work: everything about the two experiences is identical except the one element being tested, and assignment to a group is random.

The mechanics are simple. Half of incoming traffic sees version A (the control, usually the current experience). The other half sees version B (the variant, the change being tested). Both groups are exposed at the same time, to the same traffic sources, under the same conditions. Whatever difference shows up in the metric is attributable to the one thing that changed, not to a different week, a different ad campaign, or a different mix of visitors.

Marketing teams run A/B tests on headlines, email subject lines, and call-to-action button copy. Product teams run them on onboarding flows, pricing pages, and feature rollouts. In both cases the underlying question is identical: does this specific change move the metric that matters, or does it just look like it does?

⚡ Quick Reference — A/B Testing & Statistical Significance

A/B test: randomized, controlled comparison of two versions on one metric
Statistical significance: the probability a result this large would happen by chance if there were no real difference
p-value: the number that expresses that probability, commonly compared against 0.05
Confidence interval: the plausible range for the true size of the effect, not just a single point estimate
Statistical power: the chance the test detects a real effect of the size you actually care about
Practical significance: whether the effect, once confirmed real, is large enough to be worth acting on

Why Statistics Decides the Winner

Every A/B test compares two samples, not two populations. You never see every visitor who could possibly land on your page, only the few thousand who happened to arrive during the test window. Two identical experiences, shown to two different random samples, will almost never produce exactly the same conversion rate, purely because of who happened to show up in each group. Statistics is the tool that separates that ordinary sampling variation from a real, repeatable effect.

This is the core reason a raw comparison of two numbers is not enough. If control converts at 8.20% and the variant converts at 9.36%, that 1.16 percentage point gap could mean the variant genuinely performs better, or it could mean 5,000 random visitors in the variant group happened to be slightly more purchase-ready that week. A significance test quantifies exactly how surprising the observed gap would be if there were truly no difference at all, and that number, the p-value, is what turns "it looks higher" into "we have evidence it is higher."

The underlying mechanism is the central limit theorem: given enough observations, the distribution of possible sample conversion rates clusters predictably around the true rate, in a shape close to normal. That predictable shape is what lets a two-proportion z-test convert a raw percentage-point gap into a probability. Without it, "9.36% beat 8.20%" is just an observation. With it, it becomes a testable claim.

Understanding Statistical Significance

Featured Definition — Statistical Significance in A/B Testing

Statistical significance is a measure of how unlikely an observed difference between two variants would be if there were truly no underlying difference between them. It is reported as a p-value. When the p-value falls below a threshold set in advance, commonly 0.05, the result is called statistically significant, meaning the gap is unlikely to be explained by random sampling alone.

p-value < α → significant

α = 0.05 (typical)

A simple analogy: imagine flipping two coins 100 times each. Coin A lands heads 48 times, coin B lands heads 56 times. Is coin B actually biased toward heads, or is an 8-flip gap just what you'd expect from two fair coins over 100 flips? A significance test answers exactly that question for conversion rates instead of coin flips. It does not tell you whether your variant is good; it tells you whether the gap you measured is more than noise.

Statistical significance is not the same as importance, size, or business value, a distinction covered in detail later in this guide. It answers one narrow question: is this difference real, or could it plausibly be chance? A tiny, commercially irrelevant lift can be statistically significant with enough traffic, and a genuinely large lift can fail to reach significance with too little traffic. Both scenarios show up in the worked examples below.

The Statistical Concepts Behind Every A/B Test

A handful of concepts do all the real work in A/B testing statistics. Each one below is defined on its own, then tied back into the worked examples in the next section so the definitions are not abstract.

Null and Alternative Hypothesis

Every test starts with two competing statements. The null hypothesis (H₀) claims there is no real difference between control and variant, that any gap in the data is due to chance. The alternative hypothesis (H₁) claims a real difference exists. A significance test never "proves" the alternative hypothesis; it only measures whether the data gives enough evidence to reject the null. See the null and alternative hypothesis guide for the full mechanics of writing and testing these statements.

p-value

The p-value is the probability of observing a gap at least as large as the one measured, assuming the null hypothesis is true. A small p-value means the observed gap would be rare if the variants were truly identical, which is evidence against the null. A p-value is not the probability that the null hypothesis is true, a common misreading covered in the p-values guide.

Significance Level (α) and Confidence Level

The significance level, written α (alpha), is the threshold you set before the test starts, most commonly 0.05. It defines how much risk of a false positive you are willing to accept. The confidence level is simply 1 − α, expressed as a percentage: an α of 0.05 corresponds to 95% confidence. See the significance level guide for how to choose this threshold deliberately rather than by default.

Confidence Interval

A confidence interval gives a plausible range for the true difference between variants, not just a single number. A 95% confidence interval means that if you repeated the experiment many times, about 95% of the intervals you'd construct would contain the true difference. A narrow interval means a precise estimate; a wide interval means real uncertainty, even if the point estimate looks promising. Full treatment at the confidence interval for a proportion guide.

Statistical Power

Statistical power is the probability that a test correctly detects a real effect of a given size, if one truly exists. Power is typically set at 80%, meaning the test is designed to catch a real effect four times out of five. Low power is one of the most common, least visible failure modes in A/B testing: an underpowered test that reports "no significant difference" may simply have never had a fair chance to find the difference. See the statistical power guide.

Effect Size and Minimum Detectable Effect

Effect size measures how large the difference between variants actually is, in either absolute or relative terms. The minimum detectable effect (MDE) is the smallest effect size a test is designed, in advance, to reliably catch. Setting the MDE too small demands an enormous sample; setting it too large means the test will miss real but modest improvements. See the effect size guide.

Sample Size and Standard Error

Sample size (n) is the number of visitors assigned to each variant. Standard error measures how much a sample conversion rate is expected to fluctuate from the true rate purely due to sampling. Larger samples produce smaller standard errors, which is why the same percentage-point gap can be statistically significant at one sample size and meaningless noise at a smaller one, exactly what happens across the three worked examples below.

Concept	Plain-English Meaning	Typical Default
Null hypothesis (H₀)	No real difference between control and variant	p1 = p2
Alternative hypothesis (H₁)	A real difference exists	p1 ≠ p2 (two-tailed)
Significance level (α)	Acceptable risk of a false positive, set in advance	0.05
p-value	Probability of a gap this large if H₀ were true	compare to α
Confidence level	1 − α, expressed as a percentage	95%
Statistical power	Probability of detecting a real effect if one exists	80%
Minimum detectable effect	Smallest lift the test is designed to catch	10–20% relative

Real Example 1: Landing Page Headline Test

A SaaS company tests two headlines on its pricing landing page. Control keeps the existing headline; the variant tests a benefit-led rewrite. Traffic is split 50/50 and the test runs for two full weeks.

Group	Visitors	Conversions	Conversion Rate
Control (original headline)	5,000	410	8.20%
Variant (new headline)	5,000	468	9.36%

Worked Example — Two-Proportion z-Test

Is the 1.16 percentage point gap statistically significant at 95% confidence?

State the hypotheses: H₀: p₁ = p₂ (no real difference). H₁: p₁ ≠ p₂ (a real difference exists). Two-tailed test, α = 0.05.

Pooled proportion: p̂ = (410 + 468) / (5,000 + 5,000) = 878 / 10,000 = 0.0878

Standard error: SE = √[0.0878 × 0.9122 × (1/5,000 + 1/5,000)] = 0.00566

z-statistic: z = (0.0936 − 0.0820) / 0.00566 = 2.049

Two-tailed p-value: p = 0.0404. Since 0.0404 < 0.05, the result clears the significance threshold.

✓ Significant at 95% confidence (p = 0.040). The 95% confidence interval for the true difference is [0.05, 2.27] percentage points, entirely above zero. Relative uplift: +14.1%. The new headline is the better bet, though the interval's lower bound is thin, so this is a real but not overwhelming win.

Real Example 2: Email Subject Line Test

An e-commerce brand tests two subject lines for a weekly promotional email, measuring open rate across 40,000 recipients split evenly between versions.

Group	Recipients	Opens	Open Rate
Version A (informational)	20,000	2,400	12.00%
Version B (curiosity-led)	20,000	2,560	12.80%

Worked Example — Two-Proportion z-Test

Is the 0.80 percentage point gap in open rate statistically significant?

Pooled proportion: p̂ = (2,400 + 2,560) / 40,000 = 0.1240

Standard error: SE = √[0.1240 × 0.8760 × (1/20,000 + 1/20,000)] = 0.00330

z-statistic: z = (0.1280 − 0.1200) / 0.00330 = 2.427

Two-tailed p-value: p = 0.0152

✓ Significant at 95% confidence, and at 98% confidence too (p = 0.015). 95% CI for the difference: [0.15, 1.45] percentage points. Relative uplift: +6.7%. Because the sample here is four times larger than Example 1's, a smaller absolute gap still produced a stronger, more confident result — a direct illustration of how sample size drives statistical certainty.

Real Example 3: Checkout Button Color Test

A high-traffic e-commerce checkout page tests a green "Complete Purchase" button against the existing blue one, running until each variant has 250,000 visitors, a scale only reachable by sites with very heavy checkout traffic.

Group	Visitors	Conversions	Conversion Rate
Control (blue button)	250,000	10,000	4.00%
Variant (green button)	250,000	10,375	4.15%

Worked Example — Two-Proportion z-Test

Is a 0.15 percentage point gap statistically significant at this sample size?

Pooled proportion: p̂ = (10,000 + 10,375) / 500,000 = 0.04075

Standard error: SE = √[0.04075 × 0.95925 × (1/250,000 + 1/250,000)] = 0.000559

z-statistic: z = (0.0415 − 0.0400) / 0.000559 = 2.682

Two-tailed p-value: p = 0.0073, significant even at the stricter 99% confidence level.

⚠ Statistically significant, practically marginal. The 95% CI for the difference is [0.04, 0.26] percentage points, entirely above zero — the green button really is better, with high confidence. But the absolute lift is 0.15 percentage points and the relative uplift is 3.75%. Whether that's worth a design change, QA cycle, and rollout depends on the cost of implementation, not on the p-value. This is the clearest illustration in this guide of why statistical and practical significance are two separate questions.

Statistical Significance vs Practical Significance

The checkout button example above is not an edge case; it is the single most common way A/B testing misleads well-intentioned teams. A p-value only answers "is this difference probably real?" It says nothing about whether the difference is big enough to justify the engineering time, the design review, or the risk of touching a high-traffic checkout flow.

Property	📊 Statistical Significance	💰 Practical Significance
Question it answers	Is this difference probably real, not random noise?	Is this difference large enough to matter for the business?
Depends heavily on	Sample size — larger samples detect smaller real effects	Cost of implementation, revenue impact, opportunity cost
Expressed as	A p-value compared against α	Absolute lift, relative uplift, or projected revenue impact
Risk of ignoring it	Shipping changes that are actually just noise	Shipping changes that are real but not worth the cost to build
Example 3 verdict	Significant (p = 0.0073)	Marginal — 0.15pp absolute lift; judgment call

✅

Decision Rule You Can Memorize

Ask both questions in order. First: is the p-value below your threshold? If not, stop — you don't have evidence of a real difference. If yes, ask a second question: does the confidence interval's lower bound represent a lift worth the cost of shipping it? Only ship when both answers are yes.

Confidence Intervals Explained

A p-value gives a single yes-or-no signal. A confidence interval gives a range, and the range is often more useful for a business decision than the p-value alone, because it shows both how big the effect might be and how uncertain that estimate still is.

Confidence Intervals: Overlapping vs Non-Overlapping

Two intervals that overlap don't automatically mean "not significant," and non-overlapping intervals aren't the only valid test — the two-proportion z-test computed earlier is the precise method. But visually comparing confidence intervals is a fast, intuitive gut check, and it makes one thing obvious that a single p-value hides: how much uncertainty remains. Example 1's interval barely clears zero at the low end (0.05 percentage points); Example 3's interval clears zero more comfortably in relative terms but describes a much smaller effect. Both are "significant." They are not equally convincing.

To build these intervals yourself for a single conversion rate or the gap between two, use the confidence interval for a proportion calculator or the margin of error guide for the underlying formula.

Sample Size and Statistical Power

Sample size is decided before a test starts, not checked afterward. Four inputs determine how many visitors each variant needs: the baseline conversion rate, the minimum detectable effect, the significance level, and the desired statistical power.

Sample Size per Variant — Two-Proportion Test

n = (z_α/2 + z_β)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₂ − p₁)²

z_α/2 = 1.96 for 95% confidence (two-tailed) z_β = 0.84 for 80% power p₁ = baseline conversion rate p₂ = baseline rate + minimum detectable effect

Two patterns fall directly out of this formula, and both matter more in practice than the algebra itself. Lower baseline conversion rates require dramatically larger samples to detect the same relative lift, because the variance term p(1−p) behaves differently near the extremes. And chasing a smaller minimum detectable effect gets expensive fast, since the denominator is squared: cutting the MDE in half roughly quadruples the required sample size.

Baseline Rate	10% Relative MDE	20% Relative MDE	30% Relative MDE
2%	80,679 / variant	21,106 / variant	9,795 / variant
5%	31,231 / variant	8,155 / variant	3,778 / variant
10%	14,749 / variant	3,839 / variant	1,772 / variant
20%	6,507 / variant	1,680 / variant	769 / variant
30%	3,760 / variant	961 / variant	435 / variant

Sample Size Cheat Sheet — assumes α = 0.05 (two-tailed) and 80% power. Figures are visitors required per variant, calculated with the formula above. Use the sample size calculator for your own exact numbers.

⚠️

What an Underpowered Test Actually Looks Like

Imagine testing a page with a 2% baseline conversion rate, hoping to detect a 10% relative lift, but only running 10,000 visitors per variant instead of the 80,679 the table above calls for. The test will very likely report "not significant" even if the variant is genuinely 10% better — not because the lift isn't real, but because the sample was never large enough to detect it. A negative result from an underpowered test proves nothing.

Common Mistakes That Invalidate A/B Tests

Mistake	What Goes Wrong	What To Do Instead
Peeking and stopping early	Checking the dashboard daily and stopping the moment it shows "significant" inflates the real false-positive rate well beyond 5%	Set sample size and duration in advance; only interpret the result once that point is reached
Sample size too small	Underpowered tests report "no difference" even when a real effect exists	Calculate required sample size before launch using your MDE and baseline rate
Ignoring statistical power	A test with 50% power will miss a real effect roughly half the time	Design for at least 80% power against a realistic minimum detectable effect
Multiple testing / running many metrics	Testing 20 metrics at α = 0.05 gives roughly a 64% chance at least one shows "significant" purely by chance	Pick one primary metric in advance, or apply a correction (e.g., Bonferroni) for secondary metrics
Seasonal or day-of-week bias	Running a test for three days captures one slice of behavior, not the full weekly cycle	Run tests for full week multiples to average out weekday/weekend differences
Sample ratio mismatch	Traffic doesn't split as configured (e.g., 55/45 instead of 50/50), often signaling a tracking bug	Check the actual visitor split against the intended split before trusting the result

Of these, peeking is the most underestimated. Statistician and analyst Evan Miller's widely cited note on the subject demonstrates that repeatedly checking a test and stopping at the first "significant" reading can push the real false-positive rate to over five times the stated 5% threshold, because every additional look is another chance to catch a random fluctuation (Evan Miller, "How Not To Run An A/B Test"). The fix costs nothing: decide the sample size before you start, and treat the p-value as meaningless until you get there.

How Experimentation Teams Apply This at Scale

Companies running thousands of simultaneous experiments treat statistical rigor as infrastructure, not as a one-off calculation. Automated checks for sample ratio mismatch, pre-registered metrics, and standardized significance thresholds all exist to prevent the exact mistakes listed above from slipping through at volume, where a single unchecked bias can quietly corrupt hundreds of decisions.

The most detailed public account of this discipline comes from practitioners who built experimentation platforms at Microsoft, Google, and LinkedIn, documented in Ron Kohavi, Diane Tang, and Ya Xu's Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (Cambridge University Press, 2020). The book's central argument is not that the statistics are exotic, the two-proportion z-test used throughout this guide is standard, but that trustworthiness comes from process: pre-registering the metric before launch, guarding against peeking, and auditing for instrumentation errors before ever trusting a p-value (experimentguide.com).

For a small team, the same discipline scales down cleanly: write down the metric, the sample size, and the stopping point before the test launches, and don't revisit those decisions once the data starts coming in.

Tools for Running Statistical A/B Tests

Tool	Best For	Statistical Approach	Consideration
Google Analytics 4	Teams already instrumented in GA4 running lightweight tests	Basic comparison; no built-in sequential correction	Free, but significance testing is limited compared to dedicated tools
Optimizely	Enterprise experimentation programs	Sequential testing with a built-in stats engine	Mature feature set; higher cost and setup overhead
VWO	Mid-market CRO teams	Bayesian and frequentist reporting options	Good balance of usability and statistical detail
AB Tasty	Marketing-led testing programs	Frequentist significance with visual editor	Strong for non-technical marketing teams
Adobe Target	Enterprises already in the Adobe ecosystem	Frequentist and multi-armed bandit options	Best value when bundled with Adobe Experience Cloud
Microsoft Clarity	Free qualitative context alongside a test	Not a significance testing tool	Pairs with a dedicated tool; doesn't replace one
R	Custom statistical analysis and simulation	Full control — prop.test(), power.prop.test(), and more	Requires statistical and coding fluency
Python (SciPy / statsmodels)	Data teams building custom pipelines	Full control — proportions_ztest and equivalents	Same tradeoff as R; integrates well with existing data infra
Excel / Google Sheets	Manual, ad hoc calculations	Manual z-test formula, as shown in this guide	No automation or peeking protection — easy to misuse

None of these tools change the underlying math. A two-proportion z-test computed by hand, in Excel, in Python, or inside an enterprise platform's dashboard will agree on the same p-value given the same inputs. What differs is how well each tool protects you from the mistakes in the section above, particularly peeking, which is why platforms built around sequential testing exist in the first place.

A/B Test Statistical Significance Calculator

Enter your own control and variant numbers below. The calculator runs the same two-proportion z-test used in every worked example above and reports the z-statistic, p-value, uplift, and confidence interval for the difference.

🧮 A/B Test Significance Calculator

Control (A)

Visitors

Conversions

Variant (B)

Visitors

Conversions

Confidence Level

—

z-statistic

—

p-value

—

Relative Uplift

—

CI for Difference

▶ Show step-by-step breakdown

This calculator uses a two-tailed, pooled two-proportion z-test, the same method documented by NIST's Engineering Statistics Handbook §7.3.3. For sample size planning before you launch a test, use the dedicated A/B test calculator or the sample size calculator.

Should You Declare a Winner? A Decision Flowchart

Decision Flow — From Raw Data to a Ship / No-Ship Call

This flow deliberately checks four things in sequence rather than one: the sample size discipline first, the statistical test second, the business value third, and the precision of the estimate fourth. Skipping straight to "is the p-value under 0.05?" is how the checkout button example above gets shipped without anyone asking whether it was worth building.

A/B Testing Checklist

Before Launch

Pick one primary metric that defines success
Calculate required sample size from your baseline rate and MDE
Decide the significance level (α) and power target in advance
Set a fixed test duration covering full business cycles
Confirm random, unbiased assignment to each variant

During the Test

Don't act on interim p-values — resist the urge to peek and stop
Monitor for sample ratio mismatch against your intended split
Watch for tracking or instrumentation errors early, not at the end
Let the test run to the pre-calculated sample size and duration

After the Test

Check statistical significance and practical significance together
Review the confidence interval's width, not just the point estimate
Only segment results if segments were pre-registered before launch
Document the result and, for high-stakes changes, consider a confirmation test

Glossary of A/B Testing Statistics Terms

Term	Notation	Plain-English Definition	Importance
A/B Testing	—	A randomized, controlled comparison of two experiences on one metric	The experimental method this entire guide is built around
Statistical Significance	p < α	Evidence that an observed gap is unlikely to be pure chance	Separates real effects from sampling noise
p-value	p	Probability of a gap this extreme if there were truly no difference	The core output of a significance test
Confidence Interval	CI	A plausible range for the true size of the effect	Shows both the estimate and its uncertainty
Confidence Level	1 − α	How often the method produces an interval containing the truth	Set before the test; 95% is standard
Null Hypothesis	H₀	The default assumption that no real difference exists	The claim a significance test tries to reject
Alternative Hypothesis	H₁	The claim that a real difference exists	What you accept if H₀ is rejected
Statistical Power	1 − β	Probability of detecting a real effect if one exists	Prevents wasted, underpowered tests
Sample Size	n	Number of visitors assigned to each variant	Directly drives both precision and power
Effect Size	—	The magnitude of the difference between variants	Determines how much sample size is required
Standard Error	SE	Expected sampling variation of a conversion rate estimate	The denominator that turns a gap into a z-score
Conversion Rate	p̂	Proportion of visitors completing the target action	The metric being compared between variants
Type I Error	α	False positive — declaring a winner that isn't real	Controlled directly by the significance threshold
Type II Error	β	False negative — missing a real winner	Controlled by statistical power and sample size
Practical Significance	—	Whether a confirmed real effect is large enough to matter	The business judgment call that follows the statistics

For the broader statistical foundations behind every term above, see the hypothesis testing pillar page, the confidence intervals pillar page, and the full Statistics Fundamentals glossary.

Frequently Asked Questions — Statistical Significance in A/B Testing

Statistical significance in A/B testing is a measure of how likely it is that the difference you observed between a control and a variant would happen by random chance alone if there were truly no difference between them. It is expressed as a p-value. A p-value below your chosen threshold, commonly 0.05, means the observed gap is unlikely to be pure noise, so you treat it as a real effect rather than a fluke of who happened to land in each group.

Without a significance check, a conversion rate that looks higher in a small sample can simply be sampling noise, not a real improvement. Rolling that change out to every visitor could leave revenue on the table or even cost conversions. Statistical significance is the filter that separates a change worth keeping from a coincidence that happened to look good for a few thousand visitors.

Most teams use a p-value threshold (alpha) of 0.05, meaning there is a 5% chance of seeing a difference this large if the variants were actually identical. Some high-traffic teams tighten this to 0.01 to reduce false positives when running many simultaneous tests, while some early-stage teams accept 0.10 to move faster on lower-stakes decisions. The threshold should be set before the test starts, not chosen afterward to fit the result.

95% confidence (matching a 0.05 alpha) is the standard default for most conversion rate optimization tests. Choose 99% for high-stakes or hard-to-reverse changes such as pricing or checkout flows, where a false positive is costly. Choose 90% only for low-risk, easily reversible tests where speed matters more than certainty, and be aware you are accepting a higher false-positive rate in exchange.

Run the test until it reaches the sample size you calculated in advance, and let it cover at least one full business cycle, typically one to two full weeks, so weekday and weekend behavior are both represented. Stopping as soon as the dashboard flashes "significant" inflates the false-positive rate, a problem statisticians call repeated significance testing or peeking.

Sample size depends on your baseline conversion rate, the smallest lift you care about detecting (the minimum detectable effect), your significance threshold, and your desired statistical power, usually 80%. Smaller baseline rates and smaller effects both require larger samples. A sample size calculator or the formula n = (z_α/2 + z_β)² × [p1(1−p1) + p2(1−p2)] / (p2−p1)² will give you the number of visitors needed per variant.

Statistical power is the probability that a test detects a real effect of a given size if one truly exists. An underpowered test, one with too small a sample for the effect size you're hoping to catch, will frequently report "no significant difference" even when a real, meaningful difference exists. That absence of significance gets misread as proof the variants are equal, when it may simply mean the test never had a fair chance to detect the difference.

No. Checking a test repeatedly and stopping the moment it crosses the significance threshold sharply inflates the real false-positive rate, sometimes several times higher than the stated 5%. Decide on a sample size and duration before the test starts, and only interpret the p-value once that pre-set point is reached, unless you are using a testing tool built specifically for continuous monitoring, such as sequential testing.

Statistical significance tells you a difference is probably real, not random noise. Practical significance tells you whether that difference is large enough to matter for the business. With a large enough sample, even a tiny, commercially irrelevant lift can become statistically significant, exactly what happened in Example 3 above. The right question is always both: is this real, and is it worth acting on?

Two-tailed tests are the safer default for A/B testing because they detect a difference in either direction, protecting you from missing the case where your variant performs worse than control. A one-tailed test only checks one direction and requires a smaller sample to reach significance, but it should only be used when a result in the other direction is truly irrelevant to the decision, which is rare in conversion optimization. See the one-tailed vs two-tailed test guide for the full comparison.

A Type I error is a false positive: concluding the variant beats control when it actually does not. A Type II error is a false negative: concluding there is no difference when the variant actually does perform better. The significance threshold (alpha) controls the Type I error rate, while statistical power controls the Type II error rate. Reducing one type of error, all else equal, tends to increase the other, so both need to be set deliberately. Full detail at the Type I and Type II errors guide.

Not exactly. A classic p-value answers "how surprising is this data if the variants are identical?" A Bayesian "probability to beat baseline" answers a different question: "given this data, how likely is it that the variant is actually better?" The two numbers often land close together but are not interchangeable, and some platforms blur the distinction in their dashboards. Check your tool's documentation to know which framework it is actually reporting, and see the Bayesian vs frequentist guide for the underlying difference.

Sources cited in this guide: NIST/SEMATECH e-Handbook of Statistical Methods, §7.3.3 — Comparing Two Proportions · NIST/SEMATECH e-Handbook, §7.1 — Product and Process Comparisons · Penn State STAT 200, Lesson 9.1 — Two Independent Proportions · Evan Miller — "How Not To Run An A/B Test" · Kohavi, Tang & Xu — Trustworthy Online Controlled Experiments (Cambridge University Press, 2020) · OpenIntro Statistics, 4th Ed. (Diez, Çetinkaya-Rundel, Barr)