Hypothesis Testing A/B Testing Conversion Optimization 26 min read July 2, 2026
BY: Statistics Fundamentals Team
Reviewed By: Minsa A (Senior Statistics Editor)

How Statistics Powers A/B Testing (Real Examples)

A headline test shows 9.36% conversion versus an 8.20% control. Is that a real improvement, or did the winning group just happen to get slightly better visitors that week? Statistics is the only tool that answers that question honestly, and the answer changes what you should do next.

This guide walks through the actual math behind statistical significance in A/B testing: the two-proportion z-test, p-values, confidence intervals, and statistical power. It includes three fully worked examples with real numbers, a free significance calculator, a sample size cheat sheet, a decision flowchart, and a list of the mistakes that quietly invalidate most A/B tests. Every statistic on this page was computed directly from the sample sizes and conversion counts shown, not invented after the fact.

What You'll Learn
  • ✓ How the two-proportion z-test decides whether an A/B test has a real winner
  • ✓ Three worked examples, each computed step by step from real conversion counts
  • ✓ Why a statistically significant result can still be a bad business decision
  • ✓ How to calculate the sample size you need before you launch a test
  • ✓ The mistakes, like peeking and stopping early, that quietly break most tests
  • ✓ A free calculator to test your own control-versus-variant data right now

What Is A/B Testing?

Quick Definition
A/B testing is a controlled experiment that randomly splits visitors between two or more versions of a page, email, or product experience, then measures which version performs better against one clearly defined metric, such as conversion rate. The word "controlled" is doing the important work: everything about the two experiences is identical except the one element being tested, and assignment to a group is random.

The mechanics are simple. Half of incoming traffic sees version A (the control, usually the current experience). The other half sees version B (the variant, the change being tested). Both groups are exposed at the same time, to the same traffic sources, under the same conditions. Whatever difference shows up in the metric is attributable to the one thing that changed, not to a different week, a different ad campaign, or a different mix of visitors.

Marketing teams run A/B tests on headlines, email subject lines, and call-to-action button copy. Product teams run them on onboarding flows, pricing pages, and feature rollouts. In both cases the underlying question is identical: does this specific change move the metric that matters, or does it just look like it does?

⚡ Quick Reference — A/B Testing & Statistical Significance
  • A/B test: randomized, controlled comparison of two versions on one metric
  • Statistical significance: the probability a result this large would happen by chance if there were no real difference
  • p-value: the number that expresses that probability, commonly compared against 0.05
  • Confidence interval: the plausible range for the true size of the effect, not just a single point estimate
  • Statistical power: the chance the test detects a real effect of the size you actually care about
  • Practical significance: whether the effect, once confirmed real, is large enough to be worth acting on

Why Statistics Decides the Winner

Every A/B test compares two samples, not two populations. You never see every visitor who could possibly land on your page, only the few thousand who happened to arrive during the test window. Two identical experiences, shown to two different random samples, will almost never produce exactly the same conversion rate, purely because of who happened to show up in each group. Statistics is the tool that separates that ordinary sampling variation from a real, repeatable effect.

This is the core reason a raw comparison of two numbers is not enough. If control converts at 8.20% and the variant converts at 9.36%, that 1.16 percentage point gap could mean the variant genuinely performs better, or it could mean 5,000 random visitors in the variant group happened to be slightly more purchase-ready that week. A significance test quantifies exactly how surprising the observed gap would be if there were truly no difference at all, and that number, the p-value, is what turns "it looks higher" into "we have evidence it is higher."

The underlying mechanism is the central limit theorem: given enough observations, the distribution of possible sample conversion rates clusters predictably around the true rate, in a shape close to normal. That predictable shape is what lets a two-proportion z-test convert a raw percentage-point gap into a probability. Without it, "9.36% beat 8.20%" is just an observation. With it, it becomes a testable claim.

Understanding Statistical Significance

Featured Definition — Statistical Significance in A/B Testing
Statistical significance is a measure of how unlikely an observed difference between two variants would be if there were truly no underlying difference between them. It is reported as a p-value. When the p-value falls below a threshold set in advance, commonly 0.05, the result is called statistically significant, meaning the gap is unlikely to be explained by random sampling alone.
p-value < α → significant
α = 0.05 (typical)

A simple analogy: imagine flipping two coins 100 times each. Coin A lands heads 48 times, coin B lands heads 56 times. Is coin B actually biased toward heads, or is an 8-flip gap just what you'd expect from two fair coins over 100 flips? A significance test answers exactly that question for conversion rates instead of coin flips. It does not tell you whether your variant is good; it tells you whether the gap you measured is more than noise.

Statistical significance is not the same as importance, size, or business value, a distinction covered in detail later in this guide. It answers one narrow question: is this difference real, or could it plausibly be chance? A tiny, commercially irrelevant lift can be statistically significant with enough traffic, and a genuinely large lift can fail to reach significance with too little traffic. Both scenarios show up in the worked examples below.

The Statistical Concepts Behind Every A/B Test

A handful of concepts do all the real work in A/B testing statistics. Each one below is defined on its own, then tied back into the worked examples in the next section so the definitions are not abstract.

Null and Alternative Hypothesis

Every test starts with two competing statements. The null hypothesis (H₀) claims there is no real difference between control and variant, that any gap in the data is due to chance. The alternative hypothesis (H₁) claims a real difference exists. A significance test never "proves" the alternative hypothesis; it only measures whether the data gives enough evidence to reject the null. See the null and alternative hypothesis guide for the full mechanics of writing and testing these statements.

p-value

The p-value is the probability of observing a gap at least as large as the one measured, assuming the null hypothesis is true. A small p-value means the observed gap would be rare if the variants were truly identical, which is evidence against the null. A p-value is not the probability that the null hypothesis is true, a common misreading covered in the p-values guide.

Significance Level (α) and Confidence Level

The significance level, written α (alpha), is the threshold you set before the test starts, most commonly 0.05. It defines how much risk of a false positive you are willing to accept. The confidence level is simply 1 − α, expressed as a percentage: an α of 0.05 corresponds to 95% confidence. See the significance level guide for how to choose this threshold deliberately rather than by default.

Confidence Interval

A confidence interval gives a plausible range for the true difference between variants, not just a single number. A 95% confidence interval means that if you repeated the experiment many times, about 95% of the intervals you'd construct would contain the true difference. A narrow interval means a precise estimate; a wide interval means real uncertainty, even if the point estimate looks promising. Full treatment at the confidence interval for a proportion guide.

Statistical Power

Statistical power is the probability that a test correctly detects a real effect of a given size, if one truly exists. Power is typically set at 80%, meaning the test is designed to catch a real effect four times out of five. Low power is one of the most common, least visible failure modes in A/B testing: an underpowered test that reports "no significant difference" may simply have never had a fair chance to find the difference. See the statistical power guide.

Effect Size and Minimum Detectable Effect

Effect size measures how large the difference between variants actually is, in either absolute or relative terms. The minimum detectable effect (MDE) is the smallest effect size a test is designed, in advance, to reliably catch. Setting the MDE too small demands an enormous sample; setting it too large means the test will miss real but modest improvements. See the effect size guide.

Sample Size and Standard Error

Sample size (n) is the number of visitors assigned to each variant. Standard error measures how much a sample conversion rate is expected to fluctuate from the true rate purely due to sampling. Larger samples produce smaller standard errors, which is why the same percentage-point gap can be statistically significant at one sample size and meaningless noise at a smaller one, exactly what happens across the three worked examples below.

ConceptPlain-English MeaningTypical Default
Null hypothesis (H₀)No real difference between control and variantp1 = p2
Alternative hypothesis (H₁)A real difference existsp1 ≠ p2 (two-tailed)
Significance level (α)Acceptable risk of a false positive, set in advance0.05
p-valueProbability of a gap this large if H₀ were truecompare to α
Confidence level1 − α, expressed as a percentage95%
Statistical powerProbability of detecting a real effect if one exists80%
Minimum detectable effectSmallest lift the test is designed to catch10–20% relative

Real Example 1: Landing Page Headline Test

A SaaS company tests two headlines on its pricing landing page. Control keeps the existing headline; the variant tests a benefit-led rewrite. Traffic is split 50/50 and the test runs for two full weeks.

GroupVisitorsConversionsConversion Rate
Control (original headline)5,0004108.20%
Variant (new headline)5,0004689.36%
Worked Example — Two-Proportion z-Test

Is the 1.16 percentage point gap statistically significant at 95% confidence?

1

State the hypotheses: H₀: p₁ = p₂ (no real difference). H₁: p₁ ≠ p₂ (a real difference exists). Two-tailed test, α = 0.05.

2

Pooled proportion: p̂ = (410 + 468) / (5,000 + 5,000) = 878 / 10,000 = 0.0878

3

Standard error: SE = √[0.0878 × 0.9122 × (1/5,000 + 1/5,000)] = 0.00566

4

z-statistic: z = (0.0936 − 0.0820) / 0.00566 = 2.049

5

Two-tailed p-value: p = 0.0404. Since 0.0404 < 0.05, the result clears the significance threshold.

✓ Significant at 95% confidence (p = 0.040). The 95% confidence interval for the true difference is [0.05, 2.27] percentage points, entirely above zero. Relative uplift: +14.1%. The new headline is the better bet, though the interval's lower bound is thin, so this is a real but not overwhelming win.

Real Example 2: Email Subject Line Test

An e-commerce brand tests two subject lines for a weekly promotional email, measuring open rate across 40,000 recipients split evenly between versions.

GroupRecipientsOpensOpen Rate
Version A (informational)20,0002,40012.00%
Version B (curiosity-led)20,0002,56012.80%
Worked Example — Two-Proportion z-Test

Is the 0.80 percentage point gap in open rate statistically significant?

1

Pooled proportion: p̂ = (2,400 + 2,560) / 40,000 = 0.1240

2

Standard error: SE = √[0.1240 × 0.8760 × (1/20,000 + 1/20,000)] = 0.00330

3

z-statistic: z = (0.1280 − 0.1200) / 0.00330 = 2.427

4

Two-tailed p-value: p = 0.0152

✓ Significant at 95% confidence, and at 98% confidence too (p = 0.015). 95% CI for the difference: [0.15, 1.45] percentage points. Relative uplift: +6.7%. Because the sample here is four times larger than Example 1's, a smaller absolute gap still produced a stronger, more confident result — a direct illustration of how sample size drives statistical certainty.

Real Example 3: Checkout Button Color Test

A high-traffic e-commerce checkout page tests a green "Complete Purchase" button against the existing blue one, running until each variant has 250,000 visitors, a scale only reachable by sites with very heavy checkout traffic.

GroupVisitorsConversionsConversion Rate
Control (blue button)250,00010,0004.00%
Variant (green button)250,00010,3754.15%
Worked Example — Two-Proportion z-Test

Is a 0.15 percentage point gap statistically significant at this sample size?

1

Pooled proportion: p̂ = (10,000 + 10,375) / 500,000 = 0.04075

2

Standard error: SE = √[0.04075 × 0.95925 × (1/250,000 + 1/250,000)] = 0.000559

3

z-statistic: z = (0.0415 − 0.0400) / 0.000559 = 2.682

4

Two-tailed p-value: p = 0.0073, significant even at the stricter 99% confidence level.

⚠ Statistically significant, practically marginal. The 95% CI for the difference is [0.04, 0.26] percentage points, entirely above zero — the green button really is better, with high confidence. But the absolute lift is 0.15 percentage points and the relative uplift is 3.75%. Whether that's worth a design change, QA cycle, and rollout depends on the cost of implementation, not on the p-value. This is the clearest illustration in this guide of why statistical and practical significance are two separate questions.

Statistical Significance vs Practical Significance

The checkout button example above is not an edge case; it is the single most common way A/B testing misleads well-intentioned teams. A p-value only answers "is this difference probably real?" It says nothing about whether the difference is big enough to justify the engineering time, the design review, or the risk of touching a high-traffic checkout flow.

Property📊 Statistical Significance💰 Practical Significance
Question it answers Is this difference probably real, not random noise? Is this difference large enough to matter for the business?
Depends heavily on Sample size — larger samples detect smaller real effects Cost of implementation, revenue impact, opportunity cost
Expressed as A p-value compared against α Absolute lift, relative uplift, or projected revenue impact
Risk of ignoring it Shipping changes that are actually just noise Shipping changes that are real but not worth the cost to build
Example 3 verdict Significant (p = 0.0073) Marginal — 0.15pp absolute lift; judgment call
Decision Rule You Can Memorize

Ask both questions in order. First: is the p-value below your threshold? If not, stop — you don't have evidence of a real difference. If yes, ask a second question: does the confidence interval's lower bound represent a lift worth the cost of shipping it? Only ship when both answers are yes.

Confidence Intervals Explained

A p-value gives a single yes-or-no signal. A confidence interval gives a range, and the range is often more useful for a business decision than the p-value alone, because it shows both how big the effect might be and how uncertain that estimate still is.

Confidence Intervals: Overlapping vs Non-Overlapping

Overlapping CIs → Not Yet Significant Control Variant The ranges overlap — the data can't rule out "no real difference" Non-Overlapping CIs → Significant Control Variant The ranges are separated — the variant is reliably higher Example 1's 95% CI for the difference is [0.05, 2.27] percentage points — narrowly clears zero. Example 3's 95% CI is [0.04, 0.26] percentage points — clears zero comfortably, but the range itself is small. A confidence interval that just barely excludes zero is still statistically significant — but worth treating cautiously.

Two intervals that overlap don't automatically mean "not significant," and non-overlapping intervals aren't the only valid test — the two-proportion z-test computed earlier is the precise method. But visually comparing confidence intervals is a fast, intuitive gut check, and it makes one thing obvious that a single p-value hides: how much uncertainty remains. Example 1's interval barely clears zero at the low end (0.05 percentage points); Example 3's interval clears zero more comfortably in relative terms but describes a much smaller effect. Both are "significant." They are not equally convincing.

To build these intervals yourself for a single conversion rate or the gap between two, use the confidence interval for a proportion calculator or the margin of error guide for the underlying formula.

Sample Size and Statistical Power

Sample size is decided before a test starts, not checked afterward. Four inputs determine how many visitors each variant needs: the baseline conversion rate, the minimum detectable effect, the significance level, and the desired statistical power.

Sample Size per Variant — Two-Proportion Test
n = (z_α/2 + z_β)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₂ − p₁)²
z_α/2 = 1.96 for 95% confidence (two-tailed) z_β = 0.84 for 80% power p₁ = baseline conversion rate p₂ = baseline rate + minimum detectable effect

Two patterns fall directly out of this formula, and both matter more in practice than the algebra itself. Lower baseline conversion rates require dramatically larger samples to detect the same relative lift, because the variance term p(1−p) behaves differently near the extremes. And chasing a smaller minimum detectable effect gets expensive fast, since the denominator is squared: cutting the MDE in half roughly quadruples the required sample size.

Baseline Rate10% Relative MDE20% Relative MDE30% Relative MDE
2%80,679 / variant21,106 / variant9,795 / variant
5%31,231 / variant8,155 / variant3,778 / variant
10%14,749 / variant3,839 / variant1,772 / variant
20%6,507 / variant1,680 / variant769 / variant
30%3,760 / variant961 / variant435 / variant

Sample Size Cheat Sheet — assumes α = 0.05 (two-tailed) and 80% power. Figures are visitors required per variant, calculated with the formula above. Use the sample size calculator for your own exact numbers.

⚠️
What an Underpowered Test Actually Looks Like

Imagine testing a page with a 2% baseline conversion rate, hoping to detect a 10% relative lift, but only running 10,000 visitors per variant instead of the 80,679 the table above calls for. The test will very likely report "not significant" even if the variant is genuinely 10% better — not because the lift isn't real, but because the sample was never large enough to detect it. A negative result from an underpowered test proves nothing.

Common Mistakes That Invalidate A/B Tests

MistakeWhat Goes WrongWhat To Do Instead
Peeking and stopping early Checking the dashboard daily and stopping the moment it shows "significant" inflates the real false-positive rate well beyond 5% Set sample size and duration in advance; only interpret the result once that point is reached
Sample size too small Underpowered tests report "no difference" even when a real effect exists Calculate required sample size before launch using your MDE and baseline rate
Ignoring statistical power A test with 50% power will miss a real effect roughly half the time Design for at least 80% power against a realistic minimum detectable effect
Multiple testing / running many metrics Testing 20 metrics at α = 0.05 gives roughly a 64% chance at least one shows "significant" purely by chance Pick one primary metric in advance, or apply a correction (e.g., Bonferroni) for secondary metrics
Seasonal or day-of-week bias Running a test for three days captures one slice of behavior, not the full weekly cycle Run tests for full week multiples to average out weekday/weekend differences
Sample ratio mismatch Traffic doesn't split as configured (e.g., 55/45 instead of 50/50), often signaling a tracking bug Check the actual visitor split against the intended split before trusting the result

Of these, peeking is the most underestimated. Statistician and analyst Evan Miller's widely cited note on the subject demonstrates that repeatedly checking a test and stopping at the first "significant" reading can push the real false-positive rate to over five times the stated 5% threshold, because every additional look is another chance to catch a random fluctuation (Evan Miller, "How Not To Run An A/B Test"). The fix costs nothing: decide the sample size before you start, and treat the p-value as meaningless until you get there.

How Experimentation Teams Apply This at Scale

Companies running thousands of simultaneous experiments treat statistical rigor as infrastructure, not as a one-off calculation. Automated checks for sample ratio mismatch, pre-registered metrics, and standardized significance thresholds all exist to prevent the exact mistakes listed above from slipping through at volume, where a single unchecked bias can quietly corrupt hundreds of decisions.

The most detailed public account of this discipline comes from practitioners who built experimentation platforms at Microsoft, Google, and LinkedIn, documented in Ron Kohavi, Diane Tang, and Ya Xu's Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (Cambridge University Press, 2020). The book's central argument is not that the statistics are exotic, the two-proportion z-test used throughout this guide is standard, but that trustworthiness comes from process: pre-registering the metric before launch, guarding against peeking, and auditing for instrumentation errors before ever trusting a p-value (experimentguide.com).

For a small team, the same discipline scales down cleanly: write down the metric, the sample size, and the stopping point before the test launches, and don't revisit those decisions once the data starts coming in.

Tools for Running Statistical A/B Tests

ToolBest ForStatistical ApproachConsideration
Google Analytics 4Teams already instrumented in GA4 running lightweight testsBasic comparison; no built-in sequential correctionFree, but significance testing is limited compared to dedicated tools
OptimizelyEnterprise experimentation programsSequential testing with a built-in stats engineMature feature set; higher cost and setup overhead
VWOMid-market CRO teamsBayesian and frequentist reporting optionsGood balance of usability and statistical detail
AB TastyMarketing-led testing programsFrequentist significance with visual editorStrong for non-technical marketing teams
Adobe TargetEnterprises already in the Adobe ecosystemFrequentist and multi-armed bandit optionsBest value when bundled with Adobe Experience Cloud
Microsoft ClarityFree qualitative context alongside a testNot a significance testing toolPairs with a dedicated tool; doesn't replace one
RCustom statistical analysis and simulationFull control — prop.test(), power.prop.test(), and moreRequires statistical and coding fluency
Python (SciPy / statsmodels)Data teams building custom pipelinesFull control — proportions_ztest and equivalentsSame tradeoff as R; integrates well with existing data infra
Excel / Google SheetsManual, ad hoc calculationsManual z-test formula, as shown in this guideNo automation or peeking protection — easy to misuse

None of these tools change the underlying math. A two-proportion z-test computed by hand, in Excel, in Python, or inside an enterprise platform's dashboard will agree on the same p-value given the same inputs. What differs is how well each tool protects you from the mistakes in the section above, particularly peeking, which is why platforms built around sequential testing exist in the first place.

A/B Test Statistical Significance Calculator

Enter your own control and variant numbers below. The calculator runs the same two-proportion z-test used in every worked example above and reports the z-statistic, p-value, uplift, and confidence interval for the difference.

🧮 A/B Test Significance Calculator

Control (A)
Variant (B)
z-statistic
p-value
Relative Uplift
CI for Difference
▶ Show step-by-step breakdown

This calculator uses a two-tailed, pooled two-proportion z-test, the same method documented by NIST's Engineering Statistics Handbook §7.3.3. For sample size planning before you launch a test, use the dedicated A/B test calculator or the sample size calculator.

Should You Declare a Winner? A Decision Flowchart

Decision Flow — From Raw Data to a Ship / No-Ship Call

Reached the pre-set sample size and duration? Keep running. Don't peek and stop early. Yes No p-value below your chosen α (e.g., 0.05)? Not significant. No detectable difference yet. Yes No Is the effect large enough to justify the cost to ship it? Statistically real, practically negligible. Document, don't ship. Yes No Confidence interval excludes zero throughout, comfortably? Interval too wide or borderline. Run longer or add traffic. Yes No Declare a winner. Ship the variant.

This flow deliberately checks four things in sequence rather than one: the sample size discipline first, the statistical test second, the business value third, and the precision of the estimate fourth. Skipping straight to "is the p-value under 0.05?" is how the checkout button example above gets shipped without anyone asking whether it was worth building.

A/B Testing Checklist

Before Launch

  • Pick one primary metric that defines success
  • Calculate required sample size from your baseline rate and MDE
  • Decide the significance level (α) and power target in advance
  • Set a fixed test duration covering full business cycles
  • Confirm random, unbiased assignment to each variant

During the Test

  • Don't act on interim p-values — resist the urge to peek and stop
  • Monitor for sample ratio mismatch against your intended split
  • Watch for tracking or instrumentation errors early, not at the end
  • Let the test run to the pre-calculated sample size and duration

After the Test

  • Check statistical significance and practical significance together
  • Review the confidence interval's width, not just the point estimate
  • Only segment results if segments were pre-registered before launch
  • Document the result and, for high-stakes changes, consider a confirmation test

Glossary of A/B Testing Statistics Terms

TermNotationPlain-English DefinitionImportance
A/B TestingA randomized, controlled comparison of two experiences on one metricThe experimental method this entire guide is built around
Statistical Significancep < αEvidence that an observed gap is unlikely to be pure chanceSeparates real effects from sampling noise
p-valuepProbability of a gap this extreme if there were truly no differenceThe core output of a significance test
Confidence IntervalCIA plausible range for the true size of the effectShows both the estimate and its uncertainty
Confidence Level1 − αHow often the method produces an interval containing the truthSet before the test; 95% is standard
Null HypothesisH₀The default assumption that no real difference existsThe claim a significance test tries to reject
Alternative HypothesisH₁The claim that a real difference existsWhat you accept if H₀ is rejected
Statistical Power1 − βProbability of detecting a real effect if one existsPrevents wasted, underpowered tests
Sample SizenNumber of visitors assigned to each variantDirectly drives both precision and power
Effect SizeThe magnitude of the difference between variantsDetermines how much sample size is required
Standard ErrorSEExpected sampling variation of a conversion rate estimateThe denominator that turns a gap into a z-score
Conversion RateProportion of visitors completing the target actionThe metric being compared between variants
Type I ErrorαFalse positive — declaring a winner that isn't realControlled directly by the significance threshold
Type II ErrorβFalse negative — missing a real winnerControlled by statistical power and sample size
Practical SignificanceWhether a confirmed real effect is large enough to matterThe business judgment call that follows the statistics

For the broader statistical foundations behind every term above, see the hypothesis testing pillar page, the confidence intervals pillar page, and the full Statistics Fundamentals glossary.

Frequently Asked Questions — Statistical Significance in A/B Testing

Statistical significance in A/B testing is a measure of how likely it is that the difference you observed between a control and a variant would happen by random chance alone if there were truly no difference between them. It is expressed as a p-value. A p-value below your chosen threshold, commonly 0.05, means the observed gap is unlikely to be pure noise, so you treat it as a real effect rather than a fluke of who happened to land in each group.
Without a significance check, a conversion rate that looks higher in a small sample can simply be sampling noise, not a real improvement. Rolling that change out to every visitor could leave revenue on the table or even cost conversions. Statistical significance is the filter that separates a change worth keeping from a coincidence that happened to look good for a few thousand visitors.
Most teams use a p-value threshold (alpha) of 0.05, meaning there is a 5% chance of seeing a difference this large if the variants were actually identical. Some high-traffic teams tighten this to 0.01 to reduce false positives when running many simultaneous tests, while some early-stage teams accept 0.10 to move faster on lower-stakes decisions. The threshold should be set before the test starts, not chosen afterward to fit the result.
95% confidence (matching a 0.05 alpha) is the standard default for most conversion rate optimization tests. Choose 99% for high-stakes or hard-to-reverse changes such as pricing or checkout flows, where a false positive is costly. Choose 90% only for low-risk, easily reversible tests where speed matters more than certainty, and be aware you are accepting a higher false-positive rate in exchange.
Run the test until it reaches the sample size you calculated in advance, and let it cover at least one full business cycle, typically one to two full weeks, so weekday and weekend behavior are both represented. Stopping as soon as the dashboard flashes "significant" inflates the false-positive rate, a problem statisticians call repeated significance testing or peeking.
Sample size depends on your baseline conversion rate, the smallest lift you care about detecting (the minimum detectable effect), your significance threshold, and your desired statistical power, usually 80%. Smaller baseline rates and smaller effects both require larger samples. A sample size calculator or the formula n = (z_α/2 + z_β)² × [p1(1−p1) + p2(1−p2)] / (p2−p1)² will give you the number of visitors needed per variant.
Statistical power is the probability that a test detects a real effect of a given size if one truly exists. An underpowered test, one with too small a sample for the effect size you're hoping to catch, will frequently report "no significant difference" even when a real, meaningful difference exists. That absence of significance gets misread as proof the variants are equal, when it may simply mean the test never had a fair chance to detect the difference.
No. Checking a test repeatedly and stopping the moment it crosses the significance threshold sharply inflates the real false-positive rate, sometimes several times higher than the stated 5%. Decide on a sample size and duration before the test starts, and only interpret the p-value once that pre-set point is reached, unless you are using a testing tool built specifically for continuous monitoring, such as sequential testing.
Statistical significance tells you a difference is probably real, not random noise. Practical significance tells you whether that difference is large enough to matter for the business. With a large enough sample, even a tiny, commercially irrelevant lift can become statistically significant, exactly what happened in Example 3 above. The right question is always both: is this real, and is it worth acting on?
Two-tailed tests are the safer default for A/B testing because they detect a difference in either direction, protecting you from missing the case where your variant performs worse than control. A one-tailed test only checks one direction and requires a smaller sample to reach significance, but it should only be used when a result in the other direction is truly irrelevant to the decision, which is rare in conversion optimization. See the one-tailed vs two-tailed test guide for the full comparison.
A Type I error is a false positive: concluding the variant beats control when it actually does not. A Type II error is a false negative: concluding there is no difference when the variant actually does perform better. The significance threshold (alpha) controls the Type I error rate, while statistical power controls the Type II error rate. Reducing one type of error, all else equal, tends to increase the other, so both need to be set deliberately. Full detail at the Type I and Type II errors guide.
Not exactly. A classic p-value answers "how surprising is this data if the variants are identical?" A Bayesian "probability to beat baseline" answers a different question: "given this data, how likely is it that the variant is actually better?" The two numbers often land close together but are not interchangeable, and some platforms blur the distinction in their dashboards. Check your tool's documentation to know which framework it is actually reporting, and see the Bayesian vs frequentist guide for the underlying difference.

Sources cited in this guide: NIST/SEMATECH e-Handbook of Statistical Methods, §7.3.3 — Comparing Two Proportions · NIST/SEMATECH e-Handbook, §7.1 — Product and Process Comparisons · Penn State STAT 200, Lesson 9.1 — Two Independent Proportions · Evan Miller — "How Not To Run An A/B Test" · Kohavi, Tang & Xu — Trustworthy Online Controlled Experiments (Cambridge University Press, 2020) · OpenIntro Statistics, 4th Ed. (Diez, Çetinkaya-Rundel, Barr)