What does sampling with replacement mean in bootstrap?

Sampling with replacement means each observation can be selected more than once within a single bootstrap resample. After drawing a data point, it is placed back into the pool before the next draw, so some values appear multiple times in a bootstrap sample while others may not appear at all.

How does bootstrap sampling work in machine learning?

In machine learning, bootstrap sampling is the foundation of bagging (Bootstrap Aggregating). Multiple training subsets are created by sampling the original dataset with replacement. A separate model is trained on each subset, and their predictions are averaged (for regression) or majority-voted (for classification). Random Forest is the most common example of this approach.

What is the difference between bootstrap sampling and cross-validation?

Bootstrap sampling draws samples with replacement and is primarily used for statistical inference (standard errors, confidence intervals) and ensemble learning. Cross-validation partitions data into non-overlapping folds without replacement and is used to estimate model generalization error. Bootstrap tends to overestimate error on small datasets; cross-validation is preferred for model selection.

How many bootstrap resamples (B) should you use?

For estimating standard errors, B = 200 is often sufficient. For constructing confidence intervals, B = 1,000 to 2,000 is recommended. For high-precision inference or BCa intervals, B = 5,000 to 10,000 is common. The computational cost is the main constraint.

Bootstrap Sampling: Definition, How It Works & Examples (2026)

What Is Bootstrap Sampling? (Definition)

Definition — Bootstrap Sampling

Bootstrap sampling is a nonparametric resampling technique that repeatedly draws random samples of size n from an original dataset, with replacement, to estimate the sampling distribution of a statistic. By computing the statistic across B resamples and examining the resulting empirical distribution, you can estimate standard errors, bias, and confidence intervals without assuming the population follows any particular distribution.

P(X* = xᵢ) = 1/n for each observation

The method was introduced by Bradley Efron in his 1979 paper "Bootstrap Methods: Another Look at the Jackknife," published in The Annals of Statistics. The name comes from the phrase "pulling oneself up by one's bootstraps" — the idea of generating information about a population from the sample itself, with no external data required.

The core assumption is the plug-in principle: treat your observed sample as the best available proxy for the population. If you had access to the actual population you would sample from it repeatedly; since you do not, you sample from your empirical data repeatedly instead. As the original sample size grows and becomes more representative of the population, bootstrap estimates become more accurate.

1979

Year Efron introduced the bootstrap

B = 1,000+

Typical number of resamples

~63.2%

Unique observations per resample (on average)

Parametric assumptions required

📌

Featured Snippet — Bootstrap Sampling Definition

Bootstrap sampling is a statistical resampling method that draws random samples of size n from an observed dataset with replacement, repeating the process B times (typically 1,000–10,000). Each resample estimates a statistic (mean, median, regression coefficient, etc.), and the collection of B estimates forms a bootstrap distribution used to calculate standard errors and confidence intervals without parametric assumptions.

What Does "Sampling With Replacement" Mean?

Sampling with replacement means each observation is placed back into the pool after being drawn, so the same data point can be selected more than once in a single bootstrap resample. For a dataset of n = 5 values, each bootstrap resample still contains n = 5 draws, but some values may appear two or three times while others may not appear at all.

On average, any single bootstrap resample includes approximately 63.2% of the original dataset's unique observations (since the probability of any given observation being excluded from one resample is (1 − 1/n)ⁿ → 1/e ≈ 0.368). The remaining ~36.8% not selected in a given resample are called the out-of-bag (OOB) observations — a concept used directly in Random Forest to estimate generalization error without a separate test set.

⚠️

Key Distinction: Sampling With vs. Without Replacement

Bootstrap sampling always uses replacement. Simple random sampling (used in surveys and study design) typically samples without replacement, so each observation can only appear once. The distinction matters: sampling without replacement from a dataset of size n would always produce the same n observations in a different order, giving no information about sampling variability.

How Bootstrap Sampling Works — 6 Steps

📋

How Does Bootstrap Sampling Work? (Quick Answer)

Start with original data of size n → draw n observations with replacement → compute your statistic → repeat B times → compile the B statistics into a distribution → use that distribution to estimate standard error or build a confidence interval.

Collect the Original Dataset

Begin with your observed sample X = {x₁, x₂, …, xₙ} of size n. This is your empirical population — it must be representative of the real population for bootstrap estimates to be meaningful. The bootstrap does not fix a bad or biased sample.

Draw a Bootstrap Resample With Replacement

Randomly select n observations from X, one at a time, replacing each observation before the next draw. This produces a bootstrap sample X* = {x*₁, x*₂, …, x*ₙ} where some values from X may appear multiple times and others may be absent. Each draw has probability 1/n for any observation.

Compute the Statistic of Interest

Calculate your target statistic θ̂ on the bootstrap resample — this could be the mean (x̄*), median, standard deviation, correlation coefficient, regression slope, or any other quantity. Call this bootstrap replicate θ*ᵇ.

Repeat B Times

Repeat steps 2 and 3 independently B times, where B is typically 1,000 to 10,000. Each repetition yields one bootstrap replicate θ*ᵇ. More resamples give more stable estimates: for standard errors B = 200 is often enough; for confidence intervals, B = 1,000–2,000 is recommended by Efron and Tibshirani.

Build the Bootstrap Distribution

Collect all B replicates {θ*¹, θ*², …, θ*ᴮ} to form the bootstrap distribution Ω_boot. This empirical distribution approximates the true sampling distribution of θ̂ — the same distribution you would observe if you could draw many samples from the actual population.

Estimate Standard Error, Bias, or Confidence Intervals

Use the bootstrap distribution to compute: the bootstrap standard error (standard deviation of the B replicates), the bootstrap bias (mean of replicates minus the original statistic), or a 95% confidence interval (the 2.5th and 97.5th percentiles of the B replicates for the percentile method).

Bootstrap Sampling — Process Flow Diagram

Original Dataset
X = {x₁…xₙ}, size n

→

Resample #1 (with replacement)
X*¹, size n

→

Statistic θ*¹

→

Resample #2 (with replacement)
X*², size n

→

Statistic θ*²

→

Resample #B (with replacement)
X*ᴮ, size n — repeat 1,000–10,000×

→

Statistic θ*ᴮ

Bootstrap Distribution {θ*¹…θ*ᴮ} → SE, Bias, Confidence Interval

Each resample is independent and drawn with replacement from the same original dataset.

Worked Example — Bootstrap Sampling Step by Step

The following example uses a small dataset to keep the arithmetic transparent. In practice you would use hundreds or thousands of resamples; here, three resamples show the mechanics clearly. This mirrors the approach described in Efron and Tibshirani's foundational textbook An Introduction to the Bootstrap (Chapman & Hall, 1993).

Worked Example — Estimating Bootstrap Mean and Standard Error

Problem: A researcher records the resting heart rates (bpm) of five participants: X = {62, 70, 68, 75, 65}. Use bootstrap sampling to estimate the standard error of the sample mean.

Original dataset: X = {62, 70, 68, 75, 65}, n = 5
Original sample mean: x̄ = (62 + 70 + 68 + 75 + 65) / 5 = 340 / 5 = 68.0 bpm

Bootstrap Resample 1 (drawn with replacement from X):
X*¹ = {70, 62, 70, 65, 68} — note 70 appears twice
Mean*¹ = (70 + 62 + 70 + 65 + 68) / 5 = 335 / 5 = 67.0

Bootstrap Resample 2:
X*² = {75, 75, 62, 68, 70} — 75 appears twice; 65 absent
Mean*² = (75 + 75 + 62 + 68 + 70) / 5 = 350 / 5 = 70.0

Bootstrap Resample 3:
X*³ = {62, 65, 65, 68, 62} — 62 and 65 each appear twice; 70 and 75 absent
Mean*³ = (62 + 65 + 65 + 68 + 62) / 5 = 322 / 5 = 64.4

In practice, repeat for B = 1,000 resamples. Here, using B = 3 for illustration, the bootstrap mean of means = (67.0 + 70.0 + 64.4) / 3 = 67.13

Bootstrap Standard Error (SE_boot):
SE_boot = √[(Σ(θ*ᵇ − θ̄*)²) / (B − 1)]
Deviations from 67.13: (67.0−67.13)² + (70.0−67.13)² + (64.4−67.13)² = 0.017 + 8.237 + 7.453 = 15.707
SE_boot = √(15.707 / 2) = √7.853 ≈ 2.80 bpm

✅ With B = 1,000 resamples the bootstrap SE converges to approximately 2.1 bpm for this dataset — comparable to the analytical SE = s/√n = 5.0/√5 ≈ 2.24 bpm. The bootstrap estimate required no normality assumption.

Methodology follows Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC. The standard error formula above matches equation 6.1 from that text. See also the sampling distributions overview on Statistics Fundamentals.

Bootstrap Confidence Intervals

The most common reason practitioners use bootstrap sampling is to construct a confidence interval for a statistic without making distributional assumptions. There are several methods; the two most widely used are the percentile method and the BCa method.

Percentile Method

Bootstrap Percentile Confidence Interval (95%)

CI = [θ*_α/2, θ*_1−α/2]

θ*_α/2 = 2.5th percentile of B resamples θ*_1−α/2 = 97.5th percentile of B resamples B ≥ 1,000 resamples recommended

After running B = 1,000 bootstrap resamples and computing the mean for each, sort the 1,000 means from smallest to largest. A 95% confidence interval takes the value at position 25 (the 2.5th percentile) as the lower bound and the value at position 975 (the 97.5th percentile) as the upper bound. No formula for the standard normal or t-distribution is needed.

Worked Example — Bootstrap 95% Confidence Interval

Using the heart rate example with B = 1,000 resamples (simulated), suppose the sorted bootstrap means range from 63.4 to 72.8. The 95% bootstrap CI is approximately [64.2, 71.8].

Generate B = 1,000 bootstrap resamples from X = {62, 70, 68, 75, 65}, each of size n = 5 with replacement. Compute the mean for each resample.

Sort the 1,000 bootstrap means from lowest to highest: {63.4, 63.6, 63.8, …, 72.6, 72.8}.

Extract percentiles: Lower bound = value at rank 25 (2.5th percentile) ≈ 64.2. Upper bound = value at rank 975 (97.5th percentile) ≈ 71.8.

✅ 95% Bootstrap Percentile CI: [64.2, 71.8] bpm. We are 95% confident the true population mean resting heart rate falls between 64.2 and 71.8 bpm, based on this sample and resampling method. Compare this to the analytical t-interval for the same data.

BCa Method (Bias-Corrected and Accelerated)

The BCa method adjusts the percentile boundaries for two sources of distortion: bias (the bootstrap distribution is not centered on the original estimate) and acceleration (the standard error of the statistic changes with the parameter value). BCa intervals are more accurate than the basic percentile method, especially for skewed statistics or small samples. Most statistical software packages (R's boot library, Python's scipy.stats.bootstrap) default to BCa or offer it as an option.

Bootstrap Sampling in Python

Python's NumPy library makes bootstrap sampling straightforward with np.random.choice. The scipy.stats.bootstrap function (added in SciPy 1.7) provides BCa confidence intervals directly. The scikit-learn library does not expose a standalone bootstrap function, but the underlying mechanism is used inside BaggingClassifier and RandomForestClassifier.

Python — Bootstrap Sampling (NumPy)

import numpy as np

# Original dataset
data = np.array([62, 70, 68, 75, 65])
n = len(data)
B = 10000  # number of bootstrap resamples

def bootstrap_replicate_1d(data, func):
    """Draw one bootstrap resample and compute the statistic."""
    bs_sample = np.random.choice(data, size=len(data), replace=True)
    return func(bs_sample)

def draw_bs_reps(data, func, size=10000):
    """Draw B bootstrap replicates of a statistic."""
    return np.array([bootstrap_replicate_1d(data, func)
                     for _ in range(size)])

# Generate bootstrap distribution of the mean
bs_means = draw_bs_reps(data, np.mean, size=B)

# Bootstrap standard error
se_boot = np.std(bs_means, ddof=1)

# 95% percentile confidence interval
ci_lower = np.percentile(bs_means, 2.5)
ci_upper = np.percentile(bs_means, 97.5)

print(f"Original mean: {np.mean(data):.2f}")
print(f"Bootstrap SE: {se_boot:.4f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

Python — Using scipy.stats.bootstrap (BCa intervals)

from scipy import stats
import numpy as np

data = np.array([62, 70, 68, 75, 65])

result = stats.bootstrap(
    (data,),
    np.mean,
    n_resamples=9999,
    confidence_level=0.95,
    method='BCa'   # also try 'percentile' or 'basic'
)

print(f"BCa 95% CI: {result.confidence_interval}")
print(f"Bootstrap SE: {result.standard_error:.4f}")

NumPy documentation: numpy.random.choice. SciPy documentation: scipy.stats.bootstrap (BCa intervals).

Bootstrap Sampling Formulas

Quantity	Formula	Notes
SE_boot	√[ Σ(θᵇ − θ̄)² / (B−1) ]	Standard deviation of B bootstrap replicates
Bias_boot	(1/B) Σθ*ᵇ − θ̂	Mean of replicates minus original estimate
Percentile CI (95%)	[ θ_0.025, θ_0.975 ]	2.5th and 97.5th percentile of sorted replicates
Bias-corrected estimate	θ̂_bc = 2θ̂ − θ̄*	Corrects for bootstrap bias
Draw probability	P(X* = xᵢ) = 1/n	Equal chance of selecting any observation each draw
Prob. observation included	1 − (1 − 1/n)ⁿ → 1 − 1/e ≈ 0.632	~63.2% unique obs. per resample on average

Bootstrap Sampling in Machine Learning

Bootstrap sampling is not just a tool for statisticians — it is the foundation of some of the most widely used ensemble methods in machine learning. Understanding bootstrap sampling unlocks an intuitive grasp of bagging, Random Forest, and gradient boosting frameworks.

Bagging (Bootstrap Aggregating)

Bagging, short for Bootstrap Aggregating, was introduced by Leo Breiman in 1996. The procedure: generate B bootstrap samples from the training data, fit a separate model (decision tree, or any other learner) to each sample, then aggregate predictions by averaging (regression) or majority voting (classification). Because each model is trained on a slightly different bootstrap sample, individual models make different errors, and averaging reduces overall variance.

Bagging — Ensemble Prediction Formula

f̂_bag(x) = (1/B) Σ f*ᵇ(x)

f*ᵇ = model trained on bootstrap resample b B = number of bootstrap resamples Average for regression; majority vote for classification

Bootstrap Sampling in Random Forest

Random Forest extends bagging by adding a second source of randomization: at each node split, only a random subset of features is considered (typically √p for classification, p/3 for regression, where p is the total number of features). This decorrelates the individual trees further, reducing variance beyond what bagging alone achieves.

The role of bootstrap sampling in Random Forest is to create B different training sets, ensuring that no single influential data point dominates every tree. The out-of-bag observations (the ~36.8% not selected in each resample) are used to estimate generalization error — a free cross-validation estimate that requires no separate test split. This is described in Breiman's original 2001 paper in Machine Learning (vol. 45, pp. 5–32), available through the Springer archive.

🌲

Random Forest

Each tree trained on a distinct bootstrap sample. OOB score provides unbiased error estimate without a held-out test set.

🎒

Bagging Classifier

B classifiers, each trained on a bootstrap resample. Majority vote reduces variance without increasing bias.

💹

Quantitative Finance

Value-at-Risk estimation via bootstrap avoids assuming return distributions are normal — important during market stress events.

🏥

Clinical Trials

Estimate confidence intervals for treatment effects in small-sample medical studies where parametric assumptions are untenable.

🧪

A/B Testing

Bootstrap hypothesis testing compares conversion rates or revenue metrics without assuming normality of user behavior distributions.

📊

Survey Research

Estimate variance for complex sample designs where analytic formulas are unavailable, including stratified and cluster sampling.

Bootstrap vs Cross-Validation vs Jackknife

Bootstrap sampling, hypothesis testing variants, cross-validation, and the jackknife all estimate statistical uncertainty through resampling, but they serve different purposes and have different tradeoffs. The table below compares the three most common methods.

Feature	Bootstrap Sampling	Cross-Validation	Jackknife
Primary use	Standard errors, confidence intervals, ensemble learning	Model selection, generalization error estimation	Bias estimation, SE for smooth statistics
Resampling method	With replacement; same sample size	Without replacement; partitioned folds	Leave-one-out; n resamples of size n−1
Number of resamples	B = 1,000–10,000 (chosen by user)	k = 5 or 10 (fixed folds)	Exactly n (one per observation)
Parametric assumptions	None required	None required	Works best for smooth, differentiable statistics
Small sample performance	Good — designed for small n	Can be unstable for very small n	Can fail for non-smooth statistics (e.g., median)
Computational cost	High (many resamples)	Moderate (k model fits)	n model fits — manageable
ML application	Bagging, Random Forest, AdaBoost	Model hyperparameter tuning	Influence function estimation
Best used when	Need CI for any statistic without distributional assumptions	Selecting between candidate models	Quick bias correction with small n

✅

When to Use Bootstrap Sampling

Choose bootstrap sampling when: (1) your sample is small and normality is questionable, (2) your statistic has no known analytical sampling distribution (e.g., the median, a correlation ratio, a quantile), (3) you need confidence intervals for a complex estimator, or (4) you are building an ensemble model via bagging. Use cross-validation instead when comparing and selecting between different model architectures.

Bootstrap Sampling Variations

Variant	Use Case	Key Difference from Standard Bootstrap
Nonparametric Bootstrap	General inference, no distribution assumed	Resamples directly from the observed data — the standard method described above
Parametric Bootstrap	When distribution is known (e.g., Poisson count data)	Fits a parametric model to data, then generates resamples from that fitted model
Stratified Bootstrap	Preserving class ratios in classification tasks	Resamples separately within each stratum/class to maintain proportions
Block Bootstrap	Time series, panel data, spatially correlated data	Resamples contiguous blocks to preserve serial correlation structure
Residual Bootstrap	Linear regression inference	Keeps predictors fixed; resamples regression residuals only
BCa Bootstrap	Skewed statistics, improved accuracy	Adjusts CI boundaries for bias and acceleration factors
Double Bootstrap	Calibrating CI coverage probability	Bootstraps within bootstraps to correct CI undercoverage

Bootstrap Sampling Simulator

Interactive Bootstrap Simulator

Enter comma-separated numbers, choose the statistic and number of resamples, then run the bootstrap to see the standard error and 95% confidence interval.

Your Data (comma-separated numbers)

Statistic

Resamples (B)

Confidence Level

Limitations of Bootstrap Sampling

Bootstrap sampling is powerful, but it is not a cure-all. Several situations call for caution or a different approach.

⚡ When Bootstrap Sampling Can Fail

Very small samples (n < 10–15): The empirical distribution approximates the population poorly. Bootstrap CIs may have poor coverage, and edge cases (like the sample maximum) can behave erratically.
Non-smooth statistics: The bootstrap does not work well for statistics like the sample maximum or minimum, which have discontinuous sampling distributions. The jackknife or smoothed bootstrap variants can help.
Dependent data: The standard bootstrap assumes observations are independent. For time series or clustered data, use the block bootstrap instead to preserve serial or spatial correlation.
Biased or unrepresentative samples: Bootstrap resampling from a bad sample produces a bad distribution. Garbage in, garbage out — bootstrap is not a substitute for proper study design. See the study design section for guidance on data collection.
Computational cost: B = 10,000 resamples on a large dataset with a slow statistic can be expensive. Parallelization (via multiprocessing or vectorized NumPy) is advisable for production pipelines.

Frequently Asked Questions

What is bootstrap sampling in simple terms?

Bootstrap sampling means repeatedly drawing random samples from your existing dataset, each sample the same size as the original and drawn with replacement (so the same data point can appear more than once). By computing a statistic — like the mean — across hundreds of these resamples, you build a picture of how that statistic would vary if you collected many new samples from the same population. This lets you estimate uncertainty without collecting more data.

Why is bootstrap sampling done with replacement?

Replacement is what makes bootstrap resamples genuinely different from one another. If you sampled without replacement from a dataset of size n, every resample would contain exactly the same n values in a different order — no new information about sampling variability would be generated. Sampling with replacement means each bootstrap sample is a new, slightly different version of the dataset, allowing the distribution of statistics across resamples to reflect real sampling uncertainty.

What is the role of bootstrap sampling in bagging?

In bagging (Bootstrap Aggregating), bootstrap sampling creates B distinct training sets from the original training data. A separate model — usually a decision tree — is fitted to each training set. Because each model sees a slightly different version of the data, individual models make different errors. Averaging (for regression) or majority-voting (for classification) these diverse models reduces variance without meaningfully increasing bias, improving prediction accuracy. Random Forest builds on this by also randomizing which features are considered at each split.

How many bootstrap resamples (B) do I need?

For standard error estimation, B = 200 is often sufficient. For percentile confidence intervals, Efron and Tibshirani recommend B = 1,000 as a minimum. For BCa intervals or when high precision matters, B = 5,000–10,000 is common. The bootstrap distribution converges as B increases; beyond B = 10,000, improvements are usually negligible compared to the variance from your original finite sample.

What is the bootstrap distribution vs the sampling distribution?

The true sampling distribution is the theoretical distribution of a statistic computed across infinitely many samples from the actual population — usually unknown. The bootstrap distribution approximates it empirically: instead of drawing from the real population, you draw from your observed sample. As sample size n grows and your sample better represents the population, the bootstrap distribution converges to the true sampling distribution. For finite samples it is an approximation; accuracy depends on how well your observed data represents the population.

Can bootstrap sampling be used for hypothesis testing?

Yes. Bootstrap hypothesis tests (sometimes called permutation tests when done without replacement) can test whether an observed difference between groups is statistically significant. Under the null hypothesis, you pool the groups, generate bootstrap samples, compute the test statistic (e.g., difference in means) for each resample, and determine how often the bootstrap statistics exceed the observed value. The proportion that does defines the p-value. This approach sidesteps normality assumptions and is widely used in A/B testing and biostatistics. See the hypothesis testing guide for the parametric equivalent.

What is stratified bootstrap sampling?

Stratified bootstrap sampling divides the dataset into groups (strata) by some variable — commonly the class label in a classification problem — and samples with replacement separately within each stratum. This ensures that the proportion of each class remains the same in every bootstrap resample, preventing the class imbalance from changing between resamples. Scikit-learn's StratifiedKFold applies the same idea to cross-validation.

Bootstrap sampling connects to several other statistical concepts covered in depth on Statistics Fundamentals. The pages below cover the prerequisite and follow-on topics most relevant to understanding and applying bootstrap methods.

📈