What Is Bootstrap Sampling? (Definition)
The method was introduced by Bradley Efron in his 1979 paper "Bootstrap Methods: Another Look at the Jackknife," published in The Annals of Statistics. The name comes from the phrase "pulling oneself up by one's bootstraps" — the idea of generating information about a population from the sample itself, with no external data required.
The core assumption is the plug-in principle: treat your observed sample as the best available proxy for the population. If you had access to the actual population you would sample from it repeatedly; since you do not, you sample from your empirical data repeatedly instead. As the original sample size grows and becomes more representative of the population, bootstrap estimates become more accurate.
Bootstrap sampling is a statistical resampling method that draws random samples of size n from an observed dataset with replacement, repeating the process B times (typically 1,000–10,000). Each resample estimates a statistic (mean, median, regression coefficient, etc.), and the collection of B estimates forms a bootstrap distribution used to calculate standard errors and confidence intervals without parametric assumptions.
What Does "Sampling With Replacement" Mean?
Sampling with replacement means each observation is placed back into the pool after being drawn, so the same data point can be selected more than once in a single bootstrap resample. For a dataset of n = 5 values, each bootstrap resample still contains n = 5 draws, but some values may appear two or three times while others may not appear at all.
On average, any single bootstrap resample includes approximately 63.2% of the original dataset's unique observations (since the probability of any given observation being excluded from one resample is (1 − 1/n)ⁿ → 1/e ≈ 0.368). The remaining ~36.8% not selected in a given resample are called the out-of-bag (OOB) observations — a concept used directly in Random Forest to estimate generalization error without a separate test set.
Bootstrap sampling always uses replacement. Simple random sampling (used in surveys and study design) typically samples without replacement, so each observation can only appear once. The distinction matters: sampling without replacement from a dataset of size n would always produce the same n observations in a different order, giving no information about sampling variability.
How Bootstrap Sampling Works — 6 Steps
Start with original data of size n → draw n observations with replacement → compute your statistic → repeat B times → compile the B statistics into a distribution → use that distribution to estimate standard error or build a confidence interval.
Collect the Original Dataset
Begin with your observed sample X = {x₁, x₂, …, xₙ} of size n. This is your empirical population — it must be representative of the real population for bootstrap estimates to be meaningful. The bootstrap does not fix a bad or biased sample.
Draw a Bootstrap Resample With Replacement
Randomly select n observations from X, one at a time, replacing each observation before the next draw. This produces a bootstrap sample X* = {x*₁, x*₂, …, x*ₙ} where some values from X may appear multiple times and others may be absent. Each draw has probability 1/n for any observation.
Compute the Statistic of Interest
Calculate your target statistic θ̂ on the bootstrap resample — this could be the mean (x̄*), median, standard deviation, correlation coefficient, regression slope, or any other quantity. Call this bootstrap replicate θ*ᵇ.
Repeat B Times
Repeat steps 2 and 3 independently B times, where B is typically 1,000 to 10,000. Each repetition yields one bootstrap replicate θ*ᵇ. More resamples give more stable estimates: for standard errors B = 200 is often enough; for confidence intervals, B = 1,000–2,000 is recommended by Efron and Tibshirani.
Build the Bootstrap Distribution
Collect all B replicates {θ*¹, θ*², …, θ*ᴮ} to form the bootstrap distribution Ωboot. This empirical distribution approximates the true sampling distribution of θ̂ — the same distribution you would observe if you could draw many samples from the actual population.
Estimate Standard Error, Bias, or Confidence Intervals
Use the bootstrap distribution to compute: the bootstrap standard error (standard deviation of the B replicates), the bootstrap bias (mean of replicates minus the original statistic), or a 95% confidence interval (the 2.5th and 97.5th percentiles of the B replicates for the percentile method).
Bootstrap Sampling — Process Flow Diagram
X = {x₁…xₙ}, size n
X*¹, size n
X*², size n
X*ᴮ, size n — repeat 1,000–10,000×
Each resample is independent and drawn with replacement from the same original dataset.
Worked Example — Bootstrap Sampling Step by Step
The following example uses a small dataset to keep the arithmetic transparent. In practice you would use hundreds or thousands of resamples; here, three resamples show the mechanics clearly. This mirrors the approach described in Efron and Tibshirani's foundational textbook An Introduction to the Bootstrap (Chapman & Hall, 1993).
Problem: A researcher records the resting heart rates (bpm) of five participants: X = {62, 70, 68, 75, 65}. Use bootstrap sampling to estimate the standard error of the sample mean.
Original dataset: X = {62, 70, 68, 75, 65}, n = 5
Original sample mean: x̄ = (62 + 70 + 68 + 75 + 65) / 5 = 340 / 5 = 68.0 bpm
Bootstrap Resample 1 (drawn with replacement from X):
X*¹ = {70, 62, 70, 65, 68} — note 70 appears twice
Mean*¹ = (70 + 62 + 70 + 65 + 68) / 5 = 335 / 5 = 67.0
Bootstrap Resample 2:
X*² = {75, 75, 62, 68, 70} — 75 appears twice; 65 absent
Mean*² = (75 + 75 + 62 + 68 + 70) / 5 = 350 / 5 = 70.0
Bootstrap Resample 3:
X*³ = {62, 65, 65, 68, 62} — 62 and 65 each appear twice; 70 and 75 absent
Mean*³ = (62 + 65 + 65 + 68 + 62) / 5 = 322 / 5 = 64.4
In practice, repeat for B = 1,000 resamples. Here, using B = 3 for illustration, the bootstrap mean of means = (67.0 + 70.0 + 64.4) / 3 = 67.13
Bootstrap Standard Error (SEboot):
SEboot = √[(Σ(θ*ᵇ − θ̄*)²) / (B − 1)]
Deviations from 67.13: (67.0−67.13)² + (70.0−67.13)² + (64.4−67.13)² = 0.017 + 8.237 + 7.453 = 15.707
SEboot = √(15.707 / 2) = √7.853 ≈ 2.80 bpm
✅ With B = 1,000 resamples the bootstrap SE converges to approximately 2.1 bpm for this dataset — comparable to the analytical SE = s/√n = 5.0/√5 ≈ 2.24 bpm. The bootstrap estimate required no normality assumption.
Bootstrap Confidence Intervals
The most common reason practitioners use bootstrap sampling is to construct a confidence interval for a statistic without making distributional assumptions. There are several methods; the two most widely used are the percentile method and the BCa method.
Percentile Method
θ*α/2 = 2.5th percentile of B resamples
θ*1−α/2 = 97.5th percentile of B resamples
B ≥ 1,000 resamples recommended
After running B = 1,000 bootstrap resamples and computing the mean for each, sort the 1,000 means from smallest to largest. A 95% confidence interval takes the value at position 25 (the 2.5th percentile) as the lower bound and the value at position 975 (the 97.5th percentile) as the upper bound. No formula for the standard normal or t-distribution is needed.
Using the heart rate example with B = 1,000 resamples (simulated), suppose the sorted bootstrap means range from 63.4 to 72.8. The 95% bootstrap CI is approximately [64.2, 71.8].
Generate B = 1,000 bootstrap resamples from X = {62, 70, 68, 75, 65}, each of size n = 5 with replacement. Compute the mean for each resample.
Sort the 1,000 bootstrap means from lowest to highest: {63.4, 63.6, 63.8, …, 72.6, 72.8}.
Extract percentiles: Lower bound = value at rank 25 (2.5th percentile) ≈ 64.2. Upper bound = value at rank 975 (97.5th percentile) ≈ 71.8.
✅ 95% Bootstrap Percentile CI: [64.2, 71.8] bpm. We are 95% confident the true population mean resting heart rate falls between 64.2 and 71.8 bpm, based on this sample and resampling method. Compare this to the analytical t-interval for the same data.
BCa Method (Bias-Corrected and Accelerated)
The BCa method adjusts the percentile boundaries for two sources of distortion: bias (the bootstrap distribution is not centered on the original estimate) and acceleration (the standard error of the statistic changes with the parameter value). BCa intervals are more accurate than the basic percentile method, especially for skewed statistics or small samples. Most statistical software packages (R's boot library, Python's scipy.stats.bootstrap) default to BCa or offer it as an option.
Bootstrap Sampling in Python
Python's NumPy library makes bootstrap sampling straightforward with np.random.choice. The scipy.stats.bootstrap function (added in SciPy 1.7) provides BCa confidence intervals directly. The scikit-learn library does not expose a standalone bootstrap function, but the underlying mechanism is used inside BaggingClassifier and RandomForestClassifier.
import numpy as np # Original dataset data = np.array([62, 70, 68, 75, 65]) n = len(data) B = 10000 # number of bootstrap resamples def bootstrap_replicate_1d(data, func): """Draw one bootstrap resample and compute the statistic.""" bs_sample = np.random.choice(data, size=len(data), replace=True) return func(bs_sample) def draw_bs_reps(data, func, size=10000): """Draw B bootstrap replicates of a statistic.""" return np.array([bootstrap_replicate_1d(data, func) for _ in range(size)]) # Generate bootstrap distribution of the mean bs_means = draw_bs_reps(data, np.mean, size=B) # Bootstrap standard error se_boot = np.std(bs_means, ddof=1) # 95% percentile confidence interval ci_lower = np.percentile(bs_means, 2.5) ci_upper = np.percentile(bs_means, 97.5) print(f"Original mean: {np.mean(data):.2f}") print(f"Bootstrap SE: {se_boot:.4f}") print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
from scipy import stats import numpy as np data = np.array([62, 70, 68, 75, 65]) result = stats.bootstrap( (data,), np.mean, n_resamples=9999, confidence_level=0.95, method='BCa' # also try 'percentile' or 'basic' ) print(f"BCa 95% CI: {result.confidence_interval}") print(f"Bootstrap SE: {result.standard_error:.4f}")
Bootstrap Sampling Formulas
| Quantity | Formula | Notes |
|---|---|---|
| SEboot | √[ Σ(θ*ᵇ − θ̄*)² / (B−1) ] | Standard deviation of B bootstrap replicates |
| Biasboot | (1/B) Σθ*ᵇ − θ̂ | Mean of replicates minus original estimate |
| Percentile CI (95%) | [ θ*0.025, θ*0.975 ] | 2.5th and 97.5th percentile of sorted replicates |
| Bias-corrected estimate | θ̂bc = 2θ̂ − θ̄* | Corrects for bootstrap bias |
| Draw probability | P(X* = xᵢ) = 1/n | Equal chance of selecting any observation each draw |
| Prob. observation included | 1 − (1 − 1/n)ⁿ → 1 − 1/e ≈ 0.632 | ~63.2% unique obs. per resample on average |
Bootstrap Sampling in Machine Learning
Bootstrap sampling is not just a tool for statisticians — it is the foundation of some of the most widely used ensemble methods in machine learning. Understanding bootstrap sampling unlocks an intuitive grasp of bagging, Random Forest, and gradient boosting frameworks.
Bagging (Bootstrap Aggregating)
Bagging, short for Bootstrap Aggregating, was introduced by Leo Breiman in 1996. The procedure: generate B bootstrap samples from the training data, fit a separate model (decision tree, or any other learner) to each sample, then aggregate predictions by averaging (regression) or majority voting (classification). Because each model is trained on a slightly different bootstrap sample, individual models make different errors, and averaging reduces overall variance.
f*ᵇ = model trained on bootstrap resample b
B = number of bootstrap resamples
Average for regression; majority vote for classification
Bootstrap Sampling in Random Forest
Random Forest extends bagging by adding a second source of randomization: at each node split, only a random subset of features is considered (typically √p for classification, p/3 for regression, where p is the total number of features). This decorrelates the individual trees further, reducing variance beyond what bagging alone achieves.
The role of bootstrap sampling in Random Forest is to create B different training sets, ensuring that no single influential data point dominates every tree. The out-of-bag observations (the ~36.8% not selected in each resample) are used to estimate generalization error — a free cross-validation estimate that requires no separate test split. This is described in Breiman's original 2001 paper in Machine Learning (vol. 45, pp. 5–32), available through the Springer archive.
Random Forest
Each tree trained on a distinct bootstrap sample. OOB score provides unbiased error estimate without a held-out test set.
Bagging Classifier
B classifiers, each trained on a bootstrap resample. Majority vote reduces variance without increasing bias.
Quantitative Finance
Value-at-Risk estimation via bootstrap avoids assuming return distributions are normal — important during market stress events.
Clinical Trials
Estimate confidence intervals for treatment effects in small-sample medical studies where parametric assumptions are untenable.
A/B Testing
Bootstrap hypothesis testing compares conversion rates or revenue metrics without assuming normality of user behavior distributions.
Survey Research
Estimate variance for complex sample designs where analytic formulas are unavailable, including stratified and cluster sampling.
Bootstrap vs Cross-Validation vs Jackknife
Bootstrap sampling, hypothesis testing variants, cross-validation, and the jackknife all estimate statistical uncertainty through resampling, but they serve different purposes and have different tradeoffs. The table below compares the three most common methods.
| Feature | Bootstrap Sampling | Cross-Validation | Jackknife |
|---|---|---|---|
| Primary use | Standard errors, confidence intervals, ensemble learning | Model selection, generalization error estimation | Bias estimation, SE for smooth statistics |
| Resampling method | With replacement; same sample size | Without replacement; partitioned folds | Leave-one-out; n resamples of size n−1 |
| Number of resamples | B = 1,000–10,000 (chosen by user) | k = 5 or 10 (fixed folds) | Exactly n (one per observation) |
| Parametric assumptions | None required | None required | Works best for smooth, differentiable statistics |
| Small sample performance | Good — designed for small n | Can be unstable for very small n | Can fail for non-smooth statistics (e.g., median) |
| Computational cost | High (many resamples) | Moderate (k model fits) | n model fits — manageable |
| ML application | Bagging, Random Forest, AdaBoost | Model hyperparameter tuning | Influence function estimation |
| Best used when | Need CI for any statistic without distributional assumptions | Selecting between candidate models | Quick bias correction with small n |
Choose bootstrap sampling when: (1) your sample is small and normality is questionable, (2) your statistic has no known analytical sampling distribution (e.g., the median, a correlation ratio, a quantile), (3) you need confidence intervals for a complex estimator, or (4) you are building an ensemble model via bagging. Use cross-validation instead when comparing and selecting between different model architectures.
Bootstrap Sampling Variations
| Variant | Use Case | Key Difference from Standard Bootstrap |
|---|---|---|
| Nonparametric Bootstrap | General inference, no distribution assumed | Resamples directly from the observed data — the standard method described above |
| Parametric Bootstrap | When distribution is known (e.g., Poisson count data) | Fits a parametric model to data, then generates resamples from that fitted model |
| Stratified Bootstrap | Preserving class ratios in classification tasks | Resamples separately within each stratum/class to maintain proportions |
| Block Bootstrap | Time series, panel data, spatially correlated data | Resamples contiguous blocks to preserve serial correlation structure |
| Residual Bootstrap | Linear regression inference | Keeps predictors fixed; resamples regression residuals only |
| BCa Bootstrap | Skewed statistics, improved accuracy | Adjusts CI boundaries for bias and acceleration factors |
| Double Bootstrap | Calibrating CI coverage probability | Bootstraps within bootstraps to correct CI undercoverage |
Bootstrap Sampling Simulator
Interactive Bootstrap Simulator
Enter comma-separated numbers, choose the statistic and number of resamples, then run the bootstrap to see the standard error and 95% confidence interval.
Limitations of Bootstrap Sampling
Bootstrap sampling is powerful, but it is not a cure-all. Several situations call for caution or a different approach.
- Very small samples (n < 10–15): The empirical distribution approximates the population poorly. Bootstrap CIs may have poor coverage, and edge cases (like the sample maximum) can behave erratically.
- Non-smooth statistics: The bootstrap does not work well for statistics like the sample maximum or minimum, which have discontinuous sampling distributions. The jackknife or smoothed bootstrap variants can help.
- Dependent data: The standard bootstrap assumes observations are independent. For time series or clustered data, use the block bootstrap instead to preserve serial or spatial correlation.
- Biased or unrepresentative samples: Bootstrap resampling from a bad sample produces a bad distribution. Garbage in, garbage out — bootstrap is not a substitute for proper study design. See the study design section for guidance on data collection.
- Computational cost: B = 10,000 resamples on a large dataset with a slow statistic can be expensive. Parallelization (via multiprocessing or vectorized NumPy) is advisable for production pipelines.
Frequently Asked Questions
Bootstrap sampling means repeatedly drawing random samples from your existing dataset, each sample the same size as the original and drawn with replacement (so the same data point can appear more than once). By computing a statistic — like the mean — across hundreds of these resamples, you build a picture of how that statistic would vary if you collected many new samples from the same population. This lets you estimate uncertainty without collecting more data.
Replacement is what makes bootstrap resamples genuinely different from one another. If you sampled without replacement from a dataset of size n, every resample would contain exactly the same n values in a different order — no new information about sampling variability would be generated. Sampling with replacement means each bootstrap sample is a new, slightly different version of the dataset, allowing the distribution of statistics across resamples to reflect real sampling uncertainty.
In bagging (Bootstrap Aggregating), bootstrap sampling creates B distinct training sets from the original training data. A separate model — usually a decision tree — is fitted to each training set. Because each model sees a slightly different version of the data, individual models make different errors. Averaging (for regression) or majority-voting (for classification) these diverse models reduces variance without meaningfully increasing bias, improving prediction accuracy. Random Forest builds on this by also randomizing which features are considered at each split.
For standard error estimation, B = 200 is often sufficient. For percentile confidence intervals, Efron and Tibshirani recommend B = 1,000 as a minimum. For BCa intervals or when high precision matters, B = 5,000–10,000 is common. The bootstrap distribution converges as B increases; beyond B = 10,000, improvements are usually negligible compared to the variance from your original finite sample.
The true sampling distribution is the theoretical distribution of a statistic computed across infinitely many samples from the actual population — usually unknown. The bootstrap distribution approximates it empirically: instead of drawing from the real population, you draw from your observed sample. As sample size n grows and your sample better represents the population, the bootstrap distribution converges to the true sampling distribution. For finite samples it is an approximation; accuracy depends on how well your observed data represents the population.
Yes. Bootstrap hypothesis tests (sometimes called permutation tests when done without replacement) can test whether an observed difference between groups is statistically significant. Under the null hypothesis, you pool the groups, generate bootstrap samples, compute the test statistic (e.g., difference in means) for each resample, and determine how often the bootstrap statistics exceed the observed value. The proportion that does defines the p-value. This approach sidesteps normality assumptions and is widely used in A/B testing and biostatistics. See the hypothesis testing guide for the parametric equivalent.
Stratified bootstrap sampling divides the dataset into groups (strata) by some variable — commonly the class label in a classification problem — and samples with replacement separately within each stratum. This ensures that the proportion of each class remains the same in every bootstrap resample, preventing the class imbalance from changing between resamples. Scikit-learn's StratifiedKFold applies the same idea to cross-validation.
Related Topics on Statistics Fundamentals
Bootstrap sampling connects to several other statistical concepts covered in depth on Statistics Fundamentals. The pages below cover the prerequisite and follow-on topics most relevant to understanding and applying bootstrap methods.
Sampling Distributions
The theoretical framework that bootstrap distributions approximate. Essential background before this topic.
Distribution of the Sample Mean
The analytical sampling distribution of x̄ — what the bootstrap approximates empirically.
Confidence Intervals
Parametric CI methods — compare with the bootstrap percentile interval covered here.
Confidence Interval for the Mean
The t-based CI — the parametric alternative to the bootstrap CI for the mean.
Central Limit Theorem
Why the CLT justifies parametric methods — and why bootstrap is useful when n is too small for CLT to apply.
Hypothesis Testing
Classical parametric inference — the traditional alternative to bootstrap-based testing.
Variance
Bootstrap standard error is the standard deviation of bootstrap replicates — variance is the underlying concept.
Normal Distribution
Bootstrap is most valuable when data departs from normality. Learn the normal distribution here.
Study Design
Bootstrap cannot fix a biased sample. Good study design is the necessary first step.