Typically in statistics, we use theory to derive the sampling distribution of a statistic. From the sampling distribution, we can obtain the variance, construct conidence intervals, perform hypothesis tests, and more.
Challenge:
Basic idea of bootstrapping:
1 Nonparametric Bootstrap
Let \(X_1, \dots, X_n \sim F\) with pdf \(f(x)\). Recall, the cdf is defined as
In practice, this leads to the following function. Let \(X_{(1)} \le X_{(2)} \le \cdots \le X_{(n)}\) be the order statistics of the sample. Then, \[ F_n(x) = \begin{cases} 0 & x < X_{(1)} \\ \frac{i}{n} & X_{(i)} \le x < X_{(i + 1)}; \quad i = 1, \dots, n-1 \\ 1 & x \ge X_{(n)} \end{cases} \]
Theoretical:
Bootstrap:
The idea behind the bootstrap is to sample many data sets from \(F_n(x)\), which can be achieved by resampling from the data with replacement.
# observed data
x <- c(2, 2, 1, 1, 5, 4, 4, 3, 1, 2)
# create 10 bootstrap samples
x_star <- matrix(NA, nrow = length(x), ncol = 10)
for(i in 1:10) {
x_star[, i] <- sample(x, length(x), replace = TRUE)
}
x_star## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 5 2 4 4 1 2 2 1 5 1
## [2,] 4 5 1 1 1 2 1 1 4 2
## [3,] 4 2 5 1 2 2 1 4 4 3
## [4,] 4 5 1 3 2 4 4 4 3 1
## [5,] 4 1 2 1 1 1 5 2 1 1
## [6,] 4 2 2 2 4 4 3 2 1 2
## [7,] 1 5 4 4 1 2 1 2 1 4
## [8,] 3 1 1 1 4 1 4 1 4 2
## [9,] 1 4 4 2 2 1 4 3 2 1
## [10,] 4 1 2 3 4 5 5 5 2 4
## [1] 2.5
## [1] 3.4 2.8 2.6 2.2 2.2 2.4 3.0 2.5 2.7 2.1
ggplot() +
geom_histogram(aes(colMeans(x_star)), binwidth = .05) +
geom_vline(aes(xintercept = mean(x)), lty = 2, colour = "red") +
xlab("Sampling distribution of the mean via bootstrapping")1.1 Algorithm
Goal: estimate the sampling distribution of a statistic based on observed data \(x_1, \dots, x_n\).
Let \(\theta\) be the parameter of interest and \(\hat{\theta}\) be an estimator of \(\theta\). Then,
1.2 Properties of Estimators
We can use the bootstrap to estimate different properties of estimators.
1.2.1 Standard Error
Recall \(se(\hat{\theta}) = \sqrt{Var(\hat{\theta})}\). We can get a bootstrap estimate of the standard error:
1.2.2 Bias
Recall \(\text{bias}(\hat{\theta}) = E[\hat{\theta} - \theta] = E[\hat{\theta}] - \theta\).
Example 1.2
We can get a bootstrap estimate of the bias:
Overall, we seek statistics with small se and small bias.
1.3 Sample Size and # Bootstrap Samples
\[ n = \text{sample size} \quad \& \quad B = \# \text{ bootstap samples} \]
If \(n\) is too small, or sample isn’t representative of the population,
Guidelines for \(B\) –
Best approach –
Your Turn
In this example, we explore bootstrapping in the rare case where we know the values for the entire population. If you have all the data from the population, you don’t need to bootstrap (or really, inference). It is useful to learn about bootstrapping by comparing to the truth in this example.
In the package bootstrap is contained the average LSAT and GPA for admission to the population of \(82\) USA Law schools (an old data set – there are now over \(200\) law schools). This package also contains a random sample of size \(n = 15\) from this dataset.
## LSAT GPA
## 1 576 3.39
## 2 635 3.30
## 3 558 2.81
## 4 578 3.03
## 5 666 3.44
## 6 580 3.07
ggplot() +
geom_point(aes(LSAT, GPA), data = law) +
geom_point(aes(LSAT, GPA), data = law82, pch = 1)We will estimate the correlation \(\theta = \rho(\text{LSAT}, \text{GPA})\) between these two variables and use a bootstrap to estimate the sample distribution of \(\hat{\theta}\).
## [1] 0.7763745
## [1] 0.7599979
# set up the bootstrap
B <- 200
n <- nrow(law)
r <- numeric(B) # storage
for(b in B) {
## Your Turn: Do the bootstrap!
}Plot the sample distribution of \(\hat{\theta}\). Add vertical lines for the true value \(\theta\) and the sample estimate \(\hat{\theta}\).
Estimate \(sd(\hat{\theta})\).
Estimate the bias of \(\hat{\theta}\)
1.4 Bootstrap CIs
We will look at five different ways to create confidence intervals using the boostrap and discuss which to use when.
Percentile Bootstrap CI
Basic Bootstrap CI
Standard Normal Bootstrap CI
Bootstrap \(t\)
Accelerated Bias-Corrected (BCa)
Key ideas:
1.4.1 Percentile Bootstrap CI
Let \(\hat{\theta}^{(1)}, \dots, \hat{\theta}^{(B)}\) be bootstrap replicates and let \(\hat{\theta}_{\alpha/2}\) be the \(\alpha/2\) quantile of \(\hat{\theta}^{(1)}, \dots, \hat{\theta}^{(B)}\).
Then, the \(100(1 - \alpha)\%\) Percentile Bootstrap CI for \(\theta\) is
In R, if bootstrap.reps = c(\(\hat{\theta}^{(1)}, \dots, \hat{\theta}^{(B)}\)), the percentile CI is
Assumptions/usage
1.4.2 Basic Bootstrap CI
The \(100(1 - \alpha)\%\) Basic Bootstrap CI for \(\theta\) is
Assumptions/usage
1.4.3 Standard Normal Bootstrap CI
From the CLT,
So, the \(100(1 - \alpha)\%\) Standard Normal Bootstrap CI for \(\theta\) is
Assumptions/usage
1.4.4 Bootstrap \(t\) CI (Studentized Bootstrap)
Even if the distribution of \(\hat{\theta}\) is Normal and \(\hat{\theta}\) is unbiased for \(\theta\), the Normal distribution is not exactly correct for \(z\).
Additionally, the distribution of \(\hat{se}(\hat{\theta})\) is unknown.
\(\Rightarrow\) The bootstrap \(t\) interval does not use a Student \(t\) distribution as the reference distribuion, instead we estimate the distribution of a “t type” statistic by resampling.
The \(100(1 - \alpha)\%\) Boostrap \(t\) CI is
Overview
To estimate the “t style distribution” for \(\hat{\theta}\),
Assumptions/usage
1.4.5 BCa CIs
Modified version of percentile intervals that adjusts for bias of estimator and skewness of the sampling distribution.
This method automatically selects a transformation so that the normality assumption holds.
Idea:
The BCa method uses bootstrapping to estimate the bias and skewness then modifies which percentiles are chosen to get the appropriate confidence limits for a given data set.
In summary,
Your Turn
We will consider a telephone repair example from Hesterberg (2014). Verizon has repair times, with two groups, CLEC and ILEC, customers of the “Competitive” and “Incumbent” local exchange carrier.
## Time Group
## 1 17.50 ILEC
## 2 2.40 ILEC
## 3 0.00 ILEC
## 4 0.65 ILEC
## 5 22.23 ILEC
## 6 1.20 ILEC
Verizon %>%
group_by(Group) %>%
summarize(mean = mean(Time), sd = sd(Time), min = min(Time), max = max(Time)) %>%
kable()| Group | mean | sd | min | max |
|---|---|---|---|---|
| CLEC | 16.509130 | 19.50358 | 0 | 96.32 |
| ILEC | 8.411611 | 14.69004 | 0 | 191.60 |
1.5 Bootstrapping CIs
There are many bootstrapping packages in R, we will use the boot package. The function boot generates \(R\) resamples of the data and computes the desired statistic(s) for each sample. This function requires 3 arguments:
- data = the data from the original sample (data.frame or matrix).
- statistic = a function to compute the statistic from the data where the first argument is the data and the second argument is the indices of the obervations in the boostrap sample.
- \(R\) = the number of bootstrap replicates.
library(boot) # package containing the bootstrap function
mean_func <- function(x, idx) {
mean(x[idx])
}
ilec_times <- Verizon[Verizon$Group == "ILEC",]$Time
boot.ilec <- boot(ilec_times, mean_func, 2000)
plot(boot.ilec)If we want to get Bootstrap CIs, we can use the boot.ci function to generate the 5 different nonparamteric bootstrap confidence intervals.
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.ilec, conf = 0.95, type = c("perc", "basic",
## "norm", "bca"))
##
## Intervals :
## Level Normal Basic
## 95% ( 7.719, 9.114 ) ( 7.709, 9.119 )
##
## Level Percentile BCa
## 95% ( 7.704, 9.114 ) ( 7.752, 9.164 )
## Calculations and Intervals on Original Scale
## we can do some of these on our own
## normal
mean(boot.ilec$t) + c(-1, 1)*qnorm(.975)*sd(boot.ilec$t)## [1] 7.709670 9.104182
## normal is bias corrected
2*mean(ilec_times) - (mean(boot.ilec$t) - c(-1, 1)*qnorm(.975)*sd(boot.ilec$t))## [1] 7.719039 9.113551
## 2.5% 97.5%
## 7.707656 9.111150
## 97.5% 2.5%
## 7.712071 9.115565
To get the studentized bootstrap CI, we need our statistic function to also return the variance of \(\hat{\theta}\).
mean_var_func <- function(x, idx) {
c(mean(x[idx]), var(x[idx])/length(idx))
}
boot.ilec_2 <- boot(ilec_times, mean_var_func, 2000)
boot.ci(boot.ilec_2, conf = .95, type = "stud")## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.ilec_2, conf = 0.95, type = "stud")
##
## Intervals :
## Level Studentized
## 95% ( 7.733, 9.231 )
## Calculations and Intervals on Original Scale
Which CI should we use?
1.6 Bootstrapping for the difference of two means
Given iid draws of size \(n\) and \(m\) from two populations, to compare the means of the two groups using the bootstrap,
The function two.boot in the simpleboot package is used to bootstrap the difference between univariate statistics. Use the bootstrap to compute the shape, bias, and bootstrap sample error for the samples from the Verizon data set of CLEC and ILEC customers.
library(simpleboot)
clec_times <- Verizon[Verizon$Group == "CLEC",]$Time
diff_means.boot <- two.boot(ilec_times, clec_times, "mean", R = 2000)
ggplot() +
geom_histogram(aes(diff_means.boot$t)) +
xlab("mean(ilec) - mean(clec)")
qqnorm(diff_means.boot$t)
qqline(diff_means.boot$t)Which confidence intervals should we use?
Is there evidence that \[ H_0: \mu_1 - \mu_2 = 0 \\ H_a: \mu_1 - \mu_2 < 0 \] is rejected?
2 Parametric Bootstrap
In a nonparametric bootstrap, we
In a parametric bootstrap,
For both methods,
2.1 Bootstrapping for linear regression
Consider the regression model \(Y_i = \boldsymbol x_i^T\boldsymbol \beta + \epsilon_i, i = 1, \dots, n\) with \(\epsilon_i \stackrel{iid}{\sim}N(0, \sigma^2)\).
Two approaches for bootstrapping linear regression models –
2.1.1 Bootstrapping the residuals
Fit the regression model using the original data
Compute the residuals from the regression model, \[ \hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \boldsymbol x_i^T\hat{\boldsymbol \beta}, \quad i = 1, \dots, n \]
Sample \(\hat{\epsilon}^*_1, \dots, \hat{\epsilon}^*_n\) with replacement from \(\hat{\epsilon}_1, \dots, \hat{\epsilon}_n\).
Create the bootstrap sample \[ y_i^* = \boldsymbol x_i^T\hat{\boldsymbol \beta} + \epsilon^*_i, \quad i = 1, \dots, n \]
Estimate \(\hat{\boldsymbol\beta}^*\)
Repeat steps 2-4 \(B\) times to create \(B\) bootstrap estimates of \(\hat{\beta}\).
Assumptions:
2.1.2 Paired bootstrapping
Resample \(z_i^* = (y_i, \boldsymbol x_i)^*\) from the empirical distribution of the pairs \((y_i, \boldsymbol x_i)\).
Assumptions:
2.1.3 Which to use?
Standard inferences -
Bootstrapping the residuals -
Paired bootstrapping -
Your Turn
This data set is the Puromycin data in R. The goal is to create a regression model about the rate of an enzymatic reaction as a function of the substrate concentration.
## conc rate state
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
## [1] 23 3
ggplot(Puromycin) +
geom_point(aes(conc, rate))
ggplot(Puromycin) +
geom_point(aes(log(conc), (rate)))2.1.4 Standard regression
##
## Call:
## lm(formula = rate ~ conc, data = Puromycin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.861 -15.247 -2.861 15.686 48.054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.92 8.00 11.74 1.09e-10 ***
## conc 105.40 16.92 6.23 3.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.82 on 21 degrees of freedom
## Multiple R-squared: 0.6489, Adjusted R-squared: 0.6322
## F-statistic: 38.81 on 1 and 21 DF, p-value: 3.526e-06
## 2.5 % 97.5 %
## (Intercept) 77.28643 110.5607
## conc 70.21281 140.5832
##
## Call:
## lm(formula = rate ~ log(conc), data = Puromycin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.250 -12.753 0.327 12.969 30.166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.085 6.332 30.02 < 2e-16 ***
## log(conc) 33.203 2.739 12.12 6.04e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.2 on 21 degrees of freedom
## Multiple R-squared: 0.875, Adjusted R-squared: 0.869
## F-statistic: 146.9 on 1 and 21 DF, p-value: 6.039e-11
## 2.5 % 97.5 %
## (Intercept) 176.91810 203.2527
## log(conc) 27.50665 38.8987
2.1.5 Paired bootstrap
2.1.6 Bootstrapping the residuals
# Your turn
library(boot)
reg_func_2 <- function(dat, idx) {
# write a regression function that returns fitted beta
# from fitting a y that is created from the residuals
}
# use the boot function to get the bootstrap samples
# examing the bootstrap sampling distribution, make histograms
# get confidence intervals for beta_0 and beta_1 using boot.ci3 Bootstrapping Dependent Data
Suppose we have dependent data \(\boldsymbol y = (y_1, \dots, y_n)\) generated from some unknown distribution \(F = F_{\boldsymbol Y} = F_{(Y_1, \dots, Y_n)}\).
Goal:
Challenge:
We will consider 2 approaches
3.1 Model-based approach
If we assume an AR(1) model for the data, we can consider a method similar to bootstrapping residuals for linear regression.
Model-based – the performance of this approach depends on the model being appropriate for the data.
3.2 Nonparametric approach
To deal with dependence in the data, we will employ a nonparametric block bootstrap.
Idea:
3.2.1 Nonoverlapping Blocks (NBB)
Consider splitting \(\boldsymbol Y = (Y_1, \dots, Y_n)\) in \(b\) consecutive blocks of length \(\ell\).
We can then rewrite the data as \(\boldsymbol Y = (\boldsymbol B_1, \dots, \boldsymbol B_b)\) with \(\boldsymbol B_k = (Y_{(k-1)\ell + 1}, \dots, Y_{k\ell})\), \(k = 1, \dots, b\).
Note, the order of data within the blocks must be maintained, but the order of the blocks that are resampled does not matter.
3.2.2 Moving Blocks (MBB)
Now consider splitting \(\boldsymbol Y = (Y_1, \dots, Y_n)\) into overlapping blocks of adjacent data points of length \(\ell\).
We can then write the blocks as \(\boldsymbol B_k = (Y_k,\dots, Y_{k + \ell -1})\), \(k = 1, \dots, n - \ell + 1\).
3.2.3 Choosing Block Size
If the block length is too short,
If the block length is too long,
Your Turn
We will look at the annual numbers of lynx trappings for 1821–1934 in Canada. Taken from Brockwell & Davis (1991).
Goal: Estimate the sample distribution of the mean
## [1] 1538.018
3.2.4 Independent Bootstrap
library(simpleboot)
B <- 10000
## Your turn: perform the independent bootstap
## what is the bootstrap estimate se?We must account for the dependence to obtain a correct estimate of the variance!
The acf (autocorrelation) in the dominant terms is positive, so we are underestimating the standard error.
3.2.5 Non-overlapping Block Bootstrap
# function to create non-overlapping blocks
nb <- function(x, b) {
n <- length(x)
l <- n %/% b
blocks <- matrix(NA, nrow = b, ncol = l)
for(i in 1:b) {
blocks[i, ] <- x[((i - 1)*l + 1):(i*l)]
}
blocks
}
# Your turn: perform the NBB with b = 10 and l = 11
theta_hat_star_nbb <- rep(NA, B)
nb_blocks <- nb(lynx, 10)
for(i in 1:B) {
# sample blocks
# get theta_hat^*
}
# Plot your results to inspect the distribution
# What is the estimated standard error of theta hat? The Bias?3.2.6 Moving Block Bootstrap
# function to create overlapping blocks
mb <- function(x, l) {
n <- length(x)
blocks <- matrix(NA, nrow = n - l + 1, ncol = l)
for(i in 1:(n - l + 1)) {
blocks[i, ] <- x[i:(i + l - 1)]
}
blocks
}
# Your turn: perform the MBB with l = 11
mb_blocks <- mb(lynx, 11)
theta_hat_star_mbb <- rep(NA, B)
for(i in 1:B) {
# sample blocks
# get theta_hat^*
}
# Plot your results to inspect the distribution
# What is the estimated standard error of theta hat? The Bias?4 Summary
Bootstrap methods are simulation methods for frequentist inference.
Bootstrap methods are useful for
Bootstrap methods can fail when