Typically in statistics, we use theory to derive the sampling distribution of a statistic. From the sampling distribution, we can obtain the variance, construct conidence intervals, perform hypothesis tests, and more.

Challenge:




Basic idea of bootstrapping:

1 Nonparametric Bootstrap

Let \(X_1, \dots, X_n \sim F\) with pdf \(f(x)\). Recall, the cdf is defined as




Definition 1.1 The empirical cdf is a function which estimates the cdf using observed data, \[ \hat{F}(x) = F_n(x) = \text{ proportion of sample points that fall in } (\infty, x]. \]

In practice, this leads to the following function. Let \(X_{(1)} \le X_{(2)} \le \cdots \le X_{(n)}\) be the order statistics of the sample. Then, \[ F_n(x) = \begin{cases} 0 & x < X_{(1)} \\ \frac{i}{n} & X_{(i)} \le x < X_{(i + 1)}; \quad i = 1, \dots, n-1 \\ 1 & x \ge X_{(n)} \end{cases} \]




Theoretical:

Bootstrap:

Example 1.1 Let \(\boldsymbol x = 2, 2, 1, 1, 5, 4, 4, 3, 1, 2\) be an observed sample. Find \(F_n(x)\).

The idea behind the bootstrap is to sample many data sets from \(F_n(x)\), which can be achieved by resampling from the data with replacement.

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    5    2    4    4    1    2    2    1    5     1
##  [2,]    4    5    1    1    1    2    1    1    4     2
##  [3,]    4    2    5    1    2    2    1    4    4     3
##  [4,]    4    5    1    3    2    4    4    4    3     1
##  [5,]    4    1    2    1    1    1    5    2    1     1
##  [6,]    4    2    2    2    4    4    3    2    1     2
##  [7,]    1    5    4    4    1    2    1    2    1     4
##  [8,]    3    1    1    1    4    1    4    1    4     2
##  [9,]    1    4    4    2    2    1    4    3    2     1
## [10,]    4    1    2    3    4    5    5    5    2     4
## [1] 2.5
##  [1] 3.4 2.8 2.6 2.2 2.2 2.4 3.0 2.5 2.7 2.1

1.1 Algorithm

Goal: estimate the sampling distribution of a statistic based on observed data \(x_1, \dots, x_n\).

Let \(\theta\) be the parameter of interest and \(\hat{\theta}\) be an estimator of \(\theta\). Then,

1.2 Properties of Estimators

We can use the bootstrap to estimate different properties of estimators.

1.2.1 Standard Error

Recall \(se(\hat{\theta}) = \sqrt{Var(\hat{\theta})}\). We can get a bootstrap estimate of the standard error:







1.2.2 Bias

Recall \(\text{bias}(\hat{\theta}) = E[\hat{\theta} - \theta] = E[\hat{\theta}] - \theta\).

Example 1.2






We can get a bootstrap estimate of the bias:










Overall, we seek statistics with small se and small bias.

1.3 Sample Size and # Bootstrap Samples

\[ n = \text{sample size} \quad \& \quad B = \# \text{ bootstap samples} \]

If \(n\) is too small, or sample isn’t representative of the population,





Guidelines for \(B\)





Best approach –

Your Turn

In this example, we explore bootstrapping in the rare case where we know the values for the entire population. If you have all the data from the population, you don’t need to bootstrap (or really, inference). It is useful to learn about bootstrapping by comparing to the truth in this example.

In the package bootstrap is contained the average LSAT and GPA for admission to the population of \(82\) USA Law schools (an old data set – there are now over \(200\) law schools). This package also contains a random sample of size \(n = 15\) from this dataset.

##   LSAT  GPA
## 1  576 3.39
## 2  635 3.30
## 3  558 2.81
## 4  578 3.03
## 5  666 3.44
## 6  580 3.07

We will estimate the correlation \(\theta = \rho(\text{LSAT}, \text{GPA})\) between these two variables and use a bootstrap to estimate the sample distribution of \(\hat{\theta}\).

## [1] 0.7763745
## [1] 0.7599979
  1. Plot the sample distribution of \(\hat{\theta}\). Add vertical lines for the true value \(\theta\) and the sample estimate \(\hat{\theta}\).

  2. Estimate \(sd(\hat{\theta})\).

  3. Estimate the bias of \(\hat{\theta}\)

1.4 Bootstrap CIs

We will look at five different ways to create confidence intervals using the boostrap and discuss which to use when.



  1. Percentile Bootstrap CI

  2. Basic Bootstrap CI

  3. Standard Normal Bootstrap CI

  4. Bootstrap \(t\)

  5. Accelerated Bias-Corrected (BCa)



Key ideas:

1.4.1 Percentile Bootstrap CI

Let \(\hat{\theta}^{(1)}, \dots, \hat{\theta}^{(B)}\) be bootstrap replicates and let \(\hat{\theta}_{\alpha/2}\) be the \(\alpha/2\) quantile of \(\hat{\theta}^{(1)}, \dots, \hat{\theta}^{(B)}\).

Then, the \(100(1 - \alpha)\%\) Percentile Bootstrap CI for \(\theta\) is





In R, if bootstrap.reps = c(\(\hat{\theta}^{(1)}, \dots, \hat{\theta}^{(B)}\)), the percentile CI is



Assumptions/usage

1.4.2 Basic Bootstrap CI

The \(100(1 - \alpha)\%\) Basic Bootstrap CI for \(\theta\) is









Assumptions/usage

1.4.3 Standard Normal Bootstrap CI

From the CLT,





So, the \(100(1 - \alpha)\%\) Standard Normal Bootstrap CI for \(\theta\) is





Assumptions/usage

1.4.4 Bootstrap \(t\) CI (Studentized Bootstrap)

Even if the distribution of \(\hat{\theta}\) is Normal and \(\hat{\theta}\) is unbiased for \(\theta\), the Normal distribution is not exactly correct for \(z\).



Additionally, the distribution of \(\hat{se}(\hat{\theta})\) is unknown.



\(\Rightarrow\) The bootstrap \(t\) interval does not use a Student \(t\) distribution as the reference distribuion, instead we estimate the distribution of a “t type” statistic by resampling.


The \(100(1 - \alpha)\%\) Boostrap \(t\) CI is





Overview




To estimate the “t style distribution” for \(\hat{\theta}\),

Assumptions/usage

1.4.5 BCa CIs

Modified version of percentile intervals that adjusts for bias of estimator and skewness of the sampling distribution.

This method automatically selects a transformation so that the normality assumption holds.

Idea:









The BCa method uses bootstrapping to estimate the bias and skewness then modifies which percentiles are chosen to get the appropriate confidence limits for a given data set.

In summary,









Your Turn

We will consider a telephone repair example from Hesterberg (2014). Verizon has repair times, with two groups, CLEC and ILEC, customers of the “Competitive” and “Incumbent” local exchange carrier.

##    Time Group
## 1 17.50  ILEC
## 2  2.40  ILEC
## 3  0.00  ILEC
## 4  0.65  ILEC
## 5 22.23  ILEC
## 6  1.20  ILEC
Group mean sd min max
CLEC 16.509130 19.50358 0 96.32
ILEC 8.411611 14.69004 0 191.60

1.5 Bootstrapping CIs

There are many bootstrapping packages in R, we will use the boot package. The function boot generates \(R\) resamples of the data and computes the desired statistic(s) for each sample. This function requires 3 arguments:

  1. data = the data from the original sample (data.frame or matrix).
  2. statistic = a function to compute the statistic from the data where the first argument is the data and the second argument is the indices of the obervations in the boostrap sample.
  3. \(R\) = the number of bootstrap replicates.

If we want to get Bootstrap CIs, we can use the boot.ci function to generate the 5 different nonparamteric bootstrap confidence intervals.

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot.ilec, conf = 0.95, type = c("perc", "basic", 
##     "norm", "bca"))
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 7.719,  9.114 )   ( 7.709,  9.119 )  
## 
## Level     Percentile            BCa          
## 95%   ( 7.704,  9.114 )   ( 7.752,  9.164 )  
## Calculations and Intervals on Original Scale
## [1] 7.709670 9.104182
## [1] 7.719039 9.113551
##     2.5%    97.5% 
## 7.707656 9.111150
##    97.5%     2.5% 
## 7.712071 9.115565

To get the studentized bootstrap CI, we need our statistic function to also return the variance of \(\hat{\theta}\).

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot.ilec_2, conf = 0.95, type = "stud")
## 
## Intervals : 
## Level    Studentized     
## 95%   ( 7.733,  9.231 )  
## Calculations and Intervals on Original Scale

Which CI should we use?

1.6 Bootstrapping for the difference of two means

Given iid draws of size \(n\) and \(m\) from two populations, to compare the means of the two groups using the bootstrap,










The function two.boot in the simpleboot package is used to bootstrap the difference between univariate statistics. Use the bootstrap to compute the shape, bias, and bootstrap sample error for the samples from the Verizon data set of CLEC and ILEC customers.

Which confidence intervals should we use?

Is there evidence that \[ H_0: \mu_1 - \mu_2 = 0 \\ H_a: \mu_1 - \mu_2 < 0 \] is rejected?

2 Parametric Bootstrap

In a nonparametric bootstrap, we







In a parametric bootstrap,







For both methods,

2.1 Bootstrapping for linear regression

Consider the regression model \(Y_i = \boldsymbol x_i^T\boldsymbol \beta + \epsilon_i, i = 1, \dots, n\) with \(\epsilon_i \stackrel{iid}{\sim}N(0, \sigma^2)\).




Two approaches for bootstrapping linear regression models –





2.1.1 Bootstrapping the residuals

  1. Fit the regression model using the original data

  2. Compute the residuals from the regression model, \[ \hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \boldsymbol x_i^T\hat{\boldsymbol \beta}, \quad i = 1, \dots, n \]

  3. Sample \(\hat{\epsilon}^*_1, \dots, \hat{\epsilon}^*_n\) with replacement from \(\hat{\epsilon}_1, \dots, \hat{\epsilon}_n\).

  4. Create the bootstrap sample \[ y_i^* = \boldsymbol x_i^T\hat{\boldsymbol \beta} + \epsilon^*_i, \quad i = 1, \dots, n \]

  5. Estimate \(\hat{\boldsymbol\beta}^*\)

  6. Repeat steps 2-4 \(B\) times to create \(B\) bootstrap estimates of \(\hat{\beta}\).

Assumptions:

2.1.2 Paired bootstrapping

Resample \(z_i^* = (y_i, \boldsymbol x_i)^*\) from the empirical distribution of the pairs \((y_i, \boldsymbol x_i)\).





Assumptions:





2.1.3 Which to use?

  1. Standard inferences -


  2. Bootstrapping the residuals -





  3. Paired bootstrapping -





Your Turn

This data set is the Puromycin data in R. The goal is to create a regression model about the rate of an enzymatic reaction as a function of the substrate concentration.

##   conc rate   state
## 1 0.02   76 treated
## 2 0.02   47 treated
## 3 0.06   97 treated
## 4 0.06  107 treated
## 5 0.11  123 treated
## 6 0.11  139 treated
## [1] 23  3

2.1.4 Standard regression

## 
## Call:
## lm(formula = rate ~ conc, data = Puromycin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.861 -15.247  -2.861  15.686  48.054 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    93.92       8.00   11.74 1.09e-10 ***
## conc          105.40      16.92    6.23 3.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.82 on 21 degrees of freedom
## Multiple R-squared:  0.6489, Adjusted R-squared:  0.6322 
## F-statistic: 38.81 on 1 and 21 DF,  p-value: 3.526e-06
##                2.5 %   97.5 %
## (Intercept) 77.28643 110.5607
## conc        70.21281 140.5832
## 
## Call:
## lm(formula = rate ~ log(conc), data = Puromycin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.250 -12.753   0.327  12.969  30.166 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  190.085      6.332   30.02  < 2e-16 ***
## log(conc)     33.203      2.739   12.12 6.04e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.2 on 21 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.869 
## F-statistic: 146.9 on 1 and 21 DF,  p-value: 6.039e-11
##                 2.5 %   97.5 %
## (Intercept) 176.91810 203.2527
## log(conc)    27.50665  38.8987

3 Bootstrapping Dependent Data

Suppose we have dependent data \(\boldsymbol y = (y_1, \dots, y_n)\) generated from some unknown distribution \(F = F_{\boldsymbol Y} = F_{(Y_1, \dots, Y_n)}\).





Goal:





Challenge:









We will consider 2 approaches

3.1 Model-based approach

Example 3.1 Suppose we observe a time series \(\boldsymbol Y = (Y_1, \dots, Y_n)\) which we assume is generated by an AR(1) process, i.e.,





If we assume an AR(1) model for the data, we can consider a method similar to bootstrapping residuals for linear regression.



























Model-based – the performance of this approach depends on the model being appropriate for the data.

3.2 Nonparametric approach

To deal with dependence in the data, we will employ a nonparametric block bootstrap.

Idea:





3.2.1 Nonoverlapping Blocks (NBB)

Consider splitting \(\boldsymbol Y = (Y_1, \dots, Y_n)\) in \(b\) consecutive blocks of length \(\ell\).









We can then rewrite the data as \(\boldsymbol Y = (\boldsymbol B_1, \dots, \boldsymbol B_b)\) with \(\boldsymbol B_k = (Y_{(k-1)\ell + 1}, \dots, Y_{k\ell})\), \(k = 1, \dots, b\).













Note, the order of data within the blocks must be maintained, but the order of the blocks that are resampled does not matter.

3.2.2 Moving Blocks (MBB)

Now consider splitting \(\boldsymbol Y = (Y_1, \dots, Y_n)\) into overlapping blocks of adjacent data points of length \(\ell\).









We can then write the blocks as \(\boldsymbol B_k = (Y_k,\dots, Y_{k + \ell -1})\), \(k = 1, \dots, n - \ell + 1\).

3.2.3 Choosing Block Size

If the block length is too short,



If the block length is too long,







Your Turn

We will look at the annual numbers of lynx trappings for 1821–1934 in Canada. Taken from Brockwell & Davis (1991).

Goal: Estimate the sample distribution of the mean

## [1] 1538.018

3.2.4 Independent Bootstrap

We must account for the dependence to obtain a correct estimate of the variance!

The acf (autocorrelation) in the dominant terms is positive, so we are underestimating the standard error.

4 Summary

Bootstrap methods are simulation methods for frequentist inference.

Bootstrap methods are useful for







Bootstrap methods can fail when