13  Bootstrap Confidence Intervals

Author

Shelly Sinclair and Alvonee Penn

Published

May 28, 2024

13.1 Bootstrap

What is bootstrapping?

Bootstrapping is a technique from Efron (1979) that is built on a simple idea: if the data we have is a sample from a population, why don’t we sample from our own data to make more samples? Now, because we don’t have access to any new data, we’re going to take samples of our data set with replacement.

13.1.1 When to use bootstrapping?

The purpose of bootstrapping is to increase the sample size for our analysis when the sample we have been given is small.

13.2 Packages

palmerpenguins - The dataset we will using in this Lesson.

tidyverse - includes the packages to be used for data wrangling.

boot - The R package boot generates bootstrap samples in R.

# install.packages("palmerpenguins")
# install.packages("boot")

library(palmerpenguins)
library(boot)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

13.3 Penguins

This Data was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica. The palmerpenguins package contains two datasets. One is called penguins, and is a simplified version of the raw data.

Penguins includes measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex. The full dataset contains A tibble with 344 rows and 8 variables(Horst, Hill, and Gorman 2020).

Research Question: What is the difference in flipper length of the Adelie penguin from two different Islands?

# load the data
data(penguins)

# see the structure of the data
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

First we created empty vectors to list and store data points.

Our for loop will organize the data for two of the islands where Adelie Penguins reside - Torgersen and Biscoe

For simplicity, we excluded the island of Dream because their population size was much larger compared to Torgersen and Biscoe populations.

Code
new_penguins_df <- 
  filter(penguins, species == "Adelie", island != "Dream") %>% 
  select(species, island, flipper_length_mm) %>% 
  arrange(island, .by_group = TRUE) %>% 
  drop_na()

str(new_penguins_df)
tibble [95 × 3] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ flipper_length_mm: int [1:95] 174 180 189 185 180 187 183 187 172 180 ...

13.3.1 Distribution

Boxplots and histograms will be useful to understand the distribution of the data.

Our data is not normal based on the distribution.

Code
# check the boxplot of the data
boxplot(
  new_penguins_df$flipper_length_mm ~ new_penguins_df$island, las = 1, 
  ylab = "Flipper Length (mm)",
  xlab = "Island",
  main = "Flipper Length by Island"
)

Code
# check the histogram of the data
hist(
  x = new_penguins_df$flipper_length_mm,
  main = "Distribution of Flipper Length (mm)",
  xlab = "Flipper Length"
)

13.3.2 Bootstrapping Test

We need the difference in means in order to conduct our permutation test. We will test whether the difference is significant so that we can reject the null. This indicates that there is a different in flipper length among the same species that come from different islands.

Code
# set a seed so that our random results can be replicated by other people:
set.seed(20150516)

# take a random re-sample of the data that is the *same size*
N <- length(new_penguins_df$flipper_length_mm)

# a random sample:
sample(new_penguins_df$flipper_length_mm, size = N, replace = TRUE)
 [1] 184 192 198 195 195 176 188 183 184 193 199 198 184 190 198 195 195 199 193
[20] 197 198 189 197 188 189 199 200 190 183 198 194 190 191 196 189 195 198 197
[39] 191 184 198 180 195 186 193 193 191 195 190 198 189 181 197 196 182 200 188
[58] 184 202 189 197 186 181 195 181 191 185 193 196 185 192 199 186 196 180 190
[77] 190 195 197 193 191 181 195 190 186 189 192 187 190 195 195 182 172 194 181
Code
# number of bootstrap samples
B_int <- 10000

# create a list of these thousands of samples 
bootstrapSamples_ls <- map(
  .x = 1:B_int,
  .f = ~{
    sample(new_penguins_df$flipper_length_mm, size = N, replace = TRUE)
  }
)

# subset of the random samples 
bootstrapSamples_ls[1:3]
[[1]]
 [1] 183 190 189 188 181 198 181 172 187 189 189 193 180 197 191 190 196 191 195
[20] 181 193 190 190 186 188 195 190 197 198 190 180 198 194 188 195 191 203 199
[39] 190 189 195 186 189 199 202 197 189 190 194 190 181 190 190 181 186 196 174
[58] 185 174 202 191 184 181 184 193 190 190 190 191 196 189 195 195 198 193 190
[77] 197 184 186 188 193 190 191 195 198 180 191 185 189 192 183 192 199 186 195

[[2]]
 [1] 187 194 187 189 184 188 187 187 184 197 193 191 187 189 190 172 187 186 180
[20] 193 191 195 195 180 184 189 197 191 187 186 186 187 184 188 190 193 198 190
[39] 195 198 184 197 195 195 195 198 194 191 198 197 198 186 194 195 189 186 181
[58] 180 191 180 191 193 196 191 202 191 187 181 199 172 181 191 195 195 194 198
[77] 191 191 190 192 190 199 195 193 195 197 188 181 190 185 186 191 174 193 195

[[3]]
 [1] 191 196 203 195 185 195 193 186 186 202 186 203 187 180 185 186 192 202 186
[20] 192 200 195 184 185 195 193 199 190 189 185 181 181 188 197 181 190 188 185
[39] 187 184 184 195 199 186 200 186 192 195 190 182 189 191 203 193 195 191 191
[58] 199 195 198 187 191 195 190 190 187 189 192 186 199 193 190 187 181 190 191
[77] 190 190 183 193 190 197 181 190 187 198 187 190 200 184 190 184 186 191 193

13.3.3 Building Confidence Intervals for Various Statistics: Example 1

# The Sample Mean
bootMeans_num <-
  bootstrapSamples_ls %>%
  # the map_dbl() function takes in a list and returns an atomic vector of type
  #   double (numeric)
  map_dbl(mean)

# a normally distributed histogram using the samples from bootstrapping
hist(bootMeans_num)

# 95% confidence interval?
quantile(bootMeans_num, probs = c(0.025, 0.975))
    2.5%    97.5% 
188.7682 191.3684 

13.3.4 Building Confidence Intervals for Various Statistics: Example 2

Source: https://www.geeksforgeeks.org/bootstrap-confidence-interval-with-r-programming/

# Custom function to find correlation between the bill length and depth 
corr.fun <- function(data, idx) {
  
# vector of indices that the boot function uses
  df <- data[idx, ]

# Find the spearman correlation between
# the 3rd (length) and 5th (depth) columns of dataset
  cor(df[, 3], df[, 4], method = 'spearman')
}

# Setting the seed for reproducability of results
set.seed(42)

# Calling the boot function with the dataset
bootstrap <- boot(iris, corr.fun, R = 1000)

# Display the result of boot function
bootstrap

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = iris, statistic = corr.fun, R = 1000)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.9376668 -0.002717295 0.009436212
# Plot the bootstrap sampling distribution using ggplot
plot(bootstrap)

# Function to find the bootstrap CI
boot.ci(
  boot.out = bootstrap,
    type = "perc"
)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap, type = "perc")

Intervals : 
Level     Percentile     
95%   ( 0.9142,  0.9519 )  
Calculations and Intervals on Original Scale

13.4 Conclusion

The bootstrapping method is useful when working with data with relatively small samples in order to increase the sample size and normality of the data.

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.