Bootstrapping

STAT 20: Introduction to Probability and Statistics

Agenda

  • Concept Question
  • Bootstrapping with infer
  • Lab 5.2
  • Activity: The Bootstrap

Concept Question

Which of these is a valid bootstrap sample?

01:00




Original Sample
name species length
Gus Chinstrap 50.7
Luz Gentoo 48.5
Ida Chinstrap 52.8
Ola Gentoo 44.5
Abe Adelie 42.0
BS A
name species length
Ida Chinstrap 52.8
Luz Gentoo 48.5
Abe Adelie 42.0
Ola Gentoo 44.5
Ida Chinstrap 52.8
BS B
name species length
Ola Gentoo 44.5
Gus Chinstrap 50.7
Ida Chinstrap 52.8
Luz Gentoo 48.5
Gus Chinstrap 50.7
Gus Chinstrap 50.7
BS C
name species length
Gus Chinstrap 50.7
Ola Gentoo 48.5
Ola Chinstrap 52.8
Ida Gentoo 44.5
Ida Adelie 42.0
BS D
name species length
Gus Chinstrap 50.7
Abe Adelie 42.0
Gus Chinstrap 50.7
Gus Chinstrap 50.7
Gus Chinstrap 50.7

The Bootstrap

Parameters and Statistics


Our Goal: Assess the sampling error or variability in our estimate of some population parameter


Our Tool: The Bootstrap

Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?


penguins <- penguins %>%
  mutate(is_adelie = species == "Adelie")

penguins %>%
  ggplot(aes(x = is_adelie)) +
  geom_bar()




Point estimate

obs_stat <- penguins %>%
  summarize(p_adelie = mean(is_adelie))
obs_stat
# A tibble: 1 × 1
  p_adelie
     <dbl>
1    0.442

Generating one bootstrap sample

library(infer)
penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 FALSE    
 3         1 TRUE     
 4         1 FALSE    
 5         1 TRUE     
 6         1 TRUE     
 7         1 FALSE    
 8         1 TRUE     
 9         1 TRUE     
10         1 TRUE     
# ℹ 334 more rows

Two more bootstrap samples

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 FALSE    
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows
penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 TRUE     
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

Visualizing 9 bs samples

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 9, 
           type = "bootstrap") %>%
  ggplot(aes(x = is_adelie)) +
  geom_bar() +
  facet_wrap(vars(replicate),
             nrow = 3)

Calculating 9 \(\hat{p}\)

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 9, 
           type = "bootstrap") %>%
  calculate(stat = "prop")
Response: is_adelie (factor)
# A tibble: 9 × 2
  replicate  stat
      <int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 500, 
           type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  ggplot(aes(x = stat)) +
  geom_histogram()

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 500, 
           type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = .95)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.392    0.494

Documentation: infer.tidymodels.org

Lab 5.2

30:00

Break

05:00

Activity: The Bootstrap

Why do we care about sampling error or variability anyway?

Uncertainty Quantification

Confidence intervals give us a sense of uncertainty.


We quantify this uncertainty with a range of values around the point estimate

  • Large range = high uncertainty
  • Low range = low uncertainty

But when do we really care about uncertainty? When isn’t a point estimate good enough?

The First Chemotherapy Trial: Context

Detailed in The Emperor of All Maladies: A Biography of Cancer.

  • We have known about cancer for a long time. For most of that time, we had essentially zero ability to treat it (in spite of great efforts)
  • By the 1900s, we knew that cancer cells divide faster than normal cells. So, scientists hypothesized that a chemical that kills cells that divide quickly (and not non-cancerous cells) would be a good treatment
  • But, finding a substance to do that was tricky
  • Late 1940s: Sidney Farber came upon folic acid (a chemotherapy) that seemed promising based on previous scientific observations

The First Chemotherapy Trial: The Trial

  • 1947: Dr. Farber tried treating children with a certain type of leukemia with this chemotherapy. Children with this type of leukemia typically only lived a few months after their diagnosis
  • Small trial of 16 children
  • Results: Chemotherapy extended life by a few additional months compared to no treatment.
  • These results were groundbreaking since essentially no previous treatment was found to do anything, despite decades of efforts.

The First Chemotherapy Trial: The Stakes

  • Chemotherapies are poison. They have bad side effects and can kill patients (especially when used improperly).
  • Small sample size: Dr. Farber didn’t want to unnecessarily give many children a poison if that poison would not improve their prognosis.
  • Effect size: The treatment only extended life for a few months, so it had a small effect size.
  • The right conclusions mattered. You really don’t want to unnecessarily give people chemotherapy if it doesn’t help. On the other hand, if there is a drug that can extend life you really do want to give it to patients.

Uncertainty quantification matters when…

  • You make a decision based on the estimate
  • Making the right decision is important
  • If you have a small sample size (why?)
  • And it’s expensive to increase the sample size

Uncertainty Quantification matters when it could change the decision you would have made based on the point estimate alone, and that decision has real consequences.

Uncertainty Quantification in the Age of Big Data

Consider Meta. This company runs experiments on Facebook users, and each experiment has an incredibly large sample size. So what?


“Because they are so large, studies based on supersized samples can produce results that are statistically significant but at the same time are substantively trivial. It’s simple math: The larger the sample size, the smaller any differences need to be to be statistically significant—that is, highly likely to be truly different from each other.”

From: Pew Research