Bootstrapping

STAT 20: Introduction to Probability and Statistics

Agenda

Concept Question
Bootstrapping with infer
Lab 5.2
Activity: The Bootstrap

Concept Question

Which of these is a valid bootstrap sample?

01:00

Original Sample
name	species	length
Gus	Chinstrap	50.7
Luz	Gentoo	48.5
Ida	Chinstrap	52.8
Ola	Gentoo	44.5
Abe	Adelie	42.0

BS A
name	species	length
Ida	Chinstrap	52.8
Luz	Gentoo	48.5
Abe	Adelie	42.0
Ola	Gentoo	44.5
Ida	Chinstrap	52.8

BS B
name	species	length
Ola	Gentoo	44.5
Gus	Chinstrap	50.7
Ida	Chinstrap	52.8
Luz	Gentoo	48.5
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7

BS C
name	species	length
Gus	Chinstrap	50.7
Ola	Gentoo	48.5
Ola	Chinstrap	52.8
Ida	Gentoo	44.5
Ida	Adelie	42.0

BS D
name	species	length
Gus	Chinstrap	50.7
Abe	Adelie	42.0
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7

The Bootstrap

Parameters and Statistics

Our Goal: Assess the sampling error or variability in our estimate of some population parameter

Our Tool: The Bootstrap

Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?

penguins <- penguins %>%
  mutate(is_adelie = species == "Adelie")

penguins %>%
  ggplot(aes(x = is_adelie)) +
  geom_bar()

Point estimate

obs_stat <- penguins %>%
  summarize(p_adelie = mean(is_adelie))
obs_stat

# A tibble: 1 × 1
  p_adelie
     <dbl>
1    0.442

Generating one bootstrap sample

library(infer)
penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 FALSE    
 3         1 TRUE     
 4         1 FALSE    
 5         1 TRUE     
 6         1 TRUE     
 7         1 FALSE    
 8         1 TRUE     
 9         1 TRUE     
10         1 TRUE     
# ℹ 334 more rows

Two more bootstrap samples

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 FALSE    
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 TRUE     
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

Visualizing 9 bs samples

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 9, 
           type = "bootstrap") %>%
  ggplot(aes(x = is_adelie)) +
  geom_bar() +
  facet_wrap(vars(replicate),
             nrow = 3)

Calculating 9 \(\hat{p}\)

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 9, 
           type = "bootstrap") %>%
  calculate(stat = "prop")

Response: is_adelie (factor)
# A tibble: 9 × 2
  replicate  stat
      <int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 500, 
           type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  ggplot(aes(x = stat)) +
  geom_histogram()

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins %>%
  specify(response = is_adelie,
          success = "TRUE") %>%
  generate(reps = 500, 
           type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = .95)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.392    0.494

Documentation: `infer.tidymodels.org`

Lab 5.2

30:00

Break

05:00

Activity: The Bootstrap

Why do we care about sampling error or variability anyway?

Uncertainty Quantification

Confidence intervals give us a sense of uncertainty.

We quantify this uncertainty with a range of values around the point estimate

Large range = high uncertainty
Low range = low uncertainty

But when do we really care about uncertainty? When isn’t a point estimate good enough?

The First Chemotherapy Trial: Context

Detailed in The Emperor of All Maladies: A Biography of Cancer.

We have known about cancer for a long time. For most of that time, we had essentially zero ability to treat it (in spite of great efforts)
By the 1900s, we knew that cancer cells divide faster than normal cells. So, scientists hypothesized that a chemical that kills cells that divide quickly (and not non-cancerous cells) would be a good treatment
But, finding a substance to do that was tricky
Late 1940s: Sidney Farber came upon folic acid (a chemotherapy) that seemed promising based on previous scientific observations

The First Chemotherapy Trial: The Trial

1947: Dr. Farber tried treating children with a certain type of leukemia with this chemotherapy. Children with this type of leukemia typically only lived a few months after their diagnosis
Small trial of 16 children

Results: Chemotherapy extended life by a few additional months compared to no treatment.
These results were groundbreaking since essentially no previous treatment was found to do anything, despite decades of efforts.

The First Chemotherapy Trial: The Stakes

Chemotherapies are poison. They have bad side effects and can kill patients (especially when used improperly).
Small sample size: Dr. Farber didn’t want to unnecessarily give many children a poison if that poison would not improve their prognosis.

Effect size: The treatment only extended life for a few months, so it had a small effect size.

The right conclusions mattered. You really don’t want to unnecessarily give people chemotherapy if it doesn’t help. On the other hand, if there is a drug that can extend life you really do want to give it to patients.

Uncertainty quantification matters when…

You make a decision based on the estimate
Making the right decision is important
If you have a small sample size (why?)
And it’s expensive to increase the sample size

Uncertainty Quantification matters when it could change the decision you would have made based on the point estimate alone, and that decision has real consequences.

Uncertainty Quantification in the Age of Big Data

Consider Meta. This company runs experiments on Facebook users, and each experiment has an incredibly large sample size. So what?

“Because they are so large, studies based on supersized samples can produce results that are statistically significant but at the same time are substantively trivial. It’s simple math: The larger the sample size, the smaller any differences need to be to be statistically significant—that is, highly likely to be truly different from each other.”

From: Pew Research