01:00
infer
Which of these is a valid bootstrap sample?
01:00
name | species | length |
---|---|---|
Gus | Chinstrap | 50.7 |
Luz | Gentoo | 48.5 |
Ida | Chinstrap | 52.8 |
Ola | Gentoo | 44.5 |
Abe | Adelie | 42.0 |
name | species | length |
---|---|---|
Ida | Chinstrap | 52.8 |
Luz | Gentoo | 48.5 |
Abe | Adelie | 42.0 |
Ola | Gentoo | 44.5 |
Ida | Chinstrap | 52.8 |
name | species | length |
---|---|---|
Ola | Gentoo | 44.5 |
Gus | Chinstrap | 50.7 |
Ida | Chinstrap | 52.8 |
Luz | Gentoo | 48.5 |
Gus | Chinstrap | 50.7 |
Gus | Chinstrap | 50.7 |
name | species | length |
---|---|---|
Gus | Chinstrap | 50.7 |
Ola | Gentoo | 48.5 |
Ola | Chinstrap | 52.8 |
Ida | Gentoo | 44.5 |
Ida | Adelie | 42.0 |
name | species | length |
---|---|---|
Gus | Chinstrap | 50.7 |
Abe | Adelie | 42.0 |
Gus | Chinstrap | 50.7 |
Gus | Chinstrap | 50.7 |
Gus | Chinstrap | 50.7 |
Our Goal: Assess the sampling error or variability in our estimate of some population parameter
Our Tool: The Bootstrap
Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups: replicate [1]
replicate is_adelie
<int> <fct>
1 1 FALSE
2 1 FALSE
3 1 TRUE
4 1 FALSE
5 1 TRUE
6 1 TRUE
7 1 FALSE
8 1 TRUE
9 1 TRUE
10 1 TRUE
# ℹ 334 more rows
penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups: replicate [1]
replicate is_adelie
<int> <fct>
1 1 FALSE
2 1 TRUE
3 1 FALSE
4 1 FALSE
5 1 FALSE
6 1 TRUE
7 1 TRUE
8 1 FALSE
9 1 FALSE
10 1 FALSE
# ℹ 334 more rows
penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups: replicate [1]
replicate is_adelie
<int> <fct>
1 1 FALSE
2 1 TRUE
3 1 TRUE
4 1 FALSE
5 1 FALSE
6 1 TRUE
7 1 TRUE
8 1 FALSE
9 1 FALSE
10 1 FALSE
# ℹ 334 more rows
Response: is_adelie (factor)
# A tibble: 9 × 2
replicate stat
<int> <dbl>
1 1 0.404
2 2 0.430
3 3 0.404
4 4 0.433
5 5 0.468
6 6 0.448
7 7 0.427
8 8 0.413
9 9 0.474
Note the change in data frame size.
We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci()
.
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 0.392 0.494
infer.tidymodels.org
30:00
05:00
Why do we care about sampling error or variability anyway?
Confidence intervals give us a sense of uncertainty.
We quantify this uncertainty with a range of values around the point estimate
But when do we really care about uncertainty? When isn’t a point estimate good enough?
Detailed in The Emperor of All Maladies: A Biography of Cancer.
Uncertainty Quantification matters when it could change the decision you would have made based on the point estimate alone, and that decision has real consequences.
Consider Meta. This company runs experiments on Facebook users, and each experiment has an incredibly large sample size. So what?
“Because they are so large, studies based on supersized samples can produce results that are statistically significant but at the same time are substantively trivial. It’s simple math: The larger the sample size, the smaller any differences need to be to be statistically significant—that is, highly likely to be truly different from each other.”
From: Pew Research