Class 20
We collected 1000 samples from the height variable in the yrbss data set. For each of them, we tested the null hypothesis \(H_0: \mu = \mu_0\) with two-sided alternative:
We did the tests with sample size 9 and 5% significance level:
n <- 9
alpha <- 0.05First we tested with true null hypothesis:
mu0 <- mean(~height, data = yrbss, na.rm = TRUE)In other words, \(H_0: \mu = 1.691241\).
We got results similar to this:
tally(~reject, data = test_results)reject
TRUE FALSE
52 948
Since \(H_0\) was true, each rejection is a case of Type I error.
We should see a number close to 5% of 1000.
mu0 <- 1.8We got results similar to this:
tally(~reject, data = test_results)reject
TRUE FALSE
783 217
We got a Type II error for 217 samples out of 1000, which is 21.7%.
When we increased the sample size to 25:
n <- 25we got results similar to this:
tally(~reject, data = test_results)reject
TRUE FALSE
996 4
This time we got a Type II error for 4 samples out of 1000, which is 0.4%.
Note that \(n\) is still 25.
mu0 <- 1.7tally(~reject, data = test_results)reject
TRUE FALSE
74 926
With \(n = 100\) we get:
tally(~reject, data = test_results)reject
TRUE FALSE
132 868
With \(n = 900\) we get:
tally(~reject, data = test_results)reject
TRUE FALSE
732 268
If \(H_0\) is really true, and samples are simple random, the probability of type I error is \(\alpha\).
When \(H_0\) is false, the probability of Type II error decreases with increasing sample size.
It also seems to depend on “how false” the null hypothesis is.
If \(H_0\) is “almost true”, so that it can be considered true for practical purposes, the probability of type I error increases with sample size.
With very large samples, the test becomes ridiculously sensitive, leading to rejection of perfectly reasonable models.
If you need the precision that comes with large samples, you should not be doing hypothesis tests. It is better to use an interval estimate instead.
is always (possibly a lot) larger than \(\alpha\):

Researchers studying the number of electric fish species living in various parts of the Amazon basin were interested in whether the presence of tributaries affected the local number of electric fish species in the main rivers (Fernandes et al. 2004). They counted the number of electric fish species above and below the entrance point of a major tributary at 12 different river locations. Here’s what they found:
| Tributary | Upstream number of species | Downstream number of species |
|---|---|---|
| Içá | 14 | 19 |
| Jutaí | 11 | 18 |
| Japurá | 8 | 8 |
| Coari | 5 | 7 |
| Purus | 10 | 16 |
| Manacapuru | 5 | 6 |
| Negro | 23 | 24 |
| Madeira | 29 | 30 |
| Trombetas | 19 | 16 |
| Tapajós | 16 | 20 |
| Xingu | 25 | 21 |
| Tocantins | 10 | 12 |
They wanted to know if the presence of a tributary affects the number of species.
In other words, they wanted to know if there is a significant difference between the number of species found downstream of tributaries and the number of species found upstream of tributaries.
The data was given as a table with three columns:
Tributary, Upstream number of species, Downstream number of species
This is not a tidy data set! (Why?)
One of the first things we will learn is how to transform data sets like this.