Class 16
Goal: Create or evaluate a mathematical model for some random variable(s), using a sample of its (their) values.
Model for a random variable: A distribution with some parameters.
Last time we set up an experiment:
We took all values of the height variable from the very large YRBSS data set. We used that as population.
We chose a sample size \(n\), and collected a large number (1000) samples of size \(n\) from the population.
We calculated the mean of each sample.
We looked at the distribution of these means.
Use a sample to find an estimate for one parameter of the model.
Requires no previous knowledge of the parameter, although we may have to make some assumptions about the model.
We have a model that we think may work, and we want to test it against our data.
We want to see whether the model is consistent with the data.
Suppose I use the sample mean to estimate the population mean.
We know that most of the time the estimate will be pretty good.
What does most of the time mean?
What does pretty good mean?
A confidence interval provides an estimate for a population parameter along with a margin of error that gives a plausible range of values for the population parameter.
A confidence interval for a population mean \(\mu\) has the general form \[\overline{x} \pm m \text{, or } (\overline{x} - m, \overline{x} + m), \] where \(m\) is the margin of error.
To calculate \(m\), we use what is known about the sampling distribution of \(\overline{X}\).
\[\overline{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\]
Unfortunately, we usually do not know \(\sigma\)!
\(\mu\): unknown population mean
\(\sigma\): unknown population standard deviation
\(n\): the number of observations in a sample drawn from the population
\(\overline{x}\): sample mean from a sample taken from the population
\(s\): calculated sample standard deviation from the same sample used to calculate \(\overline{x}\)
Instead of using \(\sigma\), use \(s\) (the sample standard deviation)!
Unfortunately, that will introduce an extra error, especially for small samples.
We will not be able to use a normal distribution.
The \(t\) distribution is symmetric, bell-shaped, and centered at 0.
It is very close to the standard normal distribution, but has one additional parameter called degrees of freedom (df).
The tails of a \(t\) distribution are thicker than those in a normal distribution. This adjusts for the variability introduced by using \(s\) as an estimate of \(\sigma\).
When \(df\) is large (\(df \geq 30\)), the \(t\) and \(z\) distributions are virtually identical.
In 1908, W. S. Gosset (a.k.a. “Student”) discovered that if the population is normally distributed then the quantity
\[t = \frac{\overline{X} - \mu}{\frac{s}{\sqrt{n}}}\]
has a \(t\) distribution with \(n-1\) degrees of freedom.
Central Limit Theorem:
The standard deviation of the random variable \(\overline{X}\) is
\[\text{SD}_{\overline{X}} = \dfrac{\sigma_x}{\sqrt{n}}\]
Thus, the variability of a sample mean is inversely proportional to the square root of the sample size.
Typically, \(\sigma_x\) is unknown and estimated by \(s_x\).
The term \(\frac{s_x}{\sqrt{n}}\) is what we usually mean by standard error of \(\overline{X}\).
A \(100\cdot(1 - \alpha)\)% confidence interval for \(\mu\) is given by \[\overline{x} \pm t^\star \times \dfrac{s}{\sqrt{n}} = \left(\overline{x} - t^\star \times \dfrac{s}{\sqrt{n}}, \overline{x} + t^\star \times \dfrac{s}{\sqrt{n}}\right)\] where \(t^\star\), the critical \(t\)-value, is the point on a \(t\) distribution with \(n-1\) degrees of freedom that has area:
The quantity \(100\cdot(1 - \alpha)\)% is called confidence level and denoted by \(L\).
The confidence level is also called the confidence coefficient.
The correct interpretation:
The method illustrated for computing a 95% confidence interval will produce an interval that (on average) contains the true population mean 95 times out of 100.
5 out of 100 will be incorrect,
unfortunately, we do not know whether a particular interval contains the population mean.
intervals = {
resample;
var gen_interval = function(i, size, L){
var sample = Array.from({length: size}, function(){ return jstat.jStat.normal.sample(0,1)});
var crit = jstat.jStat.studentt.inv(1 - (1 - L/100)/2, size-1);
var xbar = jstat.jStat.mean(sample);
var sd = jstat.jStat.stdev(sample);
var merror = crit*sd/Math.sqrt(size);
return {int: i + 1, low: xbar - merror, high: xbar + merror, fail: Math.abs(xbar) >= merror};
}
return Array.from({length: 100}, (_, i) => gen_interval(i, sample_size, conf_level));
}100 samples from the standard normal distribution, .
Accuracy is given by the confidence level. This tells you how likely you are to “hit the target”.
Precision is determined by the margin of error. This tells you how precisely is the target determined when you hit it.
If you want to increase the accuracy, you have to decrease the precision, and vice versa.
You can increase one without sacrificing the other by increasing the sample size.
The data used to calculate the confidence interval are from a simple random sample taken from the target population.
One of the following has to be true:
The population is normally distributed with (unknown) mean \(\mu\) and (unknown) standard deviation \(\sigma\).
The sample size is at least 30, and the population is not visibly skewed.
If the population is skewed, larger sample is needed (several hundreds).
A \(100\cdot(1 - \alpha)\)% CI for \(\mu\) is given by \[\overline{x} \pm t^\star \times \dfrac{s}{\sqrt{n}} = \left(\overline{x} - t^\star \times \dfrac{s}{\sqrt{n}}, \overline{x} + t^\star \times \dfrac{s}{\sqrt{n}}\right) \] where \(t^\star\), the critical \(t\)-value, is the point on a \(t\) distribution with \(n-1\) degrees of freedom that has area \((1 - \alpha/2)\) to the left (and area \(\alpha/2\) to the right).
A left or lower \(100\cdot(1 - \alpha)\)% CI for \(\mu\) is given by \[\left(\overline{x} + t^\star \times \dfrac{s}{\sqrt{n}}, \infty\right) \] where \(t^\star\), the critical \(t\)-value, is the point on a \(t\) distribution with \(n-1\) degrees of freedom that has area \(\alpha\) to the left (and area \(1 - \alpha\) to the right).
A right or upper \(100\cdot(1 - \alpha)\)% CI for \(\mu\) is given by \[\left(-\infty, \overline{x} + t^\star \times \dfrac{s}{\sqrt{n}}\right),\] where \(t^\star\), the critical \(t\)-value, is the point on a \(t\) distribution with \(n-1\) degrees of freedom that has area \(\alpha\) to the right (and area \(1 - \alpha\) to the left).