Math 132B

Class 15

What is statistical inference?

Goal: Create a mathematical model for some random variable(s),

or “evaluate” an existing model for some random variables(s),

using a sample of its (their) values.

Model for a random variable: A distribution with some parameters.

Population

The random variable is tied to some “population”:
- All patients in some hospital that have certain disease
- All patients in the world that currently have certain disease
- All potential patients in the world that currently have certain disease, or had the disease in the past, or will have the disease in the future
- All plants of certain species in a given forest.
- All plants of certain species, in the past, now, and in the future
- All adults currently living in the US
- All current, past and future adults in the US
We then talk about the population distribution, the population parameter, and a model of the population.

Unreasonably optimistic goal

Figure out exactly what model to use.

That means complete description of the distribution, including the exact values of all the parameters.

We only have a sample of the values, that’s not going to be enough.

Realistic goals

Figure out something about one of the parameters, or
Figure out something about the way the variable is distributed.

Perhaps we already have some idea about the type of the distribution, can we learn something about its parameter(s)?

What can we learn from a sample?

Suppose I already decided that a normal distribution would be an appropriate model for my population.
I want to know what I should use as mean and standard deviation.
I collect a sample and calculate its mean and standard deviation.
What does it tell me?
It may be useful to know something about relationship of sample mean and population mean…

Let’s Experiment

The YRBSS data set contains 13,583 observations from surveys conducted from 1991 to 2013.
It has 13 variables, we will look at height.
The data set is available as a part of the oibiostat package.

Central Limit Theorem (part 1)

When sampling from a population that is normally distributed with mean \(\mu\) and standard deviation \(\sigma\):

The sample means are also normally distributed…
… with the same mean \(\mu\)…
… and standard deviation \[\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\] where \(n\) is the sample size.

Central Limit Theorem (part 2)

When sampling from any random variable \(X\) with mean (expected value) \(\mu\) and finite standard deviation \(\sigma\), then for large enough samples:

The sample means are approximately normally distributed…
… with the same mean \(\mu\)…
… and standard deviation \[\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\] where \(n\) is the sample size.
The approximation gets better as the sample size increases.

What does this mean?

With a large sample, the sample mean is likely to be a pretty good estimate of the population mean.
Even if we know nothing about the population distribution, we do know (approximately) the distribution of the sample means, so we can calculate probabilities!