slides – Math 132B

Situation

Two numerical variables
What kind of association is there between them?
Lots of possibilities
First resource: scatterplot!

Example 1

Positive linear association

Example 2

Positive linear association, much weaker

Example 3

Strong negative linear association

Example 4

Non-linear association

Example 5

Non-linear association

Example 6

Unclear (very weak linear) association

Example 7

No association, or very weak association

Strength of Linear association

$― 𝑥 = 3.80939$ , $― 𝑦 = 7.7912211$

Strength of Linear association

Not applicable if the scatterplot shows a nonlinear association!
Use only if scatterplot shows a linear association or no clear association
Start by moving data so that the point $(0, 0)$ is in the center (subtracting means): $𝑥 - ― 𝑥$ and $𝑦 - ― 𝑦$ .
Multiply together and calculate the mean
To compensate for different spreads, divide by both standard deviations

Pearson’s correlation coefficient

$𝑟 = m e a n ((𝑥 - ― 𝑥) \cdot (𝑦 - ― 𝑦)) 𝜎 𝑥 \cdot 𝜎 𝑦$

where:

$― 𝑥$ is the sample mean of the $𝑥$ variable
$― 𝑦$ is the sample mean of the $𝑦$ variable
$𝜎 𝑥$ is the sample standard deviation of the $𝑥$ variable
$𝜎 𝑦$ is the sample standard deviation of the $𝑦$ variable

In R: cor(x,y) or cor(y ~ x, data = ...)

Properties:

Between $- 1$ and $1$
$𝑟 > 0$ indicate positive correlation
$𝑟 < 0$ indicates negative correlation
if $| 𝑟 |$ is larger, it means stronger correlation

Hypothesis testing

Population correlation coefficient is denoted $𝜌$ (rho).
$𝐻 0$ : There is no correlation ( $𝜌 = 0$ ).
$𝐻 𝐴$ : There is a (positive, negative) correlation ( $𝜌 \neq (>, <) 0$ ).
We have a sample with correlation coefficient $𝑟$ , we want to know if it is a strong enough evidence to reject $𝐻 0$ .

Permutation test!

T-test for correlation

We can also calculate a t-statistic:

$𝑡 = 𝑟 \sqrt 𝑛 - 2 1 - 𝑟 2$

This has t-distribution with $𝑛 - 2$ degrees of freedom.

OK, so there is a linear association

What now?

Find a mathematical model for the association:

$𝑦 = 𝑎 𝑥 + 𝑏$

Data is scattered, so the relation will actually be

$𝑦 = 𝑎 𝑥 + 𝑏 + " n o i s e "$

Example data

Official name for the “noise” is residuals.

Terminology

predicted value of $𝑦$ : $̂ 𝑦 = 𝑎 𝑥 + 𝑏$
actual or observed value of $𝑦$ : $𝑦 = ̂ 𝑦 + 𝑒$
$𝑒 = 𝑦 - ̂ 𝑦$ is the residual

Another look at the plot

“Ideal” model

Residuals are as small as possible.
- The sum of squares of the residuals should be as small as possible.
- Line of best fit aka least squares line.
Residuals are independent of $𝑥$ .
- No association between $𝑥$ and the residuals.
- The variation of the residuals should not depend on $𝑥$ (homoscedasticity)

Parameters vs. statistics again

The population model: $̂ 𝑦 = 𝛽 0 + 𝛽 1 𝑥$
The estimate based on the sample: $̂ 𝑦 = 𝑏 0 + 𝑏 1 𝑥$
$̂ 𝑦$ is the predicted value
$𝑏 0$ is the point estimate of $𝛽 0$ : the intercept.
$𝑏 1$ is the point estimate of $𝛽 1$ : the rate of change.

Version with the “noise”

The population model: $𝑦 = 𝛽 0 + 𝛽 1 𝑥 + 𝜀$
The estimate based on the sample: $𝑦 = 𝑏 0 + 𝑏 1 𝑥 + 𝑒$
$𝑒$ is the residual (noise)
$𝜀$ is the random variable representing the residuals

Calculating the point estimates

$𝑏 1 = 𝑟 𝑠 𝑦 𝑠 𝑥$

and

$𝑏 0 = ― 𝑦 - 𝑏 1 ― 𝑥$

Usually done by computer

Calculating the point estimates

lm(RFFT ~ Age, data = prevend.samp)


Call:
lm(formula = RFFT ~ Age, data = prevend.samp)

Coefficients:
(Intercept)          Age  
    137.550       -1.261

The least squares line can be written as $̂ R F F T = 137.55 - 1.26 (A g e)$