Math 132B

Class 26

Situation

  • Two numerical variables

  • What kind of association is there between them?

  • Lots of possibilities

  • First resource: scatterplot!

Example 1

Positive linear association

Example 2

Positive linear association, much weaker

Example 3

Strong negative linear association

Example 4

Non-linear association

Example 5

Non-linear association

Example 6

Unclear (very weak linear) association

Example 7

No association, or very weak association

Strength of Linear association

\(\overline{x} = 3.80939\), \(\overline{y} = 7.7912211\)

Strength of Linear association

  • Not applicable if the scatterplot shows a nonlinear association!

  • Use only if scatterplot shows a linear association or no clear association

  • Start by moving data so that the point \((0,0)\) is in the center (subtracting means): \(x - \overline{x}\) and \(y - \overline{y}\).

  • Multiply together and calculate the mean

  • To compensate for different spreads, divide by both standard deviations

Pearson’s correlation coefficient

\[r = \frac{\text{mean}((x - \overline{x})\cdot (y - \overline{y}))}{\sigma_x\cdot\sigma_y}\]

where:

  • \(\overline{x}\) is the sample mean of the \(x\) variable
  • \(\overline{y}\) is the sample mean of the \(y\) variable
  • \(\sigma_x\) is the sample standard deviation of the \(x\) variable
  • \(\sigma_y\) is the sample standard deviation of the \(y\) variable

In R: cor(x,y) or cor(y ~ x, data = ...)

Properties:

  • Between \(-1\) and \(1\)

  • \(r > 0\) indicate positive correlation

  • \(r < 0\) indicates negative correlation

  • if \(\left\lvert r\right\rvert\) is larger, it means stronger correlation

Hypothesis testing

  • Population correlation coefficient is denoted \(\rho\) (rho).

  • \(H_0\): There is no correlation (\(\rho = 0\)).

  • \(H_A\): There is a (positive, negative) correlation (\(\rho \neq (>, <) 0\)).

  • We have a sample with correlation coefficient \(r\), we want to know if it is a strong enough evidence to reject \(H_0\).

Permutation test!

T-test for correlation

We can also calculate a t-statistic:

\[t = r\sqrt{\frac{n-2}{1-r^2}}\]

This has t-distribution with \(n-2\) degrees of freedom.

OK, so there is a linear association

What now?

Find a mathematical model for the association:

\[y = ax + b\]

Data is scattered, so the relation will actually be

\[ y = ax + b + \text{ "noise"} \]

Example data

Official name for the “noise” is residuals.

Terminology

  • predicted value of \(y\): \[\widehat{y} = ax + b\]

  • actual or observed value of \(y\): \[y = \widehat{y} + e\]

  • \(e = y - \widehat{y}\) is the residual

Another look at the plot

“Ideal” model

  • Residuals are as small as possible.
    • The sum of squares of the residuals should be as small as possible.
    • Line of best fit aka least squares line.
  • Residuals are independent of \(x\).
    • No association between \(x\) and the residuals.
    • The variation of the residuals should not depend on \(x\) (homoscedasticity)

Parameters vs. statistics again

  • The population model: \[\widehat{y} = \beta_0 + \beta_1 x\]

  • The estimate based on the sample: \[\widehat{y} = b_0 + b_1 x\]

  • \(\widehat{y}\) is the predicted value

  • \(b_0\) is the point estimate of \(\beta_0\): the intercept.

  • \(b_1\) is the point estimate of \(\beta_1\): the rate of change.

Version with the “noise”

  • The population model: \[y = \beta_0 + \beta_1 x + \varepsilon\]

  • The estimate based on the sample: \[y = b_0 + b_1 x + e\]

  • \(e\) is the residual (noise)

  • \(\varepsilon\) is the random variable representing the residuals

Calculating the point estimates

\[b_1 = r\frac{s_y}{s_x}\]

and

\[b_0 = \overline{y} - b_1\overline{x}\]

Usually done by computer

Calculating the point estimates

lm(RFFT ~ Age, data = prevend.samp)

Call:
lm(formula = RFFT ~ Age, data = prevend.samp)

Coefficients:
(Intercept)          Age  
    137.550       -1.261  

The least squares line can be written as \[ \widehat{\text{RFFT}} = 137.55 - 1.26 (\text{Age}) \]