Math 132B

Class 25

Situation

  • Two numerical variables

  • What kind of association is there between them?

  • Lots of possibilities

  • First resource: scatterplot!

Example 1

Positive linear association

Example 2

Positive linear association, much weaker

Example 3

Strong negative linear association

Example 4

Non-linear association

Example 5

Non-linear association

Example 6

Unclear (very weak linear) association

Example 7

No association, or very weak association

Strength of Linear association

\(\overline{x} = 4.0508382\), \(\overline{y} = 7.9387814\)

Quadrants

Quadrants

Quadrants

Quadrants

Quadrants

General case

Not quite there yet

Mean of \((x- \overline{x})(y - \overline{y})\):

Set a: 1.478

Set b: 4.903

How to fix this problem?

  • Trying to use the mean of \((x - \overline{x})(y - \overline{y})\) to measure strength of correlation.

  • Two data sets with similar amount of “scatter” give us very different results: the data set with larger spread has larger deviations.

  • How do we measure spread?

  • Standard deviations:

    • Set a: \(s_x = 1.4764001\) and \(s_y = 1.0384449\)
    • Set b: \(s_x = 2.6967296\) and \(s_y = 1.8826342\)
  • An idea: divide each of the deviations by the corresponding standard deviation!

\[\text{Mean of } \frac{x - \overline{x}}{s_x} \frac{y - \overline{y}}{s_y}\]

Almost there

Mean of \(\frac{x- \overline{x}}{s_x}\frac{y - \overline{y}}{s_y}\):

Set a: 0.964

Set b: 0.966

Last (optional) adjustment

  • Mean of \(\frac{x- \overline{x}}{s_x}\frac{y - \overline{y}}{s_y}\):

    \[\frac{1}{n}\sum\frac{x- \overline{x}}{s_x}\frac{y - \overline{y}}{s_y}\]

  • Small change to improve estimate properties:

    \[\frac{1}{\color{red}{n-1}}\sum\frac{x- \overline{x}}{s_x}\frac{y - \overline{y}}{s_y}\]

  • Correlation coefficient \(r\)

Strong positive correlation

\(r = 0.995\)

Strong negative correlation

\(r = -0.993\)

Weak positive correlation

\(r = 0.611\)

Weak negative correlation

\(r = -0.672\)

Very weak positive correlation

\(r = 0.561\)

Very weak negative correlation

\(r = -0.391\)

No association

\(r = 0.003\)

Nonlinear association

Correlation coefficient not useful!

Perfect positive correlation

\(r = 1\)

Perfect negative correlation

\(r = -1\)

Strength of Linear association

  • Not applicable if the scatterplot shows a nonlinear association!

  • Use only if scatterplot shows a linear association or no clear association

  • Start by moving data so that the point \((0,0)\) is in the center (subtracting means): \(x - \overline{x}\) and \(y - \overline{y}\).

  • Multiply together and calculate the “mean”

  • To compensate for different spreads, divide by both standard deviations

Pearson’s correlation coefficient

\[r = \frac{\text{mean}((x - \overline{x})\cdot (y - \overline{y}))}{\sigma_x\cdot\sigma_y}\]

where:

  • \(\overline{x}\) is the sample mean of the \(x\) variable
  • \(\overline{y}\) is the sample mean of the \(y\) variable
  • \(\sigma_x\) is the sample standard deviation of the \(x\) variable
  • \(\sigma_y\) is the sample standard deviation of the \(y\) variable

In R: cor(x,y) or cor(y ~ x, data = ...)

Properties:

  • Between \(-1\) and \(1\)

  • \(r > 0\) indicate positive correlation

  • \(r < 0\) indicates negative correlation

  • if \(\left\lvert r\right\rvert\) is larger, it means stronger correlation

Hypothesis testing

  • Population correlation coefficient is denoted \(\rho\) (rho).

  • \(H_0\): There is no correlation (\(\rho = 0\)).

  • \(H_A\): There is a (positive, negative) correlation (\(\rho \neq (>, <) 0\)).

  • We have a sample with correlation coefficient \(r\), we want to know if it is a strong enough evidence to reject \(H_0\).

Permutation test!

T-test for correlation

We can also calculate a t-statistic:

\[t = r\sqrt{\frac{n-2}{1-r^2}}\]

This has t-distribution with \(n-2\) degrees of freedom.

OK, so there is a linear association

What now?

Find a mathematical model for the association:

\[y = ax + b\]

Data is scattered, so the relation will actually be

\[ y = ax + b + \text{ "noise"} \]

Terminology

  • predicted value of \(y\): \[\widehat{y} = ax + b\]

  • actual or observed value of \(y\): \[y = \widehat{y} + e\]

  • \(e = y - \widehat{y}\) is the residual

Sum of Squares

“Ideal” model

  • Residuals are as small as possible.
    • The sum of squares of the residuals should be as small as possible.
    • Line of best fit aka least squares line.
  • Residuals are independent of \(x\).
    • No association between \(x\) and the residuals.
    • The variation of the residuals should not depend on \(x\) (homoscedasticity)

Parameters vs. statistics again

  • The population model: \[\widehat{y} = \beta_0 + \beta_1 x\]

  • The estimate based on the sample: \[\widehat{y} = b_0 + b_1 x\]

  • \(\widehat{y}\) is the predicted value

  • \(b_0\) is the point estimate of \(\beta_0\): the intercept.

  • \(b_1\) is the point estimate of \(\beta_1\): the rate of change.

Version with the “noise”

  • The population model: \[y = \beta_0 + \beta_1 x + \varepsilon\]

  • The estimate based on the sample: \[y = b_0 + b_1 x + e\]

  • \(e\) is the residual (noise)

  • \(\varepsilon\) is the random variable representing the residuals

Formulas for the Point Estimates

What we need:

  • The correlation coefficient \(r\)
  • The mean of the \(x\) variable: \(\overline{x}\)
  • The mean of the \(y\) variable: \(\overline{y}\)
  • The standard deviation of the \(x\) variable: \(s_x\)
  • The standard deviation of the \(y\) variable: \(s_y\)

\[b_1 = \frac{s_y}{s_x}\cdot r\]

\[b_0 = \overline{y} - b_1\cdot\overline{x}\]

\[\widehat{y} - \overline{y} = b_1\cdot \left(x - \overline{x}\right)\]

Calculating the point estimates

lm(RFFT ~ Age, data = prevend.samp)

Call:
lm(formula = RFFT ~ Age, data = prevend.samp)

Coefficients:
(Intercept)          Age  
    137.550       -1.261  

The least squares line can be written as \[ \widehat{\text{RFFT}} = 137.55 - 1.26 (\text{Age}) \]