# Correlation

#### Information

Correlation means how well thee data fits to a straight line (e.g.How close data points are to the LOBF [Line of Best Fit])

If the data fits to a curve do not use the product moment correlation coefficient r as we are only checking for a straight line

In order for correlation to work data must be jointly normally distributed this means that the majority of all data is centered around the mean as shown in the diagram below where for variable x and y most data is focused around the central point and then as you are further from the mean there are less data points

(Graphs shows amount of data points againt variable)

Combining these two graphs into a 3-D graph we can see that all data points are centred about one point as shown in the graph below

In order to find the value for r (product moment correlation coefficient) we must use this equation:

$$r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}$$

From previous knowledge we know how to find two of these values but as they are so important we will show you there equation once more:

(Where n is the number of data point)

$$S_{xx} = \sum{(x-\bar{x})^2} ≡ \sum{x^2} - \frac{(\sum{x})^2}{n}$$

$$S_{yy} = \sum{(y-\bar{y})^2} ≡ \sum{y^2} - \frac{(\sum{y})^2}{n}$$

$$S_{xy} = \sum{(x-\bar{x})(y-\bar{y})} ≡ \sum{xy} - \frac{(\sum{x})(\sum{y})}{n}$$

We can represent r on a scale to correspond its value to how we would see the correlation on a graph like so:

-1.0 | -0.5 | 0 | +0.5 | +1 |

Perfect Negative Correlation | Moderate Negative Correlation | No Correlation | Moderate Positive Correlation | Perfect Positive Correlation |

#### Product Moment Correlation Coefficient (r)

Now let us show an example of finding the Product Moment Correlation Coefficient (r)

#### Example 1

Given the data points:

x | y | $$x²$$ | $$y²$$ | xy |

5 | 240 | 25 | 57600 | 1200 |

5 | 232 | 25 | 53824 | 1160 |

7 | 227 | 49 | 51529 | 1589 |

8 | 222 | 64 | 49284 | 1776 |

10 | 215 | 100 | 46225 | 2150 |

$$n = 8$$

$$\sum{x} = 35$$

$$\sum{y} = 1136$$

$$\sum{x^2} = 263$$

$$\sum{y^2} = 258462$$

$$\sum{xy} = 7875$$

$$S_{xy} = \sum{xy} - \frac{(\sum{x})(\sum{y})}{n} = 7875 - \frac{35×1136}{5} = -77$$

$$S_{xx} = \sum{x^2} - \frac{(\sum{x})^2}{n} = 263 - \frac{35^2}{5} = 18$$

$$S_{yy} = \sum{y^2} - \frac{(\sum{y})^2}{n} = 258462 - \frac{1136^2}{5} = 362.8$$

$$r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \frac{-77}{\sqrt{18×362.8}} = -0.953$$

This means thats these two data sets have a **very strong negative correlation**

#### Example 2

x | y |

4 | 8 |

10 | 6 |

9 | 10 |

7 | 5 |

6 | 8 |

5 | 4 |

7 | 8 |

9 | 9 |

8 | 7 |

6 | 4 |

$$n = 10$$

$$\sum{x} = 71$$

$$\sum{y} = 69$$

$$\sum{x^2} = 537$$

$$\sum{y^2} = 515$$

$$\sum{xy} = 502$$

$$S_{xy} = \sum{xy} - \frac{(\sum{x})(\sum{y})}{n} = 502 - \frac{71×69}{10} = 12.1$$

$$S_{xx} = \sum{x^2} - \frac{(\sum{x})^2}{n} = 537 - \frac{71^2}{10} = 32.9$$

$$S_{yy} = \sum{y^2} - \frac{(\sum{y})^2}{n} = 515 - \frac{69^2}{10} = 38.9$$

$$r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \frac{12.1}{\sqrt{32.9×38.9}} = 0.338$$

This means thats these two data sets have a **weak positive correlation**

#### Example 3

Knowing that the values for x represent the head and body length of doormice and that the values for y represent the tail length of doormice and are recorded in metres

Finally given the information that:

$$S_{xy} = 416.3,$$

$$S_{xx} = 1280.55,$$

$$S_{yy} = 282.8$$

#### Part a

Calculate the value of the product moment correlation coefficient between x and y

$$r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \frac{416.3}{\sqrt{1280.55×282.8}} = 0.69$$

#### Part b

Interpret the r value in the context of this question

From the r value we can see that there is a moderate positive correlation between 'head and body length' and the 'tail length' of doormice

#### Part c

Write down the value of the product moment correlation coefficient if the measurement had been recorded in centimetres

The 'pmcc' would still be 0.692, since a change of units has no effect on the calculation

#### Part d

Why is it not generally advisible to calculate the value of 'pmcc' without first viewing a scatter diagram of the data. Illistrate answer with the use of a sketch

Three reasons that it would be a good idea to first draw a sketch is the existance of:

Outliers

Non linear relationships

Multiple linear relationships

e.g. Curved so no 'pmcc' can be made

e.g. Shows more than 1 relationship so 'pmcc' would be misleading