# Correlation and Regression

A LevelAQAEdexcelOCR

## Correlation and Regression

Correlation and regression both pertain to data measured in pairs – called bivariate data. Correlation is a measure of how closely linked the two data sets are, and how they affect each other. Regression is the line of best fit.

A LevelAQAEdexcelOCR

## Correlation and Scatter Graphs

When we have bivariate data, one variable will be the independent (or explanatory) variable, and another will be the dependent (or response) variable. The independent variable is the one that you can control, and goes on the $x$ axis. The dependent variable is the one that is being affected, and it goes on the $y$ axis.

There won’t always be a clear independent and dependent variable, and in that case it is not as important which way round they go on a scatter graph.

Correlation comes in three flavours: positive, negative and no correlation. In positive correlation, as one variable goes up so does the other. In negative correlation, as one variable goes up the other goes down. In no correlation, there is no clear link between the variables.

Correlation can also be strong or weak. In strong correlation, the data is very close to forming a line. In weak correlation, the data is not close to forming a line.

Outliers look obvious on a scatter graph. They can be ignored in subsequent calculation – but if you do plan to ignore an outlier make sure you clearly mark it on the graph as such.

You should also be aware of clusters. This is where the data forms several separate groups on the graph. We can talk about overall correlation and correlation in clusters. For example, in the graph on the right, there is negative correlation overall but positive correlation in the clusters.

A LevelAQAEdexcelOCR

## Regression

The regression line of $\mathbf{y}$ on $\mathbf{x}$ is the line of best fit. It is always written in the same form:

$y=a+bx$

$a$ is the $y$ intercept

$b$ is the gradient

The regression line can be used to predict values of the dependent variable. This comes in two flavours:

• Interpolation – If the value of $x$ being used in the prediction falls inside the range of the values of $x$ in the data. The predicted value should be reliable.
• Extrapolation – If the value of $x$ being used in the prediction falls outside of the range of the values of $x$ in the data. The predicted value might be unreliable.
A LevelAQAEdexcelOCR

@mmerevise

A LevelEdexcel

## Regression with Coded Data

Regression can also be done on coded data. All we do is we substitute the coding into our regression line, then rearrange to get it back in straight line form.

Example: $y=4+10x$, with coding $s=2y$, $t=x-3$ becomes:

\begin{aligned}\dfrac{s}{2}&=4+10(t+3)\\[1.2em]&=4+10t+30\\[1.2em]&=34+10t\end{aligned}

$s=68+20t$

Also, we can form regression lines from non-linear data in some cases.

Example: $y=ax^{n}$ becomes $log(y)=log(a)+nlog(x)$ which is regression in $log(y)$ and $log(x)$.

A LevelEdexcel
A LevelAQAEdexcelOCR

## Example 1: Correlation

Plot a scatter graph then describe the correlation of this data:

[4 marks]

Plot the points on the graph:

This graph shows positive correlation.

A LevelAQAEdexcelOCR

## Example 2: Regression

Data is collected about the temperature of the water in a kettle in °C over time in minutes. The regression line is:

$y=20+40x$

What is the gradient and y-intercept, and what do these mean in the context of the experiment. How long does the kettle take to boil?

[5 marks]

Gradient is $40$°C

y-intercept is $20$°C

In context, this means that the water starts at a temperature of $20$°C and rises by $40$°C every minute.

The kettle finishes boiling at $100$°C. Substitute this value into the expression:

$100=20+40x$

$40x=80$

$x=2$ minutes

A LevelAQAEdexcelOCR

## Correlation and Regression Example Questions

Plot the points on the graph:

This graph shows positive correlation.

Plot the points on the graph.

This graph shows negative correlation.

i) $a$ is y-intercept

$b$ is gradient

ii) Negative correlation means $b<0$

iii) $y=a$ because $a$ is the y-intercept

iv) $b$ is the gradient so it is the amount of distance further a bird travels to migrate per unit weight

$a$ is the travel distance of a bird of $0$ weight. Since birds cannot have no weight, this is not sensible.

$y=8x+3$

$s=\dfrac{y}{10}+4$

Rearrange to find $y$

$s-4=\dfrac{y}{10}$

$y=10(s-4)$

$t=4(x+3)$

Rearrange to find $x$

$\dfrac{t}{4}=x+3$

$x=\dfrac{t}{4}-3$

Substitute in the expressions for $x$ and $y$

$10(s-4)=8\left(\dfrac{t}{4}-3\right)+3$

$10s-40=2t-24+3$

$10s-40=2t-21$

$10s=2t+19$

$s=\dfrac{1}{5}t+\dfrac{19}{10}$

This graph shows correlation in clusters.

There is negative correlation within the clusters.

There is no correlation overall.

A Level

A Level

A Level

## You May Also Like...

### MME Learning Portal

Online exams, practice questions and revision videos for every GCSE level 9-1 topic! No fees, no trial period, just totally free access to the UK’s best GCSE maths revision platform.

£0.00