Correlation and Regression

A LevelAQAEdexcelOCR

Correlation and Regression Revision

Correlation and Regression

Correlation and regression both pertain to data measured in pairs – called bivariate data. Correlation is a measure of how closely linked the two data sets are, and how they affect each other. Regression is the line of best fit.

A LevelAQAEdexcelOCR

Correlation and Scatter Graphs

When we have bivariate data, one variable will be the independent (or explanatory) variable, and another will be the dependent (or response) variable. The independent variable is the one that you can control, and goes on the x axis. The dependent variable is the one that is being affected, and it goes on the y axis.

There won’t always be a clear independent and dependent variable, and in that case it is not as important which way round they go on a scatter graph.

Correlation comes in three flavours: positive, negative and no correlation. In positive correlation, as one variable goes up so does the other. In negative correlation, as one variable goes up the other goes down. In no correlation, there is no clear link between the variables.

Correlation can also be strong or weak. In strong correlation, the data is very close to forming a line. In weak correlation, the data is not close to forming a line.

Outliers look obvious on a scatter graph. They can be ignored in subsequent calculation – but if you do plan to ignore an outlier make sure you clearly mark it on the graph as such.

You should also be aware of clusters. This is where the data forms several separate groups on the graph. We can talk about overall correlation and correlation in clusters. For example, in the graph on the right, there is negative correlation overall but positive correlation in the clusters.

A LevelAQAEdexcelOCR

Regression

The regression line of \mathbf{y} on \mathbf{x} is the line of best fit. It is always written in the same form:

y=a+bx

a is the y intercept

b is the gradient

The regression line can be used to predict values of the dependent variable. This comes in two flavours:

  • Interpolation – If the value of x being used in the prediction falls inside the range of the values of x in the data. The predicted value should be reliable.
  • Extrapolation – If the value of x being used in the prediction falls outside of the range of the values of x in the data. The predicted value might be unreliable.
A LevelAQAEdexcelOCR
A LevelEdexcel

Regression with Coded Data

Regression can also be done on coded data. All we do is we substitute the coding into our regression line, then rearrange to get it back in straight line form.

Example: y=4+10x, with coding s=2y, t=x-3 becomes:

\begin{aligned}\dfrac{s}{2}&=4+10(t+3)\\[1.2em]&=4+10t+30\\[1.2em]&=34+10t\end{aligned}

s=68+20t

 

Also, we can form regression lines from non-linear data in some cases.

Example: y=ax^{n} becomes log(y)=log(a)+nlog(x) which is regression in log(y) and log(x).

A LevelEdexcel
A LevelAQAEdexcelOCR

Example 1: Correlation

Plot a scatter graph then describe the correlation of this data:

[4 marks]

Plot the points on the graph:

This graph shows positive correlation.

A LevelAQAEdexcelOCR

Example 2: Regression

Data is collected about the temperature of the water in a kettle in °C over time in minutes. The regression line is:

y=20+40x

What is the gradient and y-intercept, and what do these mean in the context of the experiment. How long does the kettle take to boil?

[5 marks]

Gradient is 40°C

y-intercept is 20°C

In context, this means that the water starts at a temperature of 20°C and rises by 40°C every minute.

The kettle finishes boiling at 100°C. Substitute this value into the expression:

100=20+40x

40x=80

x=2 minutes

A LevelAQAEdexcelOCR

Correlation and Regression Example Questions

Question 1: The table below shows the results of fuel efficiency tests where x is the amount of fuel placed into the vehicle in litres and y is the distance the vehicle travelled before it ran out of fuel.

i) Plot a scatter graph of this data.

ii) What kind of correlation is shown?

[4 marks]

A Level AQAEdexcelOCR

Plot the points on the graph:

 

 

This graph shows positive correlation.

MME Premium Laptop

Save your answers with

MME Premium

Gold Standard Education

Question 2: The data below shows how long bacteria lived after being placed into more and more alkaline solutions. x is the pH of the solution and y is the time in seconds the bacterium lived.

i) Plot this data on a scatter graph.

ii) Describe the correlation.

[4 marks]

A Level AQAEdexcelOCR

Plot the points on the graph.

 

 

This graph shows negative correlation.

MME Premium Laptop

Save your answers with

MME Premium

Gold Standard Education

Question 3: A regression line of y=a+bx describes a correlation.

i) What are a and b?

ii) If there is negative correlation, is b>0?

iii) What value does y take when x=0?

iv) Suppose this regression describes the size of birds against the distance they travel to migrate. What does b mean? Is the physical interpretation for a sensible?

[6 marks]

A Level AQAEdexcelOCR

i) a is y-intercept

b is gradient

 

ii) Negative correlation means b<0

 

iii) y=a because a is the y-intercept

 

iv) b is the gradient so it is the amount of distance further a bird travels to migrate per unit weight

a is the travel distance of a bird of 0 weight. Since birds cannot have no weight, this is not sensible.

MME Premium Laptop

Save your answers with

MME Premium

Gold Standard Education

Question 4: The regression line of y on x is y=8x+3. The data is coded with s=\dfrac{y}{10}+4 and t=4(x+3). Find the regression line of s on t.

[4 marks]

A Level Edexcel

y=8x+3

 

s=\dfrac{y}{10}+4

Rearrange to find y

s-4=\dfrac{y}{10}

y=10(s-4)

 

t=4(x+3)

Rearrange to find x

\dfrac{t}{4}=x+3

x=\dfrac{t}{4}-3

 

Substitute in the expressions for x and y

10(s-4)=8\left(\dfrac{t}{4}-3\right)+3

10s-40=2t-24+3

10s-40=2t-21

10s=2t+19

s=\dfrac{1}{5}t+\dfrac{19}{10}

MME Premium Laptop

Save your answers with

MME Premium

Gold Standard Education

Question 5: Describe as fully as possible the correlation pictured.

[3 marks]

A Level AQAEdexcelOCR

This graph shows correlation in clusters.

There is negative correlation within the clusters.

There is no correlation overall.

MME Premium Laptop

Save your answers with

MME Premium

Gold Standard Education

Additional Resources

MME

Exam Tips Cheat Sheet

A Level
MME

Formula Booklet

A Level

Correlation and Regression Worksheet and Example Questions