# Correlation and Regression

## Correlation and Regression Revision

**Correlation and Regression**

**Correlation** and **regression** both pertain to data measured in pairs – called **bivariate data**. **Correlation** is a measure of **how closely linked** the two data sets are, and **how they affect each other**. **Regression** is the **line of best fit**.

**Correlation and Scatter Graphs**

When we have **bivariate data**, one variable will be the **independent **(or **explanatory**) variable, and another will be the **dependent **(or **response**) variable. The **independent** variable is the **one that you can control**, and goes on the x axis. The **dependent **variable is the **one that is being affected**, and it goes on the y axis.

There won’t always be a clear** independent** and **dependent** variable, and in that case it is **not as important** which way round they go on a scatter graph.

**Correlation** comes in **three flavours**: **positive**, **negative** and **no correlation**. In **positive correlation**, as **one variable goes up so does the other**. In **negative correlation**, as **one variable goes up the other goes down**. In **no correlation**, there is **no clear link** between the variables.

**Correlation** can also be** strong** or **weak**. In **strong** **correlation**, the data is **very close to forming a line**. In **weak** **correlation**, the data is** not close to forming a line**.

**Outliers look obvious** on a scatter graph. They can be **ignored in subsequent calculation** – but if you do plan to ignore an outlier make sure you **clearly mark** it on the graph as such.

You should also be aware of **clusters**. This is where the data forms **several separate groups on the graph**. We can talk about **overall correlation** and **correlation in clusters**. For example, in the graph on the right, there is **negative correlation** overall but **positive correlation** in the clusters.

**Regression**

The **regression line of \mathbf{y} on \mathbf{x} **is the **line of best fit**. It is always written in the same form:

y=a+bx

a is the y intercept

b is the gradient

The **regression line** can be used to predict values of the **dependent variable**. This comes in two flavours:

**Interpolation**– If the value of x being used in the**prediction falls inside**the range of the values of x in the data. The predicted value**should be reliable**.**Extrapolation**– If the value of x being used in the**prediction falls outside**of the range of the values of x in the data. The predicted value**might be unreliable**.

**Regression with Coded Data**

Regression can also be done on **coded data**. All we do is we **substitute the coding** into our **regression line**, then rearrange to get it back in straight line form.

**Example: **y=4+10x, with coding s=2y, t=x-3 becomes:

\begin{aligned}\dfrac{s}{2}&=4+10(t+3)\\[1.2em]&=4+10t+30\\[1.2em]&=34+10t\end{aligned}

s=68+20t

Also, we can form **regression lines from non-linear data** in some cases.

**Example: **y=ax^{n} becomes log(y)=log(a)+nlog(x) which is regression in log(y) and log(x).

**Example 1: Correlation**

Plot a scatter graph then describe the **correlation** of this data:

**[4 marks]**

Plot the points on the graph:

This graph shows positive correlation.

**Example 2: Regression**

Data is collected about the temperature of the water in a kettle in °C over time in minutes. The **regression line** is:

y=20+40x

What is the **gradient** and **y-intercept**, and what do these mean in the context of the experiment. How long does the kettle take to boil?

**[5 marks]**

Gradient is 40°C

y-intercept is 20°C

In context, this means that the water starts at a temperature of 20°C and rises by 40°C every minute.

The kettle finishes boiling at 100°C. Substitute this value into the expression:

100=20+40x

40x=80

x=2 minutes

## Correlation and Regression Example Questions

**Question 1: **The table below shows the results of fuel efficiency tests where x is the amount of fuel placed into the vehicle in litres and y is the distance the vehicle travelled before it ran out of fuel.

i) Plot a scatter graph of this data.

ii) What kind of correlation is shown?

**[4 marks]**

Plot the points on the graph:

This graph shows positive correlation.

**Question 2: **The data below shows how long bacteria lived after being placed into more and more alkaline solutions. x is the pH of the solution and y is the time in seconds the bacterium lived.

i) Plot this data on a scatter graph.

ii) Describe the correlation.

**[4 marks]**

Plot the points on the graph.

This graph shows negative correlation.

**Question 3: **A regression line of y=a+bx describes a correlation.

i) What are a and b?

ii) If there is negative correlation, is b>0?

iii) What value does y take when x=0?

iv) Suppose this regression describes the size of birds against the distance they travel to migrate. What does b mean? Is the physical interpretation for a sensible?

**[6 marks]**

i) a is y-intercept

b is gradient

ii) Negative correlation means b<0

iii) y=a because a is the y-intercept

iv) b is the gradient so it is the amount of distance further a bird travels to migrate per unit weight

a is the travel distance of a bird of 0 weight. Since birds cannot have no weight, this is not sensible.

**Question 4: **The regression line of y on x is y=8x+3. The data is coded with s=\dfrac{y}{10}+4 and t=4(x+3). Find the regression line of s on t.

**[4 marks]**

y=8x+3

s=\dfrac{y}{10}+4

Rearrange to find y

s-4=\dfrac{y}{10}

y=10(s-4)

t=4(x+3)

Rearrange to find x

\dfrac{t}{4}=x+3

x=\dfrac{t}{4}-3

Substitute in the expressions for x and y

10(s-4)=8\left(\dfrac{t}{4}-3\right)+3

10s-40=2t-24+3

10s-40=2t-21

10s=2t+19

s=\dfrac{1}{5}t+\dfrac{19}{10}

**Question 5: **Describe as fully as possible the correlation pictured.

**[3 marks]**

This graph shows correlation in clusters.

There is negative correlation within the clusters.

There is no correlation overall.