Chapter 15. Correlation and Regression: What’s My Line?

image with no caption

Have you ever wondered how two things are connected?

So far we’ve looked at statistics that tell you about just one variable—like men’s height, points scored by basketball players, or how long gumball flavor lasts—but there are other statistics that tell you about the connection between variables. Seeing how things are connected can give you a lot of information about the real world, information that you can use to your advantage. Stay with us while we show you the key to spotting connections: correlation and regression.

Concerts are best when they’re in the open air—at least that’s what these groovy guys think. They have a thriving business organizing open-air concerts, and ticket sales for the summer look promising.

Today’s concert looks like it will be one of their best ones ever. The band has just started rehearsing, but there’s a cloud on the horizon...

image with no caption
image with no caption

Before too long the sky’s overcast, temperatures are dipping, and it looks like rain. Even worse, ticket sales are hit. The guys are in trouble, and they can’t afford for this to happen again.

What the guys want is to be able to predict what concert attendance will be given predicted hours of sunshine. That way, they’ll be able to gauge the impact an overcast day is likely to have on attendance. If it looks like attendance will fall below 3,500 people, the point where ticket sales won’t cover expenses, then they’ll cancel the concert

They need your help.

Here’s sample data showing the predicted hours of sunshine and concert attendance for different events. How can we use this to estimate ticket sales based on the predicted hours of sunshine for the day?

Sunshine (hours)

1.9

2.5

3.2

3.8

4.7

5.5

5.9

7.2

Concert attendance (100’s)

22

33

30

42

38

49

42

55

image with no caption

Most of the time, that’s exactly the sort of thing we’d need to do to predict likely outcomes.

The problem this time is, what would we find the mean and standard deviation of? Would we use the concert attendance as the basis for our calculations, or would we use the hours of sunshine? Neither one of them gives us all the information that we need. Instead of considering just one set of data, we need to look at both.

So far we’ve looked at independent random variables, but not ones that are dependent. We can assume that if the weather is poor, the probability of high attendance at an open air concert will be lower than if the weather is sunny. But how do we model this connection, and how do we use this to predict attendance based on hours of sunshine?

It all comes down to the type of data.

Up until now, the sort of data we’ve been dealing with has been univariate.

Univariate data concerns the frequency or probability of a single variable. As an example, univariate data could describe the winnings at a casino or the weights of brides in Statsville. In each case, just one thing is being described.

What univariate data can’t do is show you connections between sets of data. For example, if you had univariate data describing the attendance figures at an open air concert, it wouldn’t tell you anything about the predicted hours of sunshine on that day. It would just give you figures for concert attendance.

image with no caption

So what if we do need to know what the connection is between variables? While univariate data can’t give us this information, there’s another type of data that can—bivariate data.

Just as with univariate data, you can draw charts for bivariate data to help you see patterns. Instead of plotting a value against its frequency or probability, you plot one variable on the x-axis and the other variable against it on the y-axis. This helps you to visualize the connection between the two variables.

This sort of chart is called a scatter diagram or scatter plot, and drawing one of these is a lot like drawing any other sort of chart.

Start off by drawing two axes, one vertical and one horizontal. Use the x-axis for one variable and the y-axis for the other. The independent variable normally goes along the x-axis, leaving the dependent variable to go on the y-axis. Once you’ve drawn your axes, you then take the values for each observation and plot them on the scatter plot.

Here’s a scatter plot showing the number of hours of sunshine and concert attendance figures for particular events or observations. As the predicted number of hours sunshine is the independent variable, we’ve plotted it on the x-axis. The concert attendance is the dependent variable, so that’s on the y-axis.

x (sunshine)

1.9

2.5

3.2

3.8

4.7

5.5

5.9

7.2

y (attendance)

22

33

30

42

38

49

42

55

image with no caption

Can you see how the scatter diagram helps you visualize patterns in the data? Can you see how this might help us to define the connection between open air concert attendance and predicted number of hours sunshine for the day?

As you can see, scatter diagrams are useful because they show the actual pattern of the data. They enable you to more clearly visualize what connection there is between two variables, if indeed there’s any connection at all.

The scatter diagram for the concert data shows a distinct pattern—the data points are clustered along a straight line. We call this a correlation.

image with no caption

A correlation between two variables doesn’t necessarily mean that one caused the other or that they’re actually related in real life.

A correlation between two variables means that there’s some sort of mathematical relationship between the two. This means that when we plot the values on a chart, we can see a pattern and make predictions about what the missing values might be. What we don’t know is whether there’s an actual relationship between the two variables, and we certainly don’t know whether one caused the other, or if there’s some other factor at work.

As an example, suppose you gather data and find that over time, the number of coffee shops in a particular town increases, while the number of record shops decreases. While this may be true, we can’t say that there is a real-life relationship between the number of coffee shops and the number of record shops. In other words, we can’t say that the increase in coffee shops caused the decline in the record shops. What we can say is that as the number of coffee shops increases, the number of record shops decreases.

image with no caption
image with no caption
image with no caption

So far you’ve seen how scatter diagrams can help you see whether there’s a correlation between values, by showing you if there’s some sort of pattern. But how can you use this to predict concert attendance, based on the predicted amount of sunshine? How would you use your existing scatter diagram to predict the concert attendance if you know how many hours of sunshine are expected for the day?

One way of doing this is to draw a straight line through the points on the scatter diagram, making it fit the points as closely as possible. You won’t be able to get the straight line to go through every point, but if there’s a linear correlation, you should be able to make sure every point is reasonably close to the line you draw. Doing this means that you can read off an estimate for the concert attendance based on the predicted amount of sunshine.

image with no caption

The line that best fits the data points is called the line of best fit.

image with no caption

Drawing the line in this way is just a best guess.

The trouble with drawing a line in this way is that it’s an estimate, so any predictions you make on the basis of it can be suspect. You have no precise way of measuring whether it’s really the best fitting line. It’s subjective, and the quality of the line’s fit depends on your judgment.

Imagine if you asked three different people to draw what each of them think is the line of best fit for the open air concert data. It’s quite likely that each person would come up with a slightly different line of best fit, like this:

image with no caption

All three lines could conceivably be a line of best fit for the data, but what we can’t tell is which one’s really best.

What we really need is some alternative to drawing the line of best fit by eye. Instead of guessing what the line should be, it will be more reliable if we had a mathematical or statistical way of using the data we have available to find the line that fits best.

Let’s take a look at what we need from the line of best fit, y = a + bx.

The best fitting line is the one that most accurately predicts the true values of all the points. This means that for each known value of x, we need each of the y variables in the data set to be as close as possible to what we’d estimate them to be using the line of best fit. In other words, given a certain number of hours sunshine, we want our estimates for open air concert attendance to be as close as possible to the actual values.

The line of best fit is the line y = a + bx that minimizes the distances between the actual observations of y and what we estimate those values of y to be for each corresponding value of x.

image with no caption

Let’s represent each of the y values in our data set using yi, and its estimate using the line of best fit as . This is the same notation that we used for point estimators in previous chapters, as the ^ symbol indicates estimates.

We want to minimize the total distance between each actual value of y and our estimate of it based on the line of best fit. In other words, we need to minimize the total differences between yi and . We could try doing this by minimizing

image with no caption
image with no caption

but the problem with this is that all of the distances will actually cancel each other out. We need to take a slightly different approach, and it’s one that we’ve seen before.

Can you remember when we first derived the variance? We wanted to look at the total distance between sets of values and the mean, but the total distances cancelled each other out. To get around this, we added together all the distances squared instead to ensure that all values were positive.

We have a similar situation here. Instead of looking at the total distance between the actual and expected points, we need to add together the distances squared. That way, we make sure that all the values are positive.

The total sum of the distances squared is called the sum of squared errors, or SSE. It’s given by:

image with no caption

In other words, we take each value of y, subtract the predicted value of y from the line of best fit, square it, and then add all the results together.

image with no caption

The variance and SSE are calculated in similar ways.

The SSE isn’t the variance, but it does deal with the distance squared between two particular points. It gives the total of the distances squared between the actual value of y and what we predict the value of y to be, based on the line of best fit.

What we need to do now is use the data to find the values of a and b that minimize the SSE, based on the line y = a + bx.

We’ve said that we want to minimize the sum of squared errors, , where y = a + bx. By doing this, we’ll be able to find optimal values for a and b, and that will give us the equation for the line of best fit.

Let’s see if we can use this to find the slope of the line y = a + bx for the concert data. First of all, here’s a reminder of the data:

x (sunshine)

1.9

2.5

3.2

3.8

4.7

5.5

5.9

7.2

y (attendance)

22

33

30

42

38

49

42

55

Let’s start by finding the values of x̄ and ȳ, the sample means of the x and y values. We calculate these in exactly the same way as before, so

= (1.9 + 2.5 + 3.2 + 3.8 + 4.7 + 5.5 + 5.9 + 7.2)/8

 

= 34.7/8

 

= 4.3375

ȳ

= (22 + 33 + 30 + 42 + 38 + 49 + 42 + 55)/8

 

= 311/8

 

= 38.875

Now that we’ve found x̄ and ȳ, we can use them to help us find the value of b using the formula on the opposite page.

Here’s a reminder of the data for concert attendance and predicted hours of sunshine:

image with no caption
image with no caption

We’re part of the way through calculating the value of b, where y = a + bx. We’ve found that x̄ = 4.3375, ȳ = 38.875, and Σ(x – x̄)(y – ȳ) = 122.53. The final thing we have left to find is Σ(x – x̄)2. Let’s give it a go

We find the value of b by dividing Σ(x – x̄)(y – ȳ) by Σ(x – x̄)2. This gives us

b

= 122.53/23.02

 

= 5.32

In other words, the line of best fit for the data is y = a + 5.32x. But what’s a?

So far we’ve found what the optimal value of b is for the line of best fit y = a + bx. What we don’t know yet is the value of a.

image with no caption

The line needs to go through point (x̄, ȳ).

It’s good for the line of best fit to go through the the point (x̄, ȳ), the means of x and y. We can make sure this happens by substituting x̄ and ȳ into the equation for the line y = a + bx. This gives us

ȳ = a + bx̄

or

a = ȳ – bx̄

We’ve already found values for x̄, ȳ, and b. Substituting in these values gives us

image with no caption

This means that the line of best fit is given by

y = 15.80 + 5.32x

image with no caption

So far you’ve used linear regression to model the connection between predicted hours of sunshine and concert attendance. Once you know what the predicted amount of sunshine is, you can predict concert attendance using y = a + bx.

Being able to predict attendance means you’ll be able to really help the concert organizers know what they can expect ticket sales to be, and also what sort of profit they can reasonably expect to make from each event.

image with no caption

It’s the line of best fit, but we don’t know how accurate it is.

The line y = a + bx is the best line we could have come up with, but how accurately does it model the connection between the amount of sunshine and the concert attendance? There’s one thing left to consider, the strength of correlation of the regression line.

What would be really useful is if we could come up with some way of indicating how far the points are dispersed away from the line, as that will give an indication of how accurate we can expect our predictions to be based on what we already know.

Let’s look at a few examples.

The line of best fit of a set of data is the best line we can come up with to model the mathematical relationship between two variables.

Even though it’s the line that fits the data best, it’s unlikely that the line will fit precisely through every single point. Let’s look at some different sets of data to see how closely the line fits the data.

The correlation coefficient is a number between –1 and 1 that describes the scatter of data points away from the line of best fit. It’s a way of gauging how well the regression line fits the data. It’s normally represented by the letter r.

If r is –1, the data is a perfect negative linear correlation, with all of the data point in a straight line. If r is 1, the data is a perfect positive linear correlation. If r is 0, then there is no correlation.

image with no caption

Usually r is somewhere between these values, as –1, 0, and 1 are all extreme.

If r is negative, then there’s a negative linear correlation between the two variables. The closer r gets to –1, the stronger the correlation, and the closer the points are to the line.

If r is positive, then there’s a positive linear correlation between the variables. The closer r gets to 1, the stronger the correlation.

In general, as r gets closer to 0, the linear correlation gets weaker. This means that the regression line won’t be able to predict y values as accurately as when r is close to 1 or –1. The pattern might be random, or the relationship between the variables might not be linear.

If we can calculate r for the concert data, we’ll have an idea of how accurately we can predict concert attendance based on the predicted hours of sunshine. So how do we calculate r? Turn the page and we’ll show you how.

image with no caption

So how do we calculate the correlation coefficient, r?

We’re not going to show you the proof for this, but the correlation coefficient r is given by

image with no caption

where sx is the standard deviation of the x values in the sample, and sy is the standard deviation of the y values.

image with no caption

We’ve already done most of the hard work.

Since we’ve already calculated b, all we have left to find is sx and sy. What’s more, we’re already most of the way towards finding sx.

When we calculated b, we needed to find the value of Σ(x – x̄)2. If we divide this by n – 1, this actually gives us the sample variance of the x values. If we then take the square root, we’ll have sx. In other words,

image with no caption

The only remaining piece of the equation we have to find is sy, the standard deviation of the y values in the sample. We calculate this in a similar way to finding sx.

image with no caption

Let’s try finding what r is for the concert attendance data.

Let’s use the formula to find the value of r for the concert data. First of all, here’s a reminder of the data:

x (sunshine)

1.9

2.5

3.2

3.8

4.7

5.5

5.9

7.2

y (attendance)

22

33

30

42

38

49

42

55

To find r, we need to know the values of b, sx, and sy so that we can use them in the formula on the opposite page. So far we’ve found that

b = 5.32

but what about sx and sy?

Let’s start with sx. We found earlier that Σ(x – x̄)2 = 23.02, and we know that the sample size is 8. This means that if we divide 23.02 by 7, we’ll have the sample variance of x. To find sx, we take the square root.

image with no caption

The only piece of the formula we have left to find is sy. We already know that ȳ = 38.875, as we found it earlier on, so this means that

image with no caption

We can now use this to find sy, by dividing by n – 1 and taking the square root.

All we need to do now is use b, sx, and sy to find the value of the correlation coefficient r.

Now that we’ve found that b = 5.32, sx = 1.81, and sy = 10.56, we can put them together to find r.

r

= bsx/sy

 

= 5.32 x 1.81/10.56

 

= 0.91 (to 2 decimal places)

image with no caption

As r is very close to 1, this means that there’s strong positive correlation between open air concert attendance and hours of predicted sunshine. In other words, based on the data that we have, we can expect the line of best fit, y = 15.80 + 5.32x, to give a reasonably good estimate of the expected concert attendance based on the predicted hours of sunshine.

The concert organizers are amazed at the work you’ve done with their concert data. They now have a way of predicting what attendance will be like at their concerts based on the weather reports, which means they have a way of maximizing their profits.

image with no caption
image with no caption

We’re sad to see you leave, but there’s nothing like taking what you’ve learned and putting it to use. There are still a few more gems for you in the back of the book, some handy probability tables, and an index to read though, and then it’s time to take all these new ideas and put them into practice. We’re dying to hear how things go, so drop us a line at the Head First Labs web site, www.headfirstlabs.com, and let us know how Statistics is paying off for YOU!