Never trust the weather

Concerts are best when they’re in the open air—at least that’s what these groovy guys think. They have a thriving business organizing open-air concerts, and ticket sales for the summer look promising.

Today’s concert looks like it will be one of their best ones ever. The band has just started rehearsing, but there’s a cloud on the horizon...

Before too long the sky’s overcast, temperatures are dipping, and it looks like rain. Even worse, ticket sales are hit. The guys are in trouble, and they can’t afford for this to happen again.

What the guys want is to be able to predict what concert attendance will be given predicted hours of sunshine. That way, they’ll be able to gauge the impact an overcast day is likely to have on attendance. If it looks like attendance will fall below 3,500 people, the point where ticket sales won’t cover expenses, then they’ll cancel the concert

They need your help.

Let’s analyze sunshine and attendance

Here’s sample data showing the predicted hours of sunshine and concert attendance for different events. How can we use this to estimate ticket sales based on the predicted hours of sunshine for the day?

Sunshine (hours)	1.9	2.5	3.2	3.8	4.7	5.5	5.9	7.2
Concert attendance (100’s)	22	33	30	42	38	49	42	55

Most of the time, that’s exactly the sort of thing we’d need to do to predict likely outcomes.

The problem this time is, what would we find the mean and standard deviation of? Would we use the concert attendance as the basis for our calculations, or would we use the hours of sunshine? Neither one of them gives us all the information that we need. Instead of considering just one set of data, we need to look at both.

So far we’ve looked at independent random variables, but not ones that are dependent. We can assume that if the weather is poor, the probability of high attendance at an open air concert will be lower than if the weather is sunny. But how do we model this connection, and how do we use this to predict attendance based on hours of sunshine?

It all comes down to the type of data.

Brain Power

How would you go about modelling the connection between sets of data?

Exploring types of data

Up until now, the sort of data we’ve been dealing with has been univariate.

Univariate data concerns the frequency or probability of a single variable. As an example, univariate data could describe the winnings at a casino or the weights of brides in Statsville. In each case, just one thing is being described.

What univariate data can’t do is show you connections between sets of data. For example, if you had univariate data describing the attendance figures at an open air concert, it wouldn’t tell you anything about the predicted hours of sunshine on that day. It would just give you figures for concert attendance.

So what if we do need to know what the connection is between variables? While univariate data can’t give us this information, there’s another type of data that can—bivariate data.

All about bivariate data

Bivariate data gives you the value of two variables for each observation, not just one. As an example, it can give you both the predicted hours of sunshine and the concert attendance for a single event or observation, like this.

Sunshine (hours)	1.9	2.5	3.2	3.8	4.7 Note Bivariate data gives you the value of two variables for each observation.	5.5	5.9	7.2
Concert attendance (100’s)	22	33	30	42	38	49	42	55

If one of the variables has been controlled in some way or is used to explain the other, it is called the independent or explanatory variable. The other variable is called the dependent or response variable. In our example, we want to use sunshine to predict attendance, so sunshine is the independent variable, and attendance is the dependent.

Visualizing bivariate data

Just as with univariate data, you can draw charts for bivariate data to help you see patterns. Instead of plotting a value against its frequency or probability, you plot one variable on the x-axis and the other variable against it on the y-axis. This helps you to visualize the connection between the two variables.

This sort of chart is called a scatter diagram or scatter plot, and drawing one of these is a lot like drawing any other sort of chart.

Start off by drawing two axes, one vertical and one horizontal. Use the x-axis for one variable and the y-axis for the other. The independent variable normally goes along the x-axis, leaving the dependent variable to go on the y-axis. Once you’ve drawn your axes, you then take the values for each observation and plot them on the scatter plot.

Here’s a scatter plot showing the number of hours of sunshine and concert attendance figures for particular events or observations. As the predicted number of hours sunshine is the independent variable, we’ve plotted it on the x-axis. The concert attendance is the dependent variable, so that’s on the y-axis.

x (sunshine) Note Hours sunshine goes on the x-axis, attendance on the y-axis.	1.9	2.5	3.2	3.8	4.7	5.5	5.9	7.2 Note Here’s the data.
y (attendance)	22	33	30	42	38	49	42	55

Can you see how the scatter diagram helps you visualize patterns in the data? Can you see how this might help us to define the connection between open air concert attendance and predicted number of hours sunshine for the day?

Scatter diagrams show you patterns

As you can see, scatter diagrams are useful because they show the actual pattern of the data. They enable you to more clearly visualize what connection there is between two variables, if indeed there’s any connection at all.

The scatter diagram for the concert data shows a distinct pattern—the data points are clustered along a straight line. We call this a correlation.

Correlation vs. causation

A correlation between two variables doesn’t necessarily mean that one caused the other or that they’re actually related in real life.

A correlation between two variables means that there’s some sort of mathematical relationship between the two. This means that when we plot the values on a chart, we can see a pattern and make predictions about what the missing values might be. What we don’t know is whether there’s an actual relationship between the two variables, and we certainly don’t know whether one caused the other, or if there’s some other factor at work.

As an example, suppose you gather data and find that over time, the number of coffee shops in a particular town increases, while the number of record shops decreases. While this may be true, we can’t say that there is a real-life relationship between the number of coffee shops and the number of record shops. In other words, we can’t say that the increase in coffee shops caused the decline in the record shops. What we can say is that as the number of coffee shops increases, the number of record shops decreases.

We need to predict the concert attendance

So far we’ve looked at what bivariate data is, and how scatter diagrams can show whether there’s a mathematical relationship between the two variables. What we haven’t looked at yet is how we can use this to make predictions.

What we need to do next is see how we can use the data to make predictions for concert attendance, based on predicted hours of sunshine.

Brain Power

How do you think we could go about making predictions like this for bivariate data?

Predict values with a line of best fit

So far you’ve seen how scatter diagrams can help you see whether there’s a correlation between values, by showing you if there’s some sort of pattern. But how can you use this to predict concert attendance, based on the predicted amount of sunshine? How would you use your existing scatter diagram to predict the concert attendance if you know how many hours of sunshine are expected for the day?

One way of doing this is to draw a straight line through the points on the scatter diagram, making it fit the points as closely as possible. You won’t be able to get the straight line to go through every point, but if there’s a linear correlation, you should be able to make sure every point is reasonably close to the line you draw. Doing this means that you can read off an estimate for the concert attendance based on the predicted amount of sunshine.

The line that best fits the data points is called the line of best fit.

Drawing the line in this way is just a best guess.

The trouble with drawing a line in this way is that it’s an estimate, so any predictions you make on the basis of it can be suspect. You have no precise way of measuring whether it’s really the best fitting line. It’s subjective, and the quality of the line’s fit depends on your judgment.

Your best guess is still a guess

Imagine if you asked three different people to draw what each of them think is the line of best fit for the open air concert data. It’s quite likely that each person would come up with a slightly different line of best fit, like this:

All three lines could conceivably be a line of best fit for the data, but what we can’t tell is which one’s really best.

What we really need is some alternative to drawing the line of best fit by eye. Instead of guessing what the line should be, it will be more reliable if we had a mathematical or statistical way of using the data we have available to find the line that fits best.

We need to find the equation of the line

The equation for a straight line takes the form y = a + bx, where a is the point where the line crosses the y-axis, and b is the slope of the line. This means that we can write the line of best fit in the form y = a + bx.

In our case, we’re using x to represent the predicted number of hours of sunshine, and y to represent the corresponding open air concert figures. If we can use the concert attendance data to somehow find the most suitable values of a and b, we’ll have a reliable way to find the equation of the line, and a more reliable way of predicting concert attendance based on predicted hour of sunshine.

We need to minimize the errors

Let’s take a look at what we need from the line of best fit, y = a + bx.

The best fitting line is the one that most accurately predicts the true values of all the points. This means that for each known value of x, we need each of the y variables in the data set to be as close as possible to what we’d estimate them to be using the line of best fit. In other words, given a certain number of hours sunshine, we want our estimates for open air concert attendance to be as close as possible to the actual values.

The line of best fit is the line y = a + bx that minimizes the distances between the actual observations of y and what we estimate those values of y to be for each corresponding value of x.

Let’s represent each of the y values in our data set using y_i, and its estimate using the line of best fit as . This is the same notation that we used for point estimators in previous chapters, as the ^ symbol indicates estimates.

We want to minimize the total distance between each actual value of y and our estimate of it based on the line of best fit. In other words, we need to minimize the total differences between y_i and . We could try doing this by minimizing

but the problem with this is that all of the distances will actually cancel each other out. We need to take a slightly different approach, and it’s one that we’ve seen before.

Introducing the sum of squared errors

Can you remember when we first derived the variance? We wanted to look at the total distance between sets of values and the mean, but the total distances cancelled each other out. To get around this, we added together all the distances squared instead to ensure that all values were positive.

The total sum of the distances squared is called the sum of squared errors, or SSE. It’s given by:

In other words, we take each value of y, subtract the predicted value of y from the line of best fit, square it, and then add all the results together.

The variance and SSE are calculated in similar ways.

The SSE isn’t the variance, but it does deal with the distance squared between two particular points. It gives the total of the distances squared between the actual value of y and what we predict the value of y to be, based on the line of best fit.

What we need to do now is use the data to find the values of a and b that minimize the SSE, based on the line y = a + bx.

Find the equation for the line of best fit

We’ve said that we want to minimize the sum of squared errors, , where y = a + bx. By doing this, we’ll be able to find optimal values for a and b, and that will give us the equation for the line of best fit.

Let’s start with b

The value of b for the line y = a + bx gives us the slope, or steepness, of the line. In other words, b is the slope for the line of best fit.

We’re not going to show you the proof for this, but the value of b that minimizes the SSE is given by

The calculation looks tricky at first, but it’s not that difficult with practice.

First of all, find x̄ and ȳ, the means of the x and y values for the data that you have. Once you’ve done that, calculate (x – x̄) multiplied by (y – ȳ) for every observation in your data set, and add the results together. Finally, divide the whole lot by Σ(x – x̄)². This last part of the equation is very similar to how you calculate the variance of a sample. The only difference is that you don’t divide by (n – 1). You can also get software packages that work all of this out for you.

Let’s take a look at how you use this in practice.

Relax

If you need to calculate this in an exam, you will almost certainly be given the formula.

This means that you won’t have to memorize the formula, just know how to use it.

Finding the slope for the line of best fit

Let’s see if we can use this to find the slope of the line y = a + bx for the concert data. First of all, here’s a reminder of the data:

x (sunshine)	1.9	2.5	3.2	3.8	4.7	5.5	5.9	7.2
y (attendance)	22	33	30	42	38	49	42	55

Let’s start by finding the values of x̄ and ȳ, the sample means of the x and y values. We calculate these in exactly the same way as before, so

x̄	= (1.9 + 2.5 + 3.2 + 3.8 + 4.7 + 5.5 + 5.9 + 7.2)/8
	= 34.7/8
	= 4.3375

Note

Use the values of x to find x̄, and the values of y to find ȳ.

ȳ	= (22 + 33 + 30 + 42 + 38 + 49 + 42 + 55)/8
	= 311/8
	= 38.875

Now that we’ve found x̄ and ȳ, we can use them to help us find the value of b using the formula on the opposite page.

We use x̄ and ȳ to help us find b

The first part of the formula is Σ(x – x̄)(y – ȳ). To find this, we take the x and y values for each observation, subtract x̄ from the x value, subtract ȳ from the y value, and then multiply the two together. Once we’ve done this for every observation, we then add the whole lot up together.

Finding the slope for the line of best fit, part ii

Here’s a reminder of the data for concert attendance and predicted hours of sunshine:

We’re part of the way through calculating the value of b, where y = a + bx. We’ve found that x̄ = 4.3375, ȳ = 38.875, and Σ(x – x̄)(y – ȳ) = 122.53. The final thing we have left to find is Σ(x – x̄)². Let’s give it a go

We find the value of b by dividing Σ(x – x̄)(y – ȳ) by Σ(x – x̄)². This gives us

b	= 122.53/23.02
	= 5.32

Note

We’ve found b. This gives the slope for the line of best fit.

In other words, the line of best fit for the data is y = a + 5.32x. But what’s a?

We’ve found b, but what about a?

So far we’ve found what the optimal value of b is for the line of best fit y = a + bx. What we don’t know yet is the value of a.

The line needs to go through point (x̄, ȳ).

It’s good for the line of best fit to go through the the point (x̄, ȳ), the means of x and y. We can make sure this happens by substituting x̄ and ȳ into the equation for the line y = a + bx. This gives us

ȳ = a + bx̄

or

a = ȳ – bx̄

We’ve already found values for x̄, ȳ, and b. Substituting in these values gives us

Relax

If you’re taking a statistics exam, it’s likely you’ll be given this formula.

This means that you’re unlikely to have to memorize it, you just need to know how to use it.

This means that the line of best fit is given by

y = 15.80 + 5.32x

You’ve made the connection

So far you’ve used linear regression to model the connection between predicted hours of sunshine and concert attendance. Once you know what the predicted amount of sunshine is, you can predict concert attendance using y = a + bx.

Being able to predict attendance means you’ll be able to really help the concert organizers know what they can expect ticket sales to be, and also what sort of profit they can reasonably expect to make from each event.

It’s the line of best fit, but we don’t know how accurate it is.

The line y = a + bx is the best line we could have come up with, but how accurately does it model the connection between the amount of sunshine and the concert attendance? There’s one thing left to consider, the strength of correlation of the regression line.

What would be really useful is if we could come up with some way of indicating how far the points are dispersed away from the line, as that will give an indication of how accurate we can expect our predictions to be based on what we already know.

Let’s look at a few examples.

Brain Power

Why do you think it’s important to know the strength of the correlation? What difference do you think this would make to the concert organizers?

Let’s look at some correlations

The line of best fit of a set of data is the best line we can come up with to model the mathematical relationship between two variables.

Even though it’s the line that fits the data best, it’s unlikely that the line will fit precisely through every single point. Let’s look at some different sets of data to see how closely the line fits the data.

Accurate linear correlation

For this set of data, the linear correlation is an accurate fit of the data. The regression line isn’t 100% perfect, but it’s very close. It’s likely that any predictions made on the basis of it will be accurate.

No linear correlation

For this set of data, there is no linear correlation. It’s possible to calculate a regression line using least squares regression, but any predictions made are unlikely to be accurate.

Can you see what the problem is?

Both sets of data have a regression line, but the actual fit of the data varies quite a lot. For the first set of data, the correlation is very tight, but for the second, the points are scattered too widely for the regression line to be useful.

Least squares estimates can be used to predict values, which means they would be helpful if there was some way of indicating how tightly the data points fit the line, and how accurate we can expect any predictions to be as a result.

There’s a way of calculating the fit of the line, called the correlation coefficient.

The correlation coefficient measures how well the line fits the data

The correlation coefficient is a number between –1 and 1 that describes the scatter of data points away from the line of best fit. It’s a way of gauging how well the regression line fits the data. It’s normally represented by the letter r.

If r is –1, the data is a perfect negative linear correlation, with all of the data point in a straight line. If r is 1, the data is a perfect positive linear correlation. If r is 0, then there is no correlation.

Usually r is somewhere between these values, as –1, 0, and 1 are all extreme.

If r is negative, then there’s a negative linear correlation between the two variables. The closer r gets to –1, the stronger the correlation, and the closer the points are to the line.

If r is positive, then there’s a positive linear correlation between the variables. The closer r gets to 1, the stronger the correlation.

In general, as r gets closer to 0, the linear correlation gets weaker. This means that the regression line won’t be able to predict y values as accurately as when r is close to 1 or –1. The pattern might be random, or the relationship between the variables might not be linear.

If we can calculate r for the concert data, we’ll have an idea of how accurately we can predict concert attendance based on the predicted hours of sunshine. So how do we calculate r? Turn the page and we’ll show you how.

There’s a formula for calculating the correlation coefficient, r

So how do we calculate the correlation coefficient, r?

We’re not going to show you the proof for this, but the correlation coefficient r is given by

where s_x is the standard deviation of the x values in the sample, and s_y is the standard deviation of the y values.

We’ve already done most of the hard work.

Since we’ve already calculated b, all we have left to find is s_x and s_y. What’s more, we’re already most of the way towards finding s_x.

When we calculated b, we needed to find the value of Σ(x – x̄)². If we divide this by n – 1, this actually gives us the sample variance of the x values. If we then take the square root, we’ll have s_x. In other words,

The only remaining piece of the equation we have to find is s_y, the standard deviation of the y values in the sample. We calculate this in a similar way to finding s_x.

Let’s try finding what r is for the concert attendance data.

Find r for the concert data

Let’s use the formula to find the value of r for the concert data. First of all, here’s a reminder of the data:

x (sunshine)	1.9	2.5	3.2	3.8	4.7	5.5	5.9	7.2
y (attendance)	22	33	30	42	38	49	42	55

To find r, we need to know the values of b, s_x, and s_y so that we can use them in the formula on the opposite page. So far we’ve found that

b = 5.32

Note

This is the slope of the line we found earlier.

but what about s_x and s_y?

Let’s start with s_x. We found earlier that Σ(x – x̄)² = 23.02, and we know that the sample size is 8. This means that if we divide 23.02 by 7, we’ll have the sample variance of x. To find s_x, we take the square root.

The only piece of the formula we have left to find is s_y. We already know that ȳ = 38.875, as we found it earlier on, so this means that

We can now use this to find s_y, by dividing by n – 1 and taking the square root.

All we need to do now is use b, s_x, and s_y to find the value of the correlation coefficient r.

Find r for the concert data, continued

Now that we’ve found that b = 5.32, s_x = 1.81, and s_y = 10.56, we can put them together to find r.

r	= bs_x/s_y
	= 5.32 x 1.81/10.56
	= 0.91 (to 2 decimal places)

As r is very close to 1, this means that there’s strong positive correlation between open air concert attendance and hours of predicted sunshine. In other words, based on the data that we have, we can expect the line of best fit, y = 15.80 + 5.32x, to give a reasonably good estimate of the expected concert attendance based on the predicted hours of sunshine.

You’ve saved the day!

The concert organizers are amazed at the work you’ve done with their concert data. They now have a way of predicting what attendance will be like at their concerts based on the weather reports, which means they have a way of maximizing their profits.

Leaving town...

It’s been great having you here in Statsville!

We’re sad to see you leave, but there’s nothing like taking what you’ve learned and putting it to use. There are still a few more gems for you in the back of the book, some handy probability tables, and an index to read though, and then it’s time to take all these new ideas and put them into practice. We’re dying to hear how things go, so drop us a line at the Head First Labs web site, www.headfirstlabs.com, and let us know how Statistics is paying off for YOU!

Q:	Q: So are we saying that the predicted sunshine causes low ticket sales?
A:	A: The bivariate data shows that there is a mathematical relationship between the two variables, but we can’t use it to demonstrate cause and effect. It’s intuitively possible that more people will go to open air concerts when it’s sunny, but we can’t say for certain that sunshine causes this. We’d need to do more research, as there may be other factors.
Q:	Q: Other factors? Like what?
A:	A: One example would be the popularity of the artist performing. If a well-known artist is holding a concert, then fans may want to go to the concert no matter what the weather. Similarly, an unpopular artist is unlikely to have the same dedication from fans.
Q:	Q: Do scatter diagrams use populations or samples of data?
A:	A: They can use either. A lot of the time, you’ll actually be using samples, but the process of plotting a scatter diagram is the same irrespective of whether you have a sample or a population.
Q:	Q: If there’s a correlation between two variables, does it have to be linear?
A:	A: Correlation measures linear relationships, but not all relationships are linear. As an example, a strong relationship between two variables could be a distinctive curve, such as y = x². In this chapter, we’re only going to be dealing with linear relationships, though.

Q:	Q: It looks like the formulas you’ve given are for samples rather than populations. Is that right?
A:	A: That’s right. We’ve used samples rather than populations because the data we’ve been given is a sample. There’s nothing to stop you using a population if you have the data, just use μ instead of x̄.
Q:	Q: Is the value of b always positive?
A:	A: No, it isn’t. Whether b is positive or negative actually depends on the type of linear correlation. For positive linear correlation, b is positive. For negative linear correlation, b is negative.
Q:	Q: I’ve heard of the term gradient. What’s that?
A:	A: Gradient is another term for the slope of the line, b.
Q:	Q: What about if there’s no correlation? Can I still work out b?
A:	A: If there’s no correlation, you can still technically find a line of best fit, but it won’t be an effective model of the data, and you won’t be able to make accurate predictions using it.
Q:	Q: Is there an easy way of calculating b?
A:	A: Calculating b is tricky if you have lots of observations, but you can get software packages to calculate this for you.

y	= 15.80 + 5.32x
	= 15.80 + 5.32 x 6
	= 15.80 + 31.92
	= 47.72

y	= 15.80 + 5.32x
35	= 15.80 + 5.32x
35 – 15.80	= 5.32x
19.2	= 5.32x
x	= 19.2/5.32
	= 3.61 (to 2 decimal places)

Q:	Q: I’ve seen other ways of calculating r. Are they wrong?
A:	A: There are several different forms of the equation for finding r, but underneath, they’re basically the same. We’ve used the simplest form of the equation so that it’s easier to see what you’ve already calculated through finding b.
Q:	Q: Are the results accurate with such a small sample?
A:	A: A larger sample would definitely be better, but we used a small sample just to make the calculations easier to follow.
Q:	Q: You haven’t proved or derived why you calculate the values of b and r in this way. Why not?
A:	A: Deriving the formula for b and r is quite complex and involved, so we’ve decided not to go through this in the book. The key thing is that you understand when and how to use them.
Q:	Q: What’s the expected concert attendance if the predicted hours of sunshine is 0?
A:	A: We can’t say for certain because this is quite a way outside the range of data we have. The line of best fit is a pretty good estimate for the range of data that we have, but we can’t say with any certainty what the concert attendance will be like outside this range. The data might follow a different pattern outside this range, so any estimate we gave would be unreliable.
Q:	Q: When we were looking at averages, we saw that univariate data can have outliers. What about bivariate data?
A:	A: Yes, bivariate data can have outliers too. Outliers are points that lie a long way from your regression line. If you have outliers, then this can mean that you have anomalies in your data set, or alternatively, that your regression line isn’t a good fit of the data.
Q:	Q: I’ve heard of influential observations. What are they?
A:	A: Influential observations are points that lie a long way horizontally from the rest of the data. Because of this, they have the effect of pulling the regression line towards them.
Q:	Q: So is an influential observation the same as an outlier?
A:	A: No. Outliers lie a long way from the line. Influential observations lie a long way horizontally from the data.

Radiation exposure (minutes)	4	4.5	5	5.5	6	6.5	7
Weight (tons)	12	10	8	9.5	8	9	6

Radiation exposure (minutes)	4	4.5	5	5.5	6	6.5	7
Weight (tons)	12	10	8	9.5	8	9	6

x̄	= (4 + 4.5 + 5 + 5.5 + 6 + 6.5 + 7)/7
	= 38.5/7
	= 5.5
ȳ	= (12 + 10 + 8 + 9.5 + 8 + 9 + 6)/7
	= 62.5/7
	= 8.9 (to 2 decimal places)

a	= ȳ – bx̄
	= 8.9 + 1.43 x 5.5
	= 8.9 + 7.86
	= 16.76