Chapter 14

Regression: Linear, Multiple, and the General Linear Model

IN THIS CHAPTER

check Summarizing a relationship

check Working with regression

check Taking another look at ANOVA

check Exploring analysis of covariance

check Examining the general linear model

One of the main things you do when you work with statistics is make predictions. The idea is to use data from one or more variables to predict the value of another variable. To do this, you have to understand how to summarize relationships among variables, and to test hypotheses about those relationships.

In this chapter, I introduce regression, a statistical way to do just that. Regression also enables you to use the details of relationships to make predictions. First, I show you how to analyze the relationship between one variable and another. Then I show you how to analyze the relationship between a variable and two others. Finally, I let you in on the connection between regression and ANOVA.

The Plot of Scatter

FarMisht Consulting, Inc., is a consulting firm with a wide range of specialties. It receives numerous applications from people interested in becoming FarMisht consultants. Accordingly, FarMisht Human Resources has to be able to predict which applicants will succeed and which ones will not. They’ve developed a Performance measure that they use to assess their current employees. The scale is 0–100, where 100 indicates top performance.

What’s the best prediction for a new applicant? Without knowing anything about an applicant, and knowing only their own employees’ Performance scores, the answer is clear: It’s the average Performance score among their employees. Regardless of who the applicant is, that’s all the Human Resources team can say if its members’ knowledge is limited.

With more knowledge about the employees and about the applicants, a more accurate prediction becomes possible. For example, if FarMisht develops an aptitude test and assesses its employees, Human Resources can match up every employee’s Performance score with their Aptitude score and see whether the two pieces of data are somehow related. If they are, an applicant can take the FarMisht aptitude test, and Human Resources can use that score (and the relationship between Aptitude and Performance) to help make a prediction.

Figure 14-1 shows the Aptitude-Performance matchup in a graphical way. Because the points are scattered, it’s called a scatter plot. By convention, the vertical axis (the y-axis) represents what you’re trying to predict. That’s also called the dependent variable, or the y-variable. In this case, that’s Performance. Also by convention, the horizontal axis (the x-axis) represents what you’re using to make your prediction. That’s also called the independent variable, or x-variable. Here, that’s Aptitude.

image

FIGURE 14-1: Aptitude and Performance at FarMisht Consulting.

Each point in the graph represents an individual’s Performance and Aptitude. In a scatter plot for a real-life corporation, you’d see many more points than I show here. The general tendency of the set of points seems to be that high Aptitude scores are associated with high Performance scores and that low Aptitude scores are associated with low Performance scores.

I’ve singled out one of the points. It shows a FarMisht employee with an Aptitude score of 54 and a Performance score of 58. I also show the average Performance score, to give you a sense that knowing the Aptitude-Performance relationship provides an advantage over knowing only the mean.

How do you make that advantage work for you? You start by summarizing the relationship between Aptitude and Performance. The summary is a line through the points. How and where do you draw the line?

I get to that in a minute. First, I have to tell you about lines in general.

Graphing Lines

In the world of mathematics, a line is a way to picture a relationship between an independent variable (x) and a dependent variable (y). In this relationship,

images

If you supply a value for x, you can figure out the corresponding value for y. The equation says to multiply the x-value by 2 and then add 3.

If images, for example, images. If images. Table 14-1 shows a number of x-y pairs in this relationship, including the pair in which images.

TABLE 14-1 x-y Pairs in images

x

y

0

4

1

6

2

8

3

10

4

12

5

14

6

16

Figure 14-2 shows these pairs as points on a set of x-y axes, along with a line through the points. Each time I list an x-y pair in parentheses, the x-value is first.

image

FIGURE 14-2: The graph for images.

As the figure shows, the points fall neatly onto the line. The line graphs the equation images. In fact, whenever you have an equation like this, where x isn’t squared or cubed or raised to any power higher than 1, you have what mathematicians call a linear equation. (If x is raised to a higher power than 1, you connect the points with a curve, not a line.)

remember A couple of things to keep in mind about a line: You can describe a line in terms of how slanted it is, and where it runs into the y-axis.

The how-slanted-it-is part is the slope. The slope tells you how much y changes when x changes by one unit. In the line shown in Figure 14-2, when x changes by one (from 4 to 5, for example), y changes by two (from 12 to 14).

The where-it-runs-into-the-y-axis part is called the y-intercept (or sometimes just the intercept). That’s the value of y when images. In Figure 14-2, the y-intercept is 4.

You can see these numbers in the equation. The slope is the number that multiplies x, and the intercept is the number you add to x. In general,

images

where a represents the intercept and b represents the slope.

The slope can be a positive number, a negative number, or 0. In Figure 14-2, the slope is positive. If the slope is negative, the line is slanted in a direction opposite to what you see in Figure 14-2. A negative slope means that y decreases as x increases. If the slope is 0, the line is parallel to the horizontal axis. If the slope is 0, y doesn’t change as x changes.

The same applies to the intercept — it can be a positive number, a negative number, or 0. If the intercept is positive, the line cuts off the y-axis above the x-axis. If the intercept is negative, the line cuts off the y-axis below the x-axis. If the intercept is 0, it intersects with the y-axis and the x-axis, at the point called the origin.

And now, back to what I was originally talking about.

Regression: What a Line!

I mention earlier that a line is the best way to summarize the relationship in the scatter plot in Figure 14-1. It’s possible to draw an infinite amount of straight lines through the scatter plot. Which one best summarizes the relationship?

Intuitively, the “best fitting” line ought to be the one that passes through the maximum number of points and isn’t too far away from the points it doesn’t pass through. For statisticians, that line has a special property: If you draw that line through the scatter plot, then draw distances (in the vertical direction) between the points and the line, and then square those distances and add them up, the sum of the squared distances is a minimum.

Statisticians call this line the regression line, and they indicate it as

images

Each y' is a point on the line. It represents the best prediction of y for a given value of x.

To figure out exactly where this line is, you calculate its slope and its intercept. For a regression line, the slope and intercept are called regression coefficients.

The formulas for the regression coefficients are pretty straightforward. For the slope, the formula is

images

The intercept formula is

images

I illustrate with an example. To keep the numbers manageable and comprehensible, I use a small sample instead of the hundreds (or perhaps thousands) of employees you’d find in a scatter plot for a corporation. Table 14-2 shows a sample of data from 16 FarMisht consultants.

TABLE 14-2 Aptitude Scores and Performance Scores for 16 FarMisht Consultants

Consultant

Aptitude

Performance

1

45

56

2

81

74

3

65

56

4

87

81

5

68

75

6

91

84

7

77

68

8

61

52

9

55

57

10

66

82

11

82

73

12

93

90

13

76

67

14

83

79

15

61

70

16

74

66

Mean

72.81

70.63

Variance

181.63

126.65

Standard Deviation

13.48

11.25

For this set of data, the slope of the regression line is

images

The intercept is

images

So the equation of the best-fitting line through these 16 points is

images

Or, in terms of Performance and Aptitude, it’s

images

remember The slope and the intercept of a regression line are generically called regression coefficients.

Using regression for forecasting

Based on this sample and this regression line, you can take an applicant’s Aptitude score — say, 85 — and predict the applicant’s Performance:

images

Without this regression line, the only prediction is the mean Performance, 70.63.

Variation around the regression line

In Chapter 5, I describe how the mean doesn’t tell the whole story about a set of data. You have to show how the scores vary around the mean. For that reason, I introduce the variance and standard deviation.

You have a similar situation here. To get the full picture of the relationship in a scatter plot, you have to show how the scores vary around the regression line. Here, I introduce the residual variance and standard error of estimate, which are analogous to the variance and the standard deviation.

The residual variance is sort of an average of the squared deviations of the observed y-values around the predicted y-values. Each deviation of a data point from a predicted point (y - y') is called a residual; hence, the name. The formula is

images

I say “sort of” because the denominator is N-2 rather than N. Telling you the reason for the –2 is beyond the scope of this discussion. As I mention earlier, the denominator of a variance estimate is degrees of freedom (df), and that concept comes in handy in a little while.

The standard error of estimate is

images

To show you how the residual error and the standard error of estimate play out for the data in the example, here’s Table 14-3. This table extends Table 14-2 by showing the predicted Performance score for each given Aptitude score:

TABLE 14-3 Aptitude Scores, Performance Scores, and Predicted Performance Scores for 16 FarMisht Consultants

Consultant

Aptitude

Performance

Predicted Performance

1

45

56

52.44

2

81

74

75.98

3

65

56

65.52

4

87

81

79.90

5

68

75

67.48

6

91

84

82.51

7

77

68

73.36

8

61

52

62.90

9

55

57

58.98

10

66

82

66.17

11

82

73

76.63

12

93

90

83.82

13

76

67

72.71

14

83

79

77.28

15

61

70

62.90

16

74

66

71.40

Mean

72.81

70.63

Variance

181.63

126.65

Standard Deviation

13.48

11.25

As the table shows, sometimes the predicted Performance score is pretty close, and sometimes it’s not.

For these data, the residual variance is

images

The standard error of estimate is

images

If the residual variance and the standard error of estimate are small, the regression line is a good fit to the data in the scatter plot. If the residual variance and the standard error of estimate are large, the regression line is a poor fit.

What’s “small”? What’s “large”? What’s a “good” fit?

Keep reading.

Testing hypotheses about regression

The regression equation you are working with:

images

summarizes a relationship in a scatter plot of a sample. The regression coefficients a and b are sample statistics. You can use these statistics to test hypotheses about population parameters, and that’s what you do in this section.

The regression line through the population that produces the sample (like the entire set of FarMisht consultants) is the graph of an equation that consists of parameters rather than statistics. By convention, remember, Greek letters stand for parameters, so the regression equation for the population is

images

The first two Greek letters on the right are α (alpha) and β (beta), the equivalents of a and b. What about that last one? It looks something like the Greek equivalent of e. What’s it doing there?

That last term is the Greek letter epsilon. It represents “error” in the population. In a way, error is an unfortunate term. It’s a catchall for “things you don’t know or things you have no control over.” Error is reflected in the residuals — the deviations from the predictions. The more you understand about what you’re measuring, the more you decrease the error.

You can’t measure the error in the relationship between Aptitude and Performance, but it’s lurking there. Someone might score low on the Aptitude, for example, and then go on to have a wonderful consulting career with a higher-than-predicted Performance. On a scatter plot, this person’s Aptitude-Performance point looks like an error in prediction. As you find out more about that person, you might discover that she was sick on the day of the Aptitude, and that explains the “error.”

You can test hypotheses about α, β, and ε, and that’s what you do in the upcoming subsections.

Testing the fit

You begin with a test of how well the regression line fits the scatter plot. This is a test of ε, the error in the relationship.

The objective is to decide whether or not the line really does represent a relationship between the variables. It’s possible that what looks like a relationship is just due to chance and the equation of the regression line doesn’t mean anything (because the amount of error is overwhelming) — or it’s possible that the variables are strongly related.

These possibilities are testable, and you set up hypotheses to test them:

  • H0: No real relationship
  • H1: Not H0

Although those hypotheses make nice light reading, they don’t set up a statistical test. To set up the test, you have to consider the variances. To consider the variances, you start with the deviations. Figure 14-3 focuses on one point in a scatter plot and its deviation from the regression line (the residual) and from the mean of the y-variable. It also shows the deviation between the regression line and the mean.

image

FIGURE 14-3: The deviations in a scatter plot.

As the figure shows, the distance between the point and the regression line and the distance between the regression line and the mean add up to the distance between the point and the mean:

images

This sets the stage for some other important relationships.

Start by squaring each deviation. That gives you images, images, and images. If you add up each of the squared deviations, you have

images

You just saw this one. That's the numerator for the residual variance. It represents the variability around the regression line — the “error” I mention earlier. In the terminology of Chapter 12, the numerator of a variance is called a sum of squares, or SS. So this is SSResidual.

images

This one is new. The deviation imagesrepresents the gain in prediction due to using the regression line rather than the mean. The sum reflects this gain and is called SSRegression.

images

I show you this one in Chapter 5 — although I use x rather than y. That’s the numerator of the variance of y. In Chapter 12 terms, it’s the numerator of total variance. This one is SSTotal.

This relationship holds among these three sums:

images

Each one is associated with a value for degrees of freedom — the denominator of a variance estimate. As I point out in the preceding section, the denominator for SSResidual is N–2. The df for SSTotal is N–1. (See Chapters 5 and 12.) As with the SS, the degrees of freedom add up:

images

This leaves one degree of freedom for Regression.

Where is this all headed, and what does it have to do with hypothesis testing? Well, since you asked, you get variance estimates by dividing SS by df. Each variance estimate is called a mean-square, abbreviated MS (again, see Chapter 12):

images
images
images

Now for the hypothesis part. If H0 is true and what looks like a relationship between x and y is really no big deal, the piece that represents the gain in prediction because of the regression line (MSRegression) should be no greater than the variability around the regression line (MSResidual). If H0 is not true, and the gain in prediction is substantial, then MSRegression should be a lot bigger than MSResidual.

So the hypotheses now set up as

  • H0: σ2Regression ≤ σ2Residual
  • H1: σ2Regression > σ2Residual

These are hypotheses you can test. How? To test a hypothesis about two variances, you use an F test. (See Chapter 11.) The test statistic here is

images

To show you how it all works, I apply the formulas to the FarMisht example. The MSResidual is the same as images from the preceding section, and that value is 18.61. The MSRegression is

images

This sets up the F:

images

With 1 and 14 df and α = .05, the critical value of F is 4.60. (Use qf() to verify.) The calculated F is greater than the critical F, so the decision is to reject H0. That means the regression line provides a good fit to the data in the sample.

Testing the slope

Another question that arises in linear regression is whether the slope of the regression line is significantly different from zero. If it’s not, the mean is just as good a predictor as the regression line.

The hypotheses for this test are

  • H0: β ≤ 0
  • H1: β > 0

The statistical test is t, which I discuss in Chapters 9, 10, and 11 in connection with means. The t-test for the slope is

images

with df = N–2. The denominator estimates the standard error of the slope. This term sounds more complicated than it is. The formula is

images

where sx is the standard deviation of the x-variable. For the data in the example,

images
images

This is larger than the critical value of t for 14 df and α = .05 (2.14), so the decision is to reject H0.

Testing the intercept

Finally, here’s the hypothesis test for the intercept. The hypotheses are

  • H0: α = 0
  • H1: α ≠ 0

The test, once again, is a t-test. The formula is

images

The denominator is the estimate of the standard error of the intercept. Without going into detail, the formula for sa is

images

where sx is the standard deviation of the x-variable, sx2 is the variance of the x-variable, and images is the squared mean of the x-variable. Applying this formula to the data in the example,

images

The t-test is

images

With 15 degrees of freedom, and the probability of a Type I error at .05, the critical t is 2.13 for a two-tailed test. It’s a two-tailed test because H1 is that the intercept doesn’t equal zero — it doesn’t specify whether the intercept is greater than zero or less than zero. Because the calculated value is greater than the critical value, the decision is to reject H0.

Linear Regression in R

Time to see how R handles linear regression. To start the analysis for this example, create a vector for the Aptitude scores and another for the Performance scores:

Aptitude <- c(45, 81, 65, 87, 68, 91, 77, 61, 55, 66, 82, 93,
      76, 83, 61, 74)

Performance <- c(56, 74, 56, 81, 75, 84, 68, 52, 57, 82, 73, 90,
      67, 79, 70, 66)

Then use the two vectors to create a data frame

FarMisht.frame <- data.frame(Aptitude,Performance)

The lm() (linear model) function performs the analysis:

FM.reg <-lm(Performance ~ Aptitude, data=FarMisht.frame)

As always, the tilde (~) operator signifies “depends on,” so this is a perfect example of a dependent variable and an independent variable.

Applying summary() to FM.reg produces the regression information:

> summary(FM.reg)

Call:
lm(formula = Performance ~ Aptitude, data = FarMisht.frame)

Residuals:
Min 1Q Median 3Q Max
-10.9036 -5.3720 -0.4379 4.2111 15.8281

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.0299 10.2732 2.242 0.041697 *
Aptitude 0.6537 0.1389 4.707 0.000337 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.249 on 14 degrees of freedom
Multiple R-squared: 0.6128, Adjusted R-squared: 0.5851
F-statistic: 22.15 on 1 and 14 DF, p-value: 0.0003368

The first couple of lines provide summary information about the residuals. The coefficients table shows the intercept and slope of the regression line. If you divide each number in the Estimate column by the adjoining number in the Std. Error column, you get the number in the t value column. These t-values, of course, are the significance tests I mention earlier for the intercept and the slope. The extremely low p-values indicate rejection of the null hypothesis (that a coefficient = 0) for each coefficient.

The bottom part of the output shows the info on how well the line fits the scatter plot. It presents the standard error of the residual, followed by Multiple R-squared and Adjusted R-squared. These last two range from 0 to 1.00 (the higher the value, the better the fit). I discuss them in Chapter 15, but for now I’ll leave them alone. The F-statistic corresponds to the F-ratio I show you earlier. Its high value and low associated p-value indicate that the line is a great fit to the scatter plot.

remember I refer to the result of the linear regression analysis as “the linear model.”

Features of the linear model

The linear model produced by lm() is an object that provides information, if you ask for it in the right way. As I already showed you, applying summary() gives all the information you need about the analysis.

You can also zero in on the coefficients:

> coefficients(FM.reg)
(Intercept) Aptitude
23.029869 0.653667

and on their confidence intervals:

> confint(FM.reg)
2.5 % 97.5 %
(Intercept) 0.9961369 45.0636002
Aptitude 0.3558034 0.9515307

Applying fitted(FM.reg) produces the fitted values, and residuals(FM.reg) gives the residuals.

Making predictions

The value of linear regression is that it gives you the ability to predict, and R provides a function that does just that: predict() applies a set of x-values to the linear model and returns the predicted values. Imagine two applicants with Aptitude scores of 85 and 62:

predict(FM.reg,data.frame(Aptitude=c(85,62)))

The first argument is the linear model, and the second makes a data frame out of the vector of values for the independent variable. Running this function produces these predicted values:

1 2
78.59157 63.55723

Visualizing the scatter plot and regression line

With the ggplot2 package, you can visualize a scatter plot and its regression line in three statements. The first statement, as always, indicates the data source and maps the components of the data to components of the plot:

ggplot(FarMisht.frame,aes(x=Aptitude,y=Performance))

The second statement plots points in the graph

geom_point()

and the third specifies a geom function that adds the regression line (as indicated by the method = lm argument):

geom_smooth(method=lm)

Putting all three together

ggplot(FarMisht.frame,aes(x=Aptitude,y=Performance)) +
geom_point()+
geom_smooth(method=lm)

produces Figure 14-4.

image

FIGURE 14-4: Scatter plot and regression line for the 16 FarMisht consultants.

The shaded area represents the 95 percent confidence interval around the regression line.

Plotting the residuals

After a regression analysis, it’s a good idea to plot the residuals against the predicted values. If the residuals form a random pattern around a horizontal line at zero, that’s evidence in favor of a linear relationship between the independent variable and the dependent variable.

Figure 14-5 shows the residual plot for the example. The pattern of residuals around the line is consistent with a linear model.

image

FIGURE 14-5: Residuals plot for the FarMisht example.

The plot is based on FM.reg, the linear model. Here’s the ggplot() statement:

ggplot(FM.reg, aes(x=fitted(FM.reg), y=residuals(FM.reg)))

The x and y mappings are based on information from the analysis. As you might guess, fitted(FM.reg) retrieves the predicted values, and residuals(FM.reg) retrieves the residuals.

To plot points, add the appropriate geom function:

geom_point()

and then a geom function for the dashed horizontal line whose y-intercept is 0:

geom_hline(yintercept = 0, linetype = "dashed" )

So the code for Figure 14-5 is

ggplot(FM.reg, aes(x=fitted(FM.reg), y=residuals(FM.reg)))+
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed" )

Juggling Many Relationships at Once: Multiple Regression

Linear regression is a great tool for making predictions. When you know the slope and the intercept of the line that relates two variables, you can take a new x-value and predict a new y-value. In the example you’ve been working through in this chapter, you take an Aptitude score and predict a Performance score for a FarMisht applicant.

What if you knew more than just the Aptitude score for each applicant? For example, imagine that the FarMisht management team decides that a particular personality type is ideal for their consultants. So they develop the FarMisht Personality Inventory, a 20-point scale in which a higher score indicates greater compatibility with the FarMisht corporate culture and, presumably, predicts better performance. The idea is to use that data along with Aptitude scores to predict performance.

Table 14-4 shows the Aptitude, Performance, and Personality scores for the 16 current consultants. Of course, in a real-life corporation, you might have many more employees in the sample.

TABLE 14-4 Aptitude, Performance, and Personality Scores for 16 FarMisht Consultants

Consultant

Aptitude

Performance

Personality

1

45

56

9

2

81

74

15

3

65

56

11

4

87

81

15

5

68

75

14

6

91

84

19

7

77

68

12

8

61

52

10

9

55

57

9

10

66

82

14

11

82

73

15

12

93

90

14

13

76

67

16

14

83

79

18

15

61

70

15

16

74

66

12

Mean

72.81

70.63

13.63

Variance

181.63

126.65

8.65

Standard Deviation

13.48

11.25

2.94

When you work with more than one independent variable, you’re in the realm of multiple regression. As in linear regression, you find regression coefficients. In the case of two independent variables, you’re looking for the best-fitting plane through a three-dimensional scatter plot. Once again, “best-fitting” means that the sum of the squared distances from the data points to the plane is a minimum.

Here’s the equation of the regression plane:

images

For this example, that translates to

images

You can test hypotheses about the overall fit, and about all three of the regression coefficients.

I don’t walk you through all the formulas for finding the coefficients, because that gets really complicated. Instead, I go right to the R analysis.

Here are a few things to bear in mind before I proceed:

  • You can have any number of x-variables. (I use two in this example.)
  • Expect the coefficient for Aptitude to change from linear regression to multiple regression. Expect the intercept to change, too.
  • Expect the standard error of estimate to decrease from linear regression to multiple regression. Because multiple regression uses more information than linear regression, it reduces the error.

Multiple regression in R

I begin by adding a vector for the personality scores in Column 4 of Table 14-4:

Personality <- c(9, 15, 11, 15, 14, 19, 12, 10, 9, 14, 15, 14,
      16, 18, 15, 12)

And then I add that vector to the data frame:

FarMisht.frame["Personality"] = Personality

Applying lm() produces the analysis:

FM.multreg <- lm(Performance ~ Aptitude + Personality,
      data = FarMisht.frame)

And applying summary() gives the information:

> summary(FM.multreg)

Call:
lm(formula = Performance ~ Aptitude + Personality, data       = FarMisht.frame)

Residuals:
Min 1Q Median 3Q Max
-8.689 -2.834 -1.840 2.886 13.432

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.2825 9.6595 2.100 0.0558 .
Aptitude 0.3905 0.1949 2.003 0.0664 .
Personality 1.6079 0.8932 1.800 0.0951 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.73 on 13 degrees of freedom
Multiple R-squared: 0.69, Adjusted R-squared: 0.6423
F-statistic: 14.47 on 2 and 13 DF, p-value: 0.0004938

So the generic equation for the regression plane is

images

Or, in terms of this example

images

Again, the high F-value and low p-value indicate that the regression plane is an excellent fit for the scatter plot.

Making predictions

Once again, predict() enables predictions of Performance. This time, I use it with the multiple regression model: FM.multreg. Imagine two applicants: One has Aptitude and Personality scores of 85 and 14, and the other has Aptitude and Personality scores of 62 and 17. This requires two vectors — one for the Aptitude scores and one for the Personality scores:

> predict(FM.multreg, data.frame(Aptitude = c(85,62),
      Personality=c(14,17)))

1 2
75.98742 71.82924

Visualizing the 3D scatter plot and regression plane

The ggplot2 package, for all its wonderful features, does not provide a way to draw 3-dimensional graphics — like a scatter plot for a dependent variable and two independent variables. Never fear, however: R has a number of other ways to do this. In this section, I show you two of them.

The scatterplot3d package

If you want to make a nifty three-dimensional scatter plot like the one shown in Figure 14-6 — a figure that looks good on a printed page, the scatterplot3d() function is for you.

image

FIGURE 14-6: Scatter plot for the FarMisht multiple regression example, rendered in scatter plot3d().

First, install the scatterplot3d package. On the Packages tab, find scatterplot3d and select its check box.

Next, write a statement that creates the plot:

with (FarMisht.frame,
(splot <- scatterplot3d(Performance ~ Aptitude +
      Personality, type = "h", pch = 19)))

If you use with you don’t have to repeat the name of the data frame three times. The first argument to scatterplot3d() is the formula for setting up the linear model. The second argument adds the vertical lines from the x-y plane to the data points. Those vertical lines aren’t absolutely necessary, but I think they help the viewer understand where the points are in the plot. The third argument specifies what the plot characters look like.

The function produces an object that you can use to embellish the plot. For example, here’s how to add the regression plane and produce Figure 14-7:

splot$plane3d(FM.multreg,lty="dashed")

image

FIGURE 14-7: Scatter plot for the FarMisht multiple regression example, complete with regression plane.

car and rgl: A package deal

If you have to present a 3D scatter plot to an audience and you want to dazzle them with an interactive plot, the next method is for you.

The plot-creating function is called scatter3d(), and it lives in the car package. On the Packages tab, click Install. In the Install Packages dialog box, type car and click Install. When car appears on the Packages tab, select its check box.

This function works with the rgl package, which uses tools from the Open Graphics Library (OpenGL), a toolset for creating 2D and 3D graphics. You’ll find OpenGL tools at work in virtual reality, computer-aided design, flight simulation, and a number of other applications.

On the Packages tab, find rgl and select its check box.

With those two packages installed, run this function:

scatter3d(Performance ~ Aptitude + Personality,
      data=FarMisht.frame)

This opens an RGL window with the 3D scatter plot shown in Figure 14-8. As you can see, the scatter plot shows the regression plane and the residuals.

image

FIGURE 14-8: Scatter plot for the FarMisht multiple regression example, rendered in scatter3d().

You can move the mouse inside this plot, press the left mouse button, and rotate the plot to present different angles. You can also use the scroll wheel to zoom in or out of the plot. Try it!

ANOVA: Another Look

Here’s a statement you might find radical: Analysis of variance and linear regression are really the same thing.

They’re both part of what’s called the General Linear Model (GLM). In linear regression, the objective is to predict a value of a dependent variable given a value of an independent variable. In ANOVA, the objective is to decide whether several sample means differ enough from one another to enable you to reject the null hypothesis about levels of the independent variable.

How are they similar? It’s easier to see the connection if you rethink ANOVA: Given the data, imagine that the objective is to predict the dependent variable given the level of the independent variable. What would be the best prediction? For any level of the independent variable, that would be the mean of the sample for that level — also known as the “group mean.” This means that deviations from the group mean (the best predicted value) are residuals, and this is why, in an R ANOVA, the MSError is called MSResiduals.

It goes deeper than that. To show you how, I revisit the ANOVA example from Chapter 12. For convenience, here’s Table 12-1 reproduced as Table 14-5.

TABLE 14-5 Data from Three Training Methods (ANOVA Example from Chapter 12)

Method 1

Method 2

Method 3

95

83

68

91

89

75

89

85

79

90

89

74

99

81

75

88

89

81

96

90

73

98

82

77

95

84

80

Mean

93.44

85.20

75.25

Variance

16.28

14.18

15.64

Standard Deviation

4.03

3.77

3.96

You have to test

  • H0: μ1 = μ2 = μ3
  • H1: Not H0

To use the aov() function to produce an analysis of variance, set up the data in long format. Here are the first six rows:

> head(Training.frame)
Method Score
1 method1 95
2 method1 91
3 method1 89
4 method1 90
5 method1 99
6 method1 88

The result of the analysis is

> analysis <-aov(Score~Method,data = Training.frame)
> summary(analysis)
Df Sum Sq Mean Sq F value Pr(>F)
Method 2 1402.7 701.3 45.82 6.38e-09 ***
Residuals 24 367.3 15.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What if you tried a linear regression analysis on the data?

> reg.analysis <-lm(Score~Method,data = Training.frame)
> summary(reg.analysis)

Call:
lm(formula = Score ~ Method, data = Training.frame)

Residuals:
Min 1Q Median 3Q Max
-7.250 -2.822 -0.250 3.775 5.750

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.444 1.304 71.657 < 2e-16 ***
Methodmethod2 -8.244 1.798 -4.587 0.000119 ***
Methodmethod3 -18.194 1.901 -9.571 1.15e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.912 on 24 degrees of freedom
Multiple R-squared: 0.7925, Adjusted R-squared: 0.7752
F-statistic: 45.82 on 2 and 24 DF, p-value: 6.381e-09

You see a good bit more information than in the ANOVA table, but the bottom line shows the same F-ratio and associated information as the analysis of variance. Also, the coefficients provide the group means: The intercept (93.444) is the mean of Method 1, the intercept plus the second coefficient (–8.244) is the mean of Method 2 (85.20), and the intercept plus the third coefficient (–18.194) is the mean of Method 3 (75.25). Check the Means in Table 14-1, if you don’t believe me.

A bit more on the coefficients: The Intercept represents Method 1, which is a baseline against which to compare each of the others. The t-value for Method 2 (along with its associated probability, which is much less than .05) shows that Method 2 differs significantly from Method 1. It’s the same story for Method 3, which also differs significantly from Method 1.

Here’s a question that should be forming in your mind: How can you perform a linear regression when the independent variable (Method) is categorical rather than numerical?

Glad you asked.

remember To form a regression analysis with categorical data, R (and other statistical software packages) recode the levels of a variable like Method into combinations of numeric dummy variables. The only values a dummy variable can take are 0 or 1: 0 indicates the absence of a categorical value; 1 indicates the presence of a categorical value.

I’ll do this manually. For the three levels of Method (Method 1, Method 2, and Method 3), I need two dummy variables. I’ll call them D1 and D2. Here’s how I (arbitrarily) assign the values:

  • For Method 1, D1 = 0 and D2 = 0
  • For Method 2, D1 = 1, and D2 = 0
  • For Method 3, D1 = 0, and D2 = 1

To illustrate further, here’s a data frame called Training.frame.w.Dummies. Ordinarily, I wouldn’t show you all 27 rows of a data frame, but here I think it’s instructive:

> Training.frame.w.Dummies
Method D1 D2 Score
1 method1 0 0 95
2 method1 0 0 91
3 method1 0 0 89
4 method1 0 0 90
5 method1 0 0 99
6 method1 0 0 88
7 method1 0 0 96
8 method1 0 0 98
9 method1 0 0 95
10 method2 1 0 83
11 method2 1 0 89
12 method2 1 0 85
13 method2 1 0 89
14 method2 1 0 81
15 method2 1 0 89
16 method2 1 0 90
17 method2 1 0 82
18 method2 1 0 84
19 method2 1 0 80
20 method3 0 1 68
21 method3 0 1 75
22 method3 0 1 79
23 method3 0 1 74
24 method3 0 1 75
25 method3 0 1 81
26 method3 0 1 73
27 method3 0 1 77

These lines of code

model.w.Dummies <- lm(Score ~ D1 + D2,
            data= Training.frame.w.Dummies)
summary(model.w.Dummies)

produce the same result as the analysis of variance and the linear regression I showed you earlier. The only difference is that the coefficients are expressed in terms of the dummy variables:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.444 1.304 71.657 < 2e-16 ***
D1 -8.244 1.798 -4.587 0.000119 ***
D2 -18.194 1.901 -9.571 1.15e-09 ***

So, dummy variables enable a linear regression model with categorical independent variables. In fact, linear regression with categorical independent variables is the analysis of variance.

Analysis of Covariance: The Final Component of the GLM

In this chapter, I’ve shown you how linear regression works with a numeric independent (predictor) variable, and with a categorical independent (predictor) variable. Is it possible to have a study with both a numeric predictor variable and a categorical predictor variable?

Absolutely! The analytical tool for this type of study is called the Analysis of Covariance (ANCOVA). It’s the third and final component of the General Linear Model. (Linear regression and ANOVA are the first two.) The easiest way to describe it is with an example.

Make sure you have the MASS package installed. On the Packages tab, find its check box and select it, if it isn’t already. In the MASS package is a data frame called anorexia. (I use it in Chapter 2.) This data frame contains data for 72 young women randomly selected for one of three types of treatment for anorexia: Cont (a control condition with no therapy), CBT (cognitive behavioral therapy), or FT (family treatment).

Here are the first six rows:

> head(anorexia)
Treat Prewt Postwt
1 Cont 80.7 80.2
2 Cont 89.4 80.1
3 Cont 91.8 86.4
4 Cont 74.0 86.3
5 Cont 78.1 76.1
6 Cont 88.3 78.1

Prewt is the weight before treatment, and Postwt is the weight after treatment. What you need, of course, is a variable that indicates the amount of weight gained during treatment. I’ll call it WtGain, and here’s how to add it to the data frame:

anorexia["WtGain"]=anorexia["Postwt"]-anorexia["Prewt"]

Now:

> head(anorexia)
Treat Prewt Postwt WtGain
1 Cont 80.7 80.2 -0.5
2 Cont 89.4 80.1 -9.3
3 Cont 91.8 86.4 -5.4
4 Cont 74.0 86.3 12.3
5 Cont 78.1 76.1 -2.0
6 Cont 88.3 78.1 -10.2

Figure 14-9 plots the data points for this data frame.

image

FIGURE 14-9: Weight Gain versus Treat in the anorexia data frame.

Here’s the code for this plot, in case you’re curious:

ggplot(anorexia,aes(x=Treat,y=WtGain))+
geom_point()

An analysis of variance or a linear regression analysis would be appropriate to test these:

H0: μCont = μCBT = μFT

H1: Not H0

Here’s the linear regression model:

> anorexia.linreg <-lm(WtGain ~ Treat, data=anorexia)
> summary(anorexia.linreg)

Call:
lm(formula = WtGain ~ Treat, data = anorexia)

Residuals:
Min 1Q Median 3Q Max
-12.565 -4.543 -1.007 3.846 17.893

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.007 1.398 2.151 0.0350 *
TreatCont -3.457 2.033 -1.700 0.0936 .
TreatFT 4.258 2.300 1.852 0.0684 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.528 on 69 degrees of freedom
Multiple R-squared: 0.1358, Adjusted R-squared: 0.1108
F-statistic: 5.422 on 2 and 69 DF, p-value: 0.006499

The F-ratio and p-value in the bottom line tell you that you can reject the null hypothesis.

Let’s look at the coefficients. The intercept represents CBT. This is the baseline against which you compare the other treatments. The t-values and associated probabilities (greater than .05) tell you that neither of those levels differs from CBT. The significant F-ratio must result from some other comparisons.

Also, check the coefficients against the treatment means. Here’s a quick and easy way to find the treatment means: Use the function tapply() to apply mean() and find the mean WtGain in the levels of Treat:

> with (anorexia, tapply(WtGain,Treat,mean))
CBT Cont FT
3.006897 -0.450000 7.264706

The intercept, remember, is the mean for CBT. Add the intercept to the next coefficient to calculate the mean for Cont, and add the intercept to the final coefficient to calculate the mean for FT.

If you prefer to see the F-ratio and associated statistics in an ANOVA table, you can apply the anova() function to the model:

> anova(anorexia.linreg)
Analysis of Variance Table

Response: WtGain
Df Sum Sq Mean Sq F value Pr(>F)
Treat 2 614.6 307.322 5.4223 0.006499 **
Residuals 69 3910.7 56.677
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can dig a little deeper. Suppose weight gain depends not only on type of treatment but also on a person’s initial weight (which is called a covariate). Taking PreWt into consideration might yield a more accurate picture. Treat is a categorical variable, and Prewt is a numerical variable. Figure 14-10 shows a plot based on the two variables.

image

FIGURE 14-10: Weight Gain versus Treat and Prewt in the anorexia data frame.

The code for this plot is

ggplot(anorexia, aes(x=Prewt,y=WtGain, shape = Treat)) +
geom_point(size=2.5)

The first statement maps Prewt to the x-axis, WtGain to the y-axis, and Treat to shape. Thus, the shape of a data point reflects its treatment group. The second statement specifies that points appear in the plot. Its size argument enlarges the data points and makes them easier to see.

For the analysis of covariance, I use the lm() function to create a model based on both Treat and Prewt:

> anorexia.T.and.P <- lm(WtGain ~ Treat + Prewt, data=anorexia)
> summary(anorexia.T.and.P)

Call:
lm(formula = WtGain ~ Treat + Prewt, data = anorexia)

Residuals:
Min 1Q Median 3Q Max
-14.1083 -4.2773 -0.5484 5.4838 15.2922

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.7711 13.3910 3.717 0.000410 ***
TreatCont -4.0971 1.8935 -2.164 0.033999 *
TreatFT 4.5631 2.1333 2.139 0.036035 *
Prewt -0.5655 0.1612 -3.509 0.000803 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.978 on 68 degrees of freedom
Multiple R-squared: 0.2683, Adjusted R-squared: 0.236
F-statistic: 8.311 on 3 and 68 DF, p-value: 8.725e-05

Note in the last line that the degrees of freedom have changed from the first analysis: Adding Prewt takes a degree of freedom from the df Residual and adds it to the df for Treat. Note also that the F-ratio is higher and the p-value considerably lower than in the first analysis.

And now look at the coefficients. Unlike the original analysis, the t-values and associated probabilities (less than .05) for Cont and FT show that each one differs significantly from CBT.

So it seems that adding Prewt to the analysis has helped uncover treatment differences. Bottom line: The ANCOVA shows that when evaluating the effect of an anorexia treatment, it’s important to also know an individual’s pretreatment weight.

But “it seems” is not really enough for statisticians. Can you really be sure that the ANCOVA adds value? To find out, you have to compare the linear regression model with the ANCOVA model. To make the comparison, use the anova() function, which does double-duty: In addition to creating an ANOVA table for a model (which is the way you saw it used earlier), you can use it to compare models. Here’s how:

> anova(anorexia.linreg,anorexia.T.and.P)
Analysis of Variance Table

Model 1: WtGain ~ Treat
Model 2: WtGain ~ Treat + Prewt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 69 3910.7
2 68 3311.3 1 599.48 12.311 0.0008034 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What do the numbers in the table mean? The RSS indicates the residual sums of squares from each model. They’re next to their degrees of freedom in the Res.DF column. In the Df column, 1 is the difference between the two Res.Dfs. In the Sum of Sq column, 599.48 is the difference between the two RSS. The F-ratio is the result of dividing two mean squares: The mean square for the numerator is 599.48 divided by its df (1), and the mean square for the denominator is 3311.3 divided by its df (68). The high F-ratio and low Pr(>F) (probability of a Type 1 error) tell you that adding Prewt significantly lowered the residual sum of squares. In English, that means it was a good idea to add Prewt to the mix.

technicalstuff Statisticians would say that this analysis statistically controls for the effects of the covariate (Prewt).

But wait — there’s more

In an analysis of covariance, it’s important to ask whether the relationship between the dependent variable and the numerical predictor variable is the same across the levels of the categorical variable. In this example, that’s the same as asking if the slope of the regression line between WtGain and Prewt is the same for the scores in Cont as it is for the scores in CBT and for the scores in FT. If the slopes are the same, that’s called homogeneity of regression. If not, you have an interaction of Prewt and Treat, and you have to be careful about how you state your conclusions.

Adding the regression lines to the plot in Figure 14-10 is helpful. To do this, I add this line to the code that produced Figure 14-10:

geom_smooth(method = lm,se = FALSE, aes(linetype=Treat))

This instructs ggplot to add a separate line that “smoothes” the data within each treatment group. The method argument specifies lm (linear modelling) so that each line is a regression line. The next argument, se=FALSE, prevents the plotting of the confidence interval around each line. Finally, the aesthetic mapping indicates that the line for each level of Treat will look different. So the full code is

ggplot(anorexia, aes(x=Prewt,y=WtGain, shape = Treat)) +
geom_point(size=2.5) +
geom_smooth(method = lm,se = FALSE, aes(linetype=Treat))

and the result is Figure 14-11.

image

FIGURE 14-11: Weight Gain versus Treat and Prewt in the anorexia data frame, with a regression line for the scores in each level of Treat.

As you can see, the three negatively sloped regression lines are not parallel. The line for CBT parallels the line for FT, but the line for Cont (the control condition) has a much greater negative slope. Assuming that patients in the control group received no treatment, this sounds fairly intuitive: Because they received no treatment, many of these anorexic patients (the heavier ones) continued to lose weight (rather than gain weight), resulting in the highly negative slope for that line.

Apparently, we have a Treat X Prewt interaction. Does analysis bear this out?

To include the interaction in the model, I have to add Treat*Prewt to the formula:

anorexia.w.interaction <- lm(WtGain ~ Treat + Prewt +
      Treat*Prewt, data=anorexia)

Does adding the interaction make a difference?

> anova(anorexia.T.and.P,anorexia.w.interaction)
Analysis of Variance Table

Model 1: WtGain ~ Treat + Prewt
Model 2: WtGain ~ Treat + Prewt + Treat * Prewt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 68 3311.3
2 66 2844.8 2 466.48 5.4112 0.006666 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It sure does! In your conclusions about this study, you have to include the caveat that the relationship between pre-weight and weight-gain is different for the control than it is for the cognitive-behavioral treatment and for the family treatment.