Multiple Linear Regression |
15.1 The General Idea
Simple regression addresses a single explanatory variable (X) and response variable (Y):
Chapter 14 considered simple regression.
Multiple regression is an extension of simple linear regression that addresses multiple explanatory variables (X1, X2, …, Xk) in relation to a response variable (Y):
To start our discussion of multiple regression, let explanatory variable X1 represent the explanatory variable of prime interest. All other explanatory variables in the model (X2, …, Xk) will be considered to be “extraneous” for now. The multiple regression model holds constant the influence of the extraneous variables, allowing us to more readily isolate the effects of the primary explanatory variable. The intention is to “adjust out” confounding effects, leaving the regression coefficient associated with the primary explanatory variable relatively unconfounded. When interest later shifts to (say) explanatory variable X2, it then becomes the primary explanatory variable and all others are extraneous.
It must be pointed out that the causal nature of an explanatory variable from a nonexperimental study is not tested in a strong way even with multiple regression models. Adjustments imposed by multiple regression are passive. To get a more robust test of cause, active experimentation is needed. “To find out what happens to a system when you interfere with it, you have to interfere with it (not just passively observe it).”a Therefore, as was the case with simple regression, it should not be taken for granted that statistics from multiple regression models reflect actual causal relations. An observed association may be causal or non-causal depending on nature, not on the model.
Finally, we will not consider all the mathematics behind the multiple regression model. Instead, we will rely on statistical software for computations. This will allow us to introduce the model without getting bogged down in its mathematical complexities.b
15.2 The Multiple Linear Regression Model
Simple linear regression and multiple regression are based on similar lines of reasoning.c In a simple regression population model, the expected value of Y at a given level of X is based on the linear equation:
where μY |x is the expected value of Y given x, α is the parameter indicating the model’s intercept, and β is the parameter indicating the model’s slope. We do not know the true values of model parameters, so we estimate them with a least squares regression line that minimizes the ∑residuals2. This derives estimates for regression parameters that predict values of with the equation.
where a is the intercept estimate and b is the slope estimate.
The standard error of the regression is the root mean square of the residuals:
Inferential techniques use this standard error estimate in their application (Section 14.4).
Now consider a regression model with two explanatory variables. The expected value of Y at given levels of X1 and X2 is:
As was the case with simple linear regression, a least squared method minimizing ∑residuals2 is used to fit the model:
where a is the intercept estimate, b1 is the slope estimate for X1, and b2 is the slope estimate for X2.
While a simple regression model with one independent variable fits a response line in two-dimensional space, a multiple regression model with two independent variables fits a response plane in three-dimensional space. Figure 15.1 depicts a response plane for a multiple regression model with two explanatory variables.
Multiple regression models can accommodate more than two explanatory variables by fitting a response surface in k + 1 dimensional space, where k represents the number of explanatory variables in the model. For example, a model with three explanatory variables uses a four-dimensional response plane. Although we cannot visualize four dimensions, we can still use linear algebra to find the best-fitting surface for the response plane. As before, this is achieved by minimizing the ∑residuals2 for the plane. The result is a multiple regression model with k explanatory variables in which:
FIGURE 15.1 Three-dimensional response plane.
The coefficients derived by this model have similar interpretations as those derived by simpler models. Intercept estimate (a) is where the regression surface crosses the Y axis, the slope estimate associated with X1 (b1) predicts the amount of change in Y per unit increase in X1 while holding the other variables in the model constant, the slope estimate associated with X2 (b2) predicts the amount of change in Y per unit increase in X2 while holding the other variables in the model constant, and so on.
ILLUSTRATIVE EXAMPLE
Data (Forced expiratory volumes). Our illustrative data comes from a health survey done in adolescents. The illustrative data set includes information on 654 individuals between 3 and 19 years of age. Table 15.1 is a code book for the variables in the data set.
The response variable is a measure of respiratory function called forced expiratory volume (variable name FEV) that measures respiratory capacity in liters/second units. High values represent high respiratory capacity. We will consider the effects of two explanatory variables: SMOKE (coded 0 for “nonsmoker” and 1 for “smoker”) and AGE (years). Individuals were classified as ever-smokers if they currently smoked or had at some time smoked as much as one cigarette per week. We wish to quantify the effects of SMOKE on FEV while adjusting for AGE.
TABLE 15.1 Summary statistics from illustrative data set, n = 654.
Data from Rosner, B. (1990). Fundamentals of Biostatistics (Third ed.). Belmont, CA: Duxbury Press. Data are stored online in the file FEV.*.
15.3 Categorical Explanatory Variables in Regression Models
Let us start by looking at the effect of smoking (variable name SMOKE) on forced expiratory volume (FEV) while ignoring AGE for now. The response variables in regression models are sometimes referred to as the dependent variables and the explanatory variables are referred to as independent variables. In this example, FEV is the dependent variable and SMOKE is the independent variable. Notice that SMOKE is categorical. Until this point, regression models we’ve considered have addressed quantitative independent (explanatory) variables only. Fortunately, regression models can accommodate binary explanatory variables as long as they are coded 0 for “absence of the attribute” and 1 for “presence of the attribute.” Variables of this sort are called indicator variables or dummy variables.
Regression models can handle categorical explanatory variables with more than two levels of the attribute by reexpressing the variable with multiple indicator variables. Table 15.2 describes how this is done. For now, we will address only binary explanatory variables.
Figure 15.2 is a scatterplot of FEV by SMOKE. The group means are 2.566 L/sec for SMOKE = 0 and 3.277 L/sec for SMOKE = 1. The line connecting the two means on the plot is the regression line for the relationship.
TABLE 15.2 Accommodating a categorical variable with more than two levels of the attribute. When the explanatory variable has k levels of the attribute, it is reexpressed with k − 1 dummy variables. For example, an attribute classified into three levels is reexpressed with two dummy variables. As an example, the variable SMOKE2 classifies individuals into three levels of smoking (0 = nonsmoker, 1 = former smoker, 2 = current smoker). To incorporate this information into a regression model, it is recoded as follows:
SMOKE2 |
DUMMY1 |
DUMMY2 | ||
0 |
0 |
0 | ||
1 |
1 |
0 | ||
2 |
0 |
1 |
Here’s an example of programming code that can be used for this purpose:
IF SMOKE = 1 THEN DUMMY1 = 1 ELSE DUMMY1 = 0
IF SMOKE = 2 THEN DUMMY2 = 1 ELSE DUMMY2 = 0
The variable DUMMY1 is coded so that 0 = not a former smoker, 1 = former smoker. The variable DUMMY2 is coded so that 0 = not a current smoker, 1 = current smoker. When DUMMY1 and DUMMY2 both equal 0, the individual is identified as a nonsmoker.
The regression line is linked to the means as follows:
• The mean response in group 0 .
• The mean response in group 1 .
• The difference between mean responses = .
Figure 15.3 contains SPSS output for this regression model. FEV is the dependent variable and SMOKE is the independent variable. This output indicates an intercept (a) of 2.566 and slope (b) of 0.711. Therefore, the regression model is .
FIGURE 15.2 Scatterplot and regression line for forced expiratory volume in nonsmokers (SMOKE = 0) and smokers (SMOKE = 1), illustrative data set.
• The mean response in group 0 (a) is 2.566.
• The mean response in group 1 (a + b) is 2.566 + 0.711 = 3.277.
• The difference between mean responses (b) is 0.711.
Analogous statements can be made about the population regression model μY|x = α + βx:
• The expected response in group 0 is μY |x=0 = α + β(0) = α.
• The expected response in group 1 is μY |x=1 = α + β(1) = α + β.
• The difference in expectations μY |x=1 − μY |x=0 = (α + β) − α = β.
Thus, testing H0: μ1 − μ0= 0 is equivalent to testing H0: β = 0. In addition, a confidence interval for β is equivalent to a confidence interval for μ1 − μ0.
FIGURE 15.3 Screenshot of SPSS output for the illustrative example in which the response variable (FEV) is regressed on the binary explanatory variable (SMOKE). Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.
The output in Figure 15.3 reveals tstat = 6.464 with 652 df and P ≈ 0.000. It also reveals a confidence interval for β of (0.495 to 0.927) L/sec. These results suggest that the children who smoked had substantially greater lung capacity than nonsmokers—but how could this be true given what we know about the deleterious effects of smoking? The answer lies in the fact that the children in the sample who smoked were older than those who did not smoke.d AGE confounded the observed relationship between SMOKE and FEV. Fortunately, a multiple regression model can be used to adjust for AGE while comparing FEV levels in smokers and nonsmokers.
15.4 Regression Coefficients
For multiple regression models, we rely on statistical packages to calculate the intercept and slope coefficients. Figure 15.4 depicts the SPSS dialog box needed to estimate the effect of SMOKE on FEV while adjusting for AGE.e Figure 15.5 is a screenshot of output from the procedure. The intercept coefficient (a) is 0.367,f the slope coefficient for SMOKE (b1) is −0.209, and the slope coefficient for AGE (b2) is 0.231. The multiple regression model is:
where is the predicted value of the ith observation, x1i is the value of SMOKE for the ith observation, and x2i is the value of the AGE variable of the ith observation.
FIGURE 15.4 Screenshot of SPSS dialog box for setting up multiple regression. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.
• The intercept in this model (0.367) is an extrapolation of where the regression plane would slice through the Y axis. This has no practical application.
• The slope for SMOKE (−0.209) predicts that going from SMOKE = 0 (nonsmoker) to SMOKE = 1 (smoker) is associated with a mean 0.209 decline in FEV.
• The slope for AGE (0.231) predicts that each additional year of age is associated with a 0.231 increase in FEV.
This multiple regression model has adjusted for AGE to provide a more meaningful prediction of the effects of smoking.
Figure 15.5 includes the following results:
• The standard error of the slope (SEbi) for each of independent variables i = 1 through k is listed in the column labeled “Std. Error.” (Formula not presented; we rely instead on statistical software.) Based on the output in Figure 15.5, the standard error of the slope associated with SMOKE (SEb1) is 0.081, and the standard error of the slope associated with AGE (SEb2) is 0.008. As is the case with all standard errors, these statistics quantify the precision of their associated estimate.
FIGURE 15.5 Screenshot of multiple linear regression output, illustrative example. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.
• The last two columns of Figure 15.5 list 95% confidence intervals for the regression coefficients. The 95% confidence interval for the slope associated with SMOKE is (−0.368 to −0.050). The 95% confidence interval for slope associated with AGE is (0.215 to 0.247). The interpretation of these confidences is similar to those derived from simple regression models. However, the confidence limits have now been adjusted for the other independent variables in the model. For example, the 95% confidence interval for SMOKE suggests that as we go from 0 (nonsmoker) to 1 (smoker), dependent variable FEV decreases (the confidence limits have negative signs) between 0.368 and 0.050 (L/sec) after controlling for AGE.
• The Standardized Coefficients represent the predicted change in Y per unit increase in Xi in standard deviation units. The standardized coefficient for the slope of variable i is equal to .
• The t column provides t-statistics for testing H0: βi = 0. In each case, with n − k − 1 degrees of freedom. Each of these test statistics has been adjusted for the contributions of the other independent variables in the model.
• The Sig. column provides two-sided P-values for each of the t-tests. For example, Figure 15.5 shows that P = 0.010 in testing H0: β1 = 0, where β1 represents the population slope associated with the smoke variable.
15.5 ANOVA for Multiple Linear Regression
Test Procedure
Analysis of variance can be used to test the overall fit of a multiple regression model. Here is a step-by-step procedure for the test.
A. Hypotheses. The null hypothesis is H0: the multiple regression model does not fit in the population. The alternative hypothesis is Ha: the multiple regression model does fit in the population. These statements are functionally equivalent to H0: all βi = 0 versus Ha: at least one of the βi ≠ 0. Note that some of the population slopes can be 0 under the alternative hypothesis.
B. Test statistics. As discussed in Section14.4, two components of variance are analyzed: the mean square of the residuals and the mean square of the regression.
Mean square residual: Recall that a residual is the difference between an observed response and the response predicted by the regression model. For the ith observation:
where yi is the observed value and i is the predicted value of yi. The multiple regression model has been constructed so that a, b1, b2, …, bk minimized ∑residuals2. The variance of the residuals (call it σ2) is assumed to be constant at all levels and combinations of the explanatory variables. We estimate this variance with the statistic:
where ∑residuals2 is the sum of squared residuals and dfresiduals = n − k − 1 and k represents the number of explanatory variables in the model. You lose (k + 1) degrees of freedom in this variance estimate because k + 1 parameters are estimated—the slopes for each explanatory variable plus the intercept. For the illustrative data, you lose 3 degrees of freedom in estimating α, β1, and β2. Therefore, dfresiduals = 654 − 2 − 1 = 651.
The symbol can be used to represent the mean square residual. This is the estimated variance of Y given X1 and X2 on the regression plane. The square root of this mean square residual is the standard error of the multiple regression:
.
Mean square regression: The mean square regression quantifies the extent to which the response surface deviates from the grand mean of Y:
where ∑regressions2 is the sum of squares of the regression components and dfregression = k. The regression component of observation i is given by is the predicted value of Y and
is the grand mean of Y.
ANOVA table and F-statistic: Sums of squares, mean squares, and degrees of freedom are organized to form an ANOVA table (see Table 15.3).
aSome statistical packages will indicate this line of output with the label “Model” (i.e., regression model).
bSome statistical packages will indicate this line of output with the label “Error” (i.e., residual error).
The F-statistic is:
Figure 15.6 shows computer output of the ANOVA table for the illustrative example. This reveals Fstat = 443.25 with 2 and 651 degrees of freedom.
1. P-value. A conservative estimate for P-value can be derived from Appendix Table D by looking up the tail region for an F-distribution with 2 numerator df and 100 denominator df. (Always go down to the next available degree of freedom for a conservative P-value.) This lets us know that an F2,100 critical value of 7.41 has a right tail area of 0.001. The observed F-statistic falls further into this tail. Therefore, P < 0.001.
2. Significance level. The evidence against the null hypothesis is significant at the α = 0.001 level.
3. Conclusion. The F-test is directed toward H0: β1 = β2 = 0 versus Ha: “H0 is false.” Therefore, this analysis merely tells us that age and/or smoking are significantly associated with forced expiratory volume in this adolescent population (P < 0.001). To determine which slopes are significant, we must look at the individual contributions of the variables as addressed in Section 15.4.
FIGURE 15.6 Screenshot of computer output, multiple linear regression ANOVA statistics, forced expiratory volume illustrative example. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.
Model Fit
The ratio of the sum of squares of the regression and total sum of squares forms the multiple coefficient of determination (R2):
This statistic quantifies the proportion of the variance in Y that is explained by the regression model, thus providing a statistic of overall model fit.
The output in Figure 15.6 shows that SSregression = 283.058 and SSTotal= 490.920. Therefore, R2 = 283.058/490.920 = 0.577. This suggests that about 58% of the variability of the response is explained by the independent variables in the model.
The square root of the multiple coefficient of determination is the multiple correlation coefficient (R): . This statistic is analogous to Pearson’s correlation coefficient (r) except that it considers the contribution of multiple explanatory factors. This statistic will always be a positive number between 0 and 1. The multiple correlation coefficient for the illustrative example
.
Figure 15.7 is a screenshot of output of model summary statistics. Besides R and R2, this output includes two additional statistics: the Adjusted R Square, which is equal to and the Std. Error of the Estimate, which is equal to
. The Adjusted R Square more closely than R2 reflects the goodness of fit of the model in the population.g The standard error of the estimate is the standard error of the multiple regression, which is equal to
.
FIGURE 15.7 Screenshot of computer output, multiple linear regression model summary, forced expiratory volume illustrative example. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.
15.6 Examining Multiple Regression Conditions
The conditions needed for multiple linear regression mirror those of simple linear regression. Recall the mnemonic “line,” which stands for linearity, independence, normality, and equal variance (Section 14.4). Statistical packages often incorporate tools to help examine the multiple regression response surface for these conditions. Although thorough consideration of these tools is beyond the scope of this general text, we examine two such tools by way of introduction.
Figure 15.8 is a plot of standardized residuals against standardized predicted values for the illustrative example. The standardized residual of observation i is , and its standardized predicted value is
, where
. This plot shows less variation at the lower end of predicted values than at the higher end. This is not unexpected with these data; the lower FEV values are likely to represent younger subjects where respiratory function is expected to be less variable. The extent to which this undermines the equal variance assumption and utility of the inferential model is a matter of judgment and may warrant further consideration by a specialist.
FIGURE 15.8 Standardized residuals plotted against standardized predicted values, multiple linear regression model. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.
FIGURE 15.9 Normal Q-Q plot, multiple linear regression illustration. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.
Figure 15.9 is a Normal Q-Q plot of the standardized residuals. In this plot, we look for deviations from the diagonal line as evidence of non-Normality (Section 7.4). The data in this plot does a fairly good job adhering to the diagonal line, suggesting no major departures from Normality.
Summary Points (Multiple Regression)
1. Multiple regression is used to quantify linear relationship between explanatory variable X1 and quantitative response variable Y while adjusting for potential confounding variables X2, X3, …, Xk.
2. Categorical explanatory variables can be incorporated into a regression model by coding them as 0/1 dummy variables.
3. The population model for multiple regression is μY|x1, x2, …, xk = α + β1x1 + β2x2 + ··· + βkxk, where μY|x1, x2, …, xk represents the expected value of response variable Y given specific values for explanatory variables X1, X2, …, Xk· β1, β2, …, βk are the population slopes for the explanatory variables X1, X2, …, Xk; α is the population intercept.
4. We rely on statistical software to provide estimates for each regression coefficient. These estimates are called a, b1, b2, …, bk. The regression equation for the data is , where
represents the predicted value of Y given values x1, x2, …, xk.
(a) Each slope b1, b2, …, bk is interpreted as follows: if all the X variables except Xi are held constant, the predicted change in Y per unit change in Xi is equal to bi.
(b) Each slope estimate can be tested for statistical significance with t-test H0: βi = 0. We rely on statistical software to compute the statistics.
(c) 95% confidence intervals for each βi are provided by the software.
5. The residual associated with observation i() is the distance of data point from regression plane fit by the model. It is assumed that these residuals show constant variance at all points along the plane and are Normally distributed with a mean of 0:
~ N(0, σ). These assumptions can be examined with a residual plot.
6. The variance of the regression σ2 is estimated by the mean square residual , where ∑residuals2 is the sum of squared residuals and dfresiduals = n − k − 1. The mean square residual is also called the mean square error.
Vocabulary
Adjusted R square
Dependent variables
Dummy variables
Independent variables
Indicator variables
Multiple correlation coefficient (R)
Multiple regression
Multiple regression of determination (R2)
Response line
Response plane
Response surface
Simple regression
Standard error of the multiple regression (SY | x1x2,…, xk)
15.1 The relation between FEV and SEX in the illustrative data set. Download the illustrative data set (FEV.*) used in this chapter. See Table 15.1 for a codebook of variables. Examine the relationship between FEV and SEX. Then, examine the relationship between FEV and SEX while adjusting for AGE. Did AGE confound the relationship?
15.2 Cognitive function in centenarians. A geriatric researcher looked at the effect of AGE (years), EDUCATION level (years of schooling), and SEX (0 = female, 1 = male) on cognitive function in 162 centenarians (fictitious data). Cognitive function was measured using standardized psychometric and mental functioning tests with higher scores corresponding to better cognitive ability. Using these data, a multiple regression computer run derived the following multiple regression model:
Based on this model, describe the effect of AGE, EDUCATION, and SEX on cognition scores in centenarians.
______________
a George Box cited in Gilbert, J. P., & Mosteller, F. (1972). The urgent need for experimentation. In F. Mosteller & D. P. Moynihan (Eds.), On Equality of Educational Opportunity. New York: Vintage, p. 372.
b For discussions of the mathematics of multiple regression model fitting, see Kutner, M. H., Nachtsheim, C. J., Neter J. (2004). Applied Linear Regression Models (4th ed.). New York: McGraw-Hill/Irwin.
c Chapter 14 covers basic concepts for simple linear regression.
d The 589 nonsmokers have a mean age of 9.5 years (standard deviation 2.7 years). The 65 smokers have a mean age of 13.5 (standard deviation 2.3 years).
e The illustration uses SPSS, but any reliable statistical package would do.
f This is labeled (constant).
g This adjustment is needed because as predictors are added to the model, some of the variance in Y is explained simply by chance.