Basic Biostatistics: Statistics for Public Health Practice, Second Edition

15	Multiple Linear Regression

squimg 15.1 The General Idea

Simple regression addresses a single explanatory variable (X) and response variable (Y):

Chapter 14 considered simple regression.

Multiple regression is an extension of simple linear regression that addresses multiple explanatory variables (X₁, X₂, …, X_k) in relation to a response variable (Y):

To start our discussion of multiple regression, let explanatory variable X₁ represent the explanatory variable of prime interest. All other explanatory variables in the model (X₂, …, X_k) will be considered to be “extraneous” for now. The multiple regression model holds constant the influence of the extraneous variables, allowing us to more readily isolate the effects of the primary explanatory variable. The intention is to “adjust out” confounding effects, leaving the regression coefficient associated with the primary explanatory variable relatively unconfounded. When interest later shifts to (say) explanatory variable X₂, it then becomes the primary explanatory variable and all others are extraneous.

It must be pointed out that the causal nature of an explanatory variable from a nonexperimental study is not tested in a strong way even with multiple regression models. Adjustments imposed by multiple regression are passive. To get a more robust test of cause, active experimentation is needed. “To find out what happens to a system when you interfere with it, you have to interfere with it (not just passively observe it).”^a Therefore, as was the case with simple regression, it should not be taken for granted that statistics from multiple regression models reflect actual causal relations. An observed association may be causal or non-causal depending on nature, not on the model.

Finally, we will not consider all the mathematics behind the multiple regression model. Instead, we will rely on statistical software for computations. This will allow us to introduce the model without getting bogged down in its mathematical complexities.^b

squimg 15.2 The Multiple Linear Regression Model

Simple linear regression and multiple regression are based on similar lines of reasoning.^c In a simple regression population model, the expected value of Y at a given level of X is based on the linear equation:

where μ_Y _|_x is the expected value of Y given x, α is the parameter indicating the model’s intercept, and β is the parameter indicating the model’s slope. We do not know the true values of model parameters, so we estimate them with a least squares regression line that minimizes the ∑residuals². This derives estimates for regression parameters that predict values of inline with the equation.

where a is the intercept estimate and b is the slope estimate.

The standard error of the regression is the root mean square of the residuals:

Inferential techniques use this standard error estimate in their application (Section 14.4).

Now consider a regression model with two explanatory variables. The expected value of Y at given levels of X₁ and X₂ is:

As was the case with simple linear regression, a least squared method minimizing ∑residuals² is used to fit the model:

where a is the intercept estimate, b₁ is the slope estimate for X₁, and b₂ is the slope estimate for X₂.

While a simple regression model with one independent variable fits a response line in two-dimensional space, a multiple regression model with two independent variables fits a response plane in three-dimensional space. Figure 15.1 depicts a response plane for a multiple regression model with two explanatory variables.

Multiple regression models can accommodate more than two explanatory variables by fitting a response surface in k + 1 dimensional space, where k represents the number of explanatory variables in the model. For example, a model with three explanatory variables uses a four-dimensional response plane. Although we cannot visualize four dimensions, we can still use linear algebra to find the best-fitting surface for the response plane. As before, this is achieved by minimizing the ∑residuals² for the plane. The result is a multiple regression model with k explanatory variables in which:

images

FIGURE 15.1 Three-dimensional response plane.

The coefficients derived by this model have similar interpretations as those derived by simpler models. Intercept estimate (a) is where the regression surface crosses the Y axis, the slope estimate associated with X₁ (b₁) predicts the amount of change in Y per unit increase in X₁ while holding the other variables in the model constant, the slope estimate associated with X₂ (b₂) predicts the amount of change in Y per unit increase in X₂ while holding the other variables in the model constant, and so on.

ILLUSTRATIVE EXAMPLE

Data (Forced expiratory volumes). Our illustrative data comes from a health survey done in adolescents. The illustrative data set includes information on 654 individuals between 3 and 19 years of age. Table 15.1 is a code book for the variables in the data set.

The response variable is a measure of respiratory function called forced expiratory volume (variable name FEV) that measures respiratory capacity in liters/second units. High values represent high respiratory capacity. We will consider the effects of two explanatory variables: SMOKE (coded 0 for “nonsmoker” and 1 for “smoker”) and AGE (years). Individuals were classified as ever-smokers if they currently smoked or had at some time smoked as much as one cigarette per week. We wish to quantify the effects of SMOKE on FEV while adjusting for AGE.

TABLE 15.1 Summary statistics from illustrative data set, n = 654.

tab

Data from Rosner, B. (1990). Fundamentals of Biostatistics (Third ed.). Belmont, CA: Duxbury Press. Data are stored online in the file FEV.*.

squimg 15.3 Categorical Explanatory Variables in Regression Models

Let us start by looking at the effect of smoking (variable name SMOKE) on forced expiratory volume (FEV) while ignoring AGE for now. The response variables in regression models are sometimes referred to as the dependent variables and the explanatory variables are referred to as independent variables. In this example, FEV is the dependent variable and SMOKE is the independent variable. Notice that SMOKE is categorical. Until this point, regression models we’ve considered have addressed quantitative independent (explanatory) variables only. Fortunately, regression models can accommodate binary explanatory variables as long as they are coded 0 for “absence of the attribute” and 1 for “presence of the attribute.” Variables of this sort are called indicator variables or dummy variables.

Regression models can handle categorical explanatory variables with more than two levels of the attribute by reexpressing the variable with multiple indicator variables. Table 15.2 describes how this is done. For now, we will address only binary explanatory variables.

Figure 15.2 is a scatterplot of FEV by SMOKE. The group means are 2.566 L/sec for SMOKE = 0 and 3.277 L/sec for SMOKE = 1. The line connecting the two means on the plot is the regression line for the relationship.

TABLE 15.2 Accommodating a categorical variable with more than two levels of the attribute. When the explanatory variable has k levels of the attribute, it is reexpressed with k − 1 dummy variables. For example, an attribute classified into three levels is reexpressed with two dummy variables. As an example, the variable SMOKE2 classifies individuals into three levels of smoking (0 = nonsmoker, 1 = former smoker, 2 = current smoker). To incorporate this information into a regression model, it is recoded as follows:

SMOKE2	DUMMY1	DUMMY2
0	0	0
1	1	0
2	0	1

Here’s an example of programming code that can be used for this purpose:

IF SMOKE = 1 THEN DUMMY1 = 1 ELSE DUMMY1 = 0
IF SMOKE = 2 THEN DUMMY2 = 1 ELSE DUMMY2 = 0

The variable DUMMY1 is coded so that 0 = not a former smoker, 1 = former smoker. The variable DUMMY2 is coded so that 0 = not a current smoker, 1 = current smoker. When DUMMY1 and DUMMY2 both equal 0, the individual is identified as a nonsmoker.

The regression line inline is linked to the means as follows:

• The mean response in group 0 inline .

• The mean response in group 1 inline .

• The difference between mean responses = inline .

Figure 15.3 contains SPSS output for this regression model. FEV is the dependent variable and SMOKE is the independent variable. This output indicates an intercept (a) of 2.566 and slope (b) of 0.711. Therefore, the regression model is inline .

FIGURE 15.2 Scatterplot and regression line for forced expiratory volume in nonsmokers (SMOKE = 0) and smokers (SMOKE = 1), illustrative data set.

• The mean response in group 0 (a) is 2.566.

• The mean response in group 1 (a + b) is 2.566 + 0.711 = 3.277.

• The difference between mean responses (b) is 0.711.

Analogous statements can be made about the population regression model μ_Y_|_x = α + βx:

• The expected response in group 0 is μ_Y _|_x₌₀ = α + β(0) = α.

• The expected response in group 1 is μ_Y _|_x₌₁ = α + β(1) = α + β.

• The difference in expectations μ_Y _|_x₌₁ − μ_Y _|_x₌₀ = (α + β) − α = β.

Thus, testing H₀: μ₁ − μ₀= 0 is equivalent to testing H₀: β = 0. In addition, a confidence interval for β is equivalent to a confidence interval for μ₁ − μ₀.

images

FIGURE 15.3 Screenshot of SPSS output for the illustrative example in which the response variable (FEV) is regressed on the binary explanatory variable (SMOKE). Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.

The output in Figure 15.3 reveals t_stat = 6.464 with 652 df and P ≈ 0.000. It also reveals a confidence interval for β of (0.495 to 0.927) L/sec. These results suggest that the children who smoked had substantially greater lung capacity than nonsmokers—but how could this be true given what we know about the deleterious effects of smoking? The answer lies in the fact that the children in the sample who smoked were older than those who did not smoke.^d AGE confounded the observed relationship between SMOKE and FEV. Fortunately, a multiple regression model can be used to adjust for AGE while comparing FEV levels in smokers and nonsmokers.

squimg 15.4 Regression Coefficients

For multiple regression models, we rely on statistical packages to calculate the intercept and slope coefficients. Figure 15.4 depicts the SPSS dialog box needed to estimate the effect of SMOKE on FEV while adjusting for AGE.^e Figure 15.5 is a screenshot of output from the procedure. The intercept coefficient (a) is 0.367,^f the slope coefficient for SMOKE (b₁) is −0.209, and the slope coefficient for AGE (b₂) is 0.231. The multiple regression model is:

where inline is the predicted value of the ith observation, x₁_i is the value of SMOKE for the ith observation, and x₂_i is the value of the AGE variable of the ith observation.

FIGURE 15.4 Screenshot of SPSS dialog box for setting up multiple regression. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.

• The intercept in this model (0.367) is an extrapolation of where the regression plane would slice through the Y axis. This has no practical application.

• The slope for SMOKE (−0.209) predicts that going from SMOKE = 0 (nonsmoker) to SMOKE = 1 (smoker) is associated with a mean 0.209 decline in FEV.

• The slope for AGE (0.231) predicts that each additional year of age is associated with a 0.231 increase in FEV.

This multiple regression model has adjusted for AGE to provide a more meaningful prediction of the effects of smoking.

Figure 15.5 includes the following results:

• The standard error of the slope (SE_{b_i}) for each of independent variables i = 1 through k is listed in the column labeled “Std. Error.” (Formula not presented; we rely instead on statistical software.) Based on the output in Figure 15.5, the standard error of the slope associated with SMOKE (SE_b₁) is 0.081, and the standard error of the slope associated with AGE (SE_b₂) is 0.008. As is the case with all standard errors, these statistics quantify the precision of their associated estimate.

images

FIGURE 15.5 Screenshot of multiple linear regression output, illustrative example. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.

• The last two columns of Figure 15.5 list 95% confidence intervals for the regression coefficients. The 95% confidence interval for the slope associated with SMOKE is (−0.368 to −0.050). The 95% confidence interval for slope associated with AGE is (0.215 to 0.247). The interpretation of these confidences is similar to those derived from simple regression models. However, the confidence limits have now been adjusted for the other independent variables in the model. For example, the 95% confidence interval for SMOKE suggests that as we go from 0 (nonsmoker) to 1 (smoker), dependent variable FEV decreases (the confidence limits have negative signs) between 0.368 and 0.050 (L/sec) after controlling for AGE.

• The Standardized Coefficients represent the predicted change in Y per unit increase in X_i in standard deviation units. The standardized coefficient for the slope of variable i is equal to inline .

• The t column provides t-statistics for testing H₀: β_i = 0. In each case, inline with n − k − 1 degrees of freedom. Each of these test statistics has been adjusted for the contributions of the other independent variables in the model.

• The Sig. column provides two-sided P-values for each of the t-tests. For example, Figure 15.5 shows that P = 0.010 in testing H₀: β₁ = 0, where β₁ represents the population slope associated with the smoke variable.

squimg 15.5 ANOVA for Multiple Linear Regression

Test Procedure

Analysis of variance can be used to test the overall fit of a multiple regression model. Here is a step-by-step procedure for the test.

A. Hypotheses. The null hypothesis is H₀: the multiple regression model does not fit in the population. The alternative hypothesis is H_a: the multiple regression model does fit in the population. These statements are functionally equivalent to H₀: all β_i = 0 versus H_a: at least one of the β_i ≠ 0. Note that some of the population slopes can be 0 under the alternative hypothesis.

B. Test statistics. As discussed in Section14.4, two components of variance are analyzed: the mean square of the residuals and the mean square of the regression.

Mean square residual: Recall that a residual is the difference between an observed response and the response predicted by the regression model. For the ith observation:

where y_i is the observed value and _i is the predicted value of y_i. The multiple regression model has been constructed so that a, b₁, b₂, …, b_k minimized ∑residuals². The variance of the residuals (call it σ²) is assumed to be constant at all levels and combinations of the explanatory variables. We estimate this variance with the statistic:

where ∑residuals² is the sum of squared residuals and df_residuals = n − k − 1 and k represents the number of explanatory variables in the model. You lose (k + 1) degrees of freedom in this variance estimate because k + 1 parameters are estimated—the slopes for each explanatory variable plus the intercept. For the illustrative data, you lose 3 degrees of freedom in estimating α, β₁, and β_2. Therefore, df_residuals = 654 − 2 − 1 = 651.

The symbol can be used to represent the mean square residual. This is the estimated variance of Y given X₁ and X₂ on the regression plane. The square root of this mean square residual is the standard error of the multiple regression: .

Mean square regression: The mean square regression quantifies the extent to which the response surface deviates from the grand mean of Y:

where ∑regressions² is the sum of squares of the regression components and df_regression = k. The regression component of observation i is given by is the predicted value of Y and is the grand mean of Y.

ANOVA table and F-statistic: Sums of squares, mean squares, and degrees of freedom are organized to form an ANOVA table (see Table 15.3).

TABLE 15.3 ANOVA table.

^aSome statistical packages will indicate this line of output with the label “Model” (i.e., regression model).

^bSome statistical packages will indicate this line of output with the label “Error” (i.e., residual error).

The F-statistic is:

Figure 15.6 shows computer output of the ANOVA table for the illustrative example. This reveals F_stat = 443.25 with 2 and 651 degrees of freedom.

1. P-value. A conservative estimate for P-value can be derived from Appendix Table D by looking up the tail region for an F-distribution with 2 numerator df and 100 denominator df. (Always go down to the next available degree of freedom for a conservative P-value.) This lets us know that an F_2,100 critical value of 7.41 has a right tail area of 0.001. The observed F-statistic falls further into this tail. Therefore, P < 0.001.

2. Significance level. The evidence against the null hypothesis is significant at the α = 0.001 level.

3. Conclusion. The F-test is directed toward H₀: β₁ = β₂ = 0 versus H_a: “H₀ is false.” Therefore, this analysis merely tells us that age and/or smoking are significantly associated with forced expiratory volume in this adolescent population (P < 0.001). To determine which slopes are significant, we must look at the individual contributions of the variables as addressed in Section 15.4.

FIGURE 15.6 Screenshot of computer output, multiple linear regression ANOVA statistics, forced expiratory volume illustrative example. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.

Model Fit

The ratio of the sum of squares of the regression and total sum of squares forms the multiple coefficient of determination (R²):

This statistic quantifies the proportion of the variance in Y that is explained by the regression model, thus providing a statistic of overall model fit.

The output in Figure 15.6 shows that SS_regression = 283.058 and SS_Total= 490.920. Therefore, R² = 283.058/490.920 = 0.577. This suggests that about 58% of the variability of the response is explained by the independent variables in the model.

The square root of the multiple coefficient of determination is the multiple correlation coefficient (R): inline . This statistic is analogous to Pearson’s correlation coefficient (r) except that it considers the contribution of multiple explanatory factors. This statistic will always be a positive number between 0 and 1. The multiple correlation coefficient for the illustrative example inline .

Figure 15.7 is a screenshot of output of model summary statistics. Besides R and R², this output includes two additional statistics: the Adjusted R Square, which is equal to inline and the Std. Error of the Estimate, which is equal to inline . The Adjusted R Square more closely than R² reflects the goodness of fit of the model in the population.^g The standard error of the estimate is the standard error of the multiple regression, which is equal to inline .

images

FIGURE 15.7 Screenshot of computer output, multiple linear regression model summary, forced expiratory volume illustrative example. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.

15.6 Examining Multiple Regression Conditions

The conditions needed for multiple linear regression mirror those of simple linear regression. Recall the mnemonic “line,” which stands for linearity, independence, normality, and equal variance (Section 14.4). Statistical packages often incorporate tools to help examine the multiple regression response surface for these conditions. Although thorough consideration of these tools is beyond the scope of this general text, we examine two such tools by way of introduction.

Figure 15.8 is a plot of standardized residuals against standardized predicted values for the illustrative example. The standardized residual of observation i is inline , and its standardized predicted value is inline , where inline . This plot shows less variation at the lower end of predicted values than at the higher end. This is not unexpected with these data; the lower FEV values are likely to represent younger subjects where respiratory function is expected to be less variable. The extent to which this undermines the equal variance assumption and utility of the inferential model is a matter of judgment and may warrant further consideration by a specialist.

images

FIGURE 15.8 Standardized residuals plotted against standardized predicted values, multiple linear regression model. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.

images

FIGURE 15.9 Normal Q-Q plot, multiple linear regression illustration. Graph produced with SPSS for Windows, Rel. 11.0.1.2001. Chicago: SPSS Inc. Reprint Courtesy of International Business Machines Corporation.

Figure 15.9 is a Normal Q-Q plot of the standardized residuals. In this plot, we look for deviations from the diagonal line as evidence of non-Normality (Section 7.4). The data in this plot does a fairly good job adhering to the diagonal line, suggesting no major departures from Normality.

Summary Points (Multiple Regression)

1. Multiple regression is used to quantify linear relationship between explanatory variable X₁ and quantitative response variable Y while adjusting for potential confounding variables X₂, X₃, …, X_k.

2. Categorical explanatory variables can be incorporated into a regression model by coding them as 0/1 dummy variables.

3. The population model for multiple regression is μ_{Y|x₁, x₂, …, x_k} = α + β₁x₁ + β₂x₂ + ··· + β_kx_k, where μ_{Y|x₁, x₂, …, x_k} represents the expected value of response variable Y given specific values for explanatory variables X₁, X₂, …, X_k_· β₁, β₂, …, β_k are the population slopes for the explanatory variables X₁, X₂, …, X_k; α is the population intercept.

4. We rely on statistical software to provide estimates for each regression coefficient. These estimates are called a, b₁, b₂, …, b_k. The regression equation for the data is , where represents the predicted value of Y given values x₁, x₂, …, x_k.

(a) Each slope b₁, b₂, …, b_k is interpreted as follows: if all the X variables except X_i are held constant, the predicted change in Y per unit change in X_i is equal to b_i.

(b) Each slope estimate can be tested for statistical significance with t-test H₀: β_i = 0. We rely on statistical software to compute the statistics.

5. The residual associated with observation i() is the distance of data point from regression plane fit by the model. It is assumed that these residuals show constant variance at all points along the plane and are Normally distributed with a mean of 0: ~ N(0, σ). These assumptions can be examined with a residual plot.

6. The variance of the regression σ² is estimated by the mean square residual , where ∑residuals² is the sum of squared residuals and df_residuals = n − k − 1. The mean square residual is also called the mean square error.

Vocabulary

Adjusted R square

Dependent variables

Dummy variables

Independent variables

Indicator variables

Multiple correlation coefficient (R)

Multiple regression

Multiple regression of determination (R²)

Response line

Response plane

Response surface

Simple regression

Standard error of the multiple regression (S_{Y | x₁x₂,…, x_k})

Exercises

15.1 The relation between FEV and SEX in the illustrative data set. Download the illustrative data set (FEV.*) used in this chapter. See Table 15.1 for a codebook of variables. Examine the relationship between FEV and SEX. Then, examine the relationship between FEV and SEX while adjusting for AGE. Did AGE confound the relationship?

15.2 Cognitive function in centenarians. A geriatric researcher looked at the effect of AGE (years), EDUCATION level (years of schooling), and SEX (0 = female, 1 = male) on cognitive function in 162 centenarians (fictitious data). Cognitive function was measured using standardized psychometric and mental functioning tests with higher scores corresponding to better cognitive ability. Using these data, a multiple regression computer run derived the following multiple regression model:

Based on this model, describe the effect of AGE, EDUCATION, and SEX on cognition scores in centenarians.

______________

^a George Box cited in Gilbert, J. P., & Mosteller, F. (1972). The urgent need for experimentation. In F. Mosteller & D. P. Moynihan (Eds.), On Equality of Educational Opportunity. New York: Vintage, p. 372.

^b For discussions of the mathematics of multiple regression model fitting, see Kutner, M. H., Nachtsheim, C. J., Neter J. (2004). Applied Linear Regression Models (4th ed.). New York: McGraw-Hill/Irwin.

^c Chapter 14 covers basic concepts for simple linear regression.

^d The 589 nonsmokers have a mean age of 9.5 years (standard deviation 2.7 years). The 65 smokers have a mean age of 13.5 (standard deviation 2.3 years).

^e The illustration uses SPSS, but any reliable statistical package would do.

^f This is labeled (constant).

^g This adjustment is needed because as predictors are added to the model, some of the variance in Y is explained simply by chance.