8 Regression and correlation

DOUGLAS G ALTMAN, MARTIN J GARDNER

The most common statistical analyses are those that examine one or two groups of individuals with respect to a single variable (see chapters 4 to 7). Also common are those analyses that consider the relation between two variables in one group of subjects. We use regression analysis to predict one variable from another, and correlation analysis to see if the values of two variables are associated. The purposes of these two analyses are distinct, and usually one only should be used.

We outline the calculation of the linear regression equation for predicting one variable from another and show how to calculate confidence intervals for the population value of the slope and intercept of the line, for the line itself, and for predictions made using the regression equation. We explain how to obtain a confidence interval for the population value of the difference between the slopes of regression lines from two groups of subjects and how to calculate a confidence interval for the vertical distance between two parallel regression lines.

We also describe the calculations of confidence intervals for Pearson’s correlation coefficient and Spearman’s rank correlation coefficient.

The calculations have been carried out to full arithmetical precision, as is recommended practice (see chapter 14), but intermediate steps are shown as rounded results. Methods of calculating confidence intervals for different aspects of regression and correlation are demonstrated. The appropriate ones to use depend on the particular problem being studied.

The interpretation of confidence intervals has been discussed in chapters 1 and 3. Confidence intervals convey only the effects of sampling variation on the estimated statistics and cannot control for other errors such as biases in design, conduct, or analysis.

Linear regression analysis

For two variables x and y we wish to calculate the regression equation for predicting y from x. We call y the dependent or outcome variable and x the independent or explanatory variable. The equation for the population regression line is

where A is the intercept on the vertical y axis (the value of y when x = 0) and B is the slope of the line. In standard regression analysis it is assumed that the distribution of the y variable at each value of x is Normal with the same standard deviation, but no assumptions are made about the distribution of the x variable. Sample estimates a (of A) and b (of B) are needed and also the means of the two variables ( and ), the standard deviations of the two variables (s_x and s_y), and the residual standard deviation of y about the regression line (s_res). The formulae for deriving a, b, and s_res are given under “Technical details” at the end of this chapter.

All the following confidence intervals associated with a single regression line use the quantity t_{1 – α/2}, the appropriate value from the t distribution with n – 2 degrees of freedom where n is the sample size. Thus, for a 95% confidence interval we need the value that cuts off the top 2·5% of the t distribution, denoted

A fitted regression line should be used to make predictions only within the observed range of the x variable. Extrapolation outside this range is unwarranted and may mislead.¹

It is always advisable to plot the data to see whether a linear relationship between x and y is reasonable. In addition a plot of the “residuals” (“observed minus predicted”—see “Technical details” at the end of this chapter) is useful to check the distributional assumptions for the y variable.

Illustrative data set

Table 8.1 shows data from a clinical trial of enalapril versus placebo in diabetic patients.² The variables studied are mean arterial blood pressure (mmHg) and total glycosylated haemoglobin concentration (%). The analyses presented here are illustrative and do not relate directly to the clinical trial. Most of the methods for calculating confidence intervals are demonstrated using only the data from the 10 subjects who received enalapril.

Table 8.1 Mean arterial blood pressure and total glycosylated haemoglobin concentration in two groups of 10 diabetics on entry to a clinical trial of enalapril versus placebo²

Single sample

We want to describe the way total glycosylated haemoglobin concentration changes with mean arterial blood pressure. The regression line of total glycosylated haemoglobin (TGH) concentration on mean arterial blood pressure (MAP) for the 10 subjects receiving enalapril is found to be

The estimated slope of the line is negative, indicating lower total glycosylated haemoglobin concentrations for subjects with higher mean arterial blood pressure.

The other quantities needed to obtain the various confidence intervals are shown in Table 8.1. The calculations use 95% confidence intervals. For this we need the value of t_0.975 with 8 degrees of freedom, and Table 18.2 shows this to be 2·306.

Confidence interval for the slope of the regression line

The slope of the sample regression line estimates the mean change in y for a unit change in x. The standard error of the slope, b, is calculated as

The 100(1 – α)% confidence interval for the population value of the slope, B, is given by

Worked example

The standard error of the slope is

The 95% confidence interval for the population value of the slope is thus

that is, from –0·178 to –0·056% per mmHg.

For brevity, in further calculations on these data we will describe the units as %.

Confidence interval for the mean value of y for a given value of x and for the regression line

The estimated mean value of y for any chosen value of x, say x₀, is obtained from the fitted regression line as

The standard error of y_fit is given by

The 100(1 – α)% confidence interval for the population mean value of y at x = x₀ is then

When this calculation is made for all values of x in the observed range of x a 100(1 – α)% confidence interval for the position of the population regression line is obtained. Because of the expression (x₀ – )² in the formula for SE(y_fit) the confidence interval becomes wider with increasing distance of x₀ from .

Worked example

The confidence interval for the mean total glycosylated haemoglobin concentration can be calculated for any specified value of mean arterial blood pressure. If the mean arterial blood pressure of interest is 100 mmHg the estimated total glycosylated haemoglobin concentration is y_fit = 20·19 – (0·1168 × 100) = 8·51%. The standard error of this estimated value is

The 95% confidence interval for the mean total glycosylated haemoglobin concentration for the population of diabetic subjects with a mean arterial blood pressure of 100 mmHg is thus

that is, from 8·10% to 8·92%.

By calculating the 95% confidence interval for the mean total glyco-sylated haemoglobin concentration for all values of mean arterial blood pressure within the range of observations we get a 95% confidence interval for the population regression line. This is shown in Figure 8.1.

The confidence interval becomes wider as the mean arterial blood pressure moves away from the mean of 101·2 mmHg.

Confidence interval for the intercept of the regression line

The intercept of the regression line on the y axis is generally of less interest than the slope of the line and does not usually have any obvious clinical interpretation. It can be seen that the intercept is the fitted value of y when x is zero.

Thus a 100(1 — α)% confidence interval for the population value of the intercept, A, can be obtained using the formula from the preceding section with x₀ = 0 and y_fit = a. The standard error of a is given by

The confidence interval for a is thus given by

Worked example

The confidence interval for the population value of the intercept is the confidence interval for y_fit when x = 0, and is calculated as before. In this case the intercept is 20·19%, with a standard error of 2·67%. Thus the 95% confidence interval is from 14·03% to 26·35%. Clearly in this example the intercept, relating to a mean arterial blood pressure of zero, is extrapolated well below the range of the data and is of no interest in itself.

Figure 8.1 Regression line of total glycosylated haemoglobin concentration on mean arterial blood pressure, with 95% confidence interval for the population mean total glycosylated haemoglobin concentration.

Prediction interval for an individual (and all individuals)

It is useful to calculate the uncertainty in y_fit as a predictor of y for an individual. The range of uncertainty is called a prediction (or tolerance) interval. A prediction interval is wider than the asso-ciated confidence interval for the mean value of y because the scatter of data about the regression line is more important. Unlike the confidence interval for the slope, the width of the prediction interval is not greatly influenced by sample size.

For an individual whose value of x is x₀ the predicted value of y is y_fit, given by

To calculate the prediction interval we first estimate the standard deviation (s_pred) of individual values of y when x equals x₀ as

The 100(1 – α)% prediction interval is then

When this calculation is made for all values of x in the observed range the estimated prediction interval should include the values of y for 100(1 – α)% of subjects in the population.

Worked example

The 95% prediction interval for the total glycosylated haemoglobin concentration of an individual subject with a mean arterial blood pressure of 100 mmHg is obtained by first calculating s_pred:

The 95% prediction interval is then given by

that is, from 7·18 to 9·84%.

The contrast with the narrower 95% confidence interval for the mean total glycosylated haemoglobin concentration for a mean arterial blood pressure of 100 mmHg calculated above is noticeable. The 95% prediction intervals for the range of observed levels of mean arterial blood pressure are shown in Figure 8.2 and again these widen on moving away from the mean arterial blood pressure of 101·2 mmHg.

Two samples

Regression lines fitted to observations from two independent groups of subjects can be analysed to see if they come from populations with regression lines that are parallel or even coincident.³

Confidence interval for the difference between the slopes of two regression lines

If we have fitted regression lines to two different sets of data on the same two variables we can construct a confidence interval for the difference between the population regression slopes using a similar approach to that for a single regression line. The standard error of the difference between the slopes is given by first calculating s_pool, the pooled residual standard deviation, as

Figure 8.2 Regression line of total glycosylated haemoglobin concentration on mean arterial blood pressure, with 95% prediction interval for an individual total glycosylated haemoglobin concentration.

and then

where the suffixes 1 and 2 indicate values derived from the two separate sets of data. The 100(1 – α)% confidence interval for the population difference between the slopes is now given by

where t_{1 – α/2} is the appropriate value from the t distribution with n₁ + n₂ – 4 degrees of freedom.

Worked example

The regression line for the placebo group from the data in Table 8.1 is

The difference between the estimated slopes of the two regression lines is –0·1168 – (–0·09268) = –0·02412%. The standard error of this difference is found by first calculating s_pool as

and then

From Table 18.2 the value of t_0·975 with 16 degrees of freedom is 2·120, so the 95% confidence interval for the population difference between the slopes is

that is, from –0·147 to 0·098%.

Since a zero difference between slopes is near the middle of this confidence interval there is no evidence that the two population regression lines have different slopes. This is not surprising in this example as the subjects were allocated at random to the treatment groups.

Confidence interval for the common slope of two parallel regression lines

If the chosen confidence interval—for example., 95%—for the difference between population values of the slopes includes zero it is reasonable to fit two parallel regression lines with the same slope and calculate a confidence interval for their common slope. In practice we would usually perform this analysis by multiple regression using a statistical software package (see below). The calculation can be done, however, using the results obtained by fitting separate regression lines to the two groups and the standard deviations of the x and y values in the two groups: s_x₁, s_x₂, s_y₁, and s_y₂. First we define the quantity w as

The common slope of the parallel lines (b_par) is estimated as

The residual standard deviation of y around the parallel lines (s_par) is given by

and the standard error of the slope by

The 100(1 – α)% confidence interval for the population value of the common slope is then

where t_{1 – α/2} is the appropriate value from the t distribution with n₁ + n₂ – 3 degrees of freedom.

Worked example

We first calculate the quantity w as

The common slope of the parallel lines is then found as

The residual standard deviation of y around the parallel lines is

The standard error of the common slope is thus

From Table 18.2 the value of t_{1 – α/2} with 17 degrees of freedom is 2·110 so that the 95% confidence interval for the population value of b_par is therefore

that is, from –0·165 to –0·047%.

Confidence interval for the vertical distance between two parallel regression lines

The intercepts of the two parallel lines with the y axis are given by

We are usually more interested in the difference between the intercepts., which is the vertical distance between the parallel lines. This is the same as the difference between the fitted y values for the two groups at the same value of x, and is equivalent to adjusting the observed mean values of y for the mean values of x, a method known as analysis of covariance.³ The adjusted mean difference (y_diff) is calculated as

and the standard error of y_diff is

The 100(1 – α)% confidence interval for the population value of y_diff is then

where t_{1 – α/2} is the appropriate value from the t distribution with n₁ + n₂ – 3 degrees of freedom.

Worked example

Using the calculated value for the common slope the adjusted difference between the mean total glycosylated haemoglobin concentration in the two groups is

and its standard error is

The 95% confidence interval for the population value of y_diff is then given by

that is, from –0·29 to 1·20%.

Figure 8·3 illustrates the effect of adjustment.

Figure 8.3 llustration of the calculation of the adjusted difference between mean total glycosylated haemoglobin concentrations in two groups.
   Differences between means:
   ×, observed difference = ₁ – ₂ = 0·20%;
   × ×, adjusted difference = y_diff = 0·45%.

More than two samples

The methods described for two groups can be extended to the case of more than two groups of individuals,³ although such problems are rather rare. The calculations are best done using software for multiple regression, as discussed below.

We have shown linear regression with a continuous explanatory variable but this is not a requirement. Regression with a single binary explanatory variable is equivalent to performing a two-sample t test. When several explanatory variables are considered at once using multiple regression, as described below, it is common for some to be binary.

Binary outcome variable—logistic regression

In many studies the outcome variable of interest is the presence or absence of some condition, such as responding to treatment or having a myocardial infarction. When we have a binary outcome variable and give the categories numerical values of 0 and 1, usually representing “No” and “Yes” respectively, then the mean of these values in a sample of individuals is the same as the proportion of individuals with the characteristic. We might expect, therefore, that the appropriate regression model would predict the proportion of subjects with the feature of interest (or, equivalently, the probability of an individual having that characteristic) for different values of the explanatory variable.

The basic principle of logistic regression is much the same as for ordinary multiple regression. The main difference is that the model predicts a transformation of the outcome of interest. If p is the proportion of individuals with the characteristic, the transformation we use is the logit transformation, logit(p) = log[p/(1 – p)]. The regression coefficient for an explanatory variable compares the estimated outcome associated with two values of that variable, and is the log of the odds ratio (see chapter 7). We can thus use the model to estimate the odds ratio as e^b, where b is the estimated regression coefficient (log odds ratio). A confidence interval for the odds ratio is obtained by applying the same transformation to the confidence interval for the regression coefficient. An example is given below, in the section on multiple regression.

Outcome is time to an event—Cox regression

The regression method introduced by Cox in 1972 is used widely when the outcome is the time to an event.⁴ It is also known as proportional hazards regression analysis. The underlying methodology is complex, but the resulting regression model has a similar interpretation to a logistic regression model. Again the explanatory variable could be continuous or binary (for example, treatment in a randomised trial).

In this model the regression coefficient is the logarithm of the relative hazard (or “hazard ratio”) at a given time. The hazard represents the risk of the event in a very short time interval after a given time, given survival that far. The hazard ratio is interpreted as the relative risk of the event for two groups defined by different values of the explanatory variable. The model makes the strong assumption that this ratio is the same at all times after the start of follow up (for example, after randomisation in a controlled trial).

We can use the model to estimate the hazard ratio as e^b, where b is the estimated regression coefficient (log hazard ratio). As for logistic regression, a confidence interval is obtained by applying the same transformation to the confidence interval for the regression coefficient. An example is given below, in the section on multiple regression. A more detailed explanation of the method is given in chapter 9.

Several explanatory variables—multiple regression

In much medical research there are several potential explanatory variables. The principles of regression—linear, logistic, or Cox— can be extended fairly simply by using multiple regression.⁴ Further, the analysis may include binary as well as continuous explanatory variables. Standard statistical software can perform such calculations in a straightforward way. There are many issues relating to such analysis that are beyond the scope of this book—for discussion see, for example, Altman.⁴

Although the nature of the prediction varies according to the model, in each case the multiple regression analysis produces a regression equation (or “model”) which is effectively a weighted combination of the explanatory variables. The regression coefficients are the weights given to each variable.

For each variable in the regression model there is a standard error, so that it is easy to calculate a confidence interval for any particular variable in the model, using the standard approach of chapter 3.

The multiple linear regression model is

where Y is the outcome variable, X₁ to X_k are k explanatory variables, b₀ is a constant (intercept), and b₁ to b_k are the regression coefficients. The nature of the predicted outcome, Y, varies according to the type of model, as discussed above, but the regression model has the same form in each case. The main difference in multiple regression is that the relation between the outcome and a particular explanatory variable is “adjusted” for the effects of other variables.

The explanatory variables can be either continuous or binary. Table 8.2 shows the interpretation of regression coefficients for different types of outcome variable and for either continuous or binary explanatory variables. For binary variables it is assumed that these have numerical codes that differ by one; most often these are 0 and 1.

Table 8.2 Interpretation of regression coefficients for different types of outcome variable and for either continuous or binary explanatory variables

For multiple logistic regression models the regression coefficients and their confidence intervals need to be exponentiated (antilogged) to give the estimated odds ratio and its confidence interval. The same process is needed in Cox regression to get the estimated hazard ratio with its confidence interval.

Examples

Table 8.3 shows a multiple logistic regression model from a study of 183 men with unstable angina.⁵ The regression coefficients and confidence intervals have been converted to the odds ratio scale.

Some points to note about this analysis are:

The odds ratio for a binary variable (all here except age) indicates the estimated increased risk of the outcome for those with the feature. (The interpretation of an odds ratio as an approximate relative risk depends on the event being reasonably rare.)

Table 8.3 Logistic regression model for predicting cardiac death or non-fatal myocardial infarction.⁵

Table 8.4 Association between number of children and risk of testicular cancer in Danish men (514 cases and 720 controls)⁶

For two of the binary variables, diabetes and previous myocardial infarction, the confidence interval excludes 1, suggesting that these are indeed risk factors.
The odds ratio for age represents the increased risk per extra year of age. Thus, for example, the model predicts that a 70-year-old man has about an 8% extra risk of a serious cardiac event compared to a 69- year-old man. For a ten-year age difference the estimated odds ratio is 1·08¹⁰ or 2·16, representing over a twofold risk or a 116% higher risk. The confidence interval for a ten-year difference is from 1·02¹⁰ to 1·22¹⁰, that is 1·22 to 7·30.
The variables were included in this model if they showed a significant association with the outcome in a univariate (unadjusted) analysis. Although common, this procedure may lead to biased estimates of effect and the confidence intervals will be too narrow. The same comment applies to the use of stepwise selection within the multiple regression. (These remarks apply to all types of multiple regression.)
The data are actually times to a specific event and could (and perhaps should) have been analysed using Cox regression (chapter 9).

Table 8.4 shows part of a multiple logistic regression model with an ordinal explanatory variable. The number of children was compared for men with testicular cancer and controls, and adjustment made in the model for cryptorchidism, testicular atrophy, and other characteristics. Here each group with children is compared with the reference group with no children. This is achieved in the regression model by creating three binary variables indicating respectively whether or not each man has 1, 2, or 3 children. At most one of these ‘dummy’ variables will be 1, and the others are zero.⁴

Note that the overall evaluation of the relation between this ordinal variable and outcome (here risk of testicular cancer) should be based on the trend across the four groups (see “Multiple comparisons” in chapter 13). Here there was a highly significant trend of decreasing risk in relation to number of children even though one of the confidence intervals does not exclude 1.⁶

Correlation analysis

Pearson’s product moment correlation coefficient

The correlation coefficient usually calculated is the product moment correlation coefficient or Pearson’s r. This measures the degree of linear ‘co-relation’ between two variables x and y. The formula for calculating r for a sample of observations is given at the end of the chapter.

A confidence interval for the population value of r can be constructed by using a transformation of r to a quantity Z, which has an approximately Normal distribution. This calculation relies on the assumption that x and y have a joint bivariate Normal distribution (in practice, that the distributions of both variables are reasonably Normal).

The transformed value, Z, is given by

which for all values of r has SE = 1 / where n is the sample size. For a 100(1 – α)% confidence interval we then calculate the two quantities

where z_{1 – α/2} is the appropriate value from the standard Normal distribution for the 100(1 – α/2)% percentile (Table 18.1).

The values F and G need to be transformed back to the original scale to give a 100(1 – α)% confidence interval for the population correlation coefficient as

Worked example

Table 8.5 shows the basal metabolic rate and total energy expenditure in 24 hours from a study of 13 non-obese women.⁷ The data are ranked by increasing basal metabolic rate. Pearson’s r for these data is 0·7283, and the transformed value Z is

Table 8.5 Basal metabolic rate and isotopically measured 24-hour energy expenditure in 13 non-obese women⁷

The values of F and G for a 95% confidence interval are

and

From these values we derive the 95% confidence interval for the population correlation coefficient as

that is, from 0·30 to 0·91.

Spearman’s rank correlation coefficient

If either the distributional assumptions are not met or the relation between x and y is not linear we can use a rank method to assess a more general relation between the values of x and y. To calculate Spearman’s rank correlation coefficient (r_s) the values of x and y for the n individuals have to be ranked separately in order of increasing size from 1 to n. Spearman’s rank correlation coefficient is then obtained either by using the standard formula for Pearson’s product moment correlation coefficient on the ranks of the two variables, or (as shown below under “Technical details”) using the difference in their two ranks for each individual. The distribution of r_s similar to that of Pearson’s r, so that confidence intervals can be constructed as shown in the previous section.

Technical details: formulae for regression and correlation analyses

We strongly recommend that statistical software is used to perform regression or correlation analyses. Some formulae are given here to help explain the underlying principles.

Regression

The slope of the regression line is given by

where ∑ represents summation over the sample of size n. The intercept is given by

The difference between the observed and predicted values of y for an individual with observed values x₀ and y₀ is y₀ – y_fit, where y_fit = a + bx₀. The standard deviation of these differences (called “residuals”) is thus a measure of how well the line fits the data.

The residual standard deviation of y about the regression line is

Most statistical computer programs give all the necessary quantities to derive confidence intervals, but you may find that the output refers to s_res as the ‘standard error of the estimate’.

Correlation

The correlation coefficient (Pearson’s r) is estimated by

Spearman’s rank correlation coefficient is given by

where d_i is the difference in the ranks of the two variables for the ith-individual. Alternatively, r_s can be obtained by applying the formula for Pearson’s r to the ranks of the variables. The calculation of r_s should be modified when there are tied ranks in the data, but the effect is minimal unless there are many tied ranks.

1 Altman DG, Bland JM. Generalisation and extrapolation. BMJ 1998;317:409–10.

2 Marre M, Leblanc H, Suarez L, Guyenne T-T, Ménard J, Passa P. Converting enzyme inhibition and kidney function in normotensive diabetic patients with persistent micro-albuminuria. BMJ 1987;294:1448–52.

3 Armitage P, Berry G. Statistical methods in medical research. 3rd edn. Oxford: Blackwell Science, 1994: 336–7.

4 Altman DG. Practical statistics for medical research. London: Chapman & Hall, 1991: 336–58.

5 Stubbs P, Collinson P, Moseley D, Greenwood T, Noble M. Prospective study of the role of cardiac troponin T in patients admitted with unstable angina. BMJ 1996;313:262–4.

6 Møller H, Skakkebæk NE. Risk of testicular cancer in subfertile men: case-control study. BMJ 1999;318:559–62.

7 Prentice AM, Black AE, Coward WA, et al. High levels of energy expenditure in obese women. BMJ 1986;292:983–7.