Part III BUILDING A MODEL

Imagine how statisticians might feel about the powerful statistics programs that are now in our hands. It is so easy to key-in a set of data and calculate a wide variety of statistics—regardless what those statistics are or what they mean. There also is a need to check that things are done correctly in the statistical analyses we perform in our laboratories.

IN THE PREVIOUS CHAPTER, WE FOCUSED EXCLUSIVELY ON ordinary least-squares linear regression (OLS) both because it is the most common modeling technique and because the limitations and caveats we outlined there apply to virtually all modeling techniques. But OLS is not the only modeling technique. To diminish the effect of outliers, and treat prediction errors as proportional to their absolute magnitude rather than their squares, one should use least absolute deviation (LAD) regression. This would be the case if the conditional distribution of the dependent variable were characterized by a distribution with heavy tails (compared to the normal distribution, increased probability of values far from the mean).

One should also employ LAD regression when the conditional distribution of the dependent variable given the predictors is not symmetric and we wish to estimate its median rather than its mean value.

If it is not clear which variable should be viewed as the predictor and which the dependent variable, as is the case when evaluating two methods of measurement, then one should employ Deming or error in variable (EIV) regression.

If one’s primary interest is not in the expected value of the dependent variable but in its extremes (the number of bacteria that will survive treatment or the number of individuals who will fall below the poverty line), then one ought consider the use of quantile regression.

If distinct strata exist, one should consider developing separate regression models for each stratum, a technique known as ecological regression, discussed in the next-to-last section of the present chapter.

If one’s interest is in classification or if the majority of one’s predictors are dichotomous, then one should consider the use of classification and regression trees (CART) discussed in the next chapter.

If the outcomes are limited to success or failure, one ought employ logistic regression. If the outcomes are counts rather than continuous measurements, one should employ a generalized linear model (GLM). See Chapter 14.

LINEAR VERSUS NONLINEAR REGRESSION

Linear regression is a much misunderstood and mistaught concept. If a linear model provides a good fit to data, this does not imply that a plot of the dependent variable with respect to the predictor would be a straight line, only that a plot of the dependent variable with respect to some not-necessarily monotonic function of the predictor would be a line.

For example, y = A + B log[x] and y = A cos(x) + B sin(x) are both linear models whose coefficients A and B might be derived by OLS or LAD methods. Y = Ax⁵ is a linear model. Y = x^A is nonlinear.

LEAST-ABSOLUTE-DEVIATION REGRESSION

The two most popular linear regression methods for estimating model coefficients are referred to as ordinary-least-squares (OLS) and least-absolute-deviation (LAD) goodness of fit, respectively. Because they are popular, a wide selection of computer software is available to help us do the calculations.

where Y_i denotes the variable we wish to predict and X_i the corresponding value of the predictor on the ith occasion. With the LAD method, we seek to minimize the sum of the absolute deviations between the observed and the predicted value:

Least-absolute-deviation regression (LAD) attempts to correct one of the major flaws of OLS, that of sometimes giving excessive weight to extreme values. The LAD method solves for those values of the coefficients in the regression equation for which the sum of the absolute deviations Σ|y_i − R[x_i]| is a minimum.

Finding the LAD minimum is more complicated and requires linear programming, but as there is plenty of commercially available software to do the calculations for us, we need not worry about their complexity.

Algorithms for LAD regression are given in Barrodale and Roberts [1973]. The qreg function of Stata provides for LAD (least-absolute-deviation) regression as does R’s quantreg package.

1. To reduce the influence of outliers.

2. If the losses associated with errors in prediction are additive, rather than large errors being substantially more important than small ones.

3. If the conditional distribution of Y|X = x * is not symmetric and we wish to estimate the median of Y|X = x rather than its mean value.

4. If the conditional distribution of Y|X = x is heavy in the tails.

Figure 12.1 depicts systolic blood pressure as a function of age. Each circle corresponds to a pair of observations on a single individual. The solid line is the LAD regression line. The dotted line is the OLS regression line. A single individual, a 47 year-old with a systolic blood pressure of 220, is responsible for the difference between the two lines. Which line do you feel it would be better to use for prediction purposes?

Drawbacks of LAD Regression

Opinions differ as to whether LAD is unstable in the sense that a small change in the data can cause a relatively large change in the fitted plane. Ellis [1998] reports that a change in the value of one observation by as little as 1/20,000th of the interquartile range of the predictor can result in a marked change in the slope of the LAD line.

Errors-in-Variables Regression

The need for errors-in-variables (EIV) or Deming regression is best illustrated by the struggles of a small medical device firm to bring its product to market. First, they must convince regulators that their long-lasting device provides results equivalent to those of a less-efficient device already on the market. In other words, they need to show that the values V recorded by their device bears a linear relation to the values W recorded by their competitor, that is, that E(V) = a + bW.

But the errors inherent in measuring W (the so-called predictor) are as large if not larger than the variation inherent in the output V of the new device. The EIV regression method used to demonstrate equivalence differs in two respects from that of OLS:

1. With OLS, we are trying to minimize the sum of squares Σ(y_oi − y_pi)² where y_oi is the ith observed value of Y and y_pi is the ith predicted value. With EIV, we are trying to minimize the sums of squares of errors, going both ways: Σ(y_oi − y_pi)²/Var Y + Σ(x_oi − x_pi)²/Var X.

2. The coefficients of the EIV regression line depend on the ratio λ = Var X/Var Y.

Unfortunately, in cases involving only single measurements by each method, the ratio λ may be unknown and is often assigned a default value of one. In a simulated comparison of two electrolyte methods, Linnet [1998] found that misspecification of λ produced a bias that amounted to two-thirds of the maximum bias of the ordinary least-squares regression method. Standard errors and the results of hypothesis testing also became misleading. In a simulated comparison of two glucose methods, Linnet found that a misspecified error ratio resulted only in a negligible bias. Given a short range of values in relation to the measurement errors, it is important that λ is correctly estimated either from duplicate sets of measurements or, in the case of single measurement sets, specified from quality-control data. Even with a misspecified error ratio, Linnet found that Deming regression analysis is likely to perform better than least-squares regression analysis.

WHEN DOES THIS DIFFERENCE MATTER?

When the relative errors for the two methods are similar and the correlation coefficient is greater than 0.8, the OLS regression slope can be approximated as:

c12ue006

where ρ is the correlation coefficient. This means that the regular slope routinely underestimates the actual slope of the data. For ρ less than 0.8, the relationship no longer is as accurate. However differences of 20% and more continue to exist between the slopes calculated by the two methods.

For many clinical chemistry procedures, ρ is greater than 0.995 and there is very little difference between OLS and Deming regression. For predictors such as electrolytes and many hematology parameters (especially the white cells), ρ can easily be less than 0.95, and sometimes in the range of 0.2 to 0.8. In these cases, the use of Deming statistics makes a large difference in the results.

One such example, depicted in Figure 12.2, arises when activated partial thromboplastin time (APTT) is used to determine the correct dose of heparin (a blood thinner). Either too much or too little heparin could seriously impair a patient’s health. But which line are we to use?

In practice, Stöckl, Dewitte, and Thienpont [1998] find that it is not the statistical model but the quality of the analytical input data that is crucial for interpretation of method comparison studies.

Correlation versus Slope of Regression Line

Perfect correlation (ρ² = 1) does not imply that two variables are identical but rather that one of them, Y, say, can be written as a linear function of the other, Y = a + bX, where b is the slope of the regression line and a is the intercept.

How Big Should The Sample Be?

In method comparison studies, we need to be sure that differences of medical importance are detected. As discussed in Chapter 2, for a given difference, the necessary number of samples depends on the range of values and the analytical standard deviations of the methods involved.

Linnet [1999] finds that the sample sizes of 40–100 conventionally used in method comparison studies often are inadequate. A main factor is the range of values, which should be as wide as possible for the given analyte. For a range ratio (maximum value divided by minimum value) of 2, 544 samples are required to detect one standardized slope deviation; the number of required samples decreases to 64 at a range ratio of 10 (proportional analytical error). For electrolytes having very narrow ranges of values, very large sample sizes usually are necessary. In case of proportional analytical error, application of a weighted approach is important to assure an efficient analysis; for example, for a range ratio of 10, the weighted approach reduces the requirement of samples by more than 50%.

NINE GUIDELINES*

1. Use statistics to provide estimates of errors, not as indicators of acceptability.

2. Recognize that the main purpose of the method comparison experiment is to obtain an estimate of systematic error or bias.

3. Obtain estimates of systematic error at important medical decision concentrations.

4. When there is a single medical decision concentration, make the estimate of systematic error near the mean of the data.

5. When there are two or more medical decision concentrations, use the correlation coefficient, r, to assess whether the range of data is adequate for using ordinary regression analysis.

6. When the correlation coefficient exceeds 0.975, use the comparison plot along with ordinary linear regression statistics.

7. When the correlation coefficient is close to zero, improve the data or change the statistical technique.

8. When in doubt about the validity of the statistical technique, see whether the choice of statistics changes the outcome or decision on acceptability.

9. Plan the experiment carefully and collect the data appropriate for the statistical technique to be used.

*Abstracted from Westgard [1998].

QUANTILE REGRESSION

Linear regression techniques (OLS, LAD, or EIV) are designed to help us predict expected values, as in E(Y) = μ + βX. But what if our real interest is in predicting extreme values, if, for example, we would like to characterize the observations of Y that are likely to lie in the upper and lower tails of Y’s distribution. This would certainly be the case for economists and welfare workers who want to predict the number of individuals whose incomes will place them below the poverty line, physicians, bacteriologists, and public health officers who want to estimate the proportion of bacteria that will remain untouched by various doses of an antibiotic; ecologists and nature lovers who want to estimate the number of species that might perish in a toxic waste spill, and industrialists and retailers who want to know what proportion of the population might be interested in and can afford their new product.

In estimating the τth quantile,¹ we try to find that value of β for which Σ_kρ_τ(y_k − f[x_k,β]) is a minimum, where

Even when expected values or medians lie along a straight line, other quantiles may follow a curved path. Koenker and Hallock applied the method of quantile regression to data taken from Ernst Engel’s study in 1857 of the dependence of households’ food expenditure on household income. As Figure 12.3 reveals, not only was an increase in food expenditures observed as expected when household income was increased, but the dispersion of the expenditures increased also.

Some precautions are necessary. As Brian Cade notes, the most common errors associated with quantile regression include:

1. Failing to evaluate whether the model form is appropriate, for example, forcing linear fit through an obvious nonlinear response. (Of course, this is also a concern with mean regression, OLS, LAD, or EIV.)

2. Trying to over interpret a single quantile estimate (say 0.85) with a statistically significant nonzero slope (p < 0.05) when the majority of adjacent quantiles (say 0.5 − 0.84 and 0.86 − 0.95) are clearly zero (p > 0.20).

3. Failing to use all the information a quantile regression provides. Even if you think you are only interested in relations near maximum (say 0.90 − 0.99), your understanding will be enhanced by having estimates (and sampling variation via confidence intervals) across a wide range of quantiles (say 0.01 − 0.99).

SURVIVAL ANALYSIS

Survival analysis is used to assess time-to-event data including time to recovery and time to revision.

Most contemporary survival analysis is built around the Cox model for which the hazard function takes the form c12ue008

, where for each observation c12ue009

is a 1 × p row vector of covariate values and c12ue010

is a p × 1 column vector of to-be-estimated coefficients. Possible sources of error in the application of this model include all of the following:

Therneau and Grambsch [2000] cite the example of a heterogeneous population, 40% of whom acquire an infection at a rate of once per year and respond to a drug approximately half the time, 40% of whom acquire an infection at a rate of twice per year and respond to a drug approximately half the time, and 20% of whom acquire the infection as often as ten times per year and respond to the drug only 20% of the time. Let us launch a study with 1000 individuals in each treatment arm. Assuming that the infections follow a simple Poisson process, the expected number of individuals who will finish the year with k = 1, 2, … infections is given in the following table:

Clearly, in this example the treatment helps reduce the number of infected individuals. But does it reduce the number of infections? You will need to construct a table for your own data similar to the one above before you can be sure whether the heterogeneity of individuals’ susceptibility plays a role.

THE ECOLOGICAL FALLACY

The Court wrote in NAACP v. City Of Niagara Falls, “Simple regression does not allow for the effects of racial differences in voter turnout; it assumes that turnout rates between racial groups are the same.”² Whenever distinct strata exist, one ought develop separate regression models for each stratum. Failure to do so constitutes the ecological fallacy.

In the 2004 election for Governor of the State of Washington, out of the over 2.8 million votes counted, just 261 votes separated the two leading candidates, Christine Gregoire and Dino Rossi, with Mr. Rossi in the lead. Two recounts later, Ms. Gregoire was found to be ahead by 129 votes. There were many problems with the balloting, including the discovery that some 647 felons voted despite having lost the right to vote. Borders et al. v. King County et al. represents an attempt to overturn the results, arguing that if the illegal votes were deducted from each precinct proportional to the relative number of votes cast for each candidate, Mr. Rossi would have won the election.

The Court finds that the method of proportionate deduction and the assumption relied upon by Professors Gill and Katz are a scientifically unaccepted use of the method of ecological inference. In particular, Professors Gill and Katz committed what is referred to as the ecological fallacy in making inferences about a particular individual’s voting behavior using only information about the average behavior of groups; in this case, voters assigned to a particular precinct. The ecological fallacy leads to erroneous and misleading results. Election results vary significantly from one similar precinct to another, from one election to another in the same precinct, and among different candidates of the same party in the same precinct. Felons and others who vote illegally are not necessarily the same as others in the precinct.

… [T]he Court finds that the statistical methods used in the reports of Professors Gill and Katz ignore other significant factors in determining how a person is likely to vote. In this case, in light of the candidates, gender may be as significant or a more significant factor than others. The illegal voters were disproportionately male and less likely to have voted for the female candidate.³

To see how stratified regression would be applied in practice, consider a suit⁴ to compel redistricting to create a majority Hispanic district in Los Angeles County. The plaintiffs offered in evidence two regression equations to demonstrate differences in the voting behavior of Hispanics and non-Hispanics:

where Y_hi and Y_ti are the predicted proportions of voters in the ith precinct for the Hispanic candidate and for all candidates, respectively; C_h and C_t are the percentages of non-Hispanic voters who voted for the Hispanic candidate or any candidate; b_h and b_t are the added percentages of Hispanic voters who voted for the Hispanic candidate or any candidate; X_hi is the percentage of registered voters in the ith precinct who are Hispanic; and ε_hi and ε_ti are random or otherwise unexplained fluctuations.

If there were no differences in the voting behavior of Hispanics and non-Hispanics, then we would expect our estimates of b_h and b_t to be close to zero. Instead, the plaintiffs showed that the best fit to the data was provided by the equations

Of course, other estimates of the Cs and bs are possible, as only the Xs and Ys are known with certainty. It is conceivable, though unlikely, that few if any of the Hispanics actually voted for the Hispanic candidate.

NONSENSE REGRESSION

Nonlinear regression methods are appropriate when the form of the nonlinear model is known in advance. For example, a typical pharmacological model will have the form A exp[bX] + C exp[dW]. The presence of numerous locally optimal but globally suboptimal solutions creates challenges, and validation is essential. See, for example, Gallant [1987] and Carroll et al. [1995].

To be avoided are a recent spate of proprietary algorithms available solely in software form that guarantee to find a best-fitting solution. In the words of John von Neumann, “With four parameters I can fit an elephant and with five I can make him wiggle his trunk.” Goodness of fit is no guarantee of predictive success, a topic we take up repeatedly in subsequent chapters.

REPORTING THE RESULTS

Use a graph to report the results of a univariate regression only if one of the following is true:

Confidence limits should not be parallel; rather, they will appear hyperbolic around a regression line, reflecting the greater uncertainty at the extremes of the distribution.

SUMMARY

In this chapter, we distinguished linear from nonlinear regression and described a number of alternatives to ordinary least squares regression, including least absolute deviation regression, and quantile regression. We also noted the importance of using separate regression equations for each identifiable stratum.

TO LEARN MORE

Consider using LAD regression when analyzing software data sets [Miyazaki et al., 1994] or meteorological data [Mielke et al., 1996], but heed the caveats noted by Ellis [1998].

Only iteratively reweighed general Deming regression produces statistically unbiased estimates of systematic bias and reliable confidence intervals of bias. For details of the recommended technique, see Martin [2000].

Mielke and Berry [2001, Section 5.4] provide a comparison of MRPP, Cade–Richards, and OLS regression methods. Stöckl, Dewitte, and Thierpont [1998] compare ordinary linear regression, Deming regression, standardized principal component analysis, and Passing–Bablok regression.

³ Quotations are from a transcript of the decision by Chelan County Superior Court Judge John Bridges, June 6, 2005.