THE INTENT TO TREAT
In the early 1980s, Richard Peto, one of the leading biostatisticians in Great Britain, had a problem. He was analyzing the results of several clinical trials comparing different treatments for cancer among patients. Following the dictates of R. A. Fisher’s experimental design, the typical clinical trial identified a group of patients who were in need of treatment and assigned them, at random, to different experimental methods of treatment.
The analysis of such data should have been relatively straightforward. The percentage of patients who survived for five years would have been compared among the treatment groups, using Fisher’s methods. A more subtle comparison could have been made, using Aalen’s martingale approach to analyze the time from the beginning of the study to each patient’s death as the basic measure of effect, Either way, the analysis was based on the initial randomization of patients to treatment. Following Fisher’s dictums, the assignment of patients to treatment was completely independent of the outcome of the study, and p-values tor hypothesis tests could be calculated.
Peto’s problem was that all the patients had not been given the treatments to which they were randomized. These were human beings, suffering from painful and, in many cases, terminal disease. The physicians treating them felt impelled to abandon the experimental treatment or at least to modify it if they felt it was in the best interests of the patient. Blind following of an arbitrary treatment, without considering the patient’s needs and responses, would have been unethical. Contrary to Fisher’s dictums, the patients in these studies were often provided with new treatments wherein the choice of treatment did depend upon the patient’s response.
This was a typical problem in cancer studies. It had been a problem in such studies since they were first started in the 1950s. Until Peto came on the scene, the usual procedure was to analyze only those patients who remained on their randomized treatment. All other patients were dropped from the analysis. Peto realized that this could lead to serious errors. For instance, suppose we were comparing an active treatment with treatment by a placebo, a drug that has no biological effect. Suppose that patients who fail to respond are switched to a standard treatment. The patients who fail on placebo will be switched over and left out of the analysis. The only patients who remain on placebo will be patients who, for some other reason, are responding. The placebo will be found to be as effective (or perhaps even more so) than the active treatment—if the patients who remained on placebo and responded are the only placebo-treated patients used in the analysis.
Edmund Gehan, who was at M. C. Anderson Hospital in Texas, had seen the problem before Peto. His solution at the time was to propose that these studies did not meet Fisher’s requirements, so they could not be considered useful experiments for comparing treatments. Instead, the records from these studies consisted of careful observations taken on patients given different types of treatments. The best that could be expected would be a general description of their outcome, with hints of possible future treatment. Later, Gehan considered various other solutions to this
problem, but his first conclusion reflects the frustration of someone who tries to apply the methods of statistical analysis to a poorly designed or executed experiment.
Peto suggested a straightforward solution. The patients had been randomized to receive specific treatments. The act of randomization is what made it possible to calculate the p-values of hypothesis tests comparing those treatments. He suggested that each patient be treated in the analysis as if he or she had been on the treatment to which he or she had been randomized. The analyst would ignore all treatment changes that occurred during the course of the study. If a patient was randomized to treatment A and was moved off of that treatment just before the end of the study, that patient was analyzed as a treatment A patient. If the patient randomized to treatment A was on treatment A for only a week, that patient was analyzed as a treatment A patient. If the patient randomized to treatment A never took a single pill from treatment A but was hospitalized and put on alternative therapies immediately after entering the study, that patient was analyzed as a treatment A patient.
This approach may seem foolish at first glance. One can produce scenarios in which a standard treatment is being compared to an experimental one, with patients switched to the standard if they fail. Then, if the experimental treatment is worthless, all or most of the patients randomized to it will be switched to the standard, and the analysis will find the two treatments the same. As Richard Peto made it clear in his proposal, this method of analyzing the results of a study cannot be used to find that treatments are equivalent. It can only be used if the analysis finds that they differ in effect.
Peto’s solution came to be called the “intent to treat” method. The justification for this name and for its use in general was the following: If we are interested in the overall results of a medical policy that would recommend the use of a given treatment, the physician has to be given the freedom to modify treatment as she sees fit. An analysis of a clinical trial, using Peto’s solution, would determine
if it is good public policy to recommend a given treatment as a starting treatment. The application of the intent to treat method of analysis was proposed as a sensible one for large government-sponsored studies designed to determine good public policies.
Unfortunately, there is a tendency for some scientists to use statistical methods without knowing or understanding the mathematics behind them. This often appears in the world of clinical research. Peto had pointed out the limitations of his solution. In spite of this, the intent to treat method became enshrined in medical doctrine at many universities and came to be seen as the only correct method of statistical analysis of a clinical trial. Many clinical trials, especially those in cancer, are designed to show that a new treatment is at least as good as the standard, while presenting fewer side effects. The purpose of many trials is to show therapeutic equivalence. As Peto pointed out, his solution can be used only to find differences, and failure to find differences does not mean that the treatments are equivalent.
The problem lay, to some extent, in the rigidity of the Neyman-Pearson formulation. The standard redaction of the Neyman-Pearson formulation found in elementary statistics textbooks tends to present hypothesis testing as a cut-and-dried procedure. Many purely arbitrary aspects of the methods are presented as immutable.
While many of these arbitrary elements may not be appropriate for clinical research, the need that some medical scientists have to
use “correct” methods has enshrined an extremely rigid version of the Neyman-Pearson formulation. Nothing is acceptable unless the p-value cutoff is fixed in advance and preserved by the statistical procedure. This was one reason why Fisher opposed the Neyman-Pearson formulation. He did not think that the use of p-values and significance tests should be subjected to such rigorous requirements. He objected, in particular, to the fact that Neyman would fix the probability of a false positive in advance and act only if the p-value is less than that. Fisher suggested in his book Statistical Methods and Scientific Inference that the final decision about what p-value would be significant should depend upon the circumstances. I used the word suggested because Fisher is never quite clear on how he would use p-values. He only presents examples.
In 1977, David R. Cox (of Box and Cox from chapter 23) took up Fisher’s arguments and extended them. To distinguish between Fisher’s use of p-values and the Neyman-Pearson formulation, he called Fisher’s method “significance testing” and the Neyman-Pearson formulation “hypothesis testing.” By the time Cox wrote his paper, the calculation of statistical significance (through the use of p-values) had become one of the most widely used methods of scientific research. Thus, Cox reasoned, the method has proven to be useful in science. In spite of the acrimonious dispute between Fisher and Neyman, in spite of the insistence of statisticians like W. Edwards Deming that hypothesis tests were useless, in spite of the rise of Bayesian statistics, which had no place for p-values and significances—in spite of all these criticisms among mathematical statisticians, significance testing and p-values are constantly being used. How, Cox asked, do scientists actually use these tests? How do they know that the results of such tests are true or useful? He discovered that, in practice, scientists use hypothesis tests primarily for refining their views of reality by eliminating
unnecessary parameters, or for deciding between two differing models of reality.
George Box (the other half of Box and Cox) approached the problem from a slightly different perspective. Scientific research, he noted, consisted of more than a single experiment. The scientist arrives at the experiment with a large body of prior knowledge or at least with a prior expectation of what might be the result. The study is designed to refine that knowledge, and the design depends upon what type of refinement is sought. Up to this point, Box and Cox are saying much the same thing. To Box, this one experiment is part of a stream of experiments. The data from this experiment are compared to data from other experiments. The previous knowledge is then reconsidered in terms of both the new experiment and new analyses of the old experiments. The scientists never cease to return to older studies to refine their interpretation of them in terms of the newer studies.
As an example of Box’s approach, consider the manufacturer of paper who is using one of Box’s major innovations, evolutionary variation in operations (EVOP). With Box’s EVOP, the manufacturer introduces experiments into the production run. The humidity, speed, sulfur, and temperature are modified slightly in various ways. The resulting change in paper strength is not great. It cannot be great and still produce a salable product. Yet these slight differences, subjected to Fisher’s analysis of variance, can be used to propose another experiment, in which the average strength across all the runs has been slightly increased, and the new runs are used to find the direction of still another slight increase in strength. The results of each stage in EVOP are compared to previous stages. Experiments that seem to produce anomalous results are rerun. The procedure continues forever—there is no final “correct” solution. In Box’s model, the sequence of scientific experiments followed
by examination and reexamination of data has no end—there is no final scientific truth.
Deming and many other statisticians have rejected the use of hypothesis tests outright. They insist that Fisher’s work on methods of estimation should form the basis of statistical analyses. It is the parameters of the distribution that should be estimated. It makes no sense to run analyses that deal indirectly with these parameters through p-values and arbitrary hypotheses. These statisticians continue to use Neyman’s confidence intervals to measure the uncertainty of their conclusions; but Neyman-Pearson hypothesis testing, they contend, belongs in the waste bin of history, along with Karl Pearson’s method of moments. It is interesting to note that Neyman, himself, seldom used p-values and hypothesis tests in his own applied papers.
This rejection of hypothesis testing, and the Box and Cox reformulations of Fisher’s concept of significance testing may cast doubt on Richard Peto’s solution to the problem he found in clinical cancer studies. But the basic problem he faced remains. What do you do when the experiment has been modified by allowing the consequences of the treatment to change the treatment? Abraham Wald had shown how a particular type of modification can be accommodated, leading to sequential analysis. In Peto’s case, the oncologists were not following Wald’s sequential methods. They were inserting different treatments as they perceived the need.
In some ways, this is a problem that William Cochran of Johns Hopkins University dealt with in the 1960s. The city of Baltimore wanted to determine if public housing had an effect on the social attitudes and progress of poor people. They approached the
statistics group at Johns Hopkins to help them set up an experiment. Following Fisher’s methods, the Johns Hopkins statisticians suggested they take a group of people, whether they had applied for public housing or not, and randomly assign some of them to public housing and refuse it to the others. This horrified the city officials. When openings were announced in public housing, it was their practice to respond on a first-come first-served basis. It was only fair. They could not deny their rights to people who rushed to be “first”—and that on the basis of a computer-generated randomization. The Johns Hopkins statistics group pointed out, however, that those who rushed to apply were often the most energetic and ambitious. If this was true, then those in public housing would do better than the others—without the housing itself having any effect.
Cochran’s solution was to propose that they were not going to be able to use a designed scientific experiment. Instead, by following families who went into public housing and those who did not, they would have an observational study, where the families differed by many factors, such as age, educational level, religion, and family stability. He proposed methods for running a statistical analysis of such observational studies. He would do this by adjusting the outcome measurement for a given family to take these different factors into account. He would set up a mathematical model in which there would be an effect due to age, an effect due to whether it was an intact family, an effect due to religion, and so forth. Once the parameters of all these effects had been estimated, the remaining differences in effect would be used to determine the effect of public housing.
When a clinical study announces that the difference in effect has been adjusted for patient age or patient sex, this means that researchers have applied some of Cochran’s methods to estimate the underlying effect of treatment, taking into account the effect of imbalances in the assignment of treatment to patients. Almost all sociological studies use Cochran’s methods. The authors of these studies may not recognize them as coming from William Cochran,
and many of the specific techniques often predate his work. Cochran put it on a solid theoretical foundation, and his papers on observational studies have influenced medicine, sociology, political science, and astronomy—all areas in which random assignment of “treatment” is either impossible or unethical.
In the 1980s and 1990s, Donald Rubin of Harvard University proposed a different approach to Peto’s problem. In Rubin’s model, each patient is assumed to have a possible response to each of the treatments. If there are two treatments, each patient has a potential response to both treatment A and treatment B. We can observe the patient under only one of these treatments, the one to which the patient had been assigned. We can set up a mathematical model in which there is a symbol in the formula for each of those possible responses. Rubin derived conditions on this mathematical model that are needed to estimate what might have happened had the patient been put on the other treatment.
Rubin’s models and Cochran’s methods can be applied in modern statistical analyses because they make use of the computer to engage in massive amounts of number crunching. Even if they had been proposed during Fisher’s time, they would not have been feasible. They require the use of the computer because the mathematical models are highly involved and complicated. They often require iterative techniques, where the computer does thousands or even millions of estimations, the sequence of estimations converging on the final answer.
These Cochran and Rubin methods are highly model specific. That is, they will not produce correct answers unless the complicated mathematical models they use come close to describing reality. They require the analyst to devise a mathematical model that will match reality in all or most of its aspects. If the reality does not match the model, then the results of the analysis may not hold. A
concomitant part of approaches like those of Cochran and Rubin has been the effort to determine the degree to which the conclusions are robust. Current mathematical investigations are looking at how far reality can be from the model before the conclusions are no longer true. Before he died in 1980, William Cochran was examining these questions.
Methods of statistical analysis can be thought of as lying on a continuum, with highly model-bound methods like those proposed by Cochran and Rubin on one end. At the other end, there are nonparametric methods, which examine data in terms of the most general type of patterns. Just as the computer has made highly model-bound methods feasible, there has been a computer revolution at the other end of statistical modeling—this nonparametric end where little or no mathematical structure is assumed and the data are allowed to tell their story without forcing them into preconceived models. These methods go by fanciful names like “the bootstrap.” They are the subject of the next chapter.