Chapter 7

THE MEASUREMENT PROBLEM

ON THURSDAY MORNINGS, HEART SURGERY AT Columbia-Presbyterian gets off to a late start to allow time for internal meetings. There are a half dozen or so under way by 6:30—on research projects and the like—and a full staff meeting at 7:30. One Thursday a month, the staff meeting is devoted to “Journal Club,” in which four or five recent journal articles are selected for formal presentations by residents. Those were my favorite meetings. Everything a resident does is grist for his final evaluation, so the presentations are usually well prepared, and the surgeons, who are utterly immersed in their craft, become intensely engaged in discussions digging into, say, the fluid-dynamics of different-shaped bypasses.

At one of the first Journal Clubs I attended, in February 2006, the agenda included a recent New England Journal of Medicine article suggesting that an antibleeding drug, aprotinin, which is often used in heart surgery, was associated with a number of dangerous side effects, including strokes, cardiac and renal (kidney) effects, and increased mortality. The conclusion was not based on clinical trials, but rather on a statistical analysis of a very large database of surgeries. Since there appeared to be alternative drugs with much the same beneficial effects, the article suggested that continued routine use of aprotinin was “not prudent.” The presentation was delivered in a monotone, I was still adjusting to the peculiarities of time in a surgical universe, and this was my second meeting of the morning. Sitting in the back of the room, I mostly concentrated on not visibly drifting off.

Then the resident finished—and to my astonishment, the room almost exploded with anger. Jack Shanewise, the cardiothoracic anesthesiology chief, called the article “completely irresponsible,” and went on that the NEJM had been “burned on Vioxx” and was now trying to seize the high ground on aprotinin. (The NEJM had indeed published a company-sponsored paper on Vioxx, a blockbuster arthritis drug, that it later discovered had excluded crucial data on cardiac side effects.) All the doctors were seething over the “not prudent” tagline which, I belatedly realized, was doctor-speak for “should be sued.” For emphasis, the NEJM editors had also included the “not prudent” phrase in its abstract, so it had been widely picked up in the press. The resident flicked back through his PowerPoint slides to a Google search page showing the New York Times report on the NEJM article. Sure enough, it was bordered by a string of lawyer ads, soliciting aprotonin cases. An older doctor near me murmured, “What kind of a country are we living in?”

Smith focused on the article’s methodology: “If you believe in the religion of propensity scoring,” he said, “you have to stop using aprotinin. But if you don’t have a Ph.D. in math, it’s impossible to know whether this is a good study or not.” “Propensity scoring” was the primary analytic technique that the NEJM study relied on. It is of quite recent vintage, and can retrospectively create simulated clinical trial–style experimental and control groups from large databases. It requires great facility in advanced statistics, large amounts of computer power, and a fair degree of judgment-based data manipulation. And Smith was certainly right that hardly anyone in the room had the mathematical background to judge the quality of the study.*

The discussion went on for almost an hour, and it gradually became clear how deep the doctors’ anger was, and how various its targets. Partly, it was simple embarrassment. More than a decade ago, the surgeons at Columbia had developed the first widely used protocols for the safe use of aprotinin. Shanewise arrived much later, but came from Emory University, where the cardiothoracic anesthesiology program is a strong aprotinin advocate. In the year and a half he had been at Columbia, he had been working to increase the use of aprotinin. Now here is the NEJM telling him, in effect, that he has been “imprudent.”

And there was anger over their apparent loss of control over professional standards. Statisticians were now going to tell the surgeons what good medicine was. Almost everyone in the room felt strongly that in cases where postoperative bleeding was a serious risk, aprotinin had proved to be a highly effective counteragent. But it was not just a matter of opinion. One of the surgeons said, “We’ve learned not to trust anecdotal evidence. We’ve had it drummed into us that the gold standard for evidence is the randomized clinical trial.” Aprotinin, in fact, had been one of the most studied of surgical drugs. More than sixty randomized clinical trials (RCTs) had almost uniformly demonstrated its safety and effectiveness. While some of the earliest studies suggested the possibility of renal and cardiac complications, they mostly dated from before the Columbia protocols had been widely adopted. Two very large recent “meta-analyses”—studies that combined the results of multiple smaller studies—had strongly confirmed aprotinin’s usefulness and safety. But the NEJM article dismissed that evidence. Each of the individual studies was small—“underpowered” in the jargon—and apt to miss subtle variations in outcomes. Worse, they were all financed by aprotinin’s maker, Bayer, the German pharmaceutical giant.

And the surgeons were angry at the government. Smith said: “The government made the choice to delegate drug testing to the drug companies. There’s no money for independent testing. That’s why we’re in this position.” There was also, of course, anger at the tort lawyers. But surprisingly, perhaps the most intense anger of all was focused on Bayer and the corrupting influence of the drug companies. Shanewise said: “There’s no one in this room who isn’t compromised. You can’t go to an anesthesiology conference that isn’t paid for by Bayer. Nobody here can say that they’ve never had a meal that was paid for by Bayer.” Then he said, “The other night I was attending at an operation on an eighty-five-year-old woman. She was exactly the kind of case that I think needs aprotinin. And I didn’t know what to do. I finally asked myself, What would I do if this was my mother on the table—and then I gave her the aprotinin. Everything came out all right, thank God.”

I could sympathize with the surgeons. Old points of reference were visibly wobbling, and the queasiness they felt was palpable. But I was also pretty sure they were wrong, and that sooner or later they were going to have to swallow the study. I had seen the usefulness of “data mining” techniques in business—using powerful computers to discover unexpected patterns in standard databases—and thought it was much underused in medicine. At one point in the meeting, Mike Argenziano, who has far more mathematical training than most surgeons, quietly warned the group that propensity analyses were producing a lot of useful results and that they shouldn’t dismiss the study too quickly. Later that day, I walked over to the office of a friend of mine at Columbia, a senior health care economist, and asked her to look at the paper. She read it through and said, “I’m not the one to analyze this in depth, but on the face of it, it looks like a well-executed study. I’m afraid your surgeon friends may have to learn to live with it.”

It turns out, however, that whatever their reasons for mistrusting the study, the surgeons’ doubts were well founded. What follows is a twisty tale that is still not completely resolved as of March 2007. But it sheds a stark light on the challenge of assembling useful medical evidence. The NEJM study has been, if not quite discredited, strongly challenged on technical grounds, but some of those same analyses have also helped expose the weaknesses of RCTs. Many RCTs, perhaps even most, are useless or worse. Partly that’s because pharmaceutical companies have abused the trial process, but it’s also because of the great difficulty and expense of running good trials. Statistical analyses of nonrandomized medical databases, like that in the NEJM study, are therefore becoming of crucial importance. But the multiple problems with the NEJM paper demonstrate how poorly equipped the medical profession, its professional journals, and its regulators are to take on such a challenge.

The Problem with Drug Companies

The degree of press attention to the NEJM article was surprising at first. Although aprotinin creates a nice revenue stream for Bayer—about $300 million in 2005—it has nothing like the public profile of consumer blockbusters like Vioxx or Lipitor, and since it’s usually administered on the operating table along with a panoply of other drugs, few patients are even aware of it.

But the alarm over aprotinin came in the midst of a long series of devastating revelations about the ethics and the business practices of “Big Pharma.” Movies are usually a good barometer of American attitudes. The central premise of the 2005 hit movie The Constant Gardener was that a global drug company was killing poor African children in order to test a highly toxic experimental drug, and commissioning more murders to cover up its crimes. The fact that millions of people apparently found the story perfectly plausible suggests how low the industry’s reputation has sunk.

To be fair, the failings of the drug companies are often exaggerated. Drug prices are higher in America than in other countries, but by less than usually reported, taking into account the discounts negotiated by managed-care groups and Medicare insurance vendors. In addition, the use of less expensive generics—unbranded copies of popular drugs that have outlived their patent periods—is much higher in America than in other countries, and they are generally cheaper as well. It is also true that drug companies take far too much credit for innovative research—most basic research is financed by the federal government and performed in university laboratories. But there is a large gap between an exciting laboratory compound and a therapeutically useful drug that can be safely absorbed by humans. Creating effective AIDS drugs wasn’t as glamorous as decrypting the HIV virus, but was arguably as difficult, and the companies deserve full credit for their achievements.

Still, all accomplishments notwithstanding, the degree of corruption in the drug industry is disheartening. Enormous revenue flows turn on discretionary choices by a variety of gatekeepers—the patent office, the Food and Drug Administration, and, most of all, thousands and thousands of prescribing physicians. Big Pharma’s annual lobbying budget, therefore, usually tops that of any other industry, and the companies shower billions of promotional money on working doctors—free drugs, expense-paid trips, lunches, consulting and speaking fees. Getting physicians with financial ties to drug companies off crucial FDA approval committees has been like rooting out roaches. Physicians at Stanford University Medical School were recently advised that they could not be paid for signing journal articles written by drug industry contractors—the mere necessity for such an advisement is itself a comment on the state of some doctors’ ethical barometers.

At times, the wrongdoing is not even subtle. TAP Pharmaceuticals, for example, a joint venture between Abbott and the Japanese drug giant Takeda, paid $875 million in 2001 to settle criminal charges and civil liabilities related to its best-selling cancer drug Lupron. (It was overbilling Medicare for the drug in a way that benefited the doctors who prescribed it.) Two years later, AstraZeneca paid $355 million to settle roughly the same charges related to its Lupron competitor. And the year after that, Warner-Lambert, a Pfizer subsidiary, paid $430 million in criminal and civil settlements for illegal marketing of a drug called Neurontin. None of this is Constant Gardener territory, but it all lends credibility to the movie’s story line.

The NEJM aprotinin article therefore fit squarely into an evolving media narrative of scandal at Big Pharma. Bayer charges well over $1,000 a dose for the drug in the United States, which must be fifty to a hundred times its production cost, which it justifies by pointing to the large savings and reduced transfusions from using aprotinin. But there are alternatives. Aprotinin belongs to a class of agents loosely categorized as “antifibrinolytics.”* There are two other, much cheaper antifibrinolytics, operating via different chemical pathways, that appear to have effects broadly similar to aprotinin’s. Almost all of the RCTs that demonstrate the benefits of aprotinin are comparing aprotinin to placebos, not to the other antifibrinolytics. But precisely because the alternatives are so inexpensive, no company has been willing to invest in testing them.

The high prices might also be justifiable if Bayer had invented aprotinin, but the drug has been around since the 1920s. Bayer has marketed an aprotinin product, Trasylol, for a very long time—it was occasionally used for certain pancreatic conditions. The company struck gold in the mid-1980s when a British cardiac anesthesiologist, David Royston, made what he calls the “purely serendipitous” discovery that aprotinin had a powerful antibleeding effect. At least a half dozen other companies around the world make aprotinin, but Bayer is the only one that invested in the clinical trials required to win FDA approval of aprotinin as an antibleeding drug for heart surgery. It is also the only one to produce it in an FDA-approved factory and may also have cornered the market on cow lung, the raw material for aprotinin, which is certified free of “mad cow” disease. Even with those advantages, the low cost of the alternatives means that Bayer still must spend heavily on aprotinin promotion—sponsoring anesthesiology events, contributing to anesthesiologists’ research programs, and signing up prominent anesthesiologists as speakers and panelists.

Finally, a striking feature of the NEJM article is how readily it dismisses the broad array of RCTs attesting to aprotinin’s safety and effectiveness, merely on the grounds of their sponsorship by Bayer. In fact, sophisticated industry observers have long since learned to be skeptical of drug companies bearing trial reports.

The Problem with Randomized Clinical Trials

The worst of Big Pharma’s corruptions, perhaps, has been the suborning of the clinical testing process—and allowing it to happen may be the greatest failing of the FDA. Clinical tests financed by drug companies, which are the majority of tests, tend to be far more favorable to the companies than tests financed independently—3.6 times as much, according to a recent large survey—and negative test results have been routinely suppressed. A review of tests involving fluoxetine (Prozac) shows how simple-minded the deceptions can be. When fluoxetine is the experimental drug—the one that is being tested—it’s typically given in higher dosages than the control drug, so the game is tilted to a positive outcome. Flip the positions, however—so fluoxetine is the control drug—and the dosages are reversed, once again favoring the experimental. Most acceptance trials, in any case, are just against a placebo, even though there may be much cheaper, equally effective, drugs already on the market. Approvals are almost always accompanied by Phase IV agreements, in which the companies agree to conduct further tests to search for side effects and to measure their product against alternative drugs, but the agreements are routinely ignored, without apparent consequence.

If company cheating were the only problem with RCTs, however, it might be relatively straightforward to fix. But there are factors inherent even in honest RCTs that often make their results unreliable, or outright wrong. For example, if you want an unambiguous test of a drug’s efficacy, you need clean experimental and control groups—people suffering only from the target disease and not undergoing any confounding treatments. Compared to people doctors see in their practices, therefore, RCT subjects tend to be younger, freer of comorbidities, and much less likely to be consuming other drugs. One recent study of trials involving nearly 8,500 cardiac-treatment patients showed that the mortality rate of patients eligible to be in the trials, but not enrolled, was twice as high as for the patients actually enrolled in the trials.*

The “end points,” or targeted results, in drug trials, moreover, tend to be very simple. For the aprotinin trials, for example, a common end point was the number of postoperative transfusions. When the experimental drug is being compared only with a placebo, and measured on a very precise end point with a fairly large expected effect, fifty, or even fewer, subjects in each of the control and experimental groups is usually enough for statistical significance. So the companies have become experts at calibrating studies to be just large enough to show the effect they want, but not so large as to reveal less obvious side effects, as the NEJM charged was the case with aprotinin. Meta-analyses, aggregating smaller studies, are often helpful, but can easily be confounded by relatively minor divergences in the smaller studies.

But the most important problem is that it is so hard, and so expensive, to run good RCTs. The National Institutes of Health (NIH) recently completed a classic study of hypertension drugs, involving four expensive prescription compounds and a traditional generic diuretic. To the experts’ surprise, the diuretic came out the clear winner on both effectiveness and safety. One of the prescription drugs was proven so dangerous that it was actually withdrawn. But the study involved 42,000 subjects and took eight years to complete. Another large-scale study is under way in Canada, sponsored by the local NIH equivalent, that should finally resolve the controversy over aprotinin. Commencing in 2002, it involves a comprehensive analysis of 4,000 cardiac surgery patients in 600 institutions throughout Canada, and will include aprotinin and both of the alternative drugs. Results should become available starting in 2008.

Trials of surgical techniques or medical devices are even harder to mount than drug trials. Sherwin Nuland, the Yale surgeon and a specialist in medical ethics, maintains that it is “not possible” to run RCTs in surgery. Subterfuges, like making an incision to simulate a surgical intervention, or to implant a dummy device, arguably violate the precept to “do no harm,” and few doctors are willing to cooperate in such ruses. Surgical technology is also continuously evolving. Several large trials of heart bypass surgery were implemented over periods when outcomes were rapidly improving, which greatly complicates their interpretation. In surgical trials, moreover, all comparative analyses are inevitably confounded by variations in individual technical skill—some surgeons would get better results with almost any procedure.

Undertakings like the NIH hypertension study and the Canadian aprotinin study are surely what we have in mind when we talk about RCTs as the “gold standard” for medical evidence—studies with the scale, financing muscle, and analytic firepower to provide definitive answers to important questions. But it is hopelessly infeasible to apply that order of screening to every new therapy in the explosion of invention emanating from our university and company laboratories. The “gold standard” RCT will almost always be more of a platonic than a practical ideal. Insisting that RCTs are the only useful medical evidence merely begets a pestilence of bad RCTs that benefit no one but the purveyors of questionable medicine.

So the quest to develop a toolkit of reliable statistical techniques to supplement, or possibly replace, RCTs is an important one, and the “propensity scoring” approach used in the NEJM article, is one of the more promising of current candidates.

The Promise of Propensity Scoring

In an ideal world, the oceans of medical records held by hospitals and other providers should be a rich source of information on the safety and effectiveness of medical procedures. But even if records were more accessible to researchers, making effective use of them would be problematic. In the research jargon, they are merely “observational” databases: causal associations derived from them are generally regarded as meaningless. To see why, assume you were trying to analyze the mortality effects of smoking from a large, but uncontrolled, database. A direct comparison of the mortality rates of pipe smokers and cigarette smokers would probably show that cigarette smokers have the better life expectancy, which seems odd, since cigarette smokers are more likely to inhale. The puzzle goes away, however, if you plug smokers’ age into the database. Pipe smokers tend to be older than cigarette smokers, and on an age-adjusted basis, cigarette smokers actually do have the higher mortality rate.

A simple way to make the age adjustment is to stratify the sample by age—say, comparing everyone 20–24, 25–29, 30–34, the finer the strata the better—and the age relationship will be readily apparent. In real life, however, any large, interesting set of observations will contain so many “confounding variables”—variables that could affect outcomes of interest—that the subgroupings (pipe smoker, aged 20–24, overweight, and upper income) quickly become unmanageable. Powerful desktop computers now permit analysts to tackle very large data sets with mathematical tools, like “multiple regression analysis,” that can often correct for biases buried in the data. While they are very useful, such tools can also be “remarkably misleading,” in the words of one expert, primarily because of assumptions embedded in the software about the normal behavior of databases.

Propensity scoring was developed in the mid-1980s to improve the validity of data extracted from observational databases. For example, suppose you have a large observational database of cardiac patients who have been treated either with surgery or only with medication, and you want to study their comparative mortality rates. The problem, of course, is that doctors are strongly disposed to recommend surgery for the sickest patients, so you would expect them to have high mortality rates.

The propensity analyst would approach the problem by first identifying a wide range of variables that might relate to the referral for surgery, like left main disease, multiple occlusions, etc. She would then use standard software to run hundreds or even thousands of multiple regression analyses to tease out which variables are most important, and the strength of each relationship. Each variable’s influence on the referral decision would be captured in a propensity score. The separate scores for each variable can then be combined into an overall score for a patient that approximately captures the likelihood of his being referred for surgery.

And now comes the rabbit out of the hat. The analyst can then stratify all the patients in the database by their propensity scores, in the same way as we stratified smokers for age. The patients in each stratum are then more or less equally likely to have been referred for surgery, whether they actually were or not. But that’s not a bad definition of a randomly selected group. The whole point of random selection is to ensure that people selected in the control group had an equal chance of being in the experimental group, so selection bias is eliminated. The propensity analyst can therefore proceed to analyze each of the matched strata as if it were the output of a real RCT.

That capsule description also suggests the limitations of propensity scoring. In the first place, you can never be sure that you haven’t missed an important confounding variable. The clinching advantage of randomized studies is that, as long as the control and experimental groups are large enough, all of the important variables will be more or less equally represented in both groups, whether or not the analyst is aware of them. A second problem is that propensity analysis is not an impersonal, algorithm-driven, processing job, like a multiple regression. Instead, the analyst is constantly making crucial judgment calls—selecting the initial variable set, deciding how deeply to dig for unknown confounding variables, constructing the subgroupings, and many others. Dr. Eugene Blackstone, a heart surgeon who has made a personal mission out of spreading the propensity scoring gospel, says “it takes lots of love” to build a propensity scoring data set, remassaging the data again and again before constructing the final data structure.

Just on its pedigree, the NEJM aprotinin study commands attention. The primary author, Dennis Mangano, is both a cardiac anesthesiologist and a mathematics Ph.D., and heads the Ischemia Research and Education Foundation, dedicated to improving the treatment of diseases marked by insufficient arterial perfusion. The foundation does not accept funds from pharmaceutical or medical device companies and has been a pioneer in the application of propensity methods to cardiac research. One of its, and Mangano’s, major accomplishments has been the construction of a very large observational database of CABG patients that includes nearly 7,500 data points on more than 5,000 patients drawn from 69 hospitals in 17 countries. The data were collected between 1996 and 2000, and Mangano has used them for several other well-received analyses. A 2002 report, for example, is considered a decisive demonstration of the great value of aspirin therapy after bypass surgery.

The results from the NEJM aprotinin study, moreover, were quite stark. Patients receiving aprotinin seemed much more likely to suffer renal failure or require dialysis, and there were also statistically significant increases in death rates, and adverse myocardial and cerebral events. Further, most adverse outcomes in the aprotinin group were dose-dependent—they were much more likely to occur in high-dose patients. And the side effects were much less frequent with the alternative antifibrinolytics. Unusually for a propensity-based study, Mangano also presented a forceful review of clinical and laboratory data supporting his statistical findings. Aprotinin has a strong affinity for the kidneys, and there are reasons to believe that it may damage the renal tubules; other research lent plausibility to its potential myocardial and cerebral effects. Warning signs of kidney damage had long since cropped up in the randomized clinical trials that had supported the drug’s safety, and the FDA had specifically cited the the danger of “kidney toxicity” in its original approval letter. But, as Mangano acidly notes, company-sponsored trials were generally too small to kick out an unambiguous conclusion on renal effects.

Mangano concluded by briefly acknowledging the pitfalls of propensity scoring, but maintained that since a large and unusually thorough study had uncovered real dangers associated with aprotinin, and since safe alternatives were readily available, he had made his case. As he put it in a later exchange, “shouldn’t we now err on the side of protecting the patient…instead of protecting the drug?”

The Problems of Propensity Scoring

Strikingly, after the initial shock of the Mangano release wore off, the anesthesiology profession more or less officially closed ranks in favor of aprotinin. An updated set of surgical “Practice Guidelines” issued by the American Society of Anesthesiologists in July 2006 maintained its previous recommendation regarding the use of aprotinin, pointedly ignoring the Mangano study and citing only the RCT-based meta-analyses on aprotinin’s safety and effectiveness.

A reaction against propensity scoring appears to have been brewing for some time. As early as 2000, an NEJM editorial worried that the spread of statistical analytic techniques threatened to undermine the commitment to RCTs, which it called the only sure way of eliminating patient selection bias. Propensity scoring, in particular, since it is one of the more accessible of the newer approaches, seems to be popping up in almost every issue of most of the major journals. One recent paper suggests that with the proliferation of the technique, the quality of the studies has been slipping, and many appear to be badly conceived.

I reached out to Donald Rubin, a professor of statistics at Harvard. He is the coinventor of propensity scoring and has written widely on its applications. My email drew an immediate response:

The analyses by Mangano et al. are a misapplication of propensity score methods and should not be the basis of any conclusions regarding the use of aprotinin. I am surprised the study was considered publishable by any journal. My criticisms are numerous.

He also told me, however, that he had been engaged by Bayer to consult on their response to the article, and we agreed I should look for an expert without ties to any of the parties.

The detailed criticisms below are based mostly on discussions with two professional statisticians, Stan Young and Bob Obenchain, both with a substantial record of scholarly publication in statistical methods. Dr. Young was a public witness at a September 2006 FDA hearing on the aprotinin controversy. He is director of bioinformatics at the National Institute of Statistical Sciences, a university-sponsored consortium in North Carolina’s Research Triangle, and an adjunct professor of statistics at three universities. Young has no relation with Bayer and was not paid for his time at the hearing; he was there, he said, “solely because I take the misuse of statistics personally.” He is also, however, a thirty-year veteran of the industry, and pharmaceutical companies are among the institute’s affiliates.* Dr. Obenchain has had a long career at Eli Lilly, where Young was also employed, and is an expert on propensity methods. Finally, Bayer submitted a briefing paper prior to the September hearing with a detailed critique of Mangano’s methods, which I assume was partly Rubin’s work.

I’ve extracted three classes of criticisms of the Mangano paper, only one of which is technical. The first relates to evidence of carelessness in the presentation. In scholarly articles, authors also prepare abstracts—short summaries of the article’s main points. The abstract for the Mangano article includes the conclusion that “aprotinin was associated with a doubling of the risk of renal failure requiring dialysis among patients undergoing complex coronary-artery surgery (odds ratio 2.59[)].” The “odds ratio” means that the patients given aprotinin were 2.59 times more likely to suffer dialysis-requiring renal failure. That was arguably the most damning charge against aprotinin and the one most widely picked up by the lay press. It was also not what the article actually said. If one goes through the tables of data, they show that aprotinin patients undergoing complex heart surgery were 2.59 times more likely to suffer a renal event, which was defined as either a renal failure requiring dialysis or an increase in serum creatinine levels. An increase in serum creatinine levels is suggestive of renal problems and clearly not a good thing, but it is often transitory, and there are a number of effective treatments, like ACE inhibitors. Lumping catastrophic outcomes like near-total kidney failure together with changes in chemical markers, especially without providing any information on the proportion of each, seems obfuscatory at best. But misstating an important result in the abstract is sloppy in the extreme. That the editors of so prestigious a journal as the NEJM, in a piece calculated for maximum impact, would let such a howler slip by is incomprehensible.

Second, the article is full of anomalous details that call for explanation. For example, a large group of database patients with very high death rates were excluded from the analysis. Depending on whether they received aprotinin or not, they could greatly skew the outcomes, but there is no explanation as to why they were excluded. Mangano has also issued several reports based on this same database in the past, but the heart attack death rate is much higher in this report than in any of the previous ones, again with no explanation. Mangano’s analysis also appears to omit several potentially important confounding variables. Germany, for example, uses aprotinin far more than any other country, so German patients must be disproportionately represented in the aprotinin group. But Germany also has a very high cardiac mobidity and mortality rate, suggesting that the reported adverse outcomes just confirm the Germanness of the aprotinin group.* Certain treatment variables that are strongly associated with adverse outcomes, like time on bypass and the use of fresh frozen plasma products, also appear to have been excluded. Finally, Mangano states that his methods achieved the magic propensity balance among people receiving and not receiving aprotinin. The article’s aprotinin patients, however, were much more likely to have certain risk factors than the other patients, and the disparities are large enough to suggest that it might have been difficult to construct useful subgroups.

While the questions in the preceding paragraph raise flags, and there are a lot of them, Mangano may well have satisfactory answers for them all. Which brings us to the third problem: the secrecy surrounding his data. As of March 2007, more than a year after the publication of his article, Mangano has refused to release his underlying data and detailed analyses. A critical test of scientific claims is that they can be replicated by other workers. The National Academy of Sciences has adopted principles that require authors of scholarly articles to make their data freely available to other researchers. Similarly, FDA approval submissions must contain all relevant data in a form suitable for reanalysis by the agency’s statisticians, since it is the only way to be sure that claims are truly supported. But the same considerations apply to Mangano.

Mangano’s unwillingness to submit his data seems perverse—there is almost no way that the FDA could overturn a previous approval that was based on a substantial body of statistical material simply on his say-so. Publicly, Mangano insists that he did offer up his data, but that is disingenuous. At a September 2006 public hearing, before a panel of expert FDA advisors, Mangano made it clear that he would allow FDA staff to look at the data, but only on his computer and while he was physically present, and he categorically refused to give them the data so they could analyze it themselves. Negotiations had stretched on for months. As the FDA representative stated at the hearing:

We were offered limited access to the data. There was this qualifying expectation that our examination be chaperoned, or supervised, if you will…. Our statisticians felt uncomfortable having a supervised access to that data in the sense that we would not have the ability to explore the data and at least verify the mathematics and the statistical aspects.*

Given that background, it is surprising that an NEJM editorial published on November 23, 2006, should state that Mangano had “offered the FDA unrestricted access to the data, but the offer was not accepted.” Mangano, indeed, made that claim in a letter that accompanied the editorial, but the NEJM editors must have known what had actually transpired. I later corresponded with Dr. William Hiatt, of the FDA’s Cardio-Renal Advisory Committee, who chaired the hearing. He told me he hoped the FDA staff would execute a full propensity analysis with the Mangano data, which seems reasonable enough, but that would entail a couple of months of work at least, and would be impossible under the Mangano restrictions.

A second observational study, by a Canadian anesthesiologist, Keyvan Karkouti, was also presented at the hearing. It used classic propensity-matching methods to compare outcomes from aprotinin and tranexamic acid, one of the two cheaper antifibrinolytics in nearly a thousand “moderately high-risk” patients. It was not favorable to aprotinin. Aprotinin was no more effective than the cheaper alternative and had significant renal effects, defined as increased creatinine levels, but no increased mortality, or additional strokes, cardiac, or cerebral events. There were also more cases of renal failure in the aprotinin group, but not at the level of statistical significance. Karkouti clearly laid out his methods and furnished all the supporting data to the FDA. No one, including the Bayer representatives, challenged his findings, and it helped confirm the committee’s conviction that aprotinin was associated with renal effects.

At the end of the hearing, left with little choice, the expert panel voted 18 to 0, with one abstention, to ignore the Mangano paper and maintain, with some minor adjustments, the current recommendation supporting the use of aprotinin for patients at a high risk of bleeding. In doing so, they specifically recognized the presence of renal effects, but felt that they were outweighed by the dangers of extensive bleeding in high-risk patients. They were unconvinced that aprotinin caused “renal failure requiring dialysis” since the number of observed events was very small, and they saw no evidence of elevated mortality risks, or cerebral or cardiac events. Finally, they noted that while the effects on aprotinin in reducing transfusion requirements were well established, there were no data to support the assumption that it thereby improved mortality.

Soap Opera

With the advisory committee essentially confirming the value of aprotinin in appropriate cases, the undermining of the Mangano study seemed complete. But then, a week after the FDA hearing, Dr. Alexander Walker, a former chairman of the Department of Epidemiology at Harvard University, informed the FDA that he had been engaged by Bayer’s German parent to carry out an analysis of aprotinin effects from a claims database maintained by United Healthcare, a large HMO, and apparently had come up with results not unlike Mangano’s.

The FDA naturally suspended its review of aprotinin until it could analyze the new Bayer data. While the agency’s press release was couched in bland officialese, several of the advisory committee members expressed outrage at Bayer’s behavior. Bayer claims it fully intended to release the data once its analysis was completed. In any case, their failure to disclose that they were undertaking such a study was itself a violation of their terms of engagement with the FDA. In the wake of multiple revelations that companies were repressing data unfavorable to their products, the FDA had reconfirmed the requirement that companies fully disclose internal studies on the performance of drugs under agency jurisdiction. The company has “reassigned” the two executives who contracted for the study, and in time-honored corporate damage-control style has engaged an outside counsel to review what went wrong with their internal processes.

Walker’s approach is an appealing one. Claims databases are usually regarded as not useful for treatment analyses because of the frequency of coding errors and the incentive to overcode. Walker finesses the problem by inferring disease chracteristics from treatment patterns—the tests, the prescriptions, the number of visits, etc.—rather than from codes. Assuming his methods withstand critical analysis, they offer a nice counterpoint to Mangano’s approach. Mangano’s CABG database is highly detailed, and has fed several studies, but was very expensive to construct, and will inevitably obsolesce as treatments evolve. Walker, on the other hand, should be able to quickly extract at least indicative results on a variety of questions, and his results will gain authority from the very large databases he draws from. His aprotinin study reportedly involved more than 60,000 cases and was executed in just a few months.

As of March 2007, the FDA has still not reacted to the Walker study, and Walker himself is under a confidentiality rule. In December 2006, however, the agency released a label update for aprotinin that includes a modest toughening of restrictions, all of them consistent with the consensus of the advisory committee. The most extensive warnings related to the possibility of anaphylactic shock in patients who have had previous exposure to the drug, an issue that was not discussed in Mangano’s NEJM paper. (Anaphylaxis is an immune response; it occasionally causes a fatal reaction to bee stings in sensitized individuals.) The new label also added an explicit warning on the risk of renal dysfunction but noted that such effects were most common in people with preexisting renal conditions, and usually were “not severe and [were] reversible.” Finally, it reinforced the use indication by adding, in italics, that aprotinin should be restricted to patients “who are at increased risk for blood loss and blood transfusion.” There was no mention of Mangano’s paper, and the cited evidence was all drawn from meta-analyses of accumulated RCTs. (FDA drug labels, it should be noted, are quasi-advisory. Doctors may legally prescribe an approved drug for anything at all, although companies are barred from marketing off-label uses. Off-label applications for aprotinin in fields like orthopedic surgery outstrip on-label use.)

While Mangano and the NEJM deserve credit for precipitating both the label review and a rethinking of the trend toward increased aprotinin use, their primary findings on mortality risk were clearly rejected by both the FDA and the anesthesiology profession, and the episode has probably raised the barrier to professional acceptance of statistical results from observational databases. The FDA’s silence on the Walker study is perhaps to be expected. Given the novelty of his methods, the likely next step would be to convene an advisory committee meeting to review his methods and conclusions, probably in the summer of 2007. Our story, therefore, ends on an inconclusive note—a tawdry melodrama still missing parts of the final act.

But since we’re on the subject of soap opera, an earlier act in the play still puzzles: what on earth was going on at the NEJM?* Poking a powerful profession in the eye, as the Mangano article did, is part of a journal’s job. But the most lethal mode of scholarly assault is usually to downplay the rhetoric and let the data speak for themselves. In this case, the presentation was oddly tilted almost to provoke a fight. The article’s tone was often caustic, doctors complained that press releases went out well before the article was available, and there was the stark “not prudent” conclusion highlighted in the abstract. Nothing wrong with that, if your data are unassailable. But the paper is so shot through with apparent inconsistencies that the professional associations and the FDA have effectively ignored it.

A little history, however, suggests that the NEJM may have been anxious for a confrontation. The current editor, Jeffrey Drazen, was appointed in 2000, succeeding two crusading editors, Marcia Angell and Jerome Kassirer, who are strong critics of the drug industry. Since their departures, both have published exposés of medical corruption in general and the pharmaceutical industry in particular. The choice of Drazen was especially controversial, for he had had financial relations with twenty-one pharmaceutical companies between 1994 and 2000, the half dozen years before his appointment. Early in his tenure, the NEJM published the flagship study that established Vioxx as a multibillion-dollar product—the now-famous paper that withheld the data on cardiac events that later led to the drug’s withdrawal. As those cardiac data became known, the NEJM published an expression of concern about its earlier paper in December 2005, and after allowing the authors a chance to explain themselves, withdrew the article in March 2006. A few weeks later, however, the Wall Street Journal revealed that the NEJM had been informed of the data discrepancies by a knowledgeable third party as early as 2001. Criticism of Drazen and the NEJM, especially in the British medical journals, was vitriolic—the episode was more proof “that medical journals are an extension of the marketing arm of pharmaceutical companies,” as the Journal of the Royal Society of Medicine put it.

The Mangano study would have been in preparation for publication at about the same time the embarrassing revelations on Vioxx were being broadly released. Drazen and the other NEJM editors knew that they would be fiercely criticized. What better way to prove their cojones than to go full-bore against aprotinin? The NEJM, after all, is also a big business with estimated profits of perhaps $75 million a year. An impression that the journal, or its editor, is an industry patsy may not be good for sales. Dennis Mangano, with his antiestablishment reputation is the perfect antidote, and he has done good work in the past. But that is no excuse for suspending standards.

This is a sad, messy tale. But perhaps some good may come of it. New devices, new procedures, and new medications are proliferating at an accelerating rate, and it is simply not possible to subject them all to “gold standard” RCTs. So enthusiasm for the statistical treasures potentially lurking in medicine’s vast observational databases is entirely justified. And by assembling his database of heart surgery patients Mangano has done a signally important piece of work. But as frequently happens in the early stage of a hot new procedure, enthusiasm can overweigh judgment. Journals should simply stop publishing statistics-based claims from observational data—like each month’s boatload of new articles based on propensity scoring—until they have assembled panels of expert reviewers, just as they do for medical claims. (And doctors who have had a couple of courses in statistics are not experts.) Upgrading capabilities for handling advanced quantitative methodologies will not be a trivial undertaking for medical journals, and may require substantial reworking of current acceptance processes. But if they are not prepared to make those investments, they should stay out of the game.

As soap operas go, and this one has been especially depressing, it could still have a happy ending. One of the FDA advisory panel members, David DeMets, chairman of the biostatistics department at the University of Wisconsin, said in his closing remarks:

Well, this has been a fascinating day…. [W]hen I looked at the New England Journal paper I was disturbed by it…. I pretty much dismissed the conclusions that were drawn. But…the reason it’s been fascinating today is…. [i]n the post-Vioxx era…we’re going to have to rely on observational data of this kind to make, to get, some further information. We will not have randomized trials of long duration for rare events. But I think today’s discussion has demonstrated just how big a challenge that is…. It’s very tricky stuff and it’s very hard to do…. We’re going to have to really drill down on the analysis details a lot more than we do in, say, randomized trials.

I think that’s exactly right, and if it results in medical journals upgrading their abilities to handle advanced quantitative arguments as well, that would be a happy ending indeed.