Chapter Ten

INVENTING A DISCIPLINE

image

STEVE GOODMAN has spent much of his career thinking about the many ways that medical research can go awry. It was the intellectual drumbeat that kept him stimulated as he performed more mundane tasks as a biostatistician and epidemiologist at Johns Hopkins University, such as helping scientists there design valid clinical trials. Finally, Goodman decided to turn his full attention to issues of rigor and reproducibility in biomedicine. He moved to Stanford University in 2011 and two years later cofounded a new endeavor called METRICS, an acronym for Meta-Research Innovation Center at Stanford. “We do research on research,” Goodman told me. “To figure out what’s wrong and how to make research better, you have to study research. That’s what meta-research is. It’s not like metaphysics. It’s real. And we look at real things.”

“I wanted to call it the Center for Medical Truth,” he told me. “This was roundly nixed.” That rejected name reveals the depth of Goodman’s concern about the heart of medical research.

Not only does METRICS have an unusual mission; it has an unlikely history. The codirectors, Goodman and John Ioannidis, are not natural partners. Goodman is deliberative, while Ioannidis moves rapidly from project to project, publishing dozens of papers every year. The two scientists even faced off in a very public disagreement a decade ago. But their different styles and approaches may actually be an asset as they seek to disentangle the many factors responsible for the lapse of rigor and reproducibility in biomedicine.

Both scientists had spent years doing research that formed the foundation of meta-research. During his medical training in Greece and later at Harvard, Ioannidis realized that the medical literature was deeply unreliable. He launched his career studying the shortcomings of research involving human subjects. “Most of the time what we would find out was that the data were horrible,” he told me. “The analysis had major problems. There were strong biases.… And most of the time, if you had to be honest, you would conclude despite all this data, I really don’t know what’s going on here.” In the 1990s, he and Goodman independently became part of a wave to clean up the methods used in clinical medical research. It was a transformative time for medicine, because doctors were gradually relying less on intangible “expert judgment” and trying instead to make treatment decisions based on data. The movement toward data-driven medicine, of course, needed to rest on good data and careful analysis. Often it didn’t.

In one classic study, Ioannidis looked at papers from major medical journals that other researchers had cited at least a thousand times—a mark that they were having a major impact on the field. Of the forty-nine studies that met this criterion, seven had been flatly contradicted by further studies. Those included some famous mistakes in biomedicine, such as the claim that estrogen and progestin benefitted women who had had hysterectomies, when in fact the drug combination increased the risk of heart disease and breast cancer. Another ballyhooed study, which found that vitamin E reduced heart disease risk, turned out not to be true either. A few years later, Ioannidis followed up to see whether scientists were still citing that original, disproven study. The answer was yes—and frequently! Years after two of the largest and most expensive medical studies ever undertaken had debunked the claim that vitamin E reduces heart disease, half of all articles on the subject still cited the original study favorably. It left him shaking his head in disbelief. “How many trials of a billion dollars each can we do to refute a single claim out of the millions of claims that observational studies put forth?” he asked me. “We would need quintillions of dollars just to show what things are worthless before we start doing our real job,” which is finding treatments that actually work. He suspects that scientists who had spent their careers studying vitamin E kept on defending the positive findings. “They were living in their own bubble, unperturbed by the evidence.”

“This is one major reason why having lots of false results circulating in the literature is not a good idea. These results get entrenched. You cannot get rid of them,” he told me. “There will also be lots of people who are unaware of it who will just hit upon the paper and will never know that this thing has been refuted.”

Ioannidis’s papers have now collectively garnered more than 100,000 citations. None is more widely known than his 2005 paper titled “Why Most Published Research Findings Are False.” It didn’t make the public splash that Glenn Begley’s did, but it became a touchstone for many academic discussions about the foibles of biomedical research. The odd thing about the heavily cited paper is that it contains no hard data. There’s no survey of researchers or random sampling of papers in the scientific literature. Instead, it’s an essay in which Ioannidis makes a purely statistical argument. In essence, he concludes that simply by looking at how scientific research is designed and executed, he could tell that many papers were nothing more than false-positive results.

At first the paper caught the attention of statisticians and study designers. “Gradually more people started seeing these problems and being interested in these problems, and wanted to see what was happening in their field,” Ioannidis said. The provocative title probably helped. “I think actually the title was a bit of a risk because if the paper didn’t have substance then it would very easily backfire,” Ioannidis said.

Goodman, for one, was not convinced by Ioannidis’s statistical argument. He and a colleague, Sander Greenland, questioned its underlying assumptions. “We agree with the paper’s conclusions and recommendations that many medical research findings are less definitive than readers suspect,” Goodman and Greenland wrote, but “the claim that ‘most research findings are false for most research designs and for most fields’ must be considered as yet unproven.”

In addition to the pushback Ioannidis got from Goodman, Jeff Leek at Johns Hopkins also published a critique, using actual data from top medical journals as input for his calculations. Leek’s paper concluded that the failure rate in those publications was probably around 14 percent, which is not nearly as dire as the assertion that “most” findings are false. But that’s not necessarily inconsistent with Ioannidis’s essay, which noted that failure rates range from 15 to 99 percent or more, depending on the size and design of the study. Large clinical studies are the best, while smaller laboratory studies, based on small samples, are, statistically speaking, unlikely to be true most of the time. (Scientists do raise their eyebrows when you suggest that an entire class of small studies is wrong 99 percent of the time, as the Ioannidis calculation suggests.) Other scientists quoting his paper rarely mention that the success rate varies dramatically, depending on the study type. “I think that oversimplifying to just get an average is not very helpful,” Ioannidis told me, though his paper’s title of course encouraged readers to do just that. Whatever its flaws, there’s no question that the paper helped stimulate the current conversation about irreproducibility in biomedical research and how to deal with it.

image

Scientists looking for ways to improve the reliability of laboratory research have much to learn from an earlier push to improve medical research involving human subjects. Over the past twenty years, there’s been significant progress in that arena. The best clinical studies are now designed and carried out with great care (and at great expense). They involve many patients and often multiple research centers to make the findings more robust. And those improvements have helped push medical science forward, providing credible evidence for treating and managing disease, while gradually driving bad ideas out of practice. To cite one example, a careful study determined that hormone replacement therapy was deadly for many women taking the combination of estrogen and progestin, claiming many lives during the years doctors prescribed those drugs together. By one estimate, that corrective study triggered a change in medical practice that averted 126,000 breast cancer deaths and 76,000 heart disease fatalities between 2003 and 2012.

“I would be the last to say we’ve solved all the problems of clinical research,” Goodman told me. “But at least we have a decent template of what needs to be done.” The question is how to apply the lessons from clinical research to the world of laboratory science. There’s no cookie-cutter solution. Each field of science has its own particular culture, so each will have to develop its own ways to improve rigor. For example, Goodman discovered that in psychology, a single experiment often becomes the basis for an entire career, and replication is actually discouraged. “To redo an experiment is taken as a personal attack on the integrity and on the theories of the person who did the original work,” Goodman found. “I thought I couldn’t be shocked, but this is truly shocking.”

There’s also a big difference between laboratory research, where scientists are trying to figure out the mechanism of disease, and drug testing in human subjects, where overarching questions are practical: Is a drug or procedure safe and effective? A clinical trial sets out to answer those comparatively straightforward yes or no questions. Lab research often explores the much more difficult question of why. As a result, clinical studies and basic biomedical research “are profoundly different cultures.”

Importing solutions directly from clinical research and applying them in the laboratory is not likely to work. Goodman made an analogy to foreign aid projects, where Westerners parachute in and attempt to impose a solution, without fully understanding the local customs. “You can make it much worse if you don’t respect the culture,” he said. What’s more, each area of biomedicine has its own insular subculture, in which ideas and methods—good and bad—circulate and reverberate. Norms change over time, but they don’t transmit easily from one field to another.

Many of the needed improvements—randomizing animals in studies, keeping lab personnel unaware of which are test subjects versus comparison groups, not changing the end point once the experiment has started, starting with an adequate sample size—can be made unobtrusively. Often, lab scientists making mistakes in those areas “didn’t know it was important,” Goodman said. Now that the word is spreading, he expects this will start to change—leading eventually, he hopes, to a social transformation of biomedical science.

Goodman and Ioannidis are trying to accelerate that social transformation. To that end, in the fall of 2015, they stood before a crowd arranged around tables in an airy conference center on the Stanford campus. They had invited dozens of scientists from the United States and Europe who have studied the issue of rigor in biomedical research and asked them a provocative question: How can this nascent community chart a research agenda to study these systematic problems and identify and test potential solutions?

The discussion touched on four topics that generally arise when scientists think about how to fix the broken system: getting individual scientists to change their ways, getting journals to change their incentives, getting funding agencies to promote better practices, and, last but not least, getting universities to grapple with these issues. Of course these are interlocking challenges.

Take, for instance, the fact that universities rely far too heavily on the number of journal publications to judge scientists for promotion and tenure. Brian Nosek said that when he went up for promotion to full professor at the University of Virginia, the administration told him to print out all his publications and deliver them in a stack. Being ten years into his career, he’d published about a hundred papers. “So my response was, what are you going to do? Weigh them?” He knew it was far too much effort for the review committee to read one hundred studies. “So the message that’s being delivered to me was… volume matters.” Nosek told me that delivering his three best papers would have provided a more meaningful view of his accomplishments to date. And if he’d known up front that the quality of his findings, rather than the sheer number of his papers, would be key to getting tenure, that “would totally change the incentives” for his course of research. He would have spent more time thinking about a few big and interesting questions to tackle rather than worrying about populating his list of publications.

Frank Miedema, dean of the medical school at the University Medical Center in Utrecht, Holland, looked at this issue from an institution’s point of view. He complained that scientists at his medical center publish 3,500 papers a year, “and I don’t know who reads them. Have you read one of our papers?” he asked the audience at Stanford. The answer was silence. The push for quantity is utterly misplaced, he argued. Scientists aren’t asking questions with important answers; they’re asking easily answerable questions. Nobody wants to risk spending four years on a risky research project with a big potential payoff, but that could also fall flat. That might be a powerful way to move medical science forward, but the risks to a scientist’s career are enormous, given the current incentive structure.

Miedema said he’s trying an experiment at his institution to break out of that academic trap. “Ask the patients, and they tell you what they want. That’s what we do.” His medical school judges the scientists there on the public impact of their work and cares less about curiosity-driven studies. “Most papers are never used, and rarely read,” he said. “If we don’t incentivize and reward people to do the right things, they will not do the right things, and they will keep on publishing this waste, and nothing will change,” he said. “You guys will be here ten years from now, and John [Ioannidis] will have even less hair… and there will be no change in the system.”

Robert Califf, then awaiting confirmation as Food and Drug Administration (FDA) commissioner, said that sensibility is starting to take hold in the United States as well. “Academia has to clean up its shop and get out of the ego business and get into the business of answering questions that matter to patients,” he said. “But the beauty is the patients are gradually going to be taking control.” If academics don’t answer the questions the public cares about, “there’s a very high chance you won’t get funded because they’re going to have a lot to say about it.” Politicians already steer some biomedical research dollars through the Defense Department, which is heavily influenced by patient advocacy groups that participate in the peer review process. Those groups are trying to get more involved in setting the research agenda among scientists funded by the National Institutes of Health (NIH) as well. But they also need cooperation from scientists, who are used to dreaming up their own ideas and getting grants to follow their intellectual muses. Both kinds of science are essential; it’s a question of balance.

Incentives to improve academic research can also come from pharmaceutical companies, which have whittled away their own research departments and depend increasingly on academia for new product leads. Glenn Begley’s shot across the bow in 2012 laid out the problem in stark terms. Some universities are now pursuing research that blends pure scientific exploration and commercialization. Barbara Slusher is pioneering one of those efforts. She has been on both sides of the divide and runs an operation at Johns Hopkins University to take the best from each approach. Academics sometimes disdain industry scientists as unimaginative. But industry gets some important things right. And as they’re preparing to move a drug through the FDA and toward market, they must follow FDA-approved good laboratory practice guidelines, which adds a layer of bureaucracy (largely careful documentation) to their work.

Slusher tries to use some of the tools of industry to validate ideas from academic labs—before they head toward drug development. She said she’s trying to avoid more papers like Glenn Begley’s, in which pharma complains about the poor quality of academic research. “It’s not good. Not good. So our thought is let’s keep it in-house. Let’s keep it within the family” before the ideas go on to pharma. “If you talk about solutions, I think that’s something we’ll see a lot more of.”

Even at Hopkins, one of the nation’s top research institutions, her lab often struggles to reproduce results from the university’s labs. They’ve explored dozens of promising ideas. “I’d say we’ve had better than 50 percent reproduction, and I think that’s probably because we’re working hand in hand with the faculty that made the initial discovery,” she told me. “We’ve got to get rid of this irreproducibility issue. That’s a problem. We’ve got to get better.”

Glenn Begley has weighed in on this as well. He and two colleagues wrote that good laboratory practice, which works well in industry, should be adapted for academia. “The scientific community should come up with a similar system for research, which we term good institutional practice (GIP). If funding depended on a certified record of compliance with GIP, robust research would get due recognition.” Michael Rosenblatt from Merck has suggested an even more aggressive remedy: drug companies should fund more research at universities, but, in exchange, universities should offer a money-back guarantee if the findings don’t hold up. That would obviously make universities take a more active role in ensuring the reproducibility of research conducted within their walls.

Journal publishers could also play a role in easing the problems of reproducibility. One simple step would be to publish more studies that report “negative results,” that is, that fail to replicate a previously reported positive finding. High-profile journals are reluctant to do that now because those follow-up studies get cited less frequently than new and exciting ones, potentially reducing a publication’s impact factor and therefore its profits. Daniele Fanelli at METRICS has also suggested that journals set up a system of “self-retraction” so that scientists who find honest errors can flag them in the journal that published the original work. You’d think that would already happen, but in fact, because retractions are often assumed to be the result of questionable behavior, scientists are loath to admit honest errors. Colleagues wonder about the backstory, which can harm reputations and careers. Retractions “are often a source of dispute among authors and a legal headache for journal editors,” Fanelli wrote. These self-retractions would be signed by all authors as a signal that they were the result of honest error. He went on to suggest that journals should consider a year of “scientific jubilee” during which papers could be self-retracted, no questions asked. “The literature would be purged, repentant scientists would be rewarded, and those who had sinned, blessed with a second chance, would avoid future temptation.”

There’s also plenty of room to improve the journals’ own peer review process. Steve Goodman was startled to discover that Science magazine didn’t have a formal board of statistics editors until 2015 (though it did use statisticians as reviewers before then). “This has been recognized as absolutely critical to the review of empirical science for decades. And yet Science magazine just figured it out. How could that be?” Another problem: peer review is usually unpaid, so scientists may delegate the job to graduate students or spend less time than it might take to uncover problems with a paper.

Many scientists aren’t waiting for journals to change their ways. Social media has created many avenues for scientists to carry on these conversations outside the traditional channels. Scientific firebrands like Michael Eisen at the University of California (UC), Berkeley, tweet out 140-character critiques of their colleagues’ work. Paul Knoepfler at UC-Davis writes pointed blog posts about research that concerns him. Other scientists are posting comments on a site called PubPeer, which allows them to take anonymous potshots at research articles. And the NIH has gotten in on the act as well, creating Pubmed Commons, an on-the-record comment section connected to the main publications database.

A British organization, the Faculty of 1000, started a Preclinical Reproducibility and Robustness channel on its website, which accepts papers that critique or describe failed replications of previous studies. Articles are posted, and peer review comes in the form of comments. Scientists have also started posting to bioRxiv.org, a “preprint” site that doesn’t require peer review upfront but counts on scientists who comment on those papers to serve in place of journal gatekeepers. These movements could eventually devalue the journal article as the ultimate currency of scientific research and move toward a more fluid world where the record evolves along with the science. Ahmed Alkhateeb, a postdoc at Harvard Medical School, wrote an opinion piece suggesting that scientists should publish more bite-sized bits of research, with greater focus on a small increment of new data and less focus on the analysis that attempts to weave that new information into a broader scientific narrative. He argued that this system would reduce the incentive for scientists to focus unduly on data that support a popular hypothesis. A more nimble publication system might also encourage scientists to publish confirmatory or negative results.

Marcia McNutt, who spoke at the Stanford meeting as editor in chief of Science (before becoming president of the National Academy of Sciences), worried that deemphasizing the role of journals as gatekeepers would make it even more difficult for young scientists and students to know what to trust in the literature. At the same time, she acknowledged the limits of scientific publication by recounting a story about John Maddox, longtime editor in chief of Nature. Someone once asked him how much of what Nature published was wrong, “and he famously answered, ‘All of it,’” McNutt said. “What he meant by that is, viewed through the lens of time, just about everything that we write down we’ll look back at and say, ‘That isn’t quite right. That doesn’t really look like how we would express things today.’ So most papers don’t stand up to the test of time.” McNutt herself said that of the papers deep in Science’s archives, “I probably wouldn’t publish them again today, even if I didn’t care about how up-to-date they were.” Her point wasn’t that journals are useless, of course, but that scientific findings are provisional and should be treated as such.

Many of these suggestions have financial implications, whether for journals, which could lose stature if they publish less flashy papers, or for universities, which might have to offer money-back guarantees to funders from industry. But Brian Nosek argued that money isn’t the only way to change human behavior. “The solutions don’t require a huge shift in budget,” he said at the Stanford meeting. “They require a small shift in budget.” And sometimes small incentives can have an outsized impact. One idea he has pursued involves awarding “badges” to scientists who do the right thing. Like a gold star on an elementary school assignment, these visible tokens mark published papers whose authors have agreed to share their data. “Badges are stupid. But they work,” he said.

The Center for Open Science ran an analysis after the journal Psychological Science started publishing openness badges in 2014. Though many scientists who published in the journal didn’t seek this goody-goody mark of approval, their behavior changed nonetheless. A year after the journal started posting badges, the percentage of papers with open data rose from 3 to 38 percent, Nosek and his colleagues found.

Of course, reproducibility would improve if scientists took simple technical steps, such as validating cell lines, running proper controls with their antibody experiments, choosing adequate sample sizes for mouse studies, deciding in advance what hypothesis they were testing, and so on. Scientists like Nosek hope to make those practices more common simply by raising awareness. One vehicle for doing that is publishing guidelines and checklists for scientists to follow. The ARRIVE guidelines, for example, provide a template for scientists who publish results from animal experiments. Nosek convened a committee that developed the Transparency and Openness Promotion (TOP) guidelines. A survey of animal-research guidelines in 2013 identified twenty-six distinct sets, including fifty-five specific recommendations (such as randomization and adequate sample size). Major journals have adopted publication guidelines, negotiated at an NIH-sponsored meeting, that they ask authors to follow. Nature, for example, requires scientists to complete a checklist stating whether they have authenticated their cell lines—but the journal may still publish a “hot” paper even if scientists haven’t fully complied. And even the most widely accepted guidelines are frequently ignored.

That leads to an obvious conclusion: awareness isn’t enough. The social context of science needs to change in order to create the incentives that will lead scientists to raise their standards. Yet most of the scientists who are trying to fix the problems of reproducibility are biologists or physicians, not social scientists equipped to think about remodeling the culture of science. Social scientists who pay attention to the study of scientific research tend to work on esoteric topics rather than the more nuts-and-bolts issues involved in understanding and changing ongoing behaviors. But a few pioneers, like Nosek and Brian Martinson, have pushed into this territory. Jonathan Kimmelman, who focuses on biomedical ethics at McGill University in Montreal, is another. At the Stanford meeting, he challenged researchers to think more deeply about modifying scientists’ behavior.

Kimmelman argued provocatively that since science can never free itself of missteps and irreproducible results, it would be helpful for scientists, when they report a result, to state how much confidence they have in their findings. If it’s a wild idea, declare that you don’t have a whole lot of confidence in the result, and scientists following up on it can proceed at their own risk. If you’re very confident of your result, say so. And if you have a good track record, that will instill confidence in your findings. Of course this system will only work if these subjective judgments are better than the flip of a coin. Kimmelman has been running experiments to measure how well scientists can, in fact, make these predictions.

Scientists make judgments all the time, not only about their own work but about the papers they read. Kimmelman hopes that these judgments can be quantified and reported as a matter of course. With this strategy, Kimmelman is trying to take advantage of human abilities that are not conveyed in the dry analysis of journal articles. It’s “getting to what’s going on in the heads of people,” he told me. “That’s not only one of the missing pieces in the puzzle here, but I think it’s a really, really critical issue.”

During our conversation he floated an idea that seemed almost heretical: maybe a certain amount of error is necessary, because it gives scientists something to argue over. The stock market wouldn’t work if everyone was in complete agreement about the value of a given share of stock. Nobody would buy or sell anything. “As with every economy, you may need a lot of riffraff” in science, Kimmelman suggested. This idea comes from looking at biomedical research as an interwoven system. Right now, he argued, “there’s too much emphasis on the individual. What matters is where a community is going, not a particular lab,” he said. “I don’t mean to come off as an apologist at all. I’m not,” he said. “In fact in most settings I guess I would be considered a critic. But I think there are a lot of aspects about reproducibility that we don’t really understand well conceptually. I just think there’s still further work to be done to clarify those kinds of things.”

One final idea sounds downright counterintuitive: to speed the development of medicine, biomedical science should actually slow down. This means taking on fewer projects and doing them more carefully. It means improving the quality of the scientific literature by publishing fewer, more careful papers. In 1963, physicist-historian Derek de Solla Price warned that the scientific literature was growing exponentially and would eventually become unmanageable unless something was done to change the incentives in science. Daniel Sarewitz at Arizona State University wrote that Price’s premonition is coming true. “Today, the interrelated problems of scientific quantity and quality are a frightening manifestation of what he foresaw. It seems extraordinarily unlikely that these problems will be resolved through the home remedies of better statistics and lab practice, as important as they may be.” Given the current reality, Sarewitz told me, scientists and the public would be better off if we actually expected less from science.

We should not assume that every paper in the literature falls neatly into the “good” basket or the “bad” basket. Much of it is provisional, and its true worth may take decades to appreciate. Some of today’s medical advances stem from discoveries made decades ago, and presumably some of today’s discoveries will prove valuable only many years from now. If we curb our enthusiasm a bit, scientists will be less likely to run headlong after dubious ideas like transdifferentiation, and the public will be less likely to embrace the latest dietary fad. Of course this is a discouraging point of view for patients and advocates looking for rapid progress in the search for treatments and cures. But it’s important to distinguish between speed and haste.

A focus on quality over quantity would give Arturo Casadevall’s students at Johns Hopkins a chance to think instead of simply running another experiment. It suggests a path (not without pain, alas) out of the structural morass that has made the biomedical research system financially unsustainable, with too many labs competing for the available funding. And in the end it speaks to a value that many scientists hold deeply: being right should matter most of all.