RARELY IS A scientific discovery so galvanizing that Congress passes a resolution calling for more funding of the research. But Congress did just that in 2002, on the heels of an exciting announcement. Researchers at the Food and Drug Administration (FDA) and the National Institutes of Health (NIH) trumpeted a new test they’d developed to detect ovarian cancer, which up until that point could only be diagnosed with surgery. Many women learn they have the disease in its later stages, when it’s especially hard to treat, while others undergo unnecessary surgery simply to rule it out.
News of this putative new test, based on a novel technology, made headlines. The researchers who discovered it ended up on the TODAY show. Scientists were excited as well. Keith Baggerly and his colleagues at MD Anderson Cancer Center in Houston started scrambling to assemble a lab that would let them do this kind of testing. If it worked for ovarian cancer, they reasoned, it seemed likely that the technology could diagnose many other forms of cancer in early, more treatable stages. And the excitement arose in part because this was no ordinary blood test. Here’s the novel concept: scientists extracted proteins from blood samples and put them into a device called a mass spectrometer—basically a sorting machine that separates molecules by mass. The original researchers compared results from fifty women with ovarian cancer and fifty healthy women. They reported seeing a notable difference, identifying a pattern in the cancer cases most of the time. This idea marked the first step into an exciting frontier. Instead of searching for one particular molecule as most lab tests do, this test looked for a broad pattern, a protein spectrum. It seemed like the start of something big. Really big.
Baggerly naturally wanted to see if he could also see the pattern in the data. “We looked at it fairly extensively for a few months,” Baggerly said, “and we couldn’t find the patterns that they were reporting.” Other scientists started raising doubts as well, commenting on the paper’s methods and conclusions in short letters to the Lancet, which had published the original research. Baggerly kept gnawing on this bone and eventually realized that the data he had been analyzing from the original report had already been cleaned up a bit. When he went all the way back to the raw data, lo and behold, he did see a significant difference between women with and without ovarian cancer.
But there was a problem. The difference Baggerly saw was in data that scientists generally throw away because it’s untrustworthy. It reflected “noise” generated by the machine, he concluded, and had nothing to do with detecting a real protein fingerprint for ovarian cancer. And that explains the seeming difference between the women with ovarian cancer and the healthy group: The samples from the women with ovarian cancer had been run on one day and the samples from the comparison group on another. Apparently there was some subtle difference in how the mass spectrometer operated from one day to the next. The “ovarian cancer” test was really measuring nothing more than spurious signals from the machine. This is a classic example of the “batch effect,” in which apparent biological differences actually stem from nothing more than batch-to-batch variation in data collection and analysis.
The Lancet, which had published the original article, wasn’t interested in publishing Baggerly’s analysis, he told me. “We yelled about this and eventually made a bit of a stink,” but to no avail. Then, in 2004, Baggerly was at a convention of cancer doctors when he came across pharmaceutical salesmen from a company called Correlogic Systems, promoting the OvaCheck blood test based on this dubious technology. Baggerly had had enough. The medical journal wouldn’t publish Baggerly’s analysis, but the New York Times would. And soon the FDA stepped in and told the company to stop marketing its product until it could prove that it worked. Correlogic kept trying but could not make a compelling case. In 2010 the company filed for bankruptcy (and the OvaCheck name is now used for a completely different type of blood test for ovarian cancer).
The batch effect is a stark reminder that, as biomedicine becomes more heavily reliant on massive data analysis, there are ever more ways to go astray. Analytical errors alone account for almost one in four irreproducible results in biomedicine, according to Leonard Freedman’s estimate. A large part of the problem is that biomedical researchers are often not well trained in statistics. Worse, researchers often follow the traditional practices of their fields, even when those practices are deeply problematic. For example, biomedical research has embraced a dubious method of determining whether results are likely to be true by relying far too heavily on a gauge of significance called the p-value (more about that soon). Potential help is often not far away: major universities have biostatisticians on staff who are usually aware of the common pitfalls in experiment design and subsequent analysis, but they are not enlisted as often as they could be.
Keith Baggerly stands out among biostatisticians: he actively investigates research that he has doubts about, and he doesn’t hesitate to go public with what he finds. He radiates the poise and self-assurance of a sports coach, or maybe a referee. But personal charm has its limits. “There are some people who have received e-mails from me and are apprehensive about what I’m going to do if I get their data,” Baggerly said. “I can understand that reaction in light of public events,” but really, he insists, he’s just trying to get at the truth. At times, that means digging into other scientists’ work. Naturally, that doesn’t make everyone happy.
A few years ago, he placed an informal wager of sorts with a few of his colleagues at other universities. He challenged them to come up with the most egregious examples of the batch effect. The “winning” examples would be published in a journal article. It was a first stab at determining how widespread this error is in the world of biomedicine. The batch effect turns out to be common.
Baggerly had a head start in this contest because he’d already exposed the problems with the OvaCheck test. But colleagues at Johns Hopkins were not to be outdone. Their entry involved a research paper that appeared to get at the very heart of a controversial issue: one purporting to show genetic differences between Asians and Caucasians. There’s a long, painful, failure-plagued history of people using biology to support prejudice, so modern studies of race and genetics meet with suspicion. The paper in question had been coauthored by a white man and an Asian woman (a married couple, as it happens), lowering the index of suspicion. Still, the evidence would need to be substantial.
The two researchers, Richard Spielman and Vivian Cheung, were prominent geneticists at the University of Pennsylvania. (Spielman died in 2009.) In a 2007 study, they examined 4,197 genes in both Caucasians and Asians. Instead of looking at whether the genes themselves were different, they asked whether some genes were more likely to be switched on in one race versus another. (Genes are the message written in DNA, but most genes are silent most of the time, the spools of DNA where they reside knotted up tight. Biology gets interesting when our cells activate certain genes to read them and carry out their instructions.) They found that among those genes, about a quarter of the total were switched on or off in one race but not the other. Their paper, “Common Genetic Variants Account for Differences in Gene Expression Among Ethnic Groups,” landed with a splash when published in Nature Genetics.
But some scientists had their doubts. Joshua Akey, along with biostatistician Jeff Leek and colleagues at the University of Washington, had performed a similar comparison between Caucasians and Africans and found a much smaller difference in that case. Because other genetic studies show that Caucasians are more closely related to Asians than Africans, Akey didn’t expect to see the dramatic effect that Spielman and Cheung had reported. So he dug deeper. These experiments were run on a type of biological chip called a microarray—essentially a chip packed full of carefully arranged dots containing DNA. These tests allow scientists to make thousands of comparisons simultaneously—in this case to measure which genes are turned on and which are turned off.
The University of Washington team tracked down the details about the microarrays used in the experiment at Penn. They discovered that the data taken from the Caucasians had mostly been produced in 2003 and 2004, while the microarrays studying Asians had been produced in 2005 and 2006. That’s a red flag because microarrays vary from one manufacturing lot to the next, so results can differ from one day to the next, let alone from year to year. They then asked a basic question of all the genes on the chips (not just the ones that differed between Asians and Caucasians): Were they behaving the same in 2003–2004 as they were in 2005–2006? The answer was an emphatic no. In fact, the difference between years overwhelmed the apparent difference between races. The researchers wrote up a short analysis and sent it to Nature Genetics, concluding that the original findings were another instance of the batch effect.
These case studies became central examples in the research paper that Baggerly, Leek, and colleagues published in 2010, pointing out the perils of the batch effect. In that Nature Reviews Genetics paper, they conclude that these problems “are widespread and critical to address.”
“Every single assay we looked at, we could find examples where this problem was not only large but it could lead to clinically incorrect findings,” Baggerly told me. That means in many instances a patient’s health could be on the line if scientists rely on findings of this sort. “And these are not avoidable problems.” If you start out with data from different batches you can’t correct for that in the analysis. In biology today, researchers are inevitably trying to tease out a faint message from the cacophony of data, so the tests themselves must be tuned to pick up tiny changes. That also leaves them exquisitely sensitive to small perturbations—like the small differences between microarray chips or the air temperature and humidity when a mass spectrometer is running. Baggerly now routinely checks the dates when data are collected—and if cases and controls have been processed at different times, his suspicions quickly rise. It’s a simple and surprisingly powerful method for rooting out spurious results.
Senior-level biologists didn’t grow up thinking about problems like this. Thirty years ago, it was a minor miracle to generate fifty points of genetic data after weeks of toil. These days biologists can generate 50 million points of data before lunch. So being on the alert for issues like batch effects requires a whole new mind-set. Biology is no longer simply a descriptive science—numbers matter more all the time. That said, there are still many older scientists “who wanted to do science but didn’t like math, so they thought, ‘Ah ha! Biology, no problem!’” said Keith Yamamoto, vice chancellor for research at the University of California, San Francisco. “When I was in training in molecular biology [in the 1970s], friends would say, ‘If you have to resort to statistics, think of a better experiment.’” As Yamamoto surveys the reproducibility issues today, he figures that mathematical and analytical problems are even more common than are the errors caused by inappropriate animal models or contaminated cell lines. And because more and more biology now revolves around “big data,” scientists need to adapt to this new reality.
Errors are a reminder that new realms provide new pitfalls, but there’s a big upside when this kind of science is done right. The grandest big-data prize in biology was the sequencing of the human genome itself. The code, written in simple units labeled A, T, C, and G, runs to 3 billion characters. Buried in there are approximately 23,000 sequences that encode our genes. At first, scientists hoped that simply identifying and reading these genes would in essence give them a blueprint for human beings—or at least a list of all our parts. Decoding the genome was a true tour de force, and the information it provided is a vital part of the fabric of science today. Ordinary patients may soon have their individual genomes scanned routinely to help doctors identify their susceptibility to disease.
But the scales didn’t drop from our eyes when the genome was revealed. It’s giving up its secrets gradually, reluctantly. Scientists constantly scour the data looking for a specific tidbit of information. Many of these projects involve wrangling huge amounts of data to sift, compare, match, and otherwise piece together a puzzle of gargantuan proportions. When scientists toil with a few genes, that’s called genetics. When they wrestle with massive amounts of data all at once, that’s genomics.
And genomics is just one branch of the “-omics” world. There’s proteomics, in which scientists study thousands of proteins that make up the enzymes and other components of human cells (the ovarian cancer test is an example); there’s transcriptomics, which looks at whether genes are turned on or off (the study comparing Asians and Caucasians falls into this category); there’s lipidomics, which studies varieties of lipids, or fat molecules, that are essential parts of us; there’s metabolomics.… Well, you get the idea.
Alas, when scientists first dived into the -omics, they did not fully appreciate what they were getting themselves into. If you can survey thousands of genes at the same time to look for correlations between them and a given disease or other effect, it’s painfully easy to get it wrong. Many correlations occur just by chance, so you will quickly generate hundreds that look real but aren’t. In fact, if a test is looking for a rare event, most of the time the apparently positive findings will be false. The more correlations you look for, the more erroneous findings you will encounter. And there’s no telling which they are.
Over the years breathless headlines have celebrated scientists claiming to have found a gene linked to schizophrenia, obesity, depression, heart disease—you name it. These represent thousands of small-scale efforts in which labs went hunting for genes and thought they’d caught the big one. Most were dead wrong. John Ioannidis at Stanford set out in 2011 to review the vast sea of genomics papers. He and his colleagues looked at reported genetic links for obesity, depression, osteoporosis, coronary artery disease, high blood pressure, asthma, and other common conditions. He analyzed the flood of papers from the early days of genomics. “We’re talking tens of thousands of papers, and almost nothing survived” closer inspection. He says only 1.2 percent of the studies actually stood the test of time as truly positive results. The rest are what’s known in the business as false positives.
The field has come a long way since then. Ioannidis was among the scientists who pushed for more rigorous analytical approaches to genomics research. The formula for success was to insist on big studies, to make careful measurements, to use stringent statistics, and to have scientists in various labs collaborate with one another—“you know, doing things right, the way they should be done,” Ioannidis said. Under the best of these circumstances, several scientists go after exactly the same question in different labs. If they get the same results, that provides high confidence that they’re not chasing statistical ghosts. These improved standards for genomics research have largely taken hold, Ioannidis told me. “We went from an unreliable field to a highly reliable field.” He counts this as one of the great success stories in improving the reproducibility of biomedical science. Mostly. “There’s still tons of research being done the old fashioned way,” he lamented. He’s found that 70 percent of this substandard genomics work is taking place in China. The studies are being published in English-language journals, he said, “and almost all of them are wrong.”
Scientists could have avoided many of the problems in analyzing big data sets had they had a clear understanding of one key concept: statistical significance. In fact, that lack of understanding plagues many areas of biomedical research. Surprisingly, many researchers have only a poor, formulaic grasp of this critical concept. And flaws in that understanding undercut many published results, from the simplest experiment on up to the million-dollar genomics scan. The term “statistical significance” is bandied about all the time. Generally, it’s the lowest hurdle that a scientist has to clear in order to publish a result in the scientific literature.
The conventional (but wrong) understanding is that a study finding reaches statistical significance if there’s a 95 percent chance that it is correct and only a 5 percent chance that it is wrong. This probability is frequently associated with something called a p-value. If an experiment’s p-value is less than or equal to 0.05 (that’s five-hundredths, or 5 percent), scientists will declare success, and many a journal will happily publish that result. But, while this definition is widely used, it doesn’t mean what many scientists think it means. A result with a p-value less than 0.05 is not in fact at least 95 percent likely to be true. And in reality it sets the bar very low.
A bit of historical context will help here. One of the great scientific minds of the twentieth century was biologist and statistician Ronald “R. A.” Fisher, who developed fundamental ideas that remain at the heart of statistics today. In particular, nearly one hundred years ago he invented a simple mathematical formula, called Fisher’s Exact Test, to measure the strength of an observation. This test is used—in fact, abused—widely throughout biomedicine today. Here’s how it came into being.
A colleague of Fisher’s, Muriel Bristol, claimed that she could tell whether milk or tea had been poured into her cup first. So, out of her sight, Fisher poured her eight cups of tea, four with the tea poured first, four with the milk first. Bristol’s challenge was to identify which was which. Fisher hypothesized that Bristol could not actually tell the difference, and to measure that he devised a simple statistical test to judge the results. Note that Fisher’s test applies to a very specific circumstance. It starts with the assumption that the claim (Bristol can differentiate the cups) is false, and it measures the outcome against that expectation. It does not set a magic threshold that “proves” Bristol can tell which cups received milk first. And most importantly, the test does not predict how Bristol would perform in a second round of tests. As the story goes, she identified all eight cups correctly. She had only a 1.4 percent chance of doing that if she simply chose at random—that’s a p-value of 0.014, which many biologists today would take as strong evidence that she could tell the difference. In fact, the p-value provides hints but no conclusions about whether she knew, got lucky, or cheated. It’s not a measure of the truth but rather a much more limited statement about the odds that Fisher was right when he hypothesized that she would not be able to tell the difference between tea prepared in the two ways.
Fisher’s idea was that when scientists perform experiments, they should use this test as a guide to gauge the strength of their findings, and the p-value was part of that. He emphatically urged them to perform their experiments many times over to see whether the results held. And he didn’t establish a bright line that defines what qualifies as statistically significant. Unfortunately, most modern researchers have summarily dismissed his wise counsel. For starters, scientists have gradually come to use p-values as a shortcut that allows them to draw a bright line. The result of any one experiment is now judged statistically significant if it reaches a p-value of less than 0.05.
What’s wrong with this? Plenty. In the winter of 2015, the National Academy of Sciences convened a workshop to explore how better statistical methods could reduce the problem of irreproducible science. One session was devoted to the perils of the p-value. Dennis Boos from North Carolina State University offered a simple thought experiment. Say you’re a scientist with a research finding that just barely reaches that magic (and arbitrary) number of significance, a p-value equal to 0.05. It’s worth noting that a large number of results in the scientific literature do come in around that number. That’s because studies are often designed at the outset to reach that mark. To conserve resources and save time, researchers set up a study that’s just big enough to yield a result with the magic threshold of p < 0.05.
Boos asked his colleagues to consider what would happen if you ran that experiment again. Unless you land exactly on p = 0.05 a second time, there’s a fifty-fifty chance the new p-value will be higher and a fifty-fifty chance that it will be lower. In other words, there’s a strong chance your second experiment will have a p-value greater than 0.05 and therefore fail the traditional test of statistical significance. The exact same experiment would be deemed insignificant. That’s rather startling to contemplate, because before a scientist performed that second experiment, he or she was likely under the mistaken impression that there was only a 5 percent chance that the finding wouldn’t hold up. Oops. You know the saying on Wall Street that “past performance does not predict future returns?” Well, that applies to p-values as well.
The statistical elite that had gathered in the walnut-paneled lecture room at the academy nodded in knowing agreement with Boos. “We wouldn’t be where we are today if even 5 percent of people [scientists] understood this particular point,” said Stanford’s Steve Goodman, one of the beacons of statistical reasoning. What if scientists really wanted to have a 95 percent chance that their experiment, when run a second time, would still achieve a statistically significant result? Valen Johnson from Texas A&M University had run those numbers. He said scientists could use other, more powerful statistical methods. But those devoted to sticking with p-values should aim for a result ten times more stringent: a p-value of 0.005 rather than the traditional 0.05. That tougher standard would achieve the goal that many scientists already believe they are reaching: a finding that’s 95 percent likely to remain statistically significant if a study is run again. By not doing that, he said, “we’re off by a factor of ten, and this is causing non-reproducibility of scientific studies.”
That’s a much higher bar and would by implication invalidate a large chunk of “significant” findings in the scientific literature today. That’s not to say that all those results are wrong, just that scientists and journal editors place far too much confidence in them. Johnson said, “I have received a lot of pushback about this proposal to raise the bar of statistical significance from scientists, who say, ‘This is going to destroy my career, and I can’t do experiments anymore. I’ll never get a p-value of 0.005.’” But it’s not quite as daunting as it sounds. In many cases, scientists can reach that target (if indeed the phenomenon they’re studying is real) if they increase the number of people, or animals, or samples in their studies by 60 percent. In the process, they’d weed out many dubious results.
Scientific societies aren’t calling for that stringent an approach to biomedical research, but at least the problem with p-values has been getting some attention. In 2016, the American Statistical Association decided things had gotten so out of hand that it convened a group of statistics experts to write a statement about the pitfalls of p-values. It says what should be obvious to scientists but clearly is not: “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.” It goes on to caution against using this statistical test as the sine qua non for analysis and underscores another key point: just because a finding is “statistically significant” doesn’t make it meaningful. Plenty of statistically significant results, even if found repeatedly, indicate a trivial difference that has no consequential bearing on human health.
In 2010, Uri Simonsohn, an economist at the University of Pennsylvania, came to a similarly sobering conclusion about the pitfalls of p-values through a completely different mental route. He went with a couple of colleagues to a conference on the topic of consumer behavior. “We saw a lot of findings that we thought were hard to believe,” Simonsohn told me. “We noticed that when we were confronted with a finding that was hard to believe, we were siding with our intuition instead of the science. We thought that’s fundamentally wrong. You are supposed to change your belief based on the evidence. We were dismissing the evidence.”
That disturbing realization got Simonsohn and his colleagues wondering how hard it would be to show that something was “true” when in fact it was not. They were stunned by the answer. “It was extremely easy to find evidence for something that was not true.” With little or no conscious effort, scientists can look at their data, pull out the bits that support a hypothesis, and ignore the bits that don’t. Alternately, scientists can watch as their data are being generated and, the moment they reach a point of statistical significance, stop the experiment—even though more data could easily undermine their conclusion.
In a widely read 2011 paper, Simonsohn and his colleagues described this kind of manipulation. They called it p-hacking. The idea is simply to look at your data six ways from Sunday until some correlation reaches the p-value of 0.05 or less, at which point, by the conventions of biomedical science, it becomes a “significant” result. In the years since he published that paper, Simonsohn has come to realize that p-hacking is incredibly common in all branches of science. “Everybody p-hacks to some extent,” he told me. “Nobody runs an analysis once, and if it doesn’t work, throws everything away. Everybody tries more things.”
If p-hacking weren’t trouble enough, Simonsohn points to one other pervasive problem in research: scientists run an experiment first and come up with a hypothesis that fits the data only afterward. The “Texas sharpshooter fallacy” provides a good analogy. A man wanders by a barn in Texas and is amazed to see bullet holes in the exact bull’s-eyes of a series of targets. A young boy comes out and says he’s the marksman. “How did you do it?” the visitor asks. “Easy. I shot the barn first and painted the targets later,” the boy answers. In science, the equivalent practice is so common it has a name: HARKing, short for “hypothesizing after the results are known.”
It often starts out in all innocence, when scientists confuse exploratory research with confirmatory research. This may seem like a subtle point, but it’s not. Statistical tests that scientists use to differentiate true effects from random noise rest on an assumption that the scientist started with a hypothesis, designed an experiment to test that hypothesis, and is now measuring the results of that test. P-values and other statistical tools are set up explicitly for that kind of confirmatory test. But if a scientist fishes around and finds something provocative and unexpected in his or her data, the experiment silently and subtly undergoes a complete change of character. All of a sudden it’s an exploratory study. It’s fine to report those findings as unexpected and exciting, but it’s just plain wrong to recast your results as a new hypothesis backed by evidence. The fancy statistics aren’t simply inappropriate; they are misleading.
Of course exploration is the essence of science. Scientists at the lab bench slip easily back and forth between the exploratory and confirmatory modes of research. Both are vital to the enterprise. The problems come when scientists lose track of where they are in this fluid world of confirmation and exploration. They have known about the hazards of blurring this line for decades. Classic papers on the topic date from the 1950s, 1960s, and 1970s. Yet press releases, news coverage, and the scientists themselves often muddy this important distinction, which probably explains why so many “discoveries” about coffee, aspirin, vitamins, or what have you end up getting overturned when the next study comes along.
Scientists may have good intentions. For example, perhaps a researcher testing a drug notices that it has a positive effect on a small subset of the people who try it. No doubt that’s potentially exciting. That researcher may be strongly tempted to restructure the analysis to see whether, with a new set of assumptions, the unexpected result will be significant (in the statistical as well as the practical sense). When changing your hypothesis after seeing the results, “usually you have good justification for doing what you did,” Simonsohn said, “but you also could have had great justification for doing 10 different things.” The fundamental problem here is that scientists who do engage in such restructuring often don’t even realize how badly they are abusing the tools of analysis.
At the same time, there is a strong commercial motivation to put data in the best possible light. Researchers trying to develop a drug for the market will look for statistical methods that will be most compelling to reviewers at the FDA, who weigh new-drug approval. This recurring issue is a major factor behind drug withdrawals. One disastrous example involved the anti-arthritis drug Vioxx. A best-seller when first approved in 1999, it was eventually taken off the market when more careful analysis linked it to an increased risk of heart attack. Drug maker Merck had known about the comparatively high rate of heart attacks, which its analysis wrote off as unimportant. The company argued that the patients in the comparison group actually had a lower rate of heart disease, so Vioxx was not to blame. The company lost that argument. In fact, dozens of drugs have been removed from the market after updated analysis and experience identified unacceptable risks.
Biostatisticians can see through this kind of questionable analysis—if they can see the analysis in the first place. But published methods are often vague or deliberately kept secret. An AIDS advocacy group sued the FDA to release the background data and analysis the agency had used to approve the drug Truvada as a means of preventing HIV infection. The company, Gilead Pharmaceuticals, objected on the grounds that disclosing its analytical methods could help rival companies navigate the FDA drug-approval process. The FDA sided with the drug company. Biostatisticians reacted to that story with disbelief, since good analytical methods aren’t trade secrets and shouldn’t be kept secret. The judge in the case agreed.
The lesson here is that solid analysis also requires disclosure and openness. That information after all allows other scientists to understand the all-important details upon which a conclusion rests. It’s a critical element of science’s vaunted process of self-correction. But biomedical research is often more opaque than transparent, and that contributes to the troubles with reproducibility. Here, bad incentives and bad habits are both to blame.