Chapter Nine

THE CHALLENGE OF PRECISION MEDICINE

image

WHEN CAROLYN COMPTON was working as a pathologist at Massachusetts General Hospital, one of the world’s most celebrated medical centers, she never knew when a cancerous colon removed in an operating room would show up at her lab. It could take days. “I can tell you that there was no urgency” to get a colon from the operating room to the pathologist who would diagnose the disease. “A big colon gets put into a bag. It sits in the operating room until a circulating nurse gets around to putting it in the holding refrigerator in the operating room. At the end of the day the same guy who delivers the mail at Mass General comes around and puts it in a cart. He takes it two buildings over, to the pathology department. There it goes onto another bench to get logged in by a technician and it goes into a refrigerator. If it’s a three-day weekend, the resident on call doesn’t come in until Tuesday, opens up the colon and takes a piece of the cancer and puts it into formalin,” a preservative.

This delay was not a problem for the patient when Compton worked at Harvard in the 1990s, and it isn’t one today. Pathologists can still stain that piece of colon, study it under the microscope, and diagnose the type and stage of cancer. “That [drawn-out process] met the standard of care, and still does,” Compton told me. But she has gradually come to realize that this rather casual attention to tissue collection and preservation spells real trouble for biomedical research that might be conducted on a sample of colon cancer. These tissues are perishable, so studies that depend on fine molecular measurements are likely to be irreproducible.

These days, scientists are trying to wring a great deal more information out of the tissue than they can see through a microscope. Precision medicine could potentially correlate specific snippets of DNA, proteins, and other molecules with disease diagnosis or prognosis. Many of these molecules are quite fragile. Compton says even the anesthesia used in the operating room to knock out the patient can affect them. These molecules can change more when surgeons cut off the blood supply to the tissue to be removed. And once the organ is out of the body, the stability of those critical biological molecules will vary depending on the room temperature and—significantly—the amount of time the tissue sits around before it’s preserved. “We will not have precision medicine unless we can fix this problem,” Compton said.

Compton, now at Arizona State University, says it’s taken quite a while for pathologists to realize just how important all those factors could be. One wake-up call came from the lab of David Hicks at the University of Rochester. Starting in 2006, he was trying to unravel a serious medical mystery. The Food and Drug Administration (FDA) had approved a test to help diagnose a particular variant of breast cancer, called HER2-positive. The test itself was certified as very accurate and reliable. Yet about 20 percent of the time it reported that a tissue sample lacked the HER2 trait, when in fact it was present; and up to 20 percent of the time the test “found” the HER2 trait even though it was not there. That’s bad either way. Either women who could benefit from Herceptin, a drug that targets HER2, weren’t getting it, or women were receiving an expensive drug that was not only worthless to them but also had side effects.

But if the test itself wasn’t at fault, the problem must lie in its use. Hicks’s colleagues solved at least part of that mystery with a simple experiment. They let breast tissue from biopsies sit out for an hour or two before testing it. And that was enough to degrade the sample and turn a positive result into a negative one. The molecule detected by the HER2 test breaks down at room temperature. “You can have the best test in the world and still get the wrong answer if you bugger up what you are testing,” Compton said.

That observation sparked action. Two leading professional societies, representing pathologists and clinical oncologists, drew up new rules in 2010 that, among other improvements to reduce false readings, required preservation of breast tissue within an hour of surgery. That has helped make the HER2 test significantly more reliable. But Compton noted with exasperation that breast cancer is the only cancer for which doctors are required to pay attention to the clock after surgically removing tissue. There are no standards for treating samples from the more than two hundred other forms of cancer. She’s not simply concerned about patient care. She’s thinking about how to improve the reliability of scientific research based on those samples.

Compton explained that pathologists come in two varieties: clinical pathologists, who diagnose disease, and anatomic pathologists, who do research. Clinical pathologists follow long lists of federal standards and professional practices. “In laboratory medicine, accuracy of measurement and reproducibility of measurement is everything. Everything,” she said. “In fact all regulation related to laboratory medicine is focused on your ability to calibrate and reproduce reliable analytic results from run to run from day to day from lab to lab.” After all, an error can lead to misdiagnosis, with life-and-death consequences. “You would never believe a result from a [blood sample] if you hadn’t handled the blood tube properly. You would just order another draw.”

Anatomical pathologists and allied medical researchers, on the other hand, tend not to think as much about the quality of their starting materials. “They would come to the pathology department and say, ‘Can I get twenty [paraffin-preserved] blocks of colon cancer?’” Compton said. “They were glad to get their hands on anything.” They would take those starting materials back to their labs, “spend a huge amount of time and money analyzing them and then get results that nobody could interpret—and they never could actually interpret!” Compton says there are no national standards for handling tissue in research labs.

Medical researchers are already getting a taste of the daunting problems that will affect precision medicine if they aren’t absolutely scrupulous about sample collection. Paul Tempst and colleagues at the Memorial Sloan Kettering Cancer Center in New York City were running a study involving proteins taken from blood samples. The scientists were initially encouraged to find a difference between blood samples taken from cancer patients and those taken from healthy people, but Tempst became concerned that the result might simply reflect a batch effect. Sure enough, he eventually tracked down an unexpected culprit: the test tubes in which the blood had been collected. Tempst realized that samples from the healthy patients were collected in a clinic, while samples from people with cancer came from a hospital. And it turned out that the hospital used one type of test tube to collect blood, the clinic another. That seemingly trivial difference was enough to render his results meaningless.

Collecting the starting materials correctly is an essential first step, but it’s just a start. Carolyn Compton started thinking about these issues when she was working at the National Cancer Institute (NCI). She was not alone. Her friend, Anna Barker, then NCI deputy director, was growing alarmed as well. After a dinner conversation veered deeply into the topic, “that really set me on a path to say, ‘Let’s create some standard operating procedures and best practices,’” Barker told me. Samples were one thing, but “how do you collect the data? How do you exchange the data? How do you analyze the data?” Individual researchers invent their own ways to do this. “Everybody wants to do their own thing,” Barker said. Even though that’s the ethos of biomedical research, she realized that simply would not work once scientists tried to pool their data. That way lay mayhem.

With that in mind, in 2004 Barker set out to assemble The Cancer Genome Atlas (TCGA), a massive compendium that would map out a multitude of genetic changes associated with various cancers. And to make sure the data were comparable, she didn’t fund individual investigators. Instead, she contracted with researchers to perform specific tasks and to meet specific standards. “It was based on creating a situation where we would get reproducible data. So that means we controlled everything about this project,” she said. Barker specified how tissue would be collected, handled, stored, and sampled, how the DNA would be sequenced, and how the results would be analyzed. The work may not have felt creative to the scientists doing it, but Barker came away from the decadelong project with a sense that she had not only gathered a pile of useful data but convinced scientists to work toward one collective goal rather than pursuing individual projects.

That experience is by far the exception. Scientists are reluctant to create standards and even slower to adopt them. Something as commonsensical as authenticating cell lines has been a slog. Yet standards are hardly a new idea in science and technology. “We have tons of standards. We have more standards than you’d ever want to think about,” Barker said, for everything from lightbulbs and USB ports to food purity. But they don’t permeate biomedical research. “How many standards do we have in whole genome sequencing? That would be none at this point,” she said. Or how about a standard to search for mutations in genome sequences? “Nobody does it the same way.” Scientists have proposed many standards over the years, but their colleagues are often unaware of them, and, in any case, there’s no easy way to impose them on an entire field.

Biomedical science just can’t keep on going this way. “Biology has become quantitative,” she said. “That’s a transition that we’re just beginning, and it’s going to leave a lot of people behind. We haven’t trained a lot of our biologists to think mathematically or to understand or analyze data. Most people understand that we’re in the digital revolution. Well, your genome is digital data.”

image

As John Ioannidis discovered, the early efforts to collect meaningful genome data were a miserable failure. The scientific literature swelled with tens of thousands of papers reporting the discovery of a genetic marker for this or that disease. Attentive news consumers will remember many occasions in which researchers announced “a gene” for schizophrenia or colon cancer or leukemia. Of those, fewer than a dozen putative discoveries have been solid enough to lead to an FDA-approved blood test or therapy. Companies will happily screen your blood for suspicious traits, but most of the results aren’t concrete enough to form the basis for medical treatment. And it has been a struggle to get reproducible results, despite many millions of dollars in taxpayer investment.

This is especially true for studies in which scientists are trying to use genomic information in the quest for new cancer drugs. Even results from the world’s top laboratories sometimes disagree. Jeffrey Settleman and colleagues at Massachusetts General Hospital and the Harvard Medical School set out to look for new cancer drugs by screening compounds in more than six hundred cancer cell lines. Each line had been genetically fingerprinted, so the scientists could not only identify cancers that responded to individual drugs but look for genetic patterns as well. Drugs can be effective in different cancer types if those cancers are genetically similar. In 2012, the group published the results from more than 48,000 individual tests involving 130 potential drugs in these cancer cells. In collaboration with another top lab, the Wellcome Trust Sanger Institute in England, the work identified a few promising leads, linking drugs, genes, and specific cancers.

At the same time, a second consortium had set up a similar massive experiment. Scientists at the Broad Institute in Cambridge, Massachusetts, joined forces with the Novartis drug company to screen twenty-four different drugs in nearly five hundred cancer cell lines. The combined efforts cost tens of millions of dollars and constitute the largest public collections of genetic and drug data in the world. The Broad team published its first findings in the same issue of Nature as the team from Mass General and also highlighted a few leads for future drug development.

John Quackenbush and Benjamin Haibe-Kains, at the Dana-Farber Cancer Institute in Boston, decided to compare these two efforts to see if their results matched—a potentially powerful way to validate the data, since the two research teams used the same starting materials but different testing procedures and analytical methods. “We thought, what could be better than this?” Quackenbush told me. Quackenbush and Haibe-Kains identified fifteen drugs and 471 cell lines common to both experiments. The following year, they published a bombshell in Nature: the results of the two experiments showed almost no correlation. Only one of the fifteen drugs really seemed to behave the same way in the two studies. “If you want to build a predictor of drug response, and you’re using those data, you’re in trouble,” Haibe-Kains told me. Quackenbush chimed in: “How can you ever hope to take data from these cell lines and make a prediction you can take into patients? It just doesn’t work.”

Their paper caused quite a stir in the field, generating both consternation and some criticism of their own analysis. Quackenbush and Haibe-Kains accepted the criticism and tweaked their findings. That “moved the needle a little more on the side of consistency” between the two data sets, Haibe-Kains said. Consistency for two or three drugs improved, “but we are still extremely far from taking one data set, validating it on another data set, and jumping directly to the patient.”

Two years later the authors of the original studies fired back with their own reanalysis. Using a more relaxed standard and some unorthodox statistical techniques, they concluded that the results, while far from a perfect fit, gave them a correlation that “seems reasonable” and “acceptable,” especially for the large effects. More than 90 percent of the time, neither experiment showed that a drug was likely to work, and a lot of the disagreement had to do with variation in the less dramatic results, which author Levi Garraway argued weren’t very useful in any event. “The reality for most cancer drugs is that most patients don’t respond,” he said. The rare individuals who do are the interesting cases. He said his studies focus on identifying those rare dramatic effects, and those kinds of results are more consistent across the two studies.

But Quackenbush and Haibe-Kains expected much more from this rich and expensive trove of data: they hoped to find new clues to disease by finding combinations of less dramatic results to provide medically useful insights. That would be hard to do because there wasn’t even agreement about where to draw the line between uninteresting noisy data and interesting but less dramatic findings. The conflict spiraled into a heated back-and-forth over who was right and who was wrong. “We wasted our time on defending our position and point of view instead of working together to make it [the analytic process] better,” Haibe-Kains said. He ended up hiring a postdoc to work full time on science related to the controversy. “We both have lost something there.”

Garraway was also displeased with how events unfolded. He noted that he and Quackenbush are both affiliated with Harvard’s Dana-Farber Cancer Institute, “so I will admit to being somewhat disappointed that if you publish a paper like that we would never have talked about it in advance,” he told me. (Quackenbush said he had reached out to Garraway’s group but was rebuffed.) That conversation seems even less likely to take place now, since Garraway didn’t discuss his own reanalysis either before publishing it. Quackenbush, a computational biologist, would have panned it. He said he would never have allowed a student of his to use the reanalysis techniques that Garraway and his colleagues applied. “I’d ask them what demon of data dredging possessed them to go and make that kind of analysis.”

As is often the case in science, more data at least partly resolved the dispute. Genentech scientists ran similar cell line studies and compared their results with those of the two other efforts. Their findings were quite similar to those of the Broad Institute but also agreed with some of the Massachusetts General Hospital research. But that third analysis also focused specifically on large effects and set a lower bar for “agreement” than Quackenbush and Haibe-Kains had.

Garraway said the question isn’t whether the two labs would always produce the same exact results. Since they used different techniques and tests, the experiments were not designed to replicate one another exactly, the way running a second test with the same ingredients would. He said agreement between two different techniques enhances confidence in the result. When both are technically valid and they don’t agree, the divergences can potentially reveal something important about cancer biology—if you can explain why the results differ.

The story doesn’t end there, however. Even before this problem arose, Peter Sorger at Harvard had been troubled by a deeper question: Are any of these experiments producing rigorous findings in the first place? He had serious doubts. For decades, scientists analyzing cancer cell-line tests had been ignoring some critical biology, and that called much of the work into question. For example, these tests typically don’t take into account that different types of cancer cells grow at different rates. As a result, scientists can be fooled into thinking that a drug is effective at slowing cancerous growth, when in fact the cells are just proliferating slowly to begin with. Sorger ran a series of experiments to show that the standard approach was horribly flawed. He then developed a straightforward method to correct for that.

When Genentech published its findings, they contained the right details, enabling Sorger to apply his correction to the data. And the result: only 40 percent of the correlations the company scientists had reported between gene variations and drug sensitivity remained valid after the correction was applied. In one case, the Genentech study found a thousandfold increase in drug sensitivity among ovarian cell lines that carried a certain mutation. But when Sorger corrected for the fact that the mutation made the cells proliferate much faster to begin with, the drug effect simply disappeared.

And it’s not just cell proliferation rates that matter. The density of cells in a flask can also have a profound effect on a result. Many academic researchers aren’t correcting for that, either. “From a theoretical standpoint, the way in which drug response had been traditionally measured is just not sound,” Sorger said. When he talks to oncologists about his findings, he said, they are aghast. “We use that information to determine how we prescribe drugs to patients,” he said. “I think that makes the whole issue much more concerning.”

He has devoted about one hundred people in his lab to sorting out these issues, which are fundamentally a matter of reproducibility. That’s not a direction he expected to take. And though it has become a truism that research on cancer cell lines doesn’t translate into meaningful treatments, Sorger believes it does not have to be so. He worries that the field could give up on the approach altogether, when researchers could instead improve it by thinking more deeply about the underlying biology and applying those lessons. And Sorger is deeply frustrated that the serious issues he has raised don’t seem to have sunk in—most experiments still use the old techniques.

Sorger has been arguing that simple changes to test protocols would enable researchers to gather much more meaningful data. The problem, though, is that scientists have already sunk tens of millions of dollars into doing the work one way, and it’s not an easy call to go back and do it all again with different procedures. “One of the things is that when you start a large-scale project you often have to make trade-offs,” Levi Garraway told me. “You can’t always get everything for every project.” That said, when a powerful new technique comes along, labs do sometimes go back and redo many experiments. Garraway agreed that Sorger has made a good argument for changing the way these kinds of tests should be run. “I’m not promising it will happen, but it’s eminently possible,” he said.

image

Scientists pursuing the dream of precision medicine also have a great deal of work to do in order to make biomarkers more trustworthy. Researchers know that if they can find reliable biomarkers to diagnose and track the progression of a disease, they will learn much more quickly whether a potential drug works. For example, researchers long ago learned to measure the amount of HIV in the blood, and by doing that they could readily tell whether a drug would beat back the virus. That greatly accelerated drug development because pharmaceutical companies didn’t have to wait to see if people lived longer—they could simply measure the effect of the drug on viral load. “You still have to make sure you get the dose right and you understand the toxicities and everything, but it tells you very quickly whether or not the drug is going to have an effect,” said Janet Woodcock at the FDA.

But most biomarkers reported in the scientific literature have been dismal failures. Of all the problems in biomedical research, “the irreproducibility and the lack of rigor on the biomarker side is probably the most painful,” Woodcock said. She blames academic researchers for insufficient rigor in their initial efforts to find biomarkers. “The biomedical research community believes if you publish a paper on a biomarker, then it’s real. And most of them are wrong. They aren’t predictive, or they don’t add additional value. Or they’re just plain old wrong.” To figure out exactly what a biomarker does and doesn’t tell you, “you have to do a lot of work. People don’t want to do that work. So this problem isn’t just in the laboratory. It extends over into the clinic in a big-time way.”

Woodcock sees a lot of potential for biomarkers. For example, in her field of rheumatology, doctors have known for many years that osteoarthritis progresses rapidly in some people and very slowly in others. There must be a biological reason for that difference, and discovering that could point to new approaches to treating the disease. Osteoarthritis strikes millions of people, causing disability, joint pain, and expensive joint-replacement surgeries. “There have been like ten thousand papers published on osteoarthritis biomarkers with no rigorous correlative science going on,” Woodcock said. A number of projects, funded by both government and industry, are sifting through all of those leads to find biomarkers that are actually reliable, “but the effort, compared to the magnitude of biomedical research enterprise, it’s like spit or something” in an ocean of dubious data. “The dirty little secret is it costs tens of millions of dollars. It’s expensive.”

Drug companies have sometimes made that investment, and that’s why there are a few successful tests on the market today, particularly for genetic profiling of breast cancer. Those tests show the promise of these technologies, but they are the exceptions. Part of the problem is that a successful biomarker isn’t likely to be a big moneymaker, certainly not compared with a new cancer drug, which can sell for tens of thousands of dollars for a course of treatment.

Scientists are also reluctant to admit, even to themselves, that they are facing a dead end, especially if they have built an entire lab around a particular idea. Josh LaBaer at Arizona State University said that if you work for a pharmaceutical company and your research flops, the company will probably just assign you to a new project. “It’s not so easy in academia,” LaBaer said. “I don’t know if there’s a way in academia to make it so that people can retain their positions but nonetheless walk away from data that isn’t looking encouraging. I think that’s a big part of this reproducibility problem. There’s this need to stick with what you find because your career depends upon it. If you could report it as negative and say it didn’t work and still survive, I think you’d be more inclined to do that.”

LaBaer has been using his position as an editor of the Journal of Proteome Research to stanch the flow of biomarker papers that go nowhere. “I’ve gotten pretty tough lately,” he told me. He won’t accept a paper that simply reports a correlation between a biomarker and a medical condition. He tells authors they need to use that observation to generate a testable hypothesis and then test it. He won’t even review those papers anymore. “I send them back.” It’s not helpful to have the literature filled with papers that rarely pan out. “I’ve gotten a lot of heat for that, but that’s really been my policy.” I asked him whether scientists actually do follow up with more rigorous studies, or if they simply take their papers to one of the thousands of less selective journals to get the work published anyway. “I don’t know,” LaBaer shrugged. And it matters. “It’s going to impact all of us, because more and more people are doing massive literature searches to build databases of information by summarizing the literature automatically.” That makes it even harder to find biomarkers that could actually make a difference for diagnosing and treating disease.

image

Anna Barker is trying to find a way through this morass—starting with tissue collection and including drug discovery and biomarker validation—by taking a fresh approach to one of the most challenging cancers: glioblastoma. “Nothing has changed in this cancer for the last hundred years,” she said, alluding to the hundreds of failed efforts to find a viable treatment. Barker organized a coordinated effort to take on this hardest of hard cancers, figuring if she can crack it, people will have to sit up and pay attention to how she did it. “We want to try to do everything right from square one,” she said. That starts with highly regimented collection and testing of tissue samples, along with treatment procedures that are followed to the letter from one hospital to the next. This is an international effort, involving hospitals in Australia and China (which funnels its glioblastoma patients to four hospitals nationwide).

The study itself represents a major departure from the way clinical trials are typically done. Usually a drug company pays a group of researchers to test a single drug. A large-scale trial involves hundreds or thousands of patients, and once the plan is set, the study continues unchanged until there’s either an obvious victory, an obvious problem, or all patients have been enrolled and examined. Barker’s glioblastoma study, called GBM Agile, is instead an “adaptive trial,” which means researchers try to learn something from every single patient as the study progresses. If one patient seems to respond to treatment better than others, scientists will look for genetic clues and other leads that could help them modify the trial for the next patient. And the same team will test an array of drugs from a variety of manufacturers, singly or in combination, depending on where the ongoing study is leading them.

Researchers studying breast cancer pioneered this concept in the early 2000s. But adaptive trial designs are seldom used, in part because they are tougher than standard clinical trials to get approved by the FDA. But Barker has navigated that challenge. She has also developed a network of about 150 collaborators working toward the project’s common goal of finding a viable treatment. In this way the trial design gets away from ego-driven research and instead rewards collaboration. “There’s got to be a better way,” she said. “There’s got to be a better way.”

It might seem that Barker is stacking the deck against herself by choosing such an unyielding cancer as the project’s focus. But she doesn’t see it that way. “We need some demonstrable success in the rare tumors, because frankly most diseases are going to become rare diseases.” That is the paradox of precision medicine: instead of homing in on a common treatment for each disease, researchers may end up with a much more complicated and expensive problem to confront. Each patient will have a much more specific genetic diagnosis. And doctors will no longer be treating two hundred different types of cancer; they will potentially be treating thousands of unique diseases, each fine-tuned to the genetics of the individual or to the genetic pattern of the tumor.

The deepest challenge in realizing the potential of precision medicine is in changing the underlying incentives in biomedical research. That means reengineering the culture. The question is how to do that. Step one is to make sure the problems and the perverse incentives are well understood. Step two is to figure out how to create new incentives for scientists, universities, and funding agencies. If this is starting to sound like the germ of a research initiative, it is. In fact, a whole new field is emerging—one designed to study problems in how scientific research is conducted and to identify solutions. It’s called meta-research.